Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add otel_monitor #461

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

derekkraan
Copy link
Contributor

@derekkraan derekkraan commented Sep 19, 2022

Following up on this PR: open-telemetry/opentelemetry-erlang-contrib#109

This PR adds a new module, otel_monitor. We associate the span with a monitor ref in an ets table, and end the span when we detect that the process has died.

Looking forward to feedback for this one.

(note: I haven't looked at tests yet)

@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

Attention: Patch coverage is 20.83333% with 19 lines in your changes missing coverage. Please review.

Project coverage is 72.83%. Comparing base (d791c5b) to head (9cbc113).
Report is 813 commits behind head on main.

Files with missing lines Patch % Lines
apps/opentelemetry/src/otel_monitor.erl 13.63% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #461      +/-   ##
==========================================
- Coverage   73.57%   72.83%   -0.75%     
==========================================
  Files          53       54       +1     
  Lines        1722     1745      +23     
==========================================
+ Hits         1267     1271       +4     
- Misses        455      474      +19     
Flag Coverage Δ
api 68.77% <ø> (ø)
elixir 18.77% <ø> (ø)
erlang 74.29% <20.83%> (-0.80%) ⬇️
exporter 72.87% <ø> (ø)
sdk 77.38% <20.83%> (-1.63%) ⬇️
zipkin 51.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tsloughter
Copy link
Member

Just had a thought, maybe this should be part of the sweeper?

handle_info({'DOWN', Ref, process, _Pid, normal}, State) ->
case ets:take(?TABLE, Ref) of
[] -> nil;
[{_Ref, SpanCtx}] -> otel_span:end_span(SpanCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be more than 1 span being monitored in the same process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but each span will get its own ref, no? I guess this is what I will find out when I write tests...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, my mistake. I was thinking it was like my implementation which monitors all spans in a process, so it looks up all spans for a pid, not by ref.

Which is a separate question I have, is it really worth doing it per-span, are there spans that a user wouldn't want ended when a process crashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are asking why I didn't write a PR that monitors every span's process, it's because bigger changes like that are less likely to be accepted. I am aiming with this PR for the incremental approach: first offer the api optionally, if it turns out that everyone uses it all the time, then contemplate making it mandatory.

But if you are comfortable going directly to monitoring every span's process, then I am happy coding that up.

I'm not sure whether it's cheaper to have a single monitor per pid and juggling spans associated with that pid in an ets table, or just creating a new monitor and ets row per span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm asking why not monitor every span in a single process the user requests.

If the user requests to monitor Span X in process P do they really not want Span Y that is also active in process P to be ended when the process dies? If they do they have to call monitor(X) and monitor(Y). The API I had in mind was just monitor(), and then every span in self() is monitored. I suppose also a monitor(Pid) makes sense to include if we go that route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if going the route of just monitoring the Pid and all its spans we can add a pid to the span itself. In my PR for the monitor I add a pid to span so that the existing table can just be used to look up the spans to end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking in slack. It is hard to know what is best without people's input on how they'd use it, or if they are already using their own solution, like you have been.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 glad we are having this discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user requests to monitor Span X in process P do they really not want Span Y that is also active in process P to be ended when the process dies? If they do they have to call monitor(X) and monitor(Y).

I would want to have this applied to every span.

{noreply, State}.

monitor(SpanCtx) ->
ok = gen_server:call(?SERVER, {monitor, SpanCtx, self()}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User probably doesn't want to crash if the monitor fails. I'd say probably just want a cast unless you think some users will want to be able to rely on this feature in such a way that it should return false if the call fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, returning true / false based for success / failure.

@derekkraan
Copy link
Contributor Author

My first instinct would be to keep it self-contained and not include it in the sweeper. It's also a different use-case to the sweeper. The usage patterns of the sweeper are also quite different; it does a lot of work sometimes, while the monitor is doing a small amount of work all of the time, and it will also end up in the hot path of people's code, so it should be fast.

Perhaps a cast is also better from the perspective of speed...

@tsloughter
Copy link
Member

@derekkraan What do you think of the alternative of monitoring every span in the process that is monitored?

I also wonder about the api being an option passed to with/start_span. It would still monitor all spans in the process but make it more obvious when a user is looking at the API and with_span/start_span that they can do it.

@derekkraan
Copy link
Contributor Author

Re: monitoring every span, sounds good, and it appears this is what everyone wants, but I think then the exception event should only be added to the outermost one? Or the one that has been explicitly monitored? What do you think?

I do think there should also be an option to with/start_span. I was initially treating that as something that could be added later but if there is already agreement on what that should look like (not that hard perhaps, just monitor: true?), then I can certainly look at adding that in this PR.

@hkrutzer
Copy link
Contributor

the exception event should only be added to the outermost one? Or the one that has been explicitly monitored? What do you think?

Yes outermost seems fine.

@tsloughter
Copy link
Member

I don't think outermost is a good idea. It would be a bit of a pain to implement. And really I would think the inner most would make the most sense as its where the failure is more likely to have happened?

I think it is better to just mark any active span in a process with the information.

@hkrutzer
Copy link
Contributor

If the innermost is easier I think that's fine too, any trace viewer should handle that fine.

@tsloughter
Copy link
Member

My latest #602

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants