Add otel_monitor #461

derekkraan · 2022-09-19T14:50:45Z

Following up on this PR: open-telemetry/opentelemetry-erlang-contrib#109

This PR adds a new module, otel_monitor. We associate the span with a monitor ref in an ets table, and end the span when we detect that the process has died.

Looking forward to feedback for this one.

(note: I haven't looked at tests yet)

codecov · 2022-09-19T14:53:38Z

Codecov Report

Attention: Patch coverage is 20.83333% with 19 lines in your changes missing coverage. Please review.

Project coverage is 72.83%. Comparing base (d791c5b) to head (9cbc113).
Report is 813 commits behind head on main.

Files with missing lines	Patch %	Lines
apps/opentelemetry/src/otel_monitor.erl	13.63%	19 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #461      +/-   ##
==========================================
- Coverage   73.57%   72.83%   -0.75%     
==========================================
  Files          53       54       +1     
  Lines        1722     1745      +23     
==========================================
+ Hits         1267     1271       +4     
- Misses        455      474      +19

Flag	Coverage Δ
api	`68.77% <ø> (ø)`
elixir	`18.77% <ø> (ø)`
erlang	`74.29% <20.83%> (-0.80%)`	⬇️
exporter	`72.87% <ø> (ø)`
sdk	`77.38% <20.83%> (-1.63%)`	⬇️
zipkin	`51.47% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tsloughter · 2022-09-19T22:20:53Z

Just had a thought, maybe this should be part of the sweeper?

tsloughter · 2022-09-19T22:26:01Z

apps/opentelemetry/src/otel_monitor.erl

+handle_info({'DOWN', Ref, process, _Pid, normal}, State) ->
+  case ets:take(?TABLE, Ref) of
+		[] -> nil;
+		[{_Ref, SpanCtx}] -> otel_span:end_span(SpanCtx)


Could be more than 1 span being monitored in the same process.

Yes, but each span will get its own ref, no? I guess this is what I will find out when I write tests...

Ooh, my mistake. I was thinking it was like my implementation which monitors all spans in a process, so it looks up all spans for a pid, not by ref.

Which is a separate question I have, is it really worth doing it per-span, are there spans that a user wouldn't want ended when a process crashes?

If you are asking why I didn't write a PR that monitors every span's process, it's because bigger changes like that are less likely to be accepted. I am aiming with this PR for the incremental approach: first offer the api optionally, if it turns out that everyone uses it all the time, then contemplate making it mandatory.

But if you are comfortable going directly to monitoring every span's process, then I am happy coding that up.

I'm not sure whether it's cheaper to have a single monitor per pid and juggling spans associated with that pid in an ets table, or just creating a new monitor and ets row per span.

No, I'm asking why not monitor every span in a single process the user requests.

If the user requests to monitor Span X in process P do they really not want Span Y that is also active in process P to be ended when the process dies? If they do they have to call monitor(X) and monitor(Y). The API I had in mind was just monitor(), and then every span in self() is monitored. I suppose also a monitor(Pid) makes sense to include if we go that route.

Also, if going the route of just monitoring the Pid and all its spans we can add a pid to the span itself. In my PR for the monitor I add a pid to span so that the existing table can just be used to look up the spans to end.

I'm asking in slack. It is hard to know what is best without people's input on how they'd use it, or if they are already using their own solution, like you have been.

👍 glad we are having this discussion

If the user requests to monitor Span X in process P do they really not want Span Y that is also active in process P to be ended when the process dies? If they do they have to call monitor(X) and monitor(Y).

I would want to have this applied to every span.

tsloughter · 2022-09-19T22:27:21Z

apps/opentelemetry/src/otel_monitor.erl

+    {noreply, State}.
+
+monitor(SpanCtx) ->
+  ok = gen_server:call(?SERVER, {monitor, SpanCtx, self()}),


User probably doesn't want to crash if the monitor fails. I'd say probably just want a cast unless you think some users will want to be able to rely on this feature in such a way that it should return false if the call fails?

I like that idea, returning true / false based for success / failure.

derekkraan · 2022-09-20T06:27:23Z

My first instinct would be to keep it self-contained and not include it in the sweeper. It's also a different use-case to the sweeper. The usage patterns of the sweeper are also quite different; it does a lot of work sometimes, while the monitor is doing a small amount of work all of the time, and it will also end up in the hot path of people's code, so it should be fast.

Perhaps a cast is also better from the perspective of speed...

tsloughter · 2022-11-08T18:07:56Z

@derekkraan What do you think of the alternative of monitoring every span in the process that is monitored?

I also wonder about the api being an option passed to with/start_span. It would still monitor all spans in the process but make it more obvious when a user is looking at the API and with_span/start_span that they can do it.

derekkraan · 2022-11-10T14:47:16Z

Re: monitoring every span, sounds good, and it appears this is what everyone wants, but I think then the exception event should only be added to the outermost one? Or the one that has been explicitly monitored? What do you think?

I do think there should also be an option to with/start_span. I was initially treating that as something that could be added later but if there is already agreement on what that should look like (not that hard perhaps, just monitor: true?), then I can certainly look at adding that in this PR.

hkrutzer · 2023-01-12T15:21:38Z

the exception event should only be added to the outermost one? Or the one that has been explicitly monitored? What do you think?

Yes outermost seems fine.

tsloughter · 2023-01-14T00:27:53Z

I don't think outermost is a good idea. It would be a bit of a pain to implement. And really I would think the inner most would make the most sense as its where the failure is more likely to have happened?

I think it is better to just mark any active span in a process with the information.

hkrutzer · 2023-01-14T19:15:35Z

If the innermost is easier I think that's fine too, any trace viewer should handle that fine.

tsloughter · 2023-08-10T11:11:11Z

My latest #602

Add otel_monitor

9cbc113

derekkraan requested a review from a team September 19, 2022 14:50

github-actions bot added language-erlang scope-sdk labels Sep 19, 2022

tsloughter reviewed Sep 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add otel_monitor #461

Add otel_monitor #461

derekkraan commented Sep 19, 2022 •

edited

Loading

codecov bot commented Sep 19, 2022 •

edited

Loading

tsloughter commented Sep 19, 2022

tsloughter Sep 19, 2022

derekkraan Sep 20, 2022

tsloughter Sep 20, 2022

derekkraan Sep 20, 2022

tsloughter Sep 20, 2022

tsloughter Sep 20, 2022

tsloughter Sep 20, 2022

derekkraan Sep 20, 2022

hkrutzer Sep 28, 2022

tsloughter Sep 19, 2022

derekkraan Sep 20, 2022

derekkraan commented Sep 20, 2022

tsloughter commented Nov 8, 2022

derekkraan commented Nov 10, 2022

hkrutzer commented Jan 12, 2023

tsloughter commented Jan 14, 2023

hkrutzer commented Jan 14, 2023

tsloughter commented Aug 10, 2023

Add otel_monitor #461

Are you sure you want to change the base?

Add otel_monitor #461

Conversation

derekkraan commented Sep 19, 2022 • edited Loading

codecov bot commented Sep 19, 2022 • edited Loading

Codecov Report

tsloughter commented Sep 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekkraan commented Sep 20, 2022

tsloughter commented Nov 8, 2022

derekkraan commented Nov 10, 2022

hkrutzer commented Jan 12, 2023

tsloughter commented Jan 14, 2023

hkrutzer commented Jan 14, 2023

tsloughter commented Aug 10, 2023

derekkraan commented Sep 19, 2022 •

edited

Loading

codecov bot commented Sep 19, 2022 •

edited

Loading