Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add otel_monitor #461

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion apps/opentelemetry/src/opentelemetry_sup.erl
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,16 @@ init([Opts]) ->
shutdown => infinity,
modules => [otel_span_sup]},

Monitor = #{id => otel_monitor,
start => {otel_monitor, start_link, []},
restart => permanent,
shutdown => 5000,
type => worker,
modules => [otel_monitor]},

%% `TracerServer' *must* start before the `BatchProcessor'
%% `BatchProcessor' relies on getting the `Resource' from
%% the `TracerServer' process
ChildSpecs = [Detectors, TracerServer, BatchProcessor, SimpleProcessor, SpanSup],
ChildSpecs = [Detectors, TracerServer, BatchProcessor, SimpleProcessor, SpanSup, Monitor],

{ok, {SupFlags, ChildSpecs}}.
77 changes: 77 additions & 0 deletions apps/opentelemetry/src/otel_monitor.erl
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
%%%------------------------------------------------------------------------
%% Copyright 2020, OpenTelemetry Authors
%% Licensed under the Apache License, Version 2.0 (the "License");
%% you may not use this file except in compliance with the License.
%% You may obtain a copy of the License at
%%
%% http://www.apache.org/licenses/LICENSE-2.0
%%
%% Unless required by applicable law or agreed to in writing, software
%% distributed under the License is distributed on an "AS IS" BASIS,
%% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
%% See the License for the specific language governing permissions and
%% limitations under the License.
%%
%% @doc
%% Process that can optionally monitor the process a span is in and end the
%% span if the process stops for any reason with the span still unfinished.
%% @end
%%%-------------------------------------------------------------------------
-module(otel_monitor).

-behaviour(gen_server).

-export([start_link/0,
monitor/1]).

-export([init/1,
handle_call/3,
handle_cast/2,
handle_info/2]).

-include("otel_tracer.hrl").

-define(SERVER, ?MODULE).
-define(TABLE, ?MODULE).

start_link() ->
gen_server:start_link({local, ?SERVER}, ?MODULE, nil, [{name, ?SERVER}]).

init(nil) ->
_TableId = ets:new(?TABLE, [set, named_table]),
{ok, nil}.

handle_call({monitor, SpanCtx, Pid}, _From, State) ->
Ref = erlang:monitor(process, Pid),
true = ets:insert(?TABLE, {Ref, SpanCtx}),
{reply, ok, State}.

Check warning on line 47 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L45-L47

Added lines #L45 - L47 were not covered by tests

handle_info({'DOWN', Ref, process, _Pid, normal}, State) ->
case ets:take(?TABLE, Ref) of
[] -> nil;
[{_Ref, SpanCtx}] -> otel_span:end_span(SpanCtx)

Check warning on line 52 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L50-L52

Added lines #L50 - L52 were not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be more than 1 span being monitored in the same process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but each span will get its own ref, no? I guess this is what I will find out when I write tests...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, my mistake. I was thinking it was like my implementation which monitors all spans in a process, so it looks up all spans for a pid, not by ref.

Which is a separate question I have, is it really worth doing it per-span, are there spans that a user wouldn't want ended when a process crashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are asking why I didn't write a PR that monitors every span's process, it's because bigger changes like that are less likely to be accepted. I am aiming with this PR for the incremental approach: first offer the api optionally, if it turns out that everyone uses it all the time, then contemplate making it mandatory.

But if you are comfortable going directly to monitoring every span's process, then I am happy coding that up.

I'm not sure whether it's cheaper to have a single monitor per pid and juggling spans associated with that pid in an ets table, or just creating a new monitor and ets row per span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm asking why not monitor every span in a single process the user requests.

If the user requests to monitor Span X in process P do they really not want Span Y that is also active in process P to be ended when the process dies? If they do they have to call monitor(X) and monitor(Y). The API I had in mind was just monitor(), and then every span in self() is monitored. I suppose also a monitor(Pid) makes sense to include if we go that route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if going the route of just monitoring the Pid and all its spans we can add a pid to the span itself. In my PR for the monitor I add a pid to span so that the existing table can just be used to look up the spans to end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking in slack. It is hard to know what is best without people's input on how they'd use it, or if they are already using their own solution, like you have been.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 glad we are having this discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user requests to monitor Span X in process P do they really not want Span Y that is also active in process P to be ended when the process dies? If they do they have to call monitor(X) and monitor(Y).

I would want to have this applied to every span.

end,
{noreply, State};

Check warning on line 54 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L54

Added line #L54 was not covered by tests

handle_info({'DOWN', Ref, process, _Pid, {shutdown, _}}, State) ->
case ets:take(?TABLE, Ref) of
[] -> nil;
[{_Ref, SpanCtx}] -> otel_span:end_span(SpanCtx)

Check warning on line 59 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L57-L59

Added lines #L57 - L59 were not covered by tests
end,
{noreply, State};

Check warning on line 61 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L61

Added line #L61 was not covered by tests

handle_info({'DOWN', Ref, process, _Pid, Reason}, State) ->
case ets:take(?TABLE, Ref) of
[] -> nil;

Check warning on line 65 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L64-L65

Added lines #L64 - L65 were not covered by tests
[{_Ref, SpanCtx}] ->
otel_span:add_event(SpanCtx, 'Process died', #{<<"reason">> => iolist_to_binary(io_lib:format("~p", [Reason]))}),
otel_span:end_span(SpanCtx)

Check warning on line 68 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L67-L68

Added lines #L67 - L68 were not covered by tests
end,
{noreply, State}.

Check warning on line 70 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L70

Added line #L70 was not covered by tests

handle_cast(_Msg, State) ->
{noreply, State}.

Check warning on line 73 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L73

Added line #L73 was not covered by tests

monitor(SpanCtx) ->
ok = gen_server:call(?SERVER, {monitor, SpanCtx, self()}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User probably doesn't want to crash if the monitor fails. I'd say probably just want a cast unless you think some users will want to be able to rely on this feature in such a way that it should return false if the call fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, returning true / false based for success / failure.

true.

Check warning on line 77 in apps/opentelemetry/src/otel_monitor.erl

View check run for this annotation

Codecov / codecov/patch

apps/opentelemetry/src/otel_monitor.erl#L76-L77

Added lines #L76 - L77 were not covered by tests