Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add span monitor process to optionally monitor the spans in a process #84

Closed
wants to merge 1 commit into from

Conversation

tsloughter
Copy link
Member

A span can optionally set monitor to true at the time it is created.
This results in a monitor on the process and if that process exits
for any reason all spans started in that process will be ended.

@tsloughter tsloughter requested a review from a team July 9, 2020 14:51
@tsloughter
Copy link
Member Author

This is partly wrong. I realized right before opening the PR that this shouldn't be done for the current process when start_inactive_span is used. This has to be done for the process it becomes active in.

Not sure yet how I want to fix this.

@tsloughter
Copy link
Member Author

Two questions I have are:

  • Should this be only per-span. Right now it will end all spans in the process no matter which span includes the monitor => true option. I could instead have the monitor only end spans that explicitly include the monitor option and leave the others open.
  • Maybe if the solution of having it end all spans (as it is in this PR currently) in the monitored process is agreed on it should then be a call like ot_tracer:monitor_spans instead of being an option on span start?

A span can optionally set `monitor` to true at the time it is created.
This results in a monitor on the process and if that process exits
for any reason all spans started in that process will be ended.
@bryannaegele
Copy link
Contributor

It makes sense that you'd want to close all spans in the monitored process since the process dying is fatal to all of them.

@tsloughter
Copy link
Member Author

@bryannaegele my thought too. But people do/want weird things.

I should probably change it to be a function called to monitor all spans in a process instead of adding it to the span start opts then...

@garthk
Copy link
Contributor

garthk commented Jul 9, 2020

I think I'd want to add this to my tracer config to cover every span it starts. I'd also want a way to turn the exit details into extra span attributes, though the latter'd also apply to any with_child_span construct taking a function argument and catching exceptions.

@davydog187
Copy link
Contributor

@tsloughter is there any remaining work left on this PR?

I'm seeing this manifest when using Task.async_stream

Screen Shot 2020-08-31 at 11 22 58 AM

One of the tasks times out with no child span reported

@tsloughter
Copy link
Member Author

@davydog187 oh, hm, I think I just wanted to offer a function to start monitoring in a process (instead of just being an option to start_span).

Maybe a config value so it could be turned on globally but we probably will want a pool of monitoring processes before supporting that so it'll come in a separate PR.

@tsloughter
Copy link
Member Author

So I want to finish this up and can't decide what the best interface is.

An issue is how to handle inactive spans. I think this means that when attaching a context the active span in that context has to have its pid updated in the ets table...

@tsloughter tsloughter changed the base branch from master to main February 11, 2021 19:38
@derekkraan
Copy link
Contributor

derekkraan commented Oct 28, 2021

I implemented (by peeking at this PR) this in a project I'm working on. In the end I went with the following interface:
OpenTelemetry.Span.monitor/1. This is nice because it allows us to do something like:

OpenTelemetry.Tracer.with_span "my_span" do
  OpenTelemetry.Span.monitor(OpenTelemetry.Tracer.current_span_ctx())
end

Here's the code I ended up with, for posterity:

defmodule OT.Monitor do                                                                        
  use GenServer                                                                                
                                                                                               
  def start_link(_arg) do                                                                      
    GenServer.start_link(__MODULE__, nil, name: __MODULE__)                                    
  end                                                                                          
                                                                                               
  def init(nil) do                                                                             
    _table_id = :ets.new(__MODULE__, [:bag, :public, {:write_concurrency, true}, :named_table])
    {:ok, nil}                                                                                 
  end                                                                                          
                                                                                               
  def handle_call({:monitor, pid}, _from, state) do                                            
    Process.monitor(pid)                                                                       
    {:reply, :ok, state}                                                                       
  end                                                                                          
                                                                                               
  def handle_info({:DOWN, _ref, :process, pid, :normal}, state) do                             
    :ets.take(__MODULE__, pid)                                                                 
    |> Enum.each(fn {_pid, ctx} ->                                                             
      _span_ctx = OpenTelemetry.Tracer.set_current_span(ctx)                                   
      _ = OpenTelemetry.Tracer.end_span()                                                      
    end)                                                                                       
                                                                                               
    {:noreply, state}                                                                          
  end                                                                                          
                                                                                               
  def handle_info({:DOWN, _ref, :process, pid, {:shutdown, _}}, state) do                      
    :ets.take(__MODULE__, pid)                                                                 
    |> Enum.each(fn {_pid, ctx} ->                                                             
      _span_ctx = OpenTelemetry.Tracer.set_current_span(ctx)                                   
      _ = OpenTelemetry.Tracer.end_span()                                                      
    end)                                                                                       
                                                                                               
    {:noreply, state}                                                                          
  end                                                                                          
                                                                                               
  def handle_info({:DOWN, _ref, :process, pid, reason}, state) do                              
    :ets.take(__MODULE__, pid)                                                                 
    |> Enum.each(fn {_pid, ctx} ->                                                             
      _span_ctx = OpenTelemetry.Tracer.set_current_span(ctx)                                   
      _ = OpenTelemetry.Tracer.add_event("Process died", [{"reason", inspect(reason)}])        
      _ = OpenTelemetry.Tracer.end_span()                                                      
    end)                                                                                       
                                                                                               
    {:noreply, state}                                                                          
  end                                                                                          
                                                                                               
  def monitor(span_ctx) do                                                                     
    if Application.fetch_env!(:opentelemetry, :enabled) do                                     
      # monitor first, because the monitor is necessary to clean the ets table.                
      :ok = GenServer.call(__MODULE__, {:monitor, self()})                                     
      true = :ets.insert(__MODULE__, {self(), span_ctx})                                       
    end                                                                                        
  end                                                                                          
end                                                                                            

@hkrutzer
Copy link
Contributor

hkrutzer commented May 3, 2022

I think this functionality is pretty critical. Personally I would want to have this for as many spans as possible, and definitely for active spans. So that means I would prefer not to set it as an option for every span, but rather as a config value, but I'll take any version of this feature I can get 🙂

@tsloughter
Copy link
Member Author

If we either make span processors not a pain or extend them to allow additional functionality attached to them then I think we could easily allow it to be configured for all spans instead of an option per-span.

@fcevado
Copy link
Contributor

fcevado commented Sep 2, 2022

@tsloughter i was thinking if spans actually shoudn't live in a external process, that monitor the process starting the spans, in case of the process dying or raises, it closes all remaining open spans with error status.
I know it would be a big change on otel sdk architecture, should i start a discussion about that?

@tsloughter
Copy link
Member Author

@fcevado that could work. There are other tracing libs that do just that, trying to remember which one... Ah, it is Tapper. And I think New Relic or AppSignal actually use (or used) a separate process for each trace -- which I wouldn't think would scale well, at least compared to Tapper or our use of ETS.

What I'd hope to be able to experiment with is alternate SDKs. Technically it should be possible for an SDK to be created that does make all span operations messages to another process. No user code should have to change, just which SDK they include in their project.

@tsloughter
Copy link
Member Author

Closing for #602

@tsloughter tsloughter closed this Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants