Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imprecise trace representation #888

Closed
NicolasBissig opened this issue Jul 2, 2024 · 7 comments · Fixed by #890
Closed

Imprecise trace representation #888

NicolasBissig opened this issue Jul 2, 2024 · 7 comments · Fixed by #890
Labels
datasource/ClickHouse good first issue Good for newcomers type/bug Something isn't working

Comments

@NicolasBissig
Copy link

NicolasBissig commented Jul 2, 2024

What happened: When viewing an OpenTelemetry Trace stored in ClickHouse, the trace is displayed with wrong timing information

Pasted image 20240619151413
Figure 1: The ClickHouse trace viewed in grafana

Pasted image 20240619151429
Figure 2: The same trace stored and queried from Tempo viewed in Grafana

The durations of the spans are correct, however the starting positions are not. The tempo trace is correct here.

What you expected to happen: I expect the ClickHouse trace representation to be the correct one.

How to reproduce it (as minimally and precisely as possible): Setup any tracing solution and configure an OpenTelemetry collector to export the data at the same time to the ClickHouse database AND a Tempo Backend. View the same trace in both systems. Use the ClickHouse exporter contained in the contrib version of the collector.

Screenshots

Anything else we need to know?: This seems to relate to #771

This is the default query produced for viewing a trace:

WITH 'd469af085bf7524ba4dd5eb2814a6918' as trace_id, (SELECT min(Start)  
                                                      FROM "otel"."otel_traces_trace_id_ts"  
                                                      WHERE TraceId = trace_id) as trace_start, (SELECT max(End) + 1  
                                                                                                 FROM "otel"."otel_traces_trace_id_ts"  
                                                                                                 WHERE TraceId = trace_id) as trace_end  
SELECT "TraceId"                                                                                           as traceID,  
       "SpanId"                                                                                            as spanID,  
       "ParentSpanId"                                                                                      as parentSpanID,  
       "ServiceName"                                                                                       as serviceName,  
       "SpanName"                                                                                          as operationName,  
       "Timestamp"                                                                                         as startTime,  
       multiply("Duration", 0.000001)                                                                      as duration,  
       arrayMap(key -> map('key', key, 'value', "SpanAttributes"[key]), mapKeys("SpanAttributes"))         as tags,  
       arrayMap(key -> map('key', key, 'value', "ResourceAttributes"[key]),  
                mapKeys("ResourceAttributes"))                                                             as serviceTags  
FROM "otel"."otel_traces"  
WHERE traceID = trace_id  
  AND startTime >= trace_start  
  AND startTime <= trace_end  
LIMIT 1000

If it is adapted to

WITH 'd469af085bf7524ba4dd5eb2814a6918' as trace_id, (SELECT min(Start)  
                                                      FROM "otel"."otel_traces_trace_id_ts"  
                                                      WHERE TraceId = trace_id) as trace_start, (SELECT max(End) + 1  
                                                                                                 FROM "otel"."otel_traces_trace_id_ts"  
                                                                                                 WHERE TraceId = trace_id) as trace_end  
SELECT "TraceId"                                                                                   as traceID,  
       "SpanId"                                                                                    as spanID,  
       "ParentSpanId"                                                                              as parentSpanID,  
       "ServiceName"                                                                               as serviceName,  
       "SpanName"                                                                                  as operationName,  
       "Timestamp"                                                                                 as startTimeNormal,  
       toUnixTimestamp64Nano(Timestamp) / 1000000                                                  as startTime,  
       multiply("Duration", 0.000001)                                                              as duration,  
       arrayMap(key -> map('key', key, 'value', "SpanAttributes"[key]), mapKeys("SpanAttributes")) as tags,  
       arrayMap(key -> map('key', key, 'value', "ResourceAttributes"[key]),  
                mapKeys("ResourceAttributes"))                                                     as serviceTags  
FROM "otel"."otel_traces"  
WHERE traceID = trace_id  
  AND startTimeNormal >= trace_start  
  AND startTimeNormal <= trace_end  
LIMIT 1000

mainly with toUnixTimestamp64Nano(Timestamp) / 1000000 as startTime will result in the correct trace representation:

Pasted image 20240619151547
Figure 3: Trace view from ClickHouse with adapted query, now displayed correctly.

This is just a workaround / solution suggestion, if there are better ways to provide the same result, please let me know

Environment:

  • Grafana version: 11
  • Plugin version: 4.1.0
  • OS Grafana is installed on: Docker or Kubernetes
  • User OS & Browser: Linux / Windows & Brave Browser
  • Others:
@aangelisc
Copy link
Contributor

Hi @NicolasBissig, thank you for the report. Does this happen when the duration unit is in nanoseconds? Is the same behaviour observed if the unit is in milliseconds?

@aangelisc aangelisc moved this from Incoming to Waiting in Partner Datasources Jul 2, 2024
@SpencerTorres
Copy link
Collaborator

It seems like Grafana prefers the float timestamps for tracing purposes since JavaScript Date can't go sub-millisecond.

We will need to do a similar formula to what we do for duration, but this should be an easy change.

@aangelisc changing duration unit only affects duration for now, but now I see we might need to apply it to startTime as a float

@SpencerTorres SpencerTorres added the good first issue Good for newcomers label Jul 2, 2024
@aangelisc
Copy link
Contributor

Thanks for the response @SpencerTorres, perhaps we could generalise this to handle any column appropriately, wdyt?

@SpencerTorres
Copy link
Collaborator

I think it could be a constant function actually, since we would know the precision from the query.

That function is used for when we don't know the intended scale of duration, but if we're the ones converting the startTime then we can always go straight to nanoseconds and then shift the decimal to milliseconds.

Should be an easy change. See #720 for a visualization of this issue occurring on duration

@NicolasBissig
Copy link
Author

@aangelisc as I see that there already is an open pull request, can I assume that the question with the duration unit has been answered? :)

@aangelisc
Copy link
Contributor

Hi @NicolasBissig, yep that question has been answered. I believe the open PR should fix this problem!

@NicolasBissig
Copy link
Author

Cool, thanks for the fast fix guys! 👏

@aangelisc aangelisc moved this from Waiting to In Progress in Partner Datasources Jul 5, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Partner Datasources Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource/ClickHouse good first issue Good for newcomers type/bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants