Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(trace): show status of span in trace pannel #950

Merged

Conversation

Defman
Copy link
Contributor

@Defman Defman commented Aug 16, 2024

This pr fixes missing status code from spans in the trace panel.

billede

@Defman Defman requested a review from a team as a code owner August 16, 2024 22:23
@Defman Defman force-pushed the fix/missing-status-code-for-traces-panel branch from 2d10f2d to f0dc99b Compare August 16, 2024 22:32
@SpencerTorres
Copy link
Collaborator

SpencerTorres commented Aug 16, 2024

Hey thanks for the contribution!

Can you find a source where this value comes from: STATUS_CODE_ERROR? I looked through the trace spec and the Go SDK, but can't seem to find an exact source for that string. I want to make sure this string isn't unique to a specific language SDK and will properly work for all external sources before they reach our ClickHouse exporter. Perhaps this can also be tied to a version as a value within otel.ts so that it's not hardcoded

Also what does 2 mean in the if/else? I would like to have source links or docs for these in case we need to make changes in the future

@Defman
Copy link
Contributor Author

Defman commented Aug 17, 2024

I don't think that its written in some doc or spec. However its how its defined inside the otel collector.

https://github.com/open-telemetry/opentelemetry-collector/blob/d2ed276a9215053790ddead2d6479a6beacd28a7/pdata/ptrace/status_code.go#L14

@SpencerTorres
Copy link
Collaborator

Appreciate the links to the trace panel, the 2 makes sense now.

I see the first otel link you sent has some constants, but even the String function returns Error and not STATUS_CODE_ERROR. Perhaps its in some proto file somewhere?

I just want to make sure we have a source for this string that we can track in the future:

if("StatusCode" = 'STATUS_CODE_ERROR', 2, 0) as statusCode

@Defman
Copy link
Contributor Author

Defman commented Aug 21, 2024

@SpencerTorres
Copy link
Collaborator

SpencerTorres commented Aug 21, 2024

I found the real source here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/04b3b9898b242c0b3b707bc043c025eb9f6f73ba/internal/coreinternal/traceutil/traceutil.go#L37-L47

This is what is used in the ClickHouse exporter. I'm not sure whether I like this being a String, or if it should be changed to store the integer value. It's LowCardinality so I don't expect it to affect performance too much either way. Maybe an enum would be the best for both?

Edit: After reading the PR linked in the source, it seems like our plugin is using the OLD behavior. We really should be storing Unset, Ok, and Error (see open-telemetry/opentelemetry-collector#6250).

We should consider both values in the condition check for this PR so that when its changed over there, it will support both. I'm not sure I want to put all of this in the SQL though. If we need to, we can modify this value either on the backend or in the frontend as a postprocessor for the data frame.

Let me know if you have any opinions on all of this. I will review this PR as is for now.

@SpencerTorres
Copy link
Collaborator

I have submitted a PR to upgrade these statuses to match the specification: open-telemetry/opentelemetry-collector-contrib#34799

It would be nice if the Grafana code would match on this too, but that's for another PR: https://github.com/grafana/grafana/blob/d3c6e2b203f87d210cdbf346198d960ac00e72ff/public/app/features/explore/TraceView/components/TraceTimelineViewer/utils.tsx#L73

@SpencerTorres
Copy link
Collaborator

The PR has been merged today, going forward we should match on Error for the error case.

@Defman
Copy link
Contributor Author

Defman commented Aug 29, 2024

I have now replaced STATUS_CODE_ERROR with Error.

On another note @SpencerTorres do you by any chance have experience with using hosted Clickhouse and their autoscaler? Having the issue that their autoscaler really does not like the otel workload and wants to scaleup no matter what.

@Defman
Copy link
Contributor Author

Defman commented Aug 29, 2024

Hmm latest backend tests failed with

        	Error Trace:	/home/runner/work/clickhouse-datasource/clickhouse-datasource/pkg/plugin/driver_test.go:216
        	            				/home/runner/work/clickhouse-datasource/clickhouse-datasource/pkg/plugin/driver_test.go:1123
        	Error:      	Received unexpected error:
        	            	code: 44, message: Cannot create column with type 'JSON' because experimental JSON type is not allowed. Set setting allow_experimental_json_type = 1 in order to allow it

@SpencerTorres
Copy link
Collaborator

Appreciate the changes! Will review now.

On another note @SpencerTorres do you by any chance have experience with using hosted Clickhouse and their autoscaler? Having the issue that their autoscaler really does not like the otel workload and wants to scaleup no matter what.

Are you talking about ClickHouse Cloud? We have an excellent support team who will be able to look into those scaling issues. You should be able to open a ticket in our console.

Also, the backend tests failing is a known issue, I think @bossinc will be able to workaround that if we need to merge.

@adamyeats
Copy link
Contributor

CI issue should be fixed now, please rebase the branch from main to get things up and running again! 👍

@Defman Defman force-pushed the fix/missing-status-code-for-traces-panel branch from 1efe263 to 3edd252 Compare August 30, 2024 11:13
@Defman
Copy link
Contributor Author

Defman commented Aug 30, 2024

Are you talking about ClickHouse Cloud? We have an excellent support team who will be able to look into those scaling issues. You should be able to open a ticket in our console.

Yes @SpencerTorres I'm indeed talking about Clickhouse cloud and have opend a support request, however they are not able to help in regards to the OTel workload. As it is now, regardless of what node size I pick it will scale up, even if there is currently no load issue in regards to ingesting traces and etc. Was just wondering if you had experience with the cloud offering with Otel workload.

Kind reards
Jacob

@SpencerTorres
Copy link
Collaborator

Yes @SpencerTorres I'm indeed talking about Clickhouse cloud and have opend a support request...

I would recommend continuing this discussion with the support team in that request, they will have better answers than I do in regards to scaling. I sent them this thread internally so they have more context. If you have any issues they will absolutely be able to sort it out

Copy link
Collaborator

@SpencerTorres SpencerTorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good and is a great feature addition. Added a few notes.

Thanks for the contribution!

README.md Outdated Show resolved Hide resolved
src/data/sqlGenerator.ts Outdated Show resolved Hide resolved
@Defman
Copy link
Contributor Author

Defman commented Sep 2, 2024

Would using if("StatusCode" = 'STATUS_CODE_ERROR' OR "StatusCode" = 'Error', 2, 0) as statusCode be sufficient for backwards compatibility, such that users do not have to backfill the trace table.

@SpencerTorres
Copy link
Collaborator

Would using if("StatusCode" = 'STATUS_CODE_ERROR' OR "StatusCode" = 'Error', 2, 0) as statusCode be sufficient for backwards compatibility, such that users do not have to backfill the trace table.

I would prefer a single condition, but this option does offer the best compatibility. Yes, let's do it this way for now

@Defman
Copy link
Contributor Author

Defman commented Sep 4, 2024

@SpencerTorres I have now added the backwards compatible if statement.

@SpencerTorres
Copy link
Collaborator

@SpencerTorres I have now added the backwards compatible if statement.

It looks like this was added in the README but not the actual sqlgen code

@Defman Defman requested a review from SpencerTorres September 6, 2024 15:30
Copy link
Collaborator

@SpencerTorres SpencerTorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, the code looks good now. Let's add a line to the CHANGELOG and then we can merge.

Appreciate your patience and contribution 🚀

Copy link
Contributor

@aangelisc aangelisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Defman!

@aangelisc aangelisc merged commit 8ad9766 into grafana:main Sep 12, 2024
15 checks passed
@aangelisc aangelisc mentioned this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants