-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-30397 Review existing server spans to match expected conventions #18056
Conversation
I think these now match the intent for open telemetry - i.e. a service name which make sense to aggregate over several requests. A couple of comments added to the code to clarify. |
https://track.hpccsystems.com/browse/HPCC-30397 |
{ | ||
//MORE: The previous code would be better off querying httpHeaders... | ||
StringBuffer spanName; | ||
spanName.append(serviceName); | ||
if (!isEmptyString(methodName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afishbeck depending on how the service function is called, sometimes the methodName is not known at this point (part of the soap envelope?) How can it be cleanly extracted? May require some refactoring, so a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this timing issue related to when we know the name comes up sometimes. You could open a JIRA to refactor to always get the name at the beginning. I actually did similar refactoring in roxie.
One idea would be to do a quick pull parser extraction of the name even if it came from the SOAP/XML/JSON content of the message body. Generally the pull parser would usually only have to walk a few tags to find what it needs. It could then abandon the parsing until the body is fully parsed later.
Another way would be to always parse the body much sooner. Early on we decided to avoid extra overhead in case we rejected the request for other reasons.. for example security. We save a little time and memory, or perhaps there may occasionally be huge requests. I think Roxie parses pretty early but isn't as close to the front lines as ESP.
Either way would be good in the sense that it would let us refactor some other code that has to wait until the method is known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See QueryNameExtractor in roxie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPCC-31013 created.
@@ -3769,6 +3769,11 @@ void WsWuHelpers::submitWsWorkunit(IEspContext& context, IConstWorkUnit* cw, con | |||
} | |||
} | |||
|
|||
ISpan * activeSpan = context.queryActiveSpan(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afishbeck if you submitted a workunit the globalids etc. where not preserved in the workunit. They are now. Should this code be commoned up with the code in ws_ecl_service.cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when ws_ecl_service.cpp runs the job as a workunit (as opposed to calling out to roxie) it clones the original workunit. Because of that submitWsEclWorkunit is a bit different from submitWsWorkunit here, BUT it has context available so this bit of code should be able to be commoned up.
@@ -1455,7 +1456,8 @@ class RoxieProtocolMsgContext : implements IHpccProtocolMsgContext, public CInte | |||
|
|||
ensureContextLogger(); | |||
|
|||
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("request", allHeaders, flags); | |||
const char * spanName = !isEmptyString(queryName) ? queryName : "run_query"; | |||
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(spanName, allHeaders, flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the deployed roxie query is used as the span name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if prefixing with the queryset name (aka target, and generally in AKS this is just the roxie name) would help find queries from the same roxie.. but knowing that for sure will come down to experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)
I am open to debate on any of these - they were a first stab. run_workunit within esp is a client span - so the parent span should provide the context. run_workunit within roxie/hthor should be self-explanatory because of the component. It is the same service provided by different components, so probably should use the same name. |
No need to debate, again the names look right. |
61cc451
to
1d25238
Compare
ISpan * activeSpan = context.queryActiveSpan(); | ||
Owned<ISpan> clientSpan(activeSpan->createClientSpan("run_workunit")); | ||
Owned<IProperties> httpHeaders = ::getClientHeaders(clientSpan); | ||
recordTraceDebugOptions(wu, httpHeaders); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing the headers not being recorded in the workunit revealed another problem - see thorsoapcall for details.
@@ -682,7 +682,7 @@ class CServerSpan : public CSpan | |||
{ | |||
bool success = CSpan::getSpanContext(ctxProps, otelFormatted); | |||
|
|||
if (remoteParentSpanCtx.IsValid()) | |||
if (!otelFormatted && remoteParentSpanCtx.IsValid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afishbeck please check. remoteParentSpanId shouldn't have been being included in the client headers. I opened HPCC-30927 to split this function into two, because it is confusing as it stands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. For what it's worth, I thought I had 2 independent versions of this function in at least one of my many PR commits.
One function should provide propagation headers the traceparent and tracestate, the other should provide the context as props.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with both of you, these should be 2 functions it's very confusing.
So I guess if blocking remoteParentSpanID when otelFormatted is set prevents it from being sent as a header then this code is right?
if (!isEmptyString(value)) | ||
request.append(key).append(": ").append(value).append("\r\n"); | ||
//If this header is http-global-id, check that global-id hasn't been explicitly added already | ||
const char * altHeader = queryAlternativeHeader(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a query set "hpcc-global-id" but the headers line set "global-id" or vice-versa you would get both values in the headers potentially causing confusion. Only likely to affect the regression tests - but it did once the ids were preserved where they were previously dropped.
@rpastrana second commit contains fixes for regression test differences. Also rebased onto latest code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday this is fine as is, but I did leave a couple of comments for you to consider.
StringBuffer spanName; | ||
spanName.append(serviceName); | ||
if (!isEmptyString(methodName)) | ||
spanName.append("/").append(methodName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's prob fine, but I wonder if the / delimiter could cause problems downstream. (prometheous, grafana, elastic will likely use this value in UI widgets). I didn't find specific rules on span nomenclature or even attribute naming rules which excludes the slash, but I'd consider another delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the examples on the open telemetry recommendations for span name conventions (https://github.com/open-telemetry/opentelemetry-specification/blob/v1.26.0/specification/trace/api.md#span) uses a / : "get_account/{accountId}" - so I think it can be considered safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
@@ -682,7 +682,7 @@ class CServerSpan : public CSpan | |||
{ | |||
bool success = CSpan::getSpanContext(ctxProps, otelFormatted); | |||
|
|||
if (remoteParentSpanCtx.IsValid()) | |||
if (!otelFormatted && remoteParentSpanCtx.IsValid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. For what it's worth, I thought I had 2 independent versions of this function in at least one of my many PR commits.
One function should provide propagation headers the traceparent and tracestate, the other should provide the context as props.
@@ -1485,7 +1485,9 @@ CJobMaster::CJobMaster(IConstWorkUnit &_workunit, const char *graphName, ILoaded | |||
init(); | |||
|
|||
Owned<IProperties> traceHeaders = extractTraceDebugOptions(workunit); | |||
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(workunit->queryWuid(), traceHeaders); | |||
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("run_graph", traceHeaders); | |||
requestSpan->setSpanAttribute("hpcc.wuid", workunit->queryWuid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like some of the most common attributes should be defined in JTrace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I'm still not sure about. I can see the argument for having common strings - it catches typing mistakes, but I also find the code harder to read.
I suspect I could be persuaded for strings like this that are likely to be scattered through the code.
@afishbeck please also check the logic behind the last commit changes to globalid.ecl - that ids generated by esp should have precedence over any #options trying to set options. (You could still set debug options when you submit a query to control it.) |
@@ -1485,7 +1485,9 @@ CJobMaster::CJobMaster(IConstWorkUnit &_workunit, const char *graphName, ILoaded | |||
init(); | |||
|
|||
Owned<IProperties> traceHeaders = extractTraceDebugOptions(workunit); | |||
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(workunit->queryWuid(), traceHeaders); | |||
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("run_graph", traceHeaders); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday I wonder if workunit spans should be created as PRODUCER spans and then when a component works on that workunit it should be CONSUMER spans. (consumer may sound weird for something like thor, but see the description below):
- SERVER Indicates that the span covers server-side handling of a synchronous RPC or other remote request. This span is often the child of a remote CLIENT span that was expected to wait for a response.
- CLIENT Indicates that the span describes a request to some remote service. This span is usually the parent of a remote SERVER span and does not end until the response is received.
- PRODUCER Indicates that the span describes the initiators of an asynchronous request. This parent span will often end before the corresponding child CONSUMER span, possibly even before the child span starts. In messaging scenarios with batching, tracing individual messages requires a new PRODUCER span per message to be created.
- CONSUMER Indicates that the span describes a child of an asynchronous PRODUCER request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be a server span, but the code that adds the workunit to the execution queue should probably be a producer span. I have added a jira to introduce the new concept and we can revisit existing spans when that is available.
@@ -2001,7 +2001,7 @@ int CWsEclBinding::submitWsEclWorkunit(IEspContext & context, WsEclWuInfo &wsinf | |||
|
|||
Owned<ISpan> clientSpan; | |||
ISpan * activeSpan = context.queryActiveSpan(); | |||
clientSpan.setown(activeSpan->createClientSpan("wsecl/SubmitWorkunit")); | |||
clientSpan.setown(activeSpan->createClientSpan("run_workunit")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the idea of PRODUCER spans gets interesting. I guess the producer span would be created here? or would we do it in workunit.dll when a workunit is created? Then even the most basic eclagent/roxie/hthor job would be represented by a CONSUMER span? And thor jobs may create multiple consumer spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is where the producer span would be. I'm not sure that I agree it makes sense to differentiate a consumer span from a server span though. In general the service does not know whether the parent span is waiting for it or not. It may make sense with an internal producer/consumer model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday just wanted to make a note to retain this thought.
The reality could be different then the theory. BUT I think the reason for a consumer span rather than a server span is that a server span should show up as a sequential child within the timeframe of the span of its parent.
If a THOR job itermitantly / asynchronously works on a workunit the spans would not correlate to the original parent at all which isn't what I think is expected of server spans.
Web app example
WEB server span
---------------------------------------------------------------------------------
ROXIE server span1 ROXIE server span2 ROXIE server span3
------------------- ------------------------- ---------------------------
SOAPCALL span1
------------------
THOR job example
JOB kickoff PRODUCER span
(finished before any work even really begins)
--
THOR graph CONSUMER1 THOR graph CONSUMER2
------------------------------ -------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am not sure about in this model is that the consumer does not really know whether the caller is waiting for it or not. Does a webservice know if the caller is waiting for it? On reflection maybe it does. What happens if a server span sends an acknowledgement to the client before it finishes processing? Is it require to create an internal producer/consumer span if that happens?
In which case anything that pulls something off a queue should be a consumer span, but anything that is processing an http request is a server span. Definitely something to flesh out in a further discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday, regarding "Does a webservice know if the caller is waiting for it?", I would think so in the case of simple synchronous HTTP. But what about the async or even asymmetric case? Such as where a web server does HTTP/2 Push, or Server-Sent Events? That seems to match the PRODUCER pattern.
[ Sorry if this is not a use case/pattern currently being considered or discussed. But as a discussion of span naming conventions that will extend beyond HPCC, I though the question is relevant ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WsECL specifically lets the caller decide if they want to have it return immediately giving them the WUID so they can retrieve the result, or will wait for the workunit to complete.
I'd consider:
- having a flag in the workunit that determines whether graph processors should create server or consumer spans. A use case that waits for the workunit to finish could set the flag to produce server spans, and batch like component could set it to create consumers.
- We aren't talking about HPCC behavior, rather about how this is recorded on the backend for external observability stacks to display. In some sense we are choosing a diagram we want used? or perhaps tools will provide different functionality. But in any case the reality of how those tools operate on the data will be the long term deciding factor for which representation is best. Since they probably modeled current PRODUCER behavior around message queues and databases we could see how well the representation of those in grafana and elastic match with async workunits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday some comments.
@afishbeck please see commit to add the query set name, and responses to your other comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday looks good.
Do like the idea of thinking through CONSUMER vs SERVER spans for subgraph processing in the future.
Signed-off-by: Gavin Halliday <[email protected]>
7349b29
to
27b54ba
Compare
Type of change:
Checklist:
Smoketest:
Testing: