HPCC-30397 Review existing server spans to match expected conventions #18056

ghalliday · 2023-11-20T16:51:29Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

ghalliday · 2023-11-20T16:54:53Z

I think these now match the intent for open telemetry - i.e. a service name which make sense to aggregate over several requests. A couple of comments added to the code to clarify.

github-actions · 2023-11-20T16:55:06Z

https://track.hpccsystems.com/browse/HPCC-30397
Jira updated

ghalliday · 2023-11-20T16:56:13Z

esp/bindings/http/platform/httptransport.cpp

 {
    //MORE: The previous code would be better off querying httpHeaders...
+    StringBuffer spanName;
+    spanName.append(serviceName);
+    if (!isEmptyString(methodName))


@afishbeck depending on how the service function is called, sometimes the methodName is not known at this point (part of the soap envelope?) How can it be cleanly extracted? May require some refactoring, so a different PR.

Yes this timing issue related to when we know the name comes up sometimes. You could open a JIRA to refactor to always get the name at the beginning. I actually did similar refactoring in roxie.

One idea would be to do a quick pull parser extraction of the name even if it came from the SOAP/XML/JSON content of the message body. Generally the pull parser would usually only have to walk a few tags to find what it needs. It could then abandon the parsing until the body is fully parsed later.

Another way would be to always parse the body much sooner. Early on we decided to avoid extra overhead in case we rejected the request for other reasons.. for example security. We save a little time and memory, or perhaps there may occasionally be huge requests. I think Roxie parses pretty early but isn't as close to the front lines as ESP.

Either way would be good in the sense that it would let us refactor some other code that has to wait until the method is known.

See QueryNameExtractor in roxie.

HPCC-31013 created.

ghalliday · 2023-11-20T16:57:28Z

esp/services/ws_workunits/ws_workunitsHelpers.cpp

@@ -3769,6 +3769,11 @@ void WsWuHelpers::submitWsWorkunit(IEspContext& context, IConstWorkUnit* cw, con
        }
    }

+    ISpan * activeSpan = context.queryActiveSpan();


@afishbeck if you submitted a workunit the globalids etc. where not preserved in the workunit. They are now. Should this code be commoned up with the code in ws_ecl_service.cpp?

Yes, when ws_ecl_service.cpp runs the job as a workunit (as opposed to calling out to roxie) it clones the original workunit. Because of that submitWsEclWorkunit is a bit different from submitWsWorkunit here, BUT it has context available so this bit of code should be able to be commoned up.

ghalliday · 2023-11-20T16:58:27Z

roxie/ccd/ccdlistener.cpp

@@ -1455,7 +1456,8 @@ class RoxieProtocolMsgContext : implements IHpccProtocolMsgContext, public CInte

        ensureContextLogger();

-        Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("request", allHeaders, flags);
+        const char * spanName = !isEmptyString(queryName) ? queryName : "run_query";
+        Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(spanName, allHeaders, flags);


The name of the deployed roxie query is used as the span name.

I do wonder if prefixing with the queryset name (aka target, and generally in AKS this is just the roxie name) would help find queries from the same roxie.. but knowing that for sure will come down to experience.

rpastrana

@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)

ghalliday · 2023-11-21T13:51:46Z

@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)

I am open to debate on any of these - they were a first stab.

run_workunit within esp is a client span - so the parent span should provide the context.

run_workunit within roxie/hthor should be self-explanatory because of the component. It is the same service provided by different components, so probably should use the same name.

rpastrana · 2023-11-21T14:36:25Z

@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)

I am open to debate on any of these - they were a first stab.

run_workunit within esp is a client span - so the parent span should provide the context.

run_workunit within roxie/hthor should be self-explanatory because of the component. It is the same service provided by different components, so probably should use the same name.

No need to debate, again the names look right.
I feel we should annotate spans (specially those with overloaded names) with a distinguishing attribute (perhaps hpcc.component=xyz is enough). But if we feel the users should be able to disambiguate based on some sort of conditions, then ignore.

ghalliday · 2023-11-28T10:40:42Z

esp/services/ws_workunits/ws_workunitsHelpers.cpp

+    ISpan * activeSpan = context.queryActiveSpan();
+    Owned<ISpan> clientSpan(activeSpan->createClientSpan("run_workunit"));
+    Owned<IProperties> httpHeaders = ::getClientHeaders(clientSpan);
+    recordTraceDebugOptions(wu, httpHeaders);


Fixing the headers not being recorded in the workunit revealed another problem - see thorsoapcall for details.

ghalliday · 2023-11-28T10:42:07Z

system/jlib/jtrace.cpp

@@ -682,7 +682,7 @@ class CServerSpan : public CSpan
    {
        bool success = CSpan::getSpanContext(ctxProps, otelFormatted);

-        if (remoteParentSpanCtx.IsValid())
+        if (!otelFormatted && remoteParentSpanCtx.IsValid())


@afishbeck please check. remoteParentSpanId shouldn't have been being included in the client headers. I opened HPCC-30927 to split this function into two, because it is confusing as it stands.

Agreed. For what it's worth, I thought I had 2 independent versions of this function in at least one of my many PR commits.
One function should provide propagation headers the traceparent and tracestate, the other should provide the context as props.

Agree with both of you, these should be 2 functions it's very confusing.

So I guess if blocking remoteParentSpanID when otelFormatted is set prevents it from being sent as a header then this code is right?

ghalliday · 2023-11-28T10:43:55Z

common/thorhelper/thorsoapcall.cpp

-                    if (!isEmptyString(value))
-                        request.append(key).append(": ").append(value).append("\r\n");
+                    //If this header is http-global-id, check that global-id hasn't been explicitly added already
+                    const char * altHeader = queryAlternativeHeader(key);


If a query set "hpcc-global-id" but the headers line set "global-id" or vice-versa you would get both values in the headers potentially causing confusion. Only likely to affect the regression tests - but it did once the ids were preserved where they were previously dropped.

ghalliday · 2023-11-28T10:44:30Z

@rpastrana second commit contains fixes for regression test differences. Also rebased onto latest code.

rpastrana

@ghalliday this is fine as is, but I did leave a couple of comments for you to consider.

rpastrana · 2023-11-29T15:10:50Z

esp/bindings/http/platform/httptransport.cpp

+    StringBuffer spanName;
+    spanName.append(serviceName);
+    if (!isEmptyString(methodName))
+        spanName.append("/").append(methodName);


It's prob fine, but I wonder if the / delimiter could cause problems downstream. (prometheous, grafana, elastic will likely use this value in UI widgets). I didn't find specific rules on span nomenclature or even attribute naming rules which excludes the slash, but I'd consider another delimiter

One of the examples on the open telemetry recommendations for span name conventions (https://github.com/open-telemetry/opentelemetry-specification/blob/v1.26.0/specification/trace/api.md#span) uses a / : "get_account/{accountId}" - so I think it can be considered safe.

Fair enough.

rpastrana · 2023-11-29T15:17:06Z

system/jlib/jtrace.cpp

@@ -682,7 +682,7 @@ class CServerSpan : public CSpan
    {
        bool success = CSpan::getSpanContext(ctxProps, otelFormatted);

-        if (remoteParentSpanCtx.IsValid())
+        if (!otelFormatted && remoteParentSpanCtx.IsValid())


Agreed. For what it's worth, I thought I had 2 independent versions of this function in at least one of my many PR commits.
One function should provide propagation headers the traceparent and tracestate, the other should provide the context as props.

rpastrana · 2023-11-29T15:30:10Z

thorlcr/graph/thgraphmaster.cpp

@@ -1485,7 +1485,9 @@ CJobMaster::CJobMaster(IConstWorkUnit &_workunit, const char *graphName, ILoaded
    init();

    Owned<IProperties> traceHeaders = extractTraceDebugOptions(workunit);
-    Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(workunit->queryWuid(), traceHeaders);
+    Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("run_graph", traceHeaders);
+    requestSpan->setSpanAttribute("hpcc.wuid", workunit->queryWuid());


It feels like some of the most common attributes should be defined in JTrace

This is something I'm still not sure about. I can see the argument for having common strings - it catches typing mistakes, but I also find the code harder to read.

I suspect I could be persuaded for strings like this that are likely to be scattered through the code.

ghalliday · 2023-12-07T15:53:40Z

@afishbeck please also check the logic behind the last commit changes to globalid.ecl - that ids generated by esp should have precedence over any #options trying to set options. (You could still set debug options when you submit a query to control it.)

afishbeck · 2023-12-12T21:13:00Z

thorlcr/graph/thgraphmaster.cpp

@@ -1485,7 +1485,9 @@ CJobMaster::CJobMaster(IConstWorkUnit &_workunit, const char *graphName, ILoaded
    init();

    Owned<IProperties> traceHeaders = extractTraceDebugOptions(workunit);
-    Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(workunit->queryWuid(), traceHeaders);
+    Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("run_graph", traceHeaders);


@ghalliday I wonder if workunit spans should be created as PRODUCER spans and then when a component works on that workunit it should be CONSUMER spans. (consumer may sound weird for something like thor, but see the description below):

SERVER Indicates that the span covers server-side handling of a synchronous RPC or other remote request. This span is often the child of a remote CLIENT span that was expected to wait for a response.

CLIENT Indicates that the span describes a request to some remote service. This span is usually the parent of a remote SERVER span and does not end until the response is received.

PRODUCER Indicates that the span describes the initiators of an asynchronous request. This parent span will often end before the corresponding child CONSUMER span, possibly even before the child span starts. In messaging scenarios with batching, tracing individual messages requires a new PRODUCER span per message to be created.

CONSUMER Indicates that the span describes a child of an asynchronous PRODUCER request.

I think this should probably be a server span, but the code that adds the workunit to the execution queue should probably be a producer span. I have added a jira to introduce the new concept and we can revisit existing spans when that is available.

afishbeck · 2023-12-12T22:21:41Z

esp/services/ws_ecl/ws_ecl_service.cpp

@@ -2001,7 +2001,7 @@ int CWsEclBinding::submitWsEclWorkunit(IEspContext & context, WsEclWuInfo &wsinf

    Owned<ISpan> clientSpan;
    ISpan * activeSpan = context.queryActiveSpan();
-    clientSpan.setown(activeSpan->createClientSpan("wsecl/SubmitWorkunit"));
+    clientSpan.setown(activeSpan->createClientSpan("run_workunit"));


This is where the idea of PRODUCER spans gets interesting. I guess the producer span would be created here? or would we do it in workunit.dll when a workunit is created? Then even the most basic eclagent/roxie/hthor job would be represented by a CONSUMER span? And thor jobs may create multiple consumer spans.

I agree, this is where the producer span would be. I'm not sure that I agree it makes sense to differentiate a consumer span from a server span though. In general the service does not know whether the parent span is waiting for it or not. It may make sense with an internal producer/consumer model.

@ghalliday just wanted to make a note to retain this thought.

The reality could be different then the theory. BUT I think the reason for a consumer span rather than a server span is that a server span should show up as a sequential child within the timeframe of the span of its parent.

If a THOR job itermitantly / asynchronously works on a workunit the spans would not correlate to the original parent at all which isn't what I think is expected of server spans.

Web app example WEB server span --------------------------------------------------------------------------------- ROXIE server span1 ROXIE server span2 ROXIE server span3 ------------------- ------------------------- --------------------------- SOAPCALL span1 ------------------ THOR job example JOB kickoff PRODUCER span (finished before any work even really begins) -- THOR graph CONSUMER1 THOR graph CONSUMER2 ------------------------------ -------------------------------

What I am not sure about in this model is that the consumer does not really know whether the caller is waiting for it or not. Does a webservice know if the caller is waiting for it? On reflection maybe it does. What happens if a server span sends an acknowledgement to the client before it finishes processing? Is it require to create an internal producer/consumer span if that happens?
In which case anything that pulls something off a queue should be a consumer span, but anything that is processing an http request is a server span. Definitely something to flesh out in a further discussion.

@ghalliday, regarding "Does a webservice know if the caller is waiting for it?", I would think so in the case of simple synchronous HTTP. But what about the async or even asymmetric case? Such as where a web server does HTTP/2 Push, or Server-Sent Events? That seems to match the PRODUCER pattern.
[ Sorry if this is not a use case/pattern currently being considered or discussed. But as a discussion of span naming conventions that will extend beyond HPCC, I though the question is relevant ]

WsECL specifically lets the caller decide if they want to have it return immediately giving them the WUID so they can retrieve the result, or will wait for the workunit to complete.

I'd consider:

having a flag in the workunit that determines whether graph processors should create server or consumer spans. A use case that waits for the workunit to finish could set the flag to produce server spans, and batch like component could set it to create consumers.

We aren't talking about HPCC behavior, rather about how this is recorded on the backend for external observability stacks to display. In some sense we are choosing a diagram we want used? or perhaps tools will provide different functionality. But in any case the reality of how those tools operate on the data will be the long term deciding factor for which representation is best. Since they probably modeled current PRODUCER behavior around message queues and databases we could see how well the representation of those in grafana and elastic match with async workunits.

afishbeck

@ghalliday some comments.

ghalliday · 2023-12-13T13:17:36Z

@afishbeck please see commit to add the query set name, and responses to your other comments.

afishbeck

@ghalliday looks good.

Do like the idea of thinking through CONSUMER vs SERVER spans for subgraph processing in the future.

Signed-off-by: Gavin Halliday <[email protected]>

ghalliday requested review from rpastrana and afishbeck November 20, 2023 16:51

ghalliday commented Nov 20, 2023

View reviewed changes

rpastrana approved these changes Nov 20, 2023

View reviewed changes

ghalliday force-pushed the issue30397 branch from 61cc451 to 1d25238 Compare November 28, 2023 10:38

ghalliday requested a review from rpastrana November 28, 2023 10:39

ghalliday commented Nov 28, 2023

View reviewed changes

rpastrana approved these changes Nov 29, 2023

View reviewed changes

ghalliday force-pushed the issue30397 branch from 1d25238 to 749673f Compare December 7, 2023 15:51

afishbeck reviewed Dec 12, 2023

View reviewed changes

ghalliday requested a review from afishbeck December 13, 2023 13:17

afishbeck approved these changes Dec 14, 2023

View reviewed changes

HPCC-30397 Review existing server spans to match expected conventions

27b54ba

Signed-off-by: Gavin Halliday <[email protected]>

ghalliday force-pushed the issue30397 branch from 7349b29 to 27b54ba Compare December 15, 2023 11:12

ghalliday merged commit d6e08bb into hpcc-systems:candidate-9.4.x Dec 15, 2023
45 checks passed

HPCC-30397 Review existing server spans to match expected conventions #18056

HPCC-30397 Review existing server spans to match expected conventions #18056

Conversation

ghalliday commented Nov 20, 2023 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

ghalliday commented Nov 20, 2023

github-actions bot commented Nov 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afishbeck Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afishbeck Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

rpastrana left a comment

Choose a reason for hiding this comment

ghalliday commented Nov 21, 2023

rpastrana commented Nov 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghalliday commented Nov 28, 2023

rpastrana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghalliday commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afishbeck Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afishbeck Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

afishbeck left a comment

Choose a reason for hiding this comment

ghalliday commented Dec 13, 2023

afishbeck left a comment

Choose a reason for hiding this comment

ghalliday commented Nov 20, 2023 •

edited

Loading

afishbeck Dec 12, 2023 •

edited

Loading

afishbeck Dec 12, 2023 •

edited

Loading

afishbeck Dec 14, 2023 •

edited

Loading

afishbeck Dec 15, 2023 •

edited

Loading