Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30397 Review existing server spans to match expected conventions #18056

Merged
merged 1 commit into from
Dec 15, 2023

Conversation

ghalliday
Copy link
Member

@ghalliday ghalliday commented Nov 20, 2023

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@ghalliday
Copy link
Member Author

I think these now match the intent for open telemetry - i.e. a service name which make sense to aggregate over several requests. A couple of comments added to the code to clarify.

Copy link

{
//MORE: The previous code would be better off querying httpHeaders...
StringBuffer spanName;
spanName.append(serviceName);
if (!isEmptyString(methodName))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afishbeck depending on how the service function is called, sometimes the methodName is not known at this point (part of the soap envelope?) How can it be cleanly extracted? May require some refactoring, so a different PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this timing issue related to when we know the name comes up sometimes. You could open a JIRA to refactor to always get the name at the beginning. I actually did similar refactoring in roxie.

One idea would be to do a quick pull parser extraction of the name even if it came from the SOAP/XML/JSON content of the message body. Generally the pull parser would usually only have to walk a few tags to find what it needs. It could then abandon the parsing until the body is fully parsed later.

Another way would be to always parse the body much sooner. Early on we decided to avoid extra overhead in case we rejected the request for other reasons.. for example security. We save a little time and memory, or perhaps there may occasionally be huge requests. I think Roxie parses pretty early but isn't as close to the front lines as ESP.

Either way would be good in the sense that it would let us refactor some other code that has to wait until the method is known.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See QueryNameExtractor in roxie.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPCC-31013 created.

@@ -3769,6 +3769,11 @@ void WsWuHelpers::submitWsWorkunit(IEspContext& context, IConstWorkUnit* cw, con
}
}

ISpan * activeSpan = context.queryActiveSpan();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afishbeck if you submitted a workunit the globalids etc. where not preserved in the workunit. They are now. Should this code be commoned up with the code in ws_ecl_service.cpp?

Copy link
Member

@afishbeck afishbeck Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when ws_ecl_service.cpp runs the job as a workunit (as opposed to calling out to roxie) it clones the original workunit. Because of that submitWsEclWorkunit is a bit different from submitWsWorkunit here, BUT it has context available so this bit of code should be able to be commoned up.

@@ -1455,7 +1456,8 @@ class RoxieProtocolMsgContext : implements IHpccProtocolMsgContext, public CInte

ensureContextLogger();

Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("request", allHeaders, flags);
const char * spanName = !isEmptyString(queryName) ? queryName : "run_query";
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(spanName, allHeaders, flags);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the deployed roxie query is used as the span name.

Copy link
Member

@afishbeck afishbeck Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if prefixing with the queryset name (aka target, and generally in AKS this is just the roxie name) would help find queries from the same roxie.. but knowing that for sure will come down to experience.

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)

@ghalliday
Copy link
Member Author

@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)

I am open to debate on any of these - they were a first stab.

run_workunit within esp is a client span - so the parent span should provide the context.

run_workunit within roxie/hthor should be self-explanatory because of the component. It is the same service provided by different components, so probably should use the same name.

@rpastrana
Copy link
Member

@ghalliday these changes look good. Annotating spans w/ associated WUID seems very important. I noticed a few span names were normalized to a common name which self documents the span (run_workunit for example). Is there an attribute we can add to identify the run_workunit code path? (wsecl/SubmitWorkunit, ccdlistener, eclagent)

I am open to debate on any of these - they were a first stab.

run_workunit within esp is a client span - so the parent span should provide the context.

run_workunit within roxie/hthor should be self-explanatory because of the component. It is the same service provided by different components, so probably should use the same name.

No need to debate, again the names look right.
I feel we should annotate spans (specially those with overloaded names) with a distinguishing attribute (perhaps hpcc.component=xyz is enough). But if we feel the users should be able to disambiguate based on some sort of conditions, then ignore.

ISpan * activeSpan = context.queryActiveSpan();
Owned<ISpan> clientSpan(activeSpan->createClientSpan("run_workunit"));
Owned<IProperties> httpHeaders = ::getClientHeaders(clientSpan);
recordTraceDebugOptions(wu, httpHeaders);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the headers not being recorded in the workunit revealed another problem - see thorsoapcall for details.

@@ -682,7 +682,7 @@ class CServerSpan : public CSpan
{
bool success = CSpan::getSpanContext(ctxProps, otelFormatted);

if (remoteParentSpanCtx.IsValid())
if (!otelFormatted && remoteParentSpanCtx.IsValid())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afishbeck please check. remoteParentSpanId shouldn't have been being included in the client headers. I opened HPCC-30927 to split this function into two, because it is confusing as it stands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For what it's worth, I thought I had 2 independent versions of this function in at least one of my many PR commits.
One function should provide propagation headers the traceparent and tracestate, the other should provide the context as props.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with both of you, these should be 2 functions it's very confusing.

So I guess if blocking remoteParentSpanID when otelFormatted is set prevents it from being sent as a header then this code is right?

if (!isEmptyString(value))
request.append(key).append(": ").append(value).append("\r\n");
//If this header is http-global-id, check that global-id hasn't been explicitly added already
const char * altHeader = queryAlternativeHeader(key);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a query set "hpcc-global-id" but the headers line set "global-id" or vice-versa you would get both values in the headers potentially causing confusion. Only likely to affect the regression tests - but it did once the ids were preserved where they were previously dropped.

@ghalliday
Copy link
Member Author

@rpastrana second commit contains fixes for regression test differences. Also rebased onto latest code.

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday this is fine as is, but I did leave a couple of comments for you to consider.

StringBuffer spanName;
spanName.append(serviceName);
if (!isEmptyString(methodName))
spanName.append("/").append(methodName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's prob fine, but I wonder if the / delimiter could cause problems downstream. (prometheous, grafana, elastic will likely use this value in UI widgets). I didn't find specific rules on span nomenclature or even attribute naming rules which excludes the slash, but I'd consider another delimiter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the examples on the open telemetry recommendations for span name conventions (https://github.com/open-telemetry/opentelemetry-specification/blob/v1.26.0/specification/trace/api.md#span) uses a / : "get_account/{accountId}" - so I think it can be considered safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@@ -682,7 +682,7 @@ class CServerSpan : public CSpan
{
bool success = CSpan::getSpanContext(ctxProps, otelFormatted);

if (remoteParentSpanCtx.IsValid())
if (!otelFormatted && remoteParentSpanCtx.IsValid())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For what it's worth, I thought I had 2 independent versions of this function in at least one of my many PR commits.
One function should provide propagation headers the traceparent and tracestate, the other should provide the context as props.

@@ -1485,7 +1485,9 @@ CJobMaster::CJobMaster(IConstWorkUnit &_workunit, const char *graphName, ILoaded
init();

Owned<IProperties> traceHeaders = extractTraceDebugOptions(workunit);
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(workunit->queryWuid(), traceHeaders);
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("run_graph", traceHeaders);
requestSpan->setSpanAttribute("hpcc.wuid", workunit->queryWuid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like some of the most common attributes should be defined in JTrace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I'm still not sure about. I can see the argument for having common strings - it catches typing mistakes, but I also find the code harder to read.

I suspect I could be persuaded for strings like this that are likely to be scattered through the code.

@ghalliday
Copy link
Member Author

@afishbeck please also check the logic behind the last commit changes to globalid.ecl - that ids generated by esp should have precedence over any #options trying to set options. (You could still set debug options when you submit a query to control it.)

@@ -1485,7 +1485,9 @@ CJobMaster::CJobMaster(IConstWorkUnit &_workunit, const char *graphName, ILoaded
init();

Owned<IProperties> traceHeaders = extractTraceDebugOptions(workunit);
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan(workunit->queryWuid(), traceHeaders);
Owned<ISpan> requestSpan = queryTraceManager().createServerSpan("run_graph", traceHeaders);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday I wonder if workunit spans should be created as PRODUCER spans and then when a component works on that workunit it should be CONSUMER spans. (consumer may sound weird for something like thor, but see the description below):

  • SERVER Indicates that the span covers server-side handling of a synchronous RPC or other remote request. This span is often the child of a remote CLIENT span that was expected to wait for a response.
  • CLIENT Indicates that the span describes a request to some remote service. This span is usually the parent of a remote SERVER span and does not end until the response is received.
  • PRODUCER Indicates that the span describes the initiators of an asynchronous request. This parent span will often end before the corresponding child CONSUMER span, possibly even before the child span starts. In messaging scenarios with batching, tracing individual messages requires a new PRODUCER span per message to be created.
  • CONSUMER Indicates that the span describes a child of an asynchronous PRODUCER request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably be a server span, but the code that adds the workunit to the execution queue should probably be a producer span. I have added a jira to introduce the new concept and we can revisit existing spans when that is available.

@@ -2001,7 +2001,7 @@ int CWsEclBinding::submitWsEclWorkunit(IEspContext & context, WsEclWuInfo &wsinf

Owned<ISpan> clientSpan;
ISpan * activeSpan = context.queryActiveSpan();
clientSpan.setown(activeSpan->createClientSpan("wsecl/SubmitWorkunit"));
clientSpan.setown(activeSpan->createClientSpan("run_workunit"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the idea of PRODUCER spans gets interesting. I guess the producer span would be created here? or would we do it in workunit.dll when a workunit is created? Then even the most basic eclagent/roxie/hthor job would be represented by a CONSUMER span? And thor jobs may create multiple consumer spans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is where the producer span would be. I'm not sure that I agree it makes sense to differentiate a consumer span from a server span though. In general the service does not know whether the parent span is waiting for it or not. It may make sense with an internal producer/consumer model.

Copy link
Member

@afishbeck afishbeck Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday just wanted to make a note to retain this thought.

The reality could be different then the theory. BUT I think the reason for a consumer span rather than a server span is that a server span should show up as a sequential child within the timeframe of the span of its parent.

If a THOR job itermitantly / asynchronously works on a workunit the spans would not correlate to the original parent at all which isn't what I think is expected of server spans.

Web app example

WEB server span 
---------------------------------------------------------------------------------
ROXIE server span1        ROXIE server span2          ROXIE server span3
-------------------      -------------------------     ---------------------------    
                                           SOAPCALL span1 
                                           ------------------ 

THOR job example

JOB kickoff PRODUCER span
(finished before any work even really begins)
--

                                       THOR graph CONSUMER1                  THOR graph CONSUMER2
                                       ------------------------------           -------------------------------

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am not sure about in this model is that the consumer does not really know whether the caller is waiting for it or not. Does a webservice know if the caller is waiting for it? On reflection maybe it does. What happens if a server span sends an acknowledgement to the client before it finishes processing? Is it require to create an internal producer/consumer span if that happens?
In which case anything that pulls something off a queue should be a consumer span, but anything that is processing an http request is a server span. Definitely something to flesh out in a further discussion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday, regarding "Does a webservice know if the caller is waiting for it?", I would think so in the case of simple synchronous HTTP. But what about the async or even asymmetric case? Such as where a web server does HTTP/2 Push, or Server-Sent Events? That seems to match the PRODUCER pattern.
[ Sorry if this is not a use case/pattern currently being considered or discussed. But as a discussion of span naming conventions that will extend beyond HPCC, I though the question is relevant ]

Copy link
Member

@afishbeck afishbeck Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WsECL specifically lets the caller decide if they want to have it return immediately giving them the WUID so they can retrieve the result, or will wait for the workunit to complete.

I'd consider:

  1. having a flag in the workunit that determines whether graph processors should create server or consumer spans. A use case that waits for the workunit to finish could set the flag to produce server spans, and batch like component could set it to create consumers.
  2. We aren't talking about HPCC behavior, rather about how this is recorded on the backend for external observability stacks to display. In some sense we are choosing a diagram we want used? or perhaps tools will provide different functionality. But in any case the reality of how those tools operate on the data will be the long term deciding factor for which representation is best. Since they probably modeled current PRODUCER behavior around message queues and databases we could see how well the representation of those in grafana and elastic match with async workunits.

Copy link
Member

@afishbeck afishbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday some comments.

@ghalliday ghalliday requested a review from afishbeck December 13, 2023 13:17
@ghalliday
Copy link
Member Author

@afishbeck please see commit to add the query set name, and responses to your other comments.

Copy link
Member

@afishbeck afishbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday looks good.

Do like the idea of thinking through CONSUMER vs SERVER spans for subgraph processing in the future.

@ghalliday ghalliday merged commit d6e08bb into hpcc-systems:candidate-9.4.x Dec 15, 2023
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants