Add opentelemetry+elastic agent overhead benchmark on a weekly basis #3369

jackshirazi · 2023-10-18T12:53:46Z

The opentelemetry overhead benchmark is easily configurable to add in the Elastic agent, so is a nice one to run on a weekly basis

jackshirazi · 2023-10-18T12:56:50Z

Here's a script that will run all the current agents (baseline, otel release, otel snapshot, elastic release, elastic snapshot, elastic release async) on a new VM (script requires github token access set in variables gituser gittoken)

sudo apt update
sudo apt install -y openjdk-17-jdk-headless
sudo apt install -y docker.io
sudo apt install -y jq
sudo apt install -y unzip
git clone https://github.com/open-telemetry/opentelemetry-java-instrumentation.git
cd opentelemetry-java-instrumentation/
./gradlew assemble
cd benchmark-overhead
ELASTIC_SNAPSHOT_URL=$(curl -s -u $gituser:$gittoken "https://api.github.com/repos/elastic/apm-agent-java/actions/workflows/49838992/runs?branch=main" | jq -c '.workflow_runs[] | {conclusion, updated_at, display_title, url}' | grep -v null  | grep -v pending | grep -v cancelled | grep success | head -1 | awk -F'":"' '{print $5}' | tr -d '"}')
ELASTIC_SNAPSHOT_ARTIFACTS=$(curl -s -u $gituser:$gittoken "$ELASTIC_SNAPSHOT_URL" | grep artifacts_url | awk -F'":' '{print $2}' | tr -d '"} ,')
ELASTIC_SNAPSHOT_ZIPFILE=$(curl -s -u $gituser:$gittoken "https://api.github.com/repos/elastic/apm-agent-java/actions/runs/6545518750/artifacts" | jq -c ".artifacts[] | {name,archive_download_url}" | grep '"elastic-apm-agent"' | awk -F'":' '{print $3}' | tr -d '"}')
curl -s --output "elastic-agent.zip"  -L -H "Accept: application/vnd.github+json" -H "Authorization: Bearer $gittoken" -H "X-GitHub-Api-Version: 2022-11-28" -u $gituser:$gittoken "$ELASTIC_SNAPSHOT_ZIPFILE"
unzip elastic-agent.zip
ELASTIC_SNAPSHOT_JAR=$(ls -1 elastic-apm-agent-*.jar)
ELASTIC_SNAPSHOT_ENTRY="new Agent(\\\"elastic-snapshot\\\",\\\"latest available snapshot version from elastic main\\\",\\\"file://$PWD/$ELASTIC_SNAPSHOT_JAR\\\")"
ELASTIC_LATEST_VERSION=$(curl -s https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/ | perl -ne 's/<.*?>//g; if(s/^([\d\.]+).*$/$1/){print}' | sort -V | tail -1)
ELASTIC_LATEST_ENTRY="new Agent(\\\"elastic-latest\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\")"
ELASTIC_LATEST_ENTRY2="new Agent(\\\"elastic-async\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\", java.util.List.of(\\\"-Delastic.apm.delay_agent_premain_ms=15000\\\"))"
NEW_LINE="              .withAgents(Agent.NONE, Agent.LATEST_RELEASE, Agent.LATEST_SNAPSHOT, $ELASTIC_LATEST_ENTRY, $ELASTIC_LATEST_ENTRY2, $ELASTIC_SNAPSHOT_ENTRY)"
echo $NEW_LINE
perl -i -ne "if (/withAgents/) {print \"$NEW_LINE\n\"}else{print}" src/test/java/io/opentelemetry/config/Configs.java
sudo ./gradlew test
perl -ne '/Standard output/ && $on++; /\<\/pre\>/ && ($on=0);$on && s/\<.*\>//;$on && !/^\s*$/ && print' build/reports/tests/test/classes/io.opentelemetry.OverheadTests.html

jackshirazi · 2023-10-18T12:57:59Z

Here's the output from a run - the script will need adjusting to provide that somewhere in a CSV or JSON format so that we can see trends. Note this run is on a VM which is running on a shared host, so variability could be down to resource conflicts. The weekly script needs to be on an isolated dedicated host

----------------------------------------------------------
 Run at Wed Oct 18 12:23:13 UTC 2023
 release : compares no agent, latest stable, and latest snapshot agents
 5 users, 5000 iterations
----------------------------------------------------------
Agent               :              none           latest         snapshot   elastic-latest    elastic-async elastic-snapshot
Run duration        :          00:00:55         00:01:05         00:01:07         00:00:59         00:01:00         00:00:59
Avg. CPU (user) %   :        0.36701292       0.42148226       0.41992074        0.3964025        0.4192863       0.39579248
Max. CPU (user) %   :            0.5275       0.56296295         0.566416             0.56        0.5785536           0.5425
Avg. mch tot cpu %  :         0.9447133       0.95868945       0.94469744       0.94119316       0.95311606         0.947395
Startup time (ms)   :              9066            12809            12752            12841             8626            12790
Total allocated MB  :          15065.52         20208.01         20320.67         15992.55         15519.32         15927.61
Min heap used (MB)  :            180.06           119.93           114.68           188.48           110.91           141.44
Max heap used (MB)  :            549.36           429.93           374.98           639.65           446.14           532.69
Thread switch rate  :         56717.484        56390.805        54232.383        55348.195         56344.96          54675.2
GC time (ms)        :               820              571              786              933              425              844
GC pause time (ms)  :               820              571              786              933              425              844
Req. mean (ms)      :              4.05             4.84             5.01             4.37             4.46             4.37
Req. p95 (ms)       :             10.47            12.82            13.09            11.33            11.51            11.42
Iter. mean (ms)     :             53.19            63.49            65.44            57.30            58.35            57.31
Iter. p95 (ms)      :             81.21            96.39           100.21            86.74            91.41            86.21
Net read avg (bps)  :       14473952.00      12654705.00      12173681.00      12410859.00      11300186.00      12229296.00
Net write avg (bps) :       19364629.00      69109205.00      66806772.00      16584103.00      15095361.00      16333396.00
Peak threads        :                40               53               54               47               47               47

jackshirazi · 2023-10-18T13:43:06Z

Note the otel tests include a collector, the elastic tests need to include a mock apm server (eg the one in this project tests should work

jackshirazi · 2023-10-18T14:24:50Z

@v1v this is the test we'd like to run on an isolated specific hardware configuration. The suggested Ubuntu 20.04 - 6 CPU Cores / 64117 MB Memory would be fine, and we can go on that in buildkite or wait for runners if those are likely to be available in the next couple of months

v1v · 2023-10-18T20:42:05Z

A few questions:

sudo ./gradlew test

What's the reason for sudo? I cannot see anything related to sudo in https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/benchmark-overhead#setup-and-usage

the elastic tests need to include a mock apm server

Can you provide the set of steps to run the mock apm-server?

So far I managed to test the above-mentioned steps in Buildkite, see https://buildkite.com/elastic/apm-agent-java-load-testing/builds/147#018b446f-ba68-4049-9a8e-78b07b3d4eb3

Those steps have been coded in #3371

jackshirazi · 2023-10-18T21:58:24Z

What's the reason for sudo

I didn't actually try to solve why it failed without sudo, but the docker images wouldn't run. It's probably something to do with the docker install, it might not be the best choice of docker install.

Can you provide the set of steps to run the mock apm-server?

Will do, I'll update the script when I get there

v1v · 2023-10-19T08:51:50Z

Status update

This Buildkite build produced:

In addition it archives the html report, see here

Next steps:

Store benchmarks in ES

jackshirazi · 2023-10-25T15:46:08Z

For adding the APM mock server, we need to do this before the test (eg anytime after docker is installed but before the test is run)

git clone https://github.com/elastic/apm-mutating-webhook.git
cd apm-mutating-webhook/test/mock
docker build -t mock-apm-server .
docker run -dp 127.0.0.1:8027:8027 mock-apm-server

and return to the root directory for the test script. Then at the end of the test for cleanup, we want to stop and remove the image

MOCK_APM_SERVER=$(docker ps | grep mock-apm-server | awk '{print $1}')
docker stop $MOCK_APM_SERVER
docker rm $MOCK_APM_SERVER

jackshirazi · 2023-10-25T16:50:41Z

The class in #3384 will process the output and convert it for sending to ES the same way that PostProcessBenchmarkResults in run-benchmarks does

v1v · 2023-10-26T11:09:14Z

For adding the APM mock server,

That's now done and working like a charm see this build

The class in #3384 will process the output and convert it for sending to ES

I'm gonna work on this now

jackshirazi · 2023-10-26T11:51:37Z

There's one further change to the existing script, these 3 bash variables need to be changed to these

ELASTIC_SNAPSHOT_ENTRY="new Agent(\\\"elastic-snapshot\\\",\\\"latest available snapshot version from elastic main\\\",\\\"file://$PWD/$ELASTIC_SNAPSHOT_JAR\\\", java.util.List.of(\\\"-Delastic.apm.server_url=http://host.docker.internal:8027/\\\"))"
ELASTIC_LATEST_ENTRY="new Agent(\\\"elastic-latest\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\", java.util.List.of(\\\"-Delastic.apm.server_url=http://host.docker.internal:8027/\\\"))"
ELASTIC_LATEST_ENTRY2="new Agent(\\\"elastic-async\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\", java.util.List.of(\\\"-Delastic.apm.delay_agent_premain_ms=15000\\\",\\\"-Delastic.apm.server_url=http://host.docker.internal:8027/\\\"))"

jackshirazi · 2023-10-26T11:58:41Z

And the final steps are to add during setup

git clone https://github.com/elastic/apm-agent-java.git
cd apm-agent-java
./mvnw clean install -DskipTests=true -Dmaven.javadoc.skip=true > mvn-log.log 2> mvn-err.log

and then after the benchmark is run

java -cp ~/apm-agent-java/apm-agent-benchmarks/target/benchmarks.jar co.elastic.apm.agent.benchmark.ProcessOtelBenchmarkResults build/reports/tests/test/classes/io.opentelemetry.OverheadTests.html output.json $ELASTIC_LATEST_VERSION opentelemetry-javaagent.jar

v1v · 2023-10-26T13:46:20Z

Status update

All the bits and pieces have been put in place and this build ran successfully and ingested the documents in the observability-benchmarks cluster.

Index name: otel-microbenchmarks

There is just one minor improvement to help with using the benchmarks.jar rather than building them from source code. See #3386

The reason is that we already use the GitHub api/cli to fetch elastic-apm-agent.jar as described in #3369 (comment)

v1v · 2023-10-26T16:07:35Z

I've started to see some failures when integrating a couple of new changes:



Failed to map supported failure 'org.opentest4j.AssertionFailedError: Unhandled exception in release' with mapper 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@4277127c': Cannot invoke "Object.getClass()" because "obj" is null
--
  |  
  | > Task :test
  |  
  | OverheadTests > runAllTestConfigurations() > release FAILED
  | org.opentest4j.AssertionFailedError at OverheadTests.java:72
  | Caused by: com.github.dockerjava.api.exception.ConflictException at OverheadTests.java:147
  |  
  | 1 test completed, 1 failed
  |  
  | > Task :test FAILED
  |  
  | FAILURE: Build failed with an exception.
  |  
  | * What went wrong:
  | Execution failed for task ':test'.
  | > There were failing tests. See the report at: file:///var/lib/buildkite-agent/.buildkite-agent/builds/worker-1799330-build-hel1-dc1-hetzner-elasticnet-co/elastic/apm-agent-java-load-testing/opentelemetry-java-instrumentation/benchmark-overhead/build/reports/tests/test/index.html
  |  
  | * Try:
  | > Run with --scan to get full insights.
  |  
  | Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
  |  
  | You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
  |  
  | For more on this, please refer to https://docs.gradle.org/8.4/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.
  |  
  | BUILD FAILED in 8m 14s
  | 3 actionable tasks: 3 executed

See https://buildkite.com/elastic/apm-agent-java-load-testing/builds/178#018b6cb0-53af-4835-a865-69b52c30ddc2/111-112

@jackshirazi , do you happen to know what's the reason?

v1v · 2023-10-26T16:23:30Z

@jackshirazi , do you happen to know what's the reason?

It worked in the next run https://buildkite.com/elastic/apm-agent-java-load-testing/builds/179, maybe some weird environmental issue, to help with I added the archiving for the index.html that contains the test results

jackshirazi · 2023-10-26T17:23:12Z

I don't know what the failure was, and that index.html file won't help, it's the result file build/reports/tests/test/classes/io.opentelemetry.OverheadTests.html - the one that holds the results - that would have the details of the error. I think I saw a similar failure in one of my tests, it was caused by a failure to download the otel jar file from maven, ie maven flakiness

jackshirazi · 2023-10-27T11:11:45Z

Completed with #3371 .

I'll spin out 2 subsequent tasks, the dashboard and adding continuous profiling

github-actions bot added the agent-java label Oct 18, 2023

v1v mentioned this issue Oct 18, 2023

buildkite: opentelemetry+elastic agent overhead benchmark on a weekly basis #3371

Merged

19 tasks

v1v assigned v1v and jackshirazi Oct 19, 2023

jackshirazi mentioned this issue Oct 25, 2023

Process otel benchmark #3384

Merged

19 tasks

jackshirazi closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opentelemetry+elastic agent overhead benchmark on a weekly basis #3369

Add opentelemetry+elastic agent overhead benchmark on a weekly basis #3369

jackshirazi commented Oct 18, 2023

jackshirazi commented Oct 18, 2023

jackshirazi commented Oct 18, 2023 •

edited

Loading

jackshirazi commented Oct 18, 2023

jackshirazi commented Oct 18, 2023

v1v commented Oct 18, 2023

jackshirazi commented Oct 18, 2023

v1v commented Oct 19, 2023

jackshirazi commented Oct 25, 2023 •

edited

Loading

jackshirazi commented Oct 25, 2023

v1v commented Oct 26, 2023

jackshirazi commented Oct 26, 2023

jackshirazi commented Oct 26, 2023

v1v commented Oct 26, 2023 •

edited

Loading

v1v commented Oct 26, 2023

v1v commented Oct 26, 2023

jackshirazi commented Oct 26, 2023

jackshirazi commented Oct 27, 2023

Add opentelemetry+elastic agent overhead benchmark on a weekly basis #3369

Add opentelemetry+elastic agent overhead benchmark on a weekly basis #3369

Comments

jackshirazi commented Oct 18, 2023

jackshirazi commented Oct 18, 2023

jackshirazi commented Oct 18, 2023 • edited Loading

jackshirazi commented Oct 18, 2023

jackshirazi commented Oct 18, 2023

v1v commented Oct 18, 2023

jackshirazi commented Oct 18, 2023

v1v commented Oct 19, 2023

Status update

jackshirazi commented Oct 25, 2023 • edited Loading

jackshirazi commented Oct 25, 2023

v1v commented Oct 26, 2023

jackshirazi commented Oct 26, 2023

jackshirazi commented Oct 26, 2023

v1v commented Oct 26, 2023 • edited Loading

Status update

v1v commented Oct 26, 2023

v1v commented Oct 26, 2023

jackshirazi commented Oct 26, 2023

jackshirazi commented Oct 27, 2023

jackshirazi commented Oct 18, 2023 •

edited

Loading

jackshirazi commented Oct 25, 2023 •

edited

Loading

v1v commented Oct 26, 2023 •

edited

Loading