PS-9384: Sporadic crashes in Jenkins on start up due to race betweet … #5417

dlenev · 2024-09-11T09:36:46Z

…dict_stats_thread and cost model initialization

https://perconadev.atlassian.net/browse/PS-9384

Problem:

Both debug and release version of server crash sporadically while running different tests in Jenkins with stacktraces referencing to Cost_model_server::init() being called from InnoDB's dict_stats_thread().

Analysis:

Investigation has shown that there is a race condition between code handling auto-updating of histograms from InnoDB background thread and the main thread performing server start-up. The code responsible for updating histogram, which was introduced by Upstream in 8.4.0, initializes LEX structure to perform its duties and tries to use global Optimizer cost model object as part of this. OTOH the main thread performing server start-up concurrently initializes and destroys this global object several times after this background thread has been started and sets it to the final working state much later in the process of start-up, before we start accepting user queries. Not surprisingly concurrent usage of this global object and its init/deinit cause crashes.

In theory, the problem exists in Upstream but probably is normally invisible there, as to trigger it, some updates to tables are needed, so persistent stats recalculation and histogram update are requested. And in the Upstream this probably can normally happen only after user requests start being processed (by which time global cost model object has proper stable state).

While in Percona Server, we have telemetry component enabled by default, and code which on first start up of server updates mysql.component table, which triggers stats/histogram update request. As result this race becomes visible. OTOH this specific scenario should only affect the first start of the server for installation, and not later restarts. But if there are other components which update tables during initialization/start up time the issue might become more prominent.

Solution:

Delay processing of requests to update stats/histograms in background thread until server is fully operational (and thus global optimizer cost model is fully initialized and stable).

…dict_stats_thread and cost model initialization https://perconadev.atlassian.net/browse/PS-9384 Problem: -------- Both debug and release version of server crash sporadically while running different tests in Jenkins with stacktraces referencing to Cost_model_server::init() being called from InnoDB's dict_stats_thread(). Analysis: --------- Investigation has shown that there is a race condition between code handling auto-updating of histograms from InnoDB background thread and the main thread performing server start-up. The code responsible for updating histogram, which was introduced by Upstream in 8.4.0, initializes LEX structure to perform its duties and tries to use global Optimizer cost model object as part of this. OTOH the main thread performing server start-up concurrently initializes and destroys this global object several times after this background thread has been started and sets it to the final working state much later in the process of start-up, before we start accepting user queries. Not surprisingly concurrent usage of this global object and its init/deinit cause crashes. In theory, the problem exists in Upstream but probably is normally invisible there, as to trigger it, some updates to tables are needed, so persistent stats recalculation and histogram update are requested. And in the Upstream this probably can normally happen only after user requests start being processed (by which time global cost model object has proper stable state). While in Percona Server, we have telemetry component enabled by default, and code which on first start up of server updates mysql.component table, which triggers stats/histogram update request. As result this race becomes visible. OTOH this specific scenario should only affect the first start of the server for installation, and not later restarts. But if there are other components which update tables during initialization/start up time the issue might become more prominent. Solution: --------- Delay processing of requests to update stats/histograms in background thread until server is fully operational (and thus global optimizer cost model is fully initialized and stable).

dlenev · 2024-09-11T11:42:42Z

Jenkins results on different platforms for patch which only differs from the above commit in comments seems to confirm that problem is solved by it.
Look for release-8.4.2-2-with-dst-race-fix tag in https://ps80.cd.percona.com/view/8.0%20parallel%20MTR/job/percona-server-8.x-pipeline-parallel-mtr/. For example: https://ps80.cd.percona.com/view/8.0%20parallel%20MTR/job/percona-server-8.x-pipeline-parallel-mtr/326/

dlenev requested a review from percona-ysorokin September 11, 2024 09:36

dlenev force-pushed the ps-8.4-9384 branch from 172a2e3 to 927bcde Compare September 11, 2024 11:22

satya-bodapati approved these changes Sep 11, 2024

View reviewed changes

dlenev merged commit 4412597 into percona:release-8.4.2-2 Sep 11, 2024
24 checks passed

dlenev deleted the ps-8.4-9384 branch September 11, 2024 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PS-9384: Sporadic crashes in Jenkins on start up due to race betweet … #5417

PS-9384: Sporadic crashes in Jenkins on start up due to race betweet … #5417

dlenev commented Sep 11, 2024

dlenev commented Sep 11, 2024

PS-9384: Sporadic crashes in Jenkins on start up due to race betweet … #5417

PS-9384: Sporadic crashes in Jenkins on start up due to race betweet … #5417

Conversation

dlenev commented Sep 11, 2024

Problem:

Analysis:

Solution:

dlenev commented Sep 11, 2024