Skip to content

Commit

Permalink
PS-9384: Sporadic crashes in Jenkins on start up due to race betweet …
Browse files Browse the repository at this point in the history
…dict_stats_thread and cost model initialization

https://perconadev.atlassian.net/browse/PS-9384

Problem:
--------
Both debug and release version of server crash sporadically while running
different tests in Jenkins with stacktraces referencing to
Cost_model_server::init() being called from InnoDB's dict_stats_thread().

Analysis:
---------
Investigation has shown that there is a race condition between code handling
auto-updating of histograms from InnoDB background thread and the main thread
performing server start-up. The code responsible for updating histogram, which
was introduced by Upstream in 8.4.0, initializes LEX structure to perform its
duties and tries to use global Optimizer cost model object as part of this.
OTOH the main thread performing server start-up concurrently initializes and
destroys this global object several times after this background thread has
been started and sets it to the final working state much later in the process
of start-up, before we start accepting user queries. Not surprisingly
concurrent usage of this global object and its init/deinit cause crashes.

In theory, the problem exists in Upstream but probably is normally invisible
there, as to trigger it, some updates to tables are needed, so persistent
stats recalculation and histogram update are requested. And in the Upstream
this probably can normally happen only after user requests start being
processed (by which time global cost model object has proper stable state).

While in Percona Server, we have telemetry component enabled by default, and
code which on first start up of server updates mysql.component table, which
triggers stats/histogram update request. As result this race becomes visible.
OTOH this specific scenario should only affect the first start of the server
for installation, and not later restarts. But if there are other components
which update tables during initialization/start up time the issue might
become more prominent.

Solution:
---------
Delay processing of requests to update stats/histograms in background
thread until server is fully operational (and thus global optimizer
cost model is fully initialized and stable).
  • Loading branch information
dlenev committed Sep 11, 2024
1 parent 71a32e2 commit 172a2e3
Showing 1 changed file with 13 additions and 1 deletion.
14 changes: 13 additions & 1 deletion storage/innobase/dict/dict0stats_bg.cc
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ this program; if not, write to the Free Software Foundation, Inc.,
#include "os0thread-create.h"
#include "row0mysql.h"
#include "sql/histograms/histogram.h"
#include "sql/mysqld.h" // get_server_state()
#include "srv0start.h"
#include "ut0new.h"

Expand Down Expand Up @@ -405,7 +406,18 @@ void dict_stats_thread() {
break;
}

dict_stats_process_entry_from_recalc_pool(thd);
/* Some steps of server start-up process which are performed after this
thread starts (e.g., Percona Telemetry setup step) might update tables and
thus trigger request to recalculate statistics and update histograms.
However, the latter might be problematic before server becames fully
operational, as it involves usage of global optimizer cost model object,
which at the same time is concurrently inited/destroyed/reloaded by the
main thread performing start-up. Hence we delay handling of requests
to update statistics/histogram until server is fully operational (and thus
global optimizer cost model object is initialized and stable). */
if (get_server_state() == SERVER_OPERATING) {
dict_stats_process_entry_from_recalc_pool(thd);
}

os_event_reset(dict_stats_event);
}
Expand Down

0 comments on commit 172a2e3

Please sign in to comment.