Skip to content

Commit

Permalink
PS-9453: percona_telemetry causes a long wait on COND_thd_list due to…
Browse files Browse the repository at this point in the history
… the absence of the root user

https://perconadev.atlassian.net/browse/PS-9453

Problem:
If there is no 'root' user (it was renamed) and Percona Telemetry is
enabled, server shutdown stuck.
Additionally SHOW PROCESSLIST reports increasing number of processes
with root user in state 'login'.

Cause:
Percona Telemetry component used hard codded 'root' user used for
querying the server for metrics. If the user is not present,
mysql_command_factory service connect() method fails, however it leaves
opened/orphaned internal MYSQL_SESSION. It is because after opening
MYSQL_SESSION we check if the user exists. It doesn't exist, so the
method returns with STATE_MACHINE_FAILED error.
Caller does not know anything about underlying session, does cleanup,
frees memory. Moreover, the same problem potentially exists even if
the user exists, but cssm_begin_connect() exits with error for any
reason.
Creation of session registers THD object in Global_THD_manager. That's
why increasing number of processes is visible in SHOW PROCESSLIST.

Why it hangs during shutdown:
During shutdown, the server waits for signal handler thread
(signal_hand()) to join the main thread. Then the component
infrastructure is deinitialized. At the same time signal_hand() waits
for all threads to finish Global_THD_manager::wait_till_no_thd().

The above described bug related to orphaned mysql sessions cause that
there are orphaned THDs that never end. This causes server to wait
infinitively.

Solution:
1. Handle the error in cssm_begin_connect() and close opened
MYSQL_SESSION. This part fixes hangs caused by nonexistent user.
2. Percona Telemetry Component uses mysql.session user. When Percona
Telemetry Component is enabled, the user is granted a few more
permissions during the server startup. Note that even without this
permissions, the component is able to work, but not able to report some
metrics.

Additionally fixed potential memory leak if connection setup in Percona
Telemetry Component fails.
  • Loading branch information
kamil-holubicki committed Oct 11, 2024
1 parent a634d87 commit 27468f8
Show file tree
Hide file tree
Showing 6 changed files with 167 additions and 14 deletions.
44 changes: 31 additions & 13 deletions components/percona_telemetry/data_provider.cc
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,20 @@
namespace {
inline const char *b2s(bool val) { return val ? "1" : "0"; }

/*
mysql.session user is mostly enough, but it lacks the following privileges:
1. REPLICATION SLAVE
2. REPLICATION CLIENT
3. SELECT on mysql.component
4. SELECT on performance_schema.replication_group_members
These privileges are added at server startup in setup_percona_telemetry()
if Percona telemetry is enabled.
*/
constexpr const char default_command_user_name[] = "mysql.session";
constexpr const char default_command_host_name[] = "localhost";

namespace JSONKey {
const char *pillar_version = "pillar_version";
const char *db_instance_id = "db_instance_id";
Expand Down Expand Up @@ -120,25 +134,29 @@ bool DataProvider::do_query(const std::string &query, QueryResult *result,
}
result->clear();

/* command_factory_service_.init() allocates memory for mysql_h
We need to call close() always.
Even if init() fails, becaues it doesn't allocate anything, calling close()
is safe, because internally it checks if provided pointer is valid
*/
std::shared_ptr<MYSQL_H> mysql_h_close_guard(
&mysql_h, [&srv = command_factory_service_](MYSQL_H *ptr) {
srv.close(*ptr);
});

mysql_service_status_t sstatus = command_factory_service_.init(&mysql_h);

if (!sstatus)
sstatus |=
command_options_service_.set(mysql_h, MYSQL_COMMAND_PROTOCOL, nullptr);
if (!sstatus)
sstatus |=
command_options_service_.set(mysql_h, MYSQL_COMMAND_USER_NAME, "root");
sstatus |= command_options_service_.set(mysql_h, MYSQL_COMMAND_USER_NAME,
default_command_user_name);
if (!sstatus)
sstatus |=
command_options_service_.set(mysql_h, MYSQL_COMMAND_HOST_NAME, nullptr);
sstatus |= command_options_service_.set(mysql_h, MYSQL_COMMAND_HOST_NAME,
default_command_host_name);
if (!sstatus) sstatus |= command_factory_service_.connect(mysql_h);

// starting from this point, if the above succeeded we need to close mysql_h.
std::shared_ptr<void> mysql_h_close_guard(
mysql_h,
[&srv = command_factory_service_, do_close = !sstatus](void *ptr) {
if (do_close && ptr) srv.close(static_cast<MYSQL_H>(ptr));
});

// if any of the above failed, just exit
if (sstatus) {
goto err;
Expand Down Expand Up @@ -212,7 +230,7 @@ bool DataProvider::collect_db_instance_id_info(rapidjson::Document *document) {
so the SQL query failed. It will recover next time.
2. Some other reason that caused selecting server_id to fail. */
if (id.length() == 0) {
logger_.warning(
logger_.info(
"Collecting db_instance_id failed. It may be caused by server still "
"initializing.");
return true;
Expand Down Expand Up @@ -346,7 +364,7 @@ bool DataProvider::collect_se_usage_info(rapidjson::Document *document) {
QueryResult result;
if (do_query("SELECT DISTINCT ENGINE FROM information_schema.tables WHERE "
"table_schema NOT IN('mysql', 'information_schema', "
"'performance_schema', 'sys');",
"'performance_schema', 'sys')",
&result)) {
return true;
}
Expand Down
31 changes: 31 additions & 0 deletions mysql-test/suite/component_percona_telemetry/r/no_user.result
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
SELECT * FROM information_schema.table_privileges WHERE grantee = "'mysql.session'@'localhost'" ORDER BY table_schema, table_name;
GRANTEE TABLE_CATALOG TABLE_SCHEMA TABLE_NAME PRIVILEGE_TYPE IS_GRANTABLE
'mysql.session'@'localhost' def mysql component SELECT NO
'mysql.session'@'localhost' def mysql user SELECT NO
'mysql.session'@'localhost' def performance_schema replication_group_members SELECT NO
SHOW GRANTS FOR 'mysql.session'@'localhost';
Grants for mysql.session@localhost
GRANT SHUTDOWN, SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO `mysql.session`@`localhost`
GRANT AUDIT_ABORT_EXEMPT,AUTHENTICATION_POLICY_ADMIN,BACKUP_ADMIN,CLONE_ADMIN,CONNECTION_ADMIN,FIREWALL_EXEMPT,PERSIST_RO_VARIABLES_ADMIN,SESSION_VARIABLES_ADMIN,SYSTEM_USER,SYSTEM_VARIABLES_ADMIN ON *.* TO `mysql.session`@`localhost`
GRANT SELECT ON `performance_schema`.* TO `mysql.session`@`localhost`
GRANT SELECT ON `mysql`.`component` TO `mysql.session`@`localhost`
GRANT SELECT ON `mysql`.`user` TO `mysql.session`@`localhost`
GRANT SELECT ON `performance_schema`.`replication_group_members` TO `mysql.session`@`localhost`
# restart:--percona_telemetry.grace_interval=30 --percona_telemetry.scrape_interval=30 --percona_telemetry.telemetry_root_dir=<telemetry_root_dir>
RENAME USER 'root'@'localhost' to 'root.tmp'@'localhost';
Warnings:
Warning 4005 User 'root'@'localhost' is referenced as a definer account in a stored routine.
Warning 4005 User 'root'@'localhost' is referenced as a definer account in a trigger.
'root' user used by component's 1st verison does not exist. Telemetry dir should contain 1 file.
1
RENAME USER 'root.tmp'@'localhost' to 'root'@'localhost';
Warnings:
Warning 4005 User 'root'@'localhost' is referenced as a definer account in a stored routine.
Warning 4005 User 'root'@'localhost' is referenced as a definer account in a trigger.
# restart:--percona_telemetry.grace_interval=30 --percona_telemetry.scrape_interval=30 --percona_telemetry.telemetry_root_dir=<telemetry_root_dir>
RENAME USER 'mysql.session'@'localhost' to 'mysql.session.tmp'@'localhost';
include/assert.inc [No orphaned sessions expected in processlist]
'mysql.session' user used by component does not exist. Telemetry dir should still contain 1 file.
1
RENAME USER 'mysql.session.tmp'@'localhost' to 'mysql.session'@'localhost';
# restart:--percona_telemetry.grace_interval=30 --percona_telemetry.scrape_interval=30 --percona_telemetry.telemetry_root_dir=<telemetry_root_dir>
70 changes: 70 additions & 0 deletions mysql-test/suite/component_percona_telemetry/t/no_user.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Test that lack of the user used by Percona Telemetry Component
# doesn't cause hangs during server restart and no orphaned sessions are created.

--source include/have_percona_telemetry.inc

--let $telemetry_root_dir = $MYSQL_TMP_DIR/telemetry_dir
--let $grace_interval = 30
--let $scrape_interval = 30

--mkdir $telemetry_root_dir

# Record mysql.session user privileges
SELECT * FROM information_schema.table_privileges WHERE grantee = "'mysql.session'@'localhost'" ORDER BY table_schema, table_name;
SHOW GRANTS FOR 'mysql.session'@'localhost';

# restart the server with custom telemetry file path and timeouts
--let $restart_parameters = "restart:--percona_telemetry.grace_interval=$grace_interval --percona_telemetry.scrape_interval=$scrape_interval --percona_telemetry.telemetry_root_dir=$telemetry_root_dir"
--replace_regex /telemetry_root_dir=.*telemetry_dir/telemetry_root_dir=<telemetry_root_dir>/
--source include/restart_mysqld.inc

# Rename 'root' user (1st version of Percona Telemetry Component used 'root' user)
# 1st version will not collect any data and will not create telemetry file and the restart will hang.
# Fixed version will work properly as it doesn't use 'root' user.
RENAME USER 'root'@'localhost' to 'root.tmp'@'localhost';

# sleep more than grace_interval and check that telemetry file was created
--let $timeout = `select $grace_interval + 10`
--sleep $timeout

--echo 'root' user used by component's 1st verison does not exist. Telemetry dir should contain 1 file.
--exec ls -1 $telemetry_root_dir | wc -l

#
# It should be possible to restart the server.
#
RENAME USER 'root.tmp'@'localhost' to 'root'@'localhost';
--let $restart_parameters = "restart:--percona_telemetry.grace_interval=$grace_interval --percona_telemetry.scrape_interval=$scrape_interval --percona_telemetry.telemetry_root_dir=$telemetry_root_dir"
--replace_regex /telemetry_root_dir=.*telemetry_dir/telemetry_root_dir=<telemetry_root_dir>/
--source include/restart_mysqld.inc


#
# Now rename the user used by component
#
RENAME USER 'mysql.session'@'localhost' to 'mysql.session.tmp'@'localhost';

# Wait a few cycles and ensure that SHOW PROCESSLIST does not contain rows related to orphaned sessions.
--let $timeout = `select $grace_interval + 3 * $scrape_interval`
--sleep $timeout

--let $assert_text = No orphaned sessions expected in processlist
--let $assert_cond = [SELECT COUNT(*) as Result FROM performance_schema.processlist WHERE user = "mysql.session";, Result, 1] = 0
--source include/assert.inc

# Check that no new telemetry file was created
--echo 'mysql.session' user used by component does not exist. Telemetry dir should still contain 1 file.
--exec ls -1 $telemetry_root_dir | wc -l


#
# It should be still possible to restart the server.
#
RENAME USER 'mysql.session.tmp'@'localhost' to 'mysql.session'@'localhost';
--let $restart_parameters = "restart:--percona_telemetry.grace_interval=$grace_interval --percona_telemetry.scrape_interval=$scrape_interval --percona_telemetry.telemetry_root_dir=$telemetry_root_dir"
--replace_regex /telemetry_root_dir=.*telemetry_dir/telemetry_root_dir=<telemetry_root_dir>/
--source include/restart_mysqld.inc

# cleanup
--force-rmdir $telemetry_root_dir

18 changes: 18 additions & 0 deletions sql/dd/impl/upgrade/server.cc
Original file line number Diff line number Diff line change
Expand Up @@ -959,6 +959,15 @@ bool upgrade_system_schemas(THD *thd) {
/* 1. We INSERT INTO, because prepared statements do not support
INSTALL COMPONENT
2. We use stored procedure to be able to do conditional action.
3. Percona Telemetry Component uses mysql.session user. For the component to
be fully functional, mysql.session user lacks the following privileges:
1. REPLICATION SLAVE
2. REPLICATION CLIENT
3. SELECT on mysql.component
4. SELECT on performance_schema.replication_group_members
GRANT does not work yet as ACL is not initialized yet. Use UPDATES.
For 3 and 4 we could set Select_priv = 'Y' in mysql.user table for
mysql.session user, but let's allow only minimal required privileges.
*/
static const char *percona_telemetry_install[] = {
"USE mysql;\n",
Expand All @@ -977,6 +986,15 @@ static const char *percona_telemetry_install[] = {
"PREPARE stmt FROM @str;\n",
"EXECUTE stmt;\n",
"DROP PREPARE stmt;\n",
"UPDATE mysql.user SET Repl_slave_priv = 'Y' WHERE User = 'mysql.session' "
"AND Host = 'localhost';\n",
"UPDATE mysql.user SET Repl_client_priv = 'Y' WHERE User = 'mysql.session' "
"AND Host = 'localhost';\n",
"INSERT IGNORE INTO mysql.tables_priv VALUES ('localhost', 'mysql', "
"'mysql.session', 'component', 'root@localhost', NOW(), 'Select', '');\n",
"INSERT IGNORE INTO mysql.tables_priv VALUES ('localhost', "
"'performance_schema', 'mysql.session', 'replication_group_members', "
"'root@localhost', NOW(), 'Select', '');\n",
NULL};

static const char *percona_telemetry_uninstall[] = {
Expand Down
16 changes: 16 additions & 0 deletions sql/server_component/mysql_command_backend.cc
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,21 @@ mysql_state_machine_status cssm_begin_connect(mysql_async_connect *ctx) {
my_h_service h_command_consumer_srv = nullptr;

MYSQL_SESSION mysql_session = nullptr;

/* We need to handle the failure in this function.
Setting mcs_extn->session_svc right after session open is not enough
to handle user lookup errors (and following errors as well)
because in case of this function returns error, mysql->extension is
cleaned up immediately by the caller. The caller does not take care of
session_svc, because it is not aware of this structure.
*/
std::shared_ptr<MYSQL_SESSION> mysql_session_close_guard(
&mysql_session, [mcs_extn](MYSQL_SESSION *mysql_session_ptr) {
if (*mysql_session_ptr == nullptr) return;
mcs_extn->session_svc = nullptr;
srv_session_close(*mysql_session_ptr);
});

if (mcs_extn->mcs_thd == nullptr || mcs_extn->session_svc == nullptr) {
/*
Avoid possibility of nested txn in the current thd.
Expand Down Expand Up @@ -239,6 +254,7 @@ mysql_state_machine_status cssm_begin_connect(mysql_async_connect *ctx) {
}
mysql->client_flag = 0; /* For handshake */
mysql->server_status = SERVER_STATUS_AUTOCOMMIT;
mysql_session = nullptr; // disable delete quard
return STATE_MACHINE_DONE;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ TEST_F(DataProviderTest, collect_se_usage_info_test) {
const std::string query(
std::string("SELECT DISTINCT ENGINE FROM information_schema.tables WHERE "
"table_schema NOT IN('mysql', 'information_schema', "
"'performance_schema', 'sys');"));
"'performance_schema', 'sys')"));
const std::string expected_json_key("se_engines_in_use");
collect_array_info_common(query, expected_json_key,
&MockDataProvider::collect_se_usage_info);
Expand Down

0 comments on commit 27468f8

Please sign in to comment.