Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(meta): meta server failed and could not be started normally while setting environment variables after dropping table #2148

Merged
merged 15 commits into from
Nov 25, 2024

Conversation

empiredan
Copy link
Contributor

@empiredan empiredan commented Nov 19, 2024

#2149.

There are two problems that should be solved:

  1. why the primary meta server failed with segfault while dropping tables ?
  2. why all meta servers were never be restarted normally after the primary
    meta server failed ?

A pegasus cluster would flush security policies to remote meta storage
periodically (by update_ranger_policy_interval_sec) in the form of environment
variables. We do this by server_state::set_app_envs(). However, after updating
the meta data on the remote storage (namely ZooKeeper), the table is not checked
that if it still exists while updating environment variables of local memory:

void server_state::set_app_envs(const app_env_rpc &env_rpc)
{

...

    do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) {
        CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed");

        zauto_write_lock l(_lock);
        std::shared_ptr<app_state> app = get_app(app_name);
        std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        for (int idx = 0; idx < keys.size(); idx++) {
            app->envs[keys[idx]] = values[idx];
        }
        std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs);
    });
}

In std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');, since
app is nullptr, app->envs would point an invalid address, leading to segfault
in libdsn_utils.so where dsn::utils::kv_map_to_string is.

Therefore, the reason for the 1st problem is very clear: the callback for updating meta
data on remote storage is called immediately after the table is removed, and an invalid
address is accessed due to null pointer.

Then, the meta server would load meta data from remote storage after it is restart.
However, the intermediate status AS_DROPPING is also flushed to remote storage
with security policies since all meta data for a table is an unitary json object: the whole
json would be set to remote storage once any property is updated.

However AS_DROPPING is invalid, and cannot pass the assertion which would make
meta server fail again and again, which is the reason of the 2nd problem:

server_state::sync_apps_from_remote_storage()
{

...

                    std::shared_ptr<app_state> app = app_state::create(info);
                    {
                        zauto_write_lock l(_lock);
                        _all_apps.emplace(app->app_id, app);
                        if (app->status == app_status::AS_AVAILABLE) {
                            app->status = app_status::AS_CREATING;
                            _exist_apps.emplace(app->app_name, app);
                            _table_metric_entities.create_entity(app->app_id, app->partition_count);
                        } else if (app->status == app_status::AS_DROPPED) {
                            app->status = app_status::AS_DROPPING;
                        } else {
                            CHECK(false,
                                  "invalid status({}) for app({}) in remote storage",
                                  enum_to_string(app->status),
                                  app->get_logname());
                        }
                    }

...

}

To fix the 1st problem, we just check if the table still exists after meta data is updated
on the remote storage. To fix the 2nd problem, we prevent meta data with intermediate
status AS_DROPPING from being flushed to remote storage.

@empiredan empiredan added the type/bug-fix This PR fixes a bug. label Nov 19, 2024
@github-actions github-actions bot added the cpp label Nov 19, 2024
@empiredan empiredan changed the title fix(meta): meta server failed while setting environment variables to a table that has just been dropped fix(meta): meta server failed and could not be started normally after dropping table Nov 21, 2024
@empiredan empiredan marked this pull request as ready for review November 21, 2024 06:36
@empiredan empiredan changed the title fix(meta): meta server failed and could not be started normally after dropping table fix(meta): meta server failed and could not be started normally while setting environment variables after dropping table Nov 21, 2024
src/meta/server_state.cpp Outdated Show resolved Hide resolved
src/meta/server_state.h Outdated Show resolved Hide resolved
@empiredan empiredan merged commit 9daa73f into apache:master Nov 25, 2024
84 of 85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp type/bug-fix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants