Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using Family::get_or_create #231

Open
sfleen opened this issue Oct 8, 2024 · 7 comments · May be fixed by #244
Open

Deadlock when using Family::get_or_create #231

sfleen opened this issue Oct 8, 2024 · 7 comments · May be fixed by #244

Comments

@sfleen
Copy link

sfleen commented Oct 8, 2024

It's fairly common practice for us to create a set of metrics from a metric family given a set of labels, and store them for later use. However, this can cause a deadlock when Family::get_or_create is called multiple times on the same instance within the same statement:

struct Metrics {
  inbound: Counter,
  outbound: Counter,
}

let family = Family::<Vec<(String, String)>, Counter>::default();

Metrics {
  inbound: family.get_or_create(&vec!["dir", "inbound"]).clone(),
  outbound: family.get_or_create(&vec!["dir", "outbound"]).clone(), // deadlocks
}

This is due to the RwLockReadGuard returned by each call to Family::get_or_create being temporaries that are not dropped until the end of the statement.

The user-facing solution is to create separate bindings for each metric:

// ...

let inbound - family.get_or_create(&vec!["dir", "inbound"]).clone();
let outbound = family.get_or_create(&vec!["dir", "outbound"]).clone(); // no deadlock
Metrics { inbound, outbound }

However, this is a footgun that can be pretty tricky to debug. Is there an alternative API here that would work, perhaps just returning the metric directly instead of a mutex guard?

@mxinden
Copy link
Member

mxinden commented Oct 10, 2024

Agreed that this is not intuitive (i.e. a footgun).

That said, I am not sure what to do about it. Cloning a metric might be expensive. Thus internally on every get_or_create might have a performance impact.

@sfleen
Copy link
Author

sfleen commented Oct 10, 2024

Some of the metrics are "cheap" to clone (Arc/etc.). There's a few options I can think of to accommodate those but most would require specialization in some form.

I wonder if, alternatively, there's a way to move the mutex up one level? Something (very roughly) like this:

let guard = family.lock();
Metrics {
  inbound: guard.get_or_create(&vec!["dir", "inbound"]).clone(),
  outbound: guard.get_or_create(&vec!["dir", "outbound"]).clone()
}

It does have some downsides of its own where it holds a write lock slightly longer than it strictly needs to, but this feels like an API that's harder to misuse when multiple metrics are being extracted.

I imagine this could either replace the existing get_or_create or live alongside it for cases where just a single metric is needed.

@cratelyn
Copy link
Contributor

cratelyn commented Oct 15, 2024

Cloning a metric might be expensive. Thus internally on every get_or_create
might have a performance impact.

I've also run into this footgun myself. I am sympathetic to the idea that calling get_or_create should not have a performance impact, but can you think of a case cloning a metric would ever be prohibitively expensive @mxinden?

It's worth noting that Family::get_or_create() already performs implicit cloning on behalf of the caller, by cloning the label set S. In my experience, these are collections of key-value strings that would be more expensive to clone than any metric type M, which often tends to be backed by an Arc<T> wrapping some atomic integer, e.g. AtomicU64.

Taking a look inside of the library, we have metrics definitions that look like this:

These types do already offer Clone implementations, and by virtue of the Arc<A> wrapping the inner atomic value, will have negligible performance impact.

The standard library's implementation of Clone can be found here, https://doc.rust-lang.org/1.81.0/src/alloc/sync.rs.html#2087, it increments an AtomicUsize, and performs a comparison. For our purposes, it feels safe to say that cloning an Arc<T> is very cheap.

The remaining metric type is Info<S>, which is currently defined like so:

// info.rs
#[derive(Debug)]
pub struct Info<S>(pub(crate) S);

The OpenMetrics specification states:

Info metrics are used to expose textual information which SHOULD NOT change
during process lifetime. Common examples are an application's version,
revision control commit, and the version of a compiler.

If these SHOULD NOT change, then they won't be set more than once, and implicit cloning is thus out of the question. They won't be the metric type M that would be used in a Family<S, M, C> like @sfleen's example above.

As for the remaining metric types, Summary is not currently implemented, and Unknown should not be used.

My understanding is that this library is intentionally designed to allow consumers to provide their own implementations of counters, histograms, and gauges, so long as they implement encoding::EncodeMetric and metrics::TypedMetric. Is there any kind of alternate implementation that would be expensive to clone?

I agree with @sfleen's suggestion that we directly return the M as well, and would propose a set of interfaces like this:

// family.rs
impl<S: Clone + std::hash::Hash + Eq, M, C: MetricConstructor<M>> Family<S, M, C> {
    pub fn get(&self, label_set: &S) -> Option<M> {
        // ...
    }

    pub fn get_or_create(&self, label_set: &S) -> M {
        // ...
    }

    pub fn get_ref(&self, label_set: &S) -> MappedRwLockReadGuard<M> {
        // ...
    }
}

get_ref() could provide callers a way to avoid any implicit metric cloning, while simultaneously providing get(..) and get_or_create(..) that don't present an interface through which callers can accidentally introduce deadlocks into their programs.

I'll cross-reference #130 and #230, which both propose a Family::get(&self, labels: &S) method.

Would this be a welcome change @mxinden? I'll note that this would be a breaking change, but preventing deadlocks feels like a worthwhile design goal for this library.

@cratelyn
Copy link
Contributor

Hello @mxinden! I wanted to ask if you had thoughts on the proposal above. I would be happy to drive this work myself, but don't want to spend time writing that if it would not be a welcome PR.

@mxinden
Copy link
Member

mxinden commented Oct 27, 2024

Thank you @sfleen and @cratelyn for the thorough input! Much appreciated.

It's worth noting that Family::get_or_create() already performs implicit cloning on behalf of the caller, by cloning the label set S.

For the sake of completeness, only on first call, not on the return-early get path.

write_guard
.entry(label_set.clone())
.or_insert_with(|| self.constructor.new_metric());

When writing the implementation we have today, I only had the ad-hoc access in mind, e.g. calling get_or_create on each Counter inc(). As one can see, I failed to design the API for the get_or_create long-lived instance in mind.

but can you think of a case cloning a metric would ever be prohibitively expensive @mxinden?

Not off the top of my head.

Would this be a welcome change @mxinden?

Definitely interested fixing this footgun.

I'll note that this would be a breaking change, but preventing deadlocks feels like a worthwhile design goal for this library.

Instead of a breaking change, can we start off with a non-breaking change, e.g. by adding get_or_create_clone that would return an M instead of a MappedRwLockReadGuard<M>? All public documentation can use and point to get_or_create_clone. get_or_create_clone can point to the potentially more efficient legacy get_or_create with a big warning sign that it can deadlock with an example.

Thouhgts? @sfleen @cratelyn would either of you be willing to write a prototype pull request?

(We could also explore reentrace-mutexes. That said, I find the current MappedRwLockReadGuard<M> already to add too much complexity to the implementation.)

@cratelyn
Copy link
Contributor

@mxinden i'd be happy to write a prototype pull request! i'll likely get to this early next week.

cratelyn added a commit to cratelyn/prometheus-client that referenced this issue Nov 18, 2024
fixes prometheus#231.

this introduces a new method to `Family<S, M, C>` to help alleviate the
risk of deadlocks when accessing multiple series within a given metrics
family.

this returns a plain `M`, rather than the `MappedRwLockReadGuard<M>`
RAII guard returned by `get_or_create()`.

a test case is introduced in this commit to demonstrate that structures
accessing multiple series within a single expression will not
accidentally create a deadlock.

Signed-off-by: katelyn martin <[email protected]>
@cratelyn
Copy link
Contributor

thanks for your patience! i have opened #244 to provide an accessor that will clone the metric M rather than returning a RAII guard.

cratelyn added a commit to cratelyn/prometheus-client that referenced this issue Nov 18, 2024
fixes prometheus#231.

this introduces a new method to `Family<S, M, C>` to help alleviate the
risk of deadlocks when accessing multiple series within a given metrics
family.

this returns a plain `M`, rather than the `MappedRwLockReadGuard<M>`
RAII guard returned by `get_or_create()`.

a test case is introduced in this commit to demonstrate that structures
accessing multiple series within a single expression will not
accidentally create a deadlock.

Signed-off-by: katelyn martin <[email protected]>
cratelyn added a commit to cratelyn/prometheus-client that referenced this issue Nov 18, 2024
fixes prometheus#231.

this introduces a new method to `Family<S, M, C>` to help alleviate the
risk of deadlocks when accessing multiple series within a given metrics
family.

this returns a plain `M`, rather than the `MappedRwLockReadGuard<M>`
RAII guard returned by `get_or_create()`.

a test case is introduced in this commit to demonstrate that structures
accessing multiple series within a single expression will not
accidentally create a deadlock.

Signed-off-by: katelyn martin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants