- Notifications
You must be signed in to change notification settings - Fork 484
Closed
Labels
Description
To reproduce, create a Cargo.toml
with this:
[package] name = "regex-contention-repro-work" version = "0.1.0" edition = "2021" [[bin]] name = "repro" path = "main.rs" [dependencies] anyhow = "1.0.66" regex = "1.7.0"
And in the same directory, create a main.rs
containing:
use std::{ io::Write, sync::Arc, time::{Duration, Instant}, }; use regex::{Regex, RegexBuilder}; const ITERS: usize = 100_000; const PATTERN: &str = ""; const HAYSTACK: &str = "ZQZQZQZQ"; #[derive(Debug)] struct Benchmark { re: Regex, threads: u32, } impl Benchmark { fn cloned(&self) -> anyhow::Result<Duration> { let start = Instant::now(); let mut handles = vec![]; for _ in 0..self.threads { // When we clone the regex like this, it does NOT make a complete // copy of all of its internal state, but it does create an entirely // fresh pool from which to get mutable scratch space for each // search. Basically, a 'Regex' internally looks like this: // // struct Regex { // // Among other things, this contains the literal // // prefilters and the Thompson VM bytecode // // instructions. // read_only: Arc<ReadOnly>, // // Contains space used by the regex matcher // // during search time. e.g., The DFA transition // // table for the lazy DFA or the set of active // // threads for the Thompson NFA simulation. // pool: Pool<ScratchSpace>, // } // // That is, a regex already internally uses reference counting, // so cloning it does not create an entirely separate copy of the // data. It's effectively free. However, cloning it does create // an entirely fresh 'Pool'. It specifically does not reuse pools // across cloned regexes, and it does this specifically so that // callers have a path that permits them to opt out of contention // on the pool. // // Namely, when a fresh pool is created, it activates a special // optimization for the first thread that accesses the pool. For // that thread gets access to a special value ONLY accessible to // that thread, where as all other threads accessing the pool get // sent through the "slow" path via a mutex. When a lot of threads // share the same regex **with the same pool**, this mutex comes // under very heavy contention. // // It is worth pointing out that the mutex is NOT held for the // duration of the search. Effectively what happens is: // // is "first" thread optimization active? // NO: mutex lock // pop pointer out of the pool // mutex unlock // do a search // is "first" thread optimization active? // NO: mutex lock // push pointer back into pool // mutex unlock // // So in the case where "do a search" is extremely fast, i.e., when // the haystack is tiny, as in this case, the mutex contention ends // up dominating the runtime. As the number of threads increases, // the contention gets worse and worse and thus runtime blows up. // // But, all of that contention can be avoided by giving each thread // a fresh regex and thus each one gets its own pool and each // thread gets the "first" thread optimization applied. So the // internal access for the mutable scratch space now looks like // this: // // is "first" thread optimization active? // YES: return pointer to special mutable scratch space // do a search // is "first" thread optimization active? // YES: do nothing // // So how to fix this? Well, it's kind of hard. The regex crate // used to use the 'thread_local' crate that optimized this // particular access pattern and essentially kept a hash table // keyed on thread ID. But this led to other issues. Specifically, // its memory usage scaled with the number of active threads using // a regex, where as the current approach scales with the number of // active threads *simultaneously* using a regex. // // I am not an expert on concurrent data structures though, so // there is likely a better approach. But the idea here is indeed // to make it possible to opt out of contention by being able to // clone the regex. Once you do that, there are **zero** competing // resources between the threads. // // Why not just do this in all cases? Well, I guess I would if I // could, but I don't know how. The reason why explicit cloning // permits one to opt out is that each thread is handed its own // copy of the regex and its own pool, and that is specifically // controlled by the caller. I'm not sure how to do that from // within the regex library itself, since it isn't really aware of // threads per se. let re = self.re.clone(); handles.push(std::thread::spawn(move || { let mut matched = 0; for _ in 0..ITERS { if re.is_match(HAYSTACK) { matched += 1; } } matched })); } let mut matched = 0; for h in handles { matched += h.join().unwrap(); } assert!(matched > 0); Ok(Instant::now().duration_since(start)) } fn shared(&self) -> anyhow::Result<Duration> { let start = Instant::now(); let mut handles = vec![]; // We clone the regex into an Arc but then share it across all threads. // Each thread in turn competes with the single regex's shared memory // pool for mutable scratch space to use during a search. This is what // ultimately caused this 'shared' benchmark to be much slower than the // 'cloned' benchmark when run with many threads. Indeed, profiling it // reveals that most of the time is spent in the regex internal 'Pool' // type's 'get' and 'get_slow' methods. let re = Arc::new(self.re.clone()); for _ in 0..self.threads { let re = Arc::clone(&re); handles.push(std::thread::spawn(move || { let mut matched = 0; for _ in 0..ITERS { if re.is_match(HAYSTACK) { matched += 1; } } matched })); } let mut matched = 0; for h in handles { matched += h.join().unwrap(); } assert!(matched > 0); Ok(Instant::now().duration_since(start)) } } fn main() -> anyhow::Result<()> { let threads: u32 = std::env::var("REGEX_BENCH_THREADS")?.parse()?; let re = RegexBuilder::new(PATTERN) .unicode(false) .dfa_size_limit(50 * (1 << 20)) .build()?; let benchmark = Benchmark { re, threads }; let which = std::env::var("REGEX_BENCH_WHICH")?; let duration = match &*which { "cloned" => benchmark.cloned(), "shared" => benchmark.shared(), unknown => anyhow::bail!("unrecognized REGEX_BENCH_WHICH={}", unknown), }; writeln!(std::io::stdout(), "{:?}", duration)?; Ok(()) }
Now build and run the benchmark:
$ cargo build --release $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 6.5 ms ± 1.2 ms [User: 55.1 ms, System: 3.8 ms] Range (min … max): 0.1 ms … 10.5 ms 254 runs Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely. Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 530.5 ms ± 12.6 ms [User: 1886.3 ms, System: 4994.7 ms] Range (min … max): 514.2 ms … 552.4 ms 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran 81.66 ± 15.52 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
As noted in the comments in the code above, the only difference between these two benchmarks is that cloned
creates a fresh Regex
for each thread where as shared
uses the same Regex
across multiple threads. This leads to contention on a single Mutex
in the latter case and overall very poor performance. (See the code comments above for more info.)
We can confirm this by looking at a profile. Here is a screenshot from perf
for the cloned
benchmark:
And now for the shared
benchmark:
As we can see, in the shared
benchmark, virtually all of the time is being spent locking and unlocking the mutex.
WGH-, stepancheg and cosmicexplorer