Posted on Nov 18

Maximizing CPU Core Utilization: GraphBit's Parallel Processing Strategies

CPU Core Utilization: How GraphBit Maximizes Parallel Processing Power

GraphBit’s runtime is designed to “fill the cores” responsibly making it the only parallel processing agentic AI framework. It sizes worker pools from host topology, isolates blocking I/O, and exposes knobs to match provider limits and workload mix—so parallelism translates into useful work, not thrash.

Core Principles

Proportionate scaling: 2× cores for worker threads recognizes that mixed workloads interleave compute with I/O waits; extra threads keep cores busy during stalls.

Hard cap: A practical ceiling of 32 worker threads limits context-switch churn and memory waste on general-purpose instances.
I/O segregation: A separate blocking pool sized at 4× cores shields compute from socket waits and slow disks.
Lean stacks: 1 MB stacks keep RSS predictable as concurrency rises.

Implementation Highlights

Tokio multi-thread runtime with Builder::new_multi_thread() and explicit worker_threads(..).

Allocator choice: On Unix (non-Python builds), jemalloc is used to reduce fragmentation under parallel allocations.
Python interop: With PyO3, the GIL is released during Rust execution, so Python callers don’t serialize parallel sections.

Scheduling & Affinity

CPU affinity alignment (optional) constrains threads to the host’s pinned core set—helpful in container quotas and NUMA-sensitive environments.

Fairness: Async tasks are short and cooperative; heavy CPU functions should remain bounded or moved into dedicated pools.

Tuning Cheatsheet (safe defaults first)

worker_threads: leave default; raise only when CPU-bound stages dominate and host has spare cores.

max_blocking_threads: increase for heavy network fan-out; decrease if downstreams are rate-limited to avoid self-induced pressure.

thread_stack_size_mb: keep at 1 unless you have deep recursion or large stack objects (prefer heap for large buffers).

Failure & Resilience

Circuit breakers and connection pooling absorb flakiness while keeping threads productive.

Graceful degradation: When rate-limited, concurrency is reduced; retries jittered to avoid thundering herds.

Trade-offs

Cap may underuse very large hosts: If you routinely deploy on 64+ core machines, consider exposing a higher cap behind a config flag.

jemalloc is disabled for Python builds and Windows due to TLS and platform constraints—expect allocator behavior to vary slightly across targets.

Top comments (1)

Sawyer Wolfe • Nov 18

Interesting approach. I wonder how this strategy compares to letting the OS scheduler handle more of the balancing, especially on highly heterogeneous workloads.