Benchmarks🔗
Criterion benchmark results for the rsigma detection engine.
Hardware: Apple M4 Pro, macOS
Profile: bench (optimized, release)
Date captured: 2026-05-07
Captured on version: 0.9.0
Freshness note. These figures were captured on rsigma 0.9.0; the workspace has since shipped through 0.13.0 and parts of the hot path (matcher dispatch, dot-path traversal, modifier validation) have changed. Treat the numbers as a baseline shape, not a per-release benchmark. To refresh for a specific release, check out the matching tag and run
cargo benchper the commands listed in Running below; update the hardware/date/version block above when you do.
Running🔗
# All benchmarks across all crates
cargo bench
# Individual suites
cargo bench -p rsigma-parser --bench parse
cargo bench -p rsigma-eval --bench eval
cargo bench -p rsigma-eval --bench correlation
cargo bench -p rsigma-eval --bench correlation_memory # peak-heap stress (not Criterion)
cargo bench -p rsigma-runtime --bench runtime_throughput
cargo bench -p rsigma-runtime --bench dynamic_pipelines
# Specific benchmark group
cargo bench -p rsigma-eval --bench eval -- eval_throughput
# Quick mode (fewer samples, useful for CI smoke tests)
cargo bench -- --quick
Parser (rsigma-parser)🔗
Rule Parsing🔗
| Benchmark | Time (median) | Throughput |
|---|---|---|
| Single rule | 10.7 us | - |
| 10 rules | 110.0 us | - |
| 100 rules | 1.15 ms | 2.42 MiB/s |
| 500 rules | 5.79 ms | 5.10 MiB/s |
| 1000 rules | 11.7 ms | 12.7 MiB/s |
| Complex condition (single) | 23.8 us | - |
Wildcard and Regex Rules (pre-compiled pattern cache)🔗
| Benchmark | Time (median) |
|---|---|
| Wildcard rules (100) | 84.1 us |
| Wildcard rules (500) | 84.0 us |
| Wildcard rules (1000) | 84.5 us |
| Regex rules (100) | 50.6 us |
| Regex rules (500) | 50.2 us |
| Regex rules (1000) | 50.6 us |
Wildcard/regex rule parsing is O(1) due to pattern caching (only the first parse compiles the patterns).
Evaluation Engine (rsigma-eval)🔗
Rule Compilation🔗
| Rules | Time (median) |
|---|---|
| 100 | 98.9 us |
| 500 | 478.3 us |
| 1,000 | 961.9 us |
| 5,000 | 4.90 ms |
Rule Load Paths (0.11.x)🔗
Apple M4 Pro, macOS, release build, 2026-05-16. Compares the three engine entry points for loading rules at large N. add_collection and add_rules rebuild the inverted and bloom indexes once at the end of the batch; add_rule in a loop folds each rule incrementally with an amortized-doubling bloom rebuild (64-rule floor, 2x ratchet).
| Rules | add_collection | add_rules | add_rule loop |
|---|---|---|---|
| 1,000 | 1.15 ms (1.15 us/rule) | 1.17 ms (1.17 us/rule) | 1.64 ms (1.64 us/rule) |
| 10,000 | 11.82 ms (1.18 us/rule) | 11.85 ms (1.18 us/rule) | 17.23 ms (1.72 us/rule) |
| 100,000 | 121.65 ms (1.22 us/rule) | 122.13 ms (1.22 us/rule) | 166.07 ms (1.66 us/rule) |
All three paths scale linearly in the rule count. Per-rule cost is essentially constant from 1K to 100K, confirming the O(N) total complexity:
add_collectionandadd_rulescost roughly 1.2 us/rule. The fixed per-batch cost is dominated by the final inverted index + bloom build over the aggregate.add_rulein a loop costs roughly 1.65 us/rule, about 40% more than the batched paths. The overhead is the per-rule incremental insert plus the ~11 doubling-watermark rebuilds the bloom triggers between 1 and 100K rules. There is no quadratic blowup; the constant factor pays for the incremental contract.
The takeaway is that add_rule is no longer a foot-gun for bulk loads. Batched APIs are still slightly faster and remain the recommended path for cold-load scenarios; the single-rule path exists for cases where the caller wants per-rule error reporting (rsigma rule validate) or per-rule mutation semantics.
Run with cargo bench -p rsigma-eval --bench eval -- rule_load.
Single Event Evaluation🔗
Time to evaluate one event against N compiled rules.
| Rules | Time (median) | Per-rule |
|---|---|---|
| 100 | 2.25 us | 22.5 ns |
| 500 | 12.3 us | 24.5 ns |
| 1,000 | 30.9 us | 30.9 ns |
| 5,000 | 162.9 us | 32.6 ns |
Detection Throughput (100 rules)🔗
| Events | Time (median) | Throughput |
|---|---|---|
| 1,000 | 2.50 ms | 401 Kelem/s |
| 10,000 | 24.8 ms | 403 Kelem/s |
| 100,000 | 248.1 ms | 403 Kelem/s |
Batch Mode (Sequential vs Parallel)🔗
| Configuration | Time (median) | Throughput |
|---|---|---|
| 100 rules, sequential | 2.48 ms | 404 Kelem/s |
| 100 rules, batch | 2.52 ms | 397 Kelem/s |
| 1000 rules, sequential | 31.2 ms | 32.0 Kelem/s |
| 1000 rules, batch | 31.3 ms | 32.0 Kelem/s |
| 5000 rules, sequential | 162.0 ms | 6.17 Kelem/s |
| 5000 rules, batch | 162.3 ms | 6.16 Kelem/s |
Wildcard and Regex Matching🔗
| Benchmark | Time (median) |
|---|---|
| Wildcard (100 rules) | 19.1 us |
| Wildcard (500 rules) | 19.2 us |
| Wildcard (1000 rules) | 19.1 us |
| Regex (100 rules) | 5.17 us |
| Regex (500 rules) | 5.21 us |
| Regex (1000 rules) | 5.15 us |
Wildcard/regex matching scales O(1) with rule count thanks to compiled pattern sets.
Aho-Corasick Threshold Sweep (0.10.0)🔗
Single rule with N |contains patterns evaluated against 50 randomly generated events at varying haystack lengths. Drove the choice of AHO_CORASICK_THRESHOLD = 8 in compiler/optimizer.rs. Throughput is per event.
| Patterns | h=100 B | h=1 KB | h=8 KB | h=64 KB |
|---|---|---|---|---|
| 1 | 13.0 Melem/s (3.84 us / batch) | 7.77 Melem/s (6.43 us) | 1.85 Melem/s (27.1 us) | 248 Kelem/s (201.4 us) |
| 2 | 10.5 Melem/s (4.77 us) | 2.33 Melem/s (21.5 us) | 345 Kelem/s (144.8 us) | 42.3 Kelem/s (1.18 ms) |
| 4 | 9.08 Melem/s (5.51 us) | 2.03 Melem/s (24.6 us) | 293 Kelem/s (170.8 us) | 35.6 Kelem/s (1.40 ms) |
| 8 | 5.17 Melem/s (9.68 us) | 620 Kelem/s (80.6 us) | 79.0 Kelem/s (633.1 us) | 9.76 Kelem/s (5.12 ms) |
| 16 | 5.19 Melem/s (9.63 us) | 628 Kelem/s (79.6 us) | 78.6 Kelem/s (636.4 us) | 9.67 Kelem/s (5.17 ms) |
| 32 | 4.99 Melem/s (10.0 us) | 607 Kelem/s (82.3 us) | 76.4 Kelem/s (654.4 us) | 8.88 Kelem/s (5.63 ms) |
Throughput flattens at p=8: p16 and p32 perform within ~3% of p8 because the AC automaton scans the haystack once regardless of pattern count. Below 8 patterns, the sequential str::contains path with SIMD acceleration (memchr / Two-Way) wins. The crossover is clearly at 8.
Run with cargo bench -p rsigma-eval --bench eval -- eval_ac_threshold_sweep.
Cross-Rule Aho-Corasick Pre-Filter, daachorse-index feature (0.10.0)🔗
200 non-matching events evaluated against N pure-substring rules. Best-case workload for the cross-rule index: every rule is AC-prunable (every detection consists exclusively of positive substring matchers, no negation in conditions), and every event has zero pattern hits across all fields.
| Rules | Off (default) | On (set_cross_rule_ac(true)) | Speedup |
|---|---|---|---|
| 1,000 | 17.34 ms (11.5 Kelem/s) | 253.0 us (790 Kelem/s) | ~68x |
| 5,000 | 85.51 ms (2.34 Kelem/s) | 883.0 us (226 Kelem/s) | ~97x |
| 10,000 | 173.37 ms (1.15 Kelem/s) | 1.71 ms (117 Kelem/s) | ~101x |
The cross-rule index turns O(rules × patterns) per event into O(haystack_length) for the AC scan, so throughput is essentially constant in rule count once the index is enabled.
For typical mixed workloads (substring + exact + regex rules, events that hit multiple fields, smaller rule sets) the index adds build-time and lookup overhead with smaller wins or none, and can even cause a slowdown. Off by default. Enable via Engine::set_cross_rule_ac(true) programmatically, or --cross-rule-ac on rsigma engine daemon / rsigma engine eval (requires the daachorse-index Cargo feature). Always benchmark against representative data before flipping it on.
Run with cargo bench -p rsigma-eval --features daachorse-index --bench eval -- eval_cross_rule_ac.
Correlation Engine (rsigma-eval)🔗
Event Count Correlation🔗
1000 events evaluated against N correlation rules.
| Corr rules | Time (median) | Throughput |
|---|---|---|
| 5 | 944.9 us | 1.06 Melem/s |
| 10 | 953.7 us | 1.05 Melem/s |
| 20 | 974.7 us | 1.03 Melem/s |
Temporal Correlation🔗
1000 events evaluated with temporal ordering constraints.
| Corr rules | Time (median) | Throughput |
|---|---|---|
| 3 | 475.6 us | 2.10 Melem/s |
| 5 | 478.5 us | 2.09 Melem/s |
| 10 | 483.5 us | 2.07 Melem/s |
Correlation Throughput🔗
| Events | Time (median) | Throughput |
|---|---|---|
| 10,000 | 17.6 ms | 568 Kelem/s |
| 100,000 | 175.7 ms | 569 Kelem/s |
Sequential vs Batch (10,000 events)🔗
| Mode | Time (median) | Throughput |
|---|---|---|
| Sequential | 17.7 ms | 565 Kelem/s |
| Batch | 18.7 ms | 534 Kelem/s |
State Pressure (unique group-by keys)🔗
| Unique keys | Time (median) | Throughput |
|---|---|---|
| 1,000 | 764.0 us | 1.31 Melem/s |
| 10,000 | 7.97 ms | 1.25 Melem/s |
| 50,000 | 41.5 ms | 1.20 Melem/s |
Window Modes: sliding vs tumbling vs session (0.15.0)🔗
Apple M4 Pro, macOS, release build, 2026-06-12. Identical event_count workload for all three modes: 10,000 events across 1,000 group keys, one event per group per 10s tick, 1h window, 10m session gap. The window decision in apply_window_open is O(1) (deque front/back inspection), so the three modes cost the same per event.
| Window mode | Time (median) | Throughput |
|---|---|---|
sliding (default) | 7.25 ms | 1.38 Melem/s |
tumbling | 6.69 ms | 1.49 Melem/s |
session | 6.74 ms | 1.48 Melem/s |
Run with cargo bench -p rsigma-eval --bench correlation -- correlation_window_modes.
Window-Mode Memory Stress (0.15.0)🔗
Apple M4 Pro, macOS, release build, 2026-06-12. The correlation_memory bench is not a Criterion suite: it installs a counting global allocator and reports peak and settled heap deltas over the engine baseline, isolating window-state maintenance (alert thresholds are unreachable; event construction is excluded from the deltas). It reproduces the two scenarios from the SEP #214 discussion on memory becoming the bottleneck in stateful window correlation.
A. High-cardinality session windows (one event per unique key, event_count, gap 5m, cap 2h) — exercises the max_state_entries hard cap and stalest-first eviction:
| Configuration | Throughput | Peak heap | Settled | Live groups |
|---|---|---|---|---|
| 100k keys, default cap (100k) | 789 Kelem/s | 20.5 MiB | 17.7 MiB | 100,000 |
| 1M keys, default cap (100k) | 898 Kelem/s | 39.8 MiB | 22.4 MiB | capped |
| 1M keys, cap raised to 2M | 805 Kelem/s | 327.4 MiB | 243.8 MiB | 1,000,000 |
A live session group costs ~256 B settled, dominated by the GroupKey heap strings rather than the timestamps. Throughput under active eviction is the highest of the three runs because the state map stays small; the eviction sort is amortized over the 10% headroom the cap reclaims.
B. Long-lived chatty sessions (groups emitting continuously inside an open session; gap never exceeded, so the per-group deque grows to timespan/interval entries):
| Workload | Throughput | Peak heap | Bytes/in-window event |
|---|---|---|---|
event_count session, 1k groups @ 30s (240 ev/window) | 1.16 Melem/s | 2.2 MiB | 8 B |
event_count sliding, 1k groups @ 30s (240 ev/window) | 1.14 Melem/s | 2.2 MiB | 8 B |
value_count session, 1k groups @ 30s, distinct strings | 324 Kelem/s | 21.1 MiB | ~92 B |
event_count session, 100 groups @ 1 ev/s (7,200 ev/window) | 1.13 Melem/s | 6.3 MiB | 9 B |
value_count session, 100 groups @ 1 ev/s, distinct (1,800 ev/window) | 63 Kelem/s | 16.0 MiB | ~93 B |
C. Mode comparison (10k groups, 1M events, 1h window): sliding 1.10 Melem/s, tumbling 1.06 Melem/s, session 1.06 Melem/s — all 6.6 MiB peak. Memory differences between modes come only from how long entries are retained, not from per-event overhead.
Run with cargo bench -p rsigma-eval --bench correlation_memory (about half a minute total).
Runtime (rsigma-runtime)🔗
LogProcessor Pipeline Throughput🔗
End-to-end processing: format parsing, detection, and result collection (100 rules).
| Format | Events | Time (median) | Throughput |
|---|---|---|---|
| JSON | 1,000 | 1.15 ms | 868 Kelem/s |
| JSON | 10,000 | 9.45 ms | 1.06 Melem/s |
| Syslog | 1,000 | 849.4 us | 1.18 Melem/s |
| Syslog | 10,000 | 7.20 ms | 1.39 Melem/s |
| Plain | 1,000 | 192.4 us | 5.20 Melem/s |
| Plain | 10,000 | 1.06 ms | 9.40 Melem/s |
| Auto-detect | 1,000 | 1.11 ms | 903 Kelem/s |
| Auto-detect | 10,000 | 9.38 ms | 1.07 Melem/s |
Raw Engine vs LogProcessor (10,000 events, 100 rules)🔗
| Mode | Time (median) | Throughput |
|---|---|---|
| Raw Engine (pre-parsed) | 11.6 ms | 865 Kelem/s |
| LogProcessor (JSON) | 9.24 ms | 1.08 Melem/s |
| LogProcessor (auto-detect) | 9.14 ms | 1.09 Melem/s |
Rule Scaling (1,000 JSON events)🔗
| Rules | Time (median) | Throughput |
|---|---|---|
| 100 | 1.11 ms | 904 Kelem/s |
| 500 | 1.11 ms | 903 Kelem/s |
| 1,000 | 1.10 ms | 909 Kelem/s |
Rule count has minimal impact on runtime throughput due to the engine's indexed matching.
Dynamic Pipelines (rsigma-runtime)🔗
Source Resolution (File I/O + JSON Parse)🔗
| Items | Time (median) |
|---|---|
| 10 | 18.9 us |
| 100 | 20.9 us |
| 1,000 | 64.3 us |
| 10,000 | 467.1 us |
Data Parsing (No I/O)🔗
| Format | Items | Time (median) |
|---|---|---|
| JSON | 10 | 388 ns |
| JSON | 100 | 2.89 us |
| JSON | 1,000 | 25.4 us |
| JSON | 10,000 | 255.4 us |
| YAML | 10 | 3.38 us |
| Lines | 100 | 3.05 us |
Extract Expressions🔗
Expression evaluation on a 100-item dataset with nested objects.
| Language | Expression type | Time (median) |
|---|---|---|
| JQ | Identity (.items) | 60.8 us |
| JQ | Filter (select(.active)) | 96.2 us |
| JQ | Nested path (.a.b.c) | 34.8 us |
| JSONPath | Simple ($.items[*].name) | 25.2 us |
| JSONPath | Filter ([?@.active==true]) | 27.1 us |
| CEL | Field access (data.metadata.count) | 59.8 us |
| CEL | List filter (.filter(x, x.active)) | 827.6 us |
Template Expansion🔗
TemplateExpander::expand substituting ${source.*} references in pipeline vars.
| Vars | Values/source | Time (median) |
|---|---|---|
| 1 | 10 | 500 ns |
| 5 | 10 | 2.24 us |
| 10 | 10 | 4.37 us |
| 20 | 10 | 9.00 us |
| 5 | 100 | 11.3 us |
| 5 | 1,000 | 101.6 us |
Resolve with Extract (File + Filter, 500 IOC entries)🔗
| Language | Time (median) |
|---|---|
JQ (.indicators[] \| select(.active) \| .value) | 527.8 us |
JSONPath ($.indicators[?@.active==true].value) | 272.0 us |
CEL (data.indicators.filter(x, x.active).map(x, x.value)) | 43.2 ms |
Dynamic Detection End-to-End🔗
Full pipeline: resolve source, expand templates, apply value_placeholders, evaluate events.
| Scenario | Time (median) | Throughput |
|---|---|---|
| Detect 1000 events (50 IOCs) | 369.5 us | 2.71 Melem/s |
| Reload with resolve | 42.4 us | 23.6 Melem/s |
Key Observations🔗
- AC threshold is empirically 8: substring-list throughput flattens at 8 patterns once Aho-Corasick takes over. p16/p32 perform within ~3% of p8; below 8, the sequential
str::containsSIMD path (memchr / Two-Way) is faster. - Cross-rule AC is order-of-magnitude on substring-only rule sets: with the
daachorse-indexfeature enabled, 200 non-matching events against 10K pure-substring rules dropped from 173 ms to 1.71 ms (~101x). Off by default; only worth enabling for substring-heavy rule sets where most events don't match (e.g., threat-intel feeds against high-volume telemetry). - Detection is fast: ~400K events/sec with 100 rules in pure evaluation mode, scaling linearly with event count.
- Runtime overhead is negative: LogProcessor with JSON batching is actually faster than raw Engine evaluation due to batch-level optimizations and format-aware parsing.
- Rule count scales well: Increasing from 100 to 1000 rules has minimal per-event cost increase (~50%) thanks to indexed field matching.
- Correlation is efficient: Temporal correlations (2.1M elem/s) are 2x faster than event-count correlations (1.05M elem/s), and both scale linearly with events.
- Window modes cost the same per event: sliding, tumbling, and session all run at ~1.4-1.5 Melem/s on an identical workload. The window decision is O(1); choosing
sessionoverslidingis free at evaluation time. - Correlation memory is bounded by entry count, not bytes: the
max_state_entriescap (default 100k) held 1M unique session keys to a 39.8 MiB peak via stalest-first eviction. Within the cap, per-group deques grow with timespan x event rate: 8 B per in-window event forevent_count, ~92 B forvalue_countwith distinct string values. value_countdistinct counting is the correlation bottleneck: the distinct count is recomputed per event over the whole window (O(W) per event), so throughput drops from ~1.1 Melem/s to 63 Kelem/s at 1,800 distinct values per window — CPU collapses before memory does. Prefer shorter windows orevent_countwhere distinctness is not required.- Template expansion is negligible: Even with 20 vars, expansion adds < 10 us. Not a bottleneck.
- JSONPath is the fastest extraction language: Roughly 2x faster than JQ for comparable filter operations on dynamic source data.
- CEL has high overhead: ~160x slower than JSONPath for list filtering due to interpretation overhead. Best suited for simple field access or small datasets.
- Dynamic pipelines add no per-event cost: Once the engine is built, detection throughput with dynamic pipelines (2.71M elem/s) is comparable to static pipeline performance.
- Reload is cheap: Engine rebuild with source re-resolution takes ~42 us (excluding network/file I/O). In production, reload latency is dominated by source fetch time.
For operator-facing performance guidance (when to enable --bloom-prefilter and --cross-rule-ac, how to tune --batch-size and --buffer-size), see Performance Tuning. For the metrics that surface throughput and back-pressure at runtime, see Prometheus metrics and Observability.