Prometheus Metricsπ
The engine daemon exposes Prometheus metrics on GET /metrics on the same --api-addr as the REST API. The full definition catalogue is 27 metric names across three concerns; the runtime exposes the ones that have ever fired in a given process. A startup scrape shows 21 names by default (one of the per-rule counters surfaces immediately because the registry pre-creates it for documentation); the remaining six lazy metrics register on first use of dynamic pipelines or OTLP.
The exact source of truth is the daemon/metrics module.
Engine core (16 metrics)π
These always show up. They cover ingest, matches, queue depth, back-pressure, reloads, and resource usage.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_events_processed_total | counter | β | Total events processed by the engine. |
rsigma_events_parse_errors_total | counter | β | JSON or log-format parse errors at the source. |
rsigma_detection_matches_total | counter | β | Total detection matches emitted. |
rsigma_correlation_matches_total | counter | β | Total correlation matches emitted. |
rsigma_detection_rules_loaded | gauge | β | Number of detection rules currently loaded. |
rsigma_correlation_rules_loaded | gauge | β | Number of correlation rules currently loaded. |
rsigma_correlation_state_entries | gauge | β | Active entries in the correlation state. Watch versus the max_state_entries cap (default 100000). |
rsigma_reloads_total | counter | β | Total reload attempts (file watcher, SIGHUP, POST /api/v1/reload). |
rsigma_reloads_failed_total | counter | β | Reload attempts that produced parse or compile errors. |
rsigma_uptime_seconds | gauge | β | Daemon uptime in seconds. |
rsigma_input_queue_depth | gauge | β | Events currently buffered in the sourceβengine channel. |
rsigma_output_queue_depth | gauge | β | Results currently buffered in the engineβsink channel. |
rsigma_back_pressure_events_total | counter | β | Times a source was blocked on a full event channel. |
rsigma_event_processing_seconds | histogram | β | Per-event processing latency. |
rsigma_pipeline_latency_seconds | histogram | β | End-to-end latency from event dequeue to sink send. |
rsigma_batch_size | histogram | β | Number of events processed per batch. |
rsigma_dlq_events_total | counter | β | Events routed to the dead-letter queue. |
Per-rule labels (2 metrics)π
These counters carry labels that identify which rule fired. They surface on /metrics only after the first match for that kind.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_detection_matches_by_rule_total | counter | rule_title, level | Detection matches per rule. |
rsigma_correlation_matches_by_rule_total | counter | rule_title, level, correlation_type | Correlation matches per rule (correlation_type is event_count, value_count, temporal, temporal_ordered, value_sum, value_avg, value_percentile, or value_median). |
rule_title is not guaranteed to be unique in a rule set. If two rules share a title, their counters add together. For collision-free per-rule analytics, scrape rsigma_detection_matches_total and join against your detection NDJSON stream by rule_id outside Prometheus.
Dynamic pipeline sources (5 metrics)π
Exposed when one or more pipelines declare dynamic sources. The labelled counters surface after the first resolve attempt for the relevant source; source_cache_hits_total and source_resolve_seconds are global (no source_id label).
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_source_resolves_total | counter | source_id, source_type (file, http, command, nats) | Total dynamic source resolution attempts. Counts every attempt, successful or not. |
rsigma_source_resolve_errors_total | counter | source_id, error_kind (Fetch, Parse, Extract, Timeout, ResourceLimit) | Failed dynamic source resolutions. |
rsigma_source_resolve_seconds | histogram | β | Dynamic source resolution latency. Aggregated across all sources. |
rsigma_source_cache_hits_total | counter | β | Times cached source data was served on resolution failure. Aggregated across all sources. |
rsigma_source_last_resolved_timestamp | gauge | source_id | Unix timestamp of the last successful resolution per source. Alert on staleness. |
The error_kind values come from rsigma_runtime::sources::SourceErrorKind. Fetch covers HTTP / file / command / NATS connect-or-read failures (per-protocol details land in the error_message log field, not the label). ResourceLimit covers the 10 MiB body cap, 30 s command exec cap, and similar.
Enrichment (6 metrics)π
Exposed when the daemon is built with daemon and --enrichers is passed. Every (enricher_id, kind, status) triple and every HTTP-cache enricher_id row is pre-registered at startup, so all six families render with their # HELP / # TYPE lines and zeroed counters on the first scrape, even before any event has fired. Filtered (kind- or scope-mismatched) enricher calls do not increment any counter, so cardinality stays bounded by the number of configured enrichers.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_enrichment_total | counter | enricher_id, kind (detection, correlation), status (success, skip, error, timeout, drop) | Per-call outcome counter. kind is the enricher's declared kind (the YAML kind: field), not a per-result discriminator. |
rsigma_enrichment_duration_seconds | histogram | enricher_id, kind | Per-enricher latency. Buckets target both fast template calls and slower http/command invocations. |
rsigma_enrichment_queue_depth | gauge | β | Pending enrichment calls (sum across both kinds). Watch this versus max_concurrent_enrichments. |
rsigma_enrichment_http_cache_hits_total | counter | enricher_id | HTTP enricher response-cache hits. Mandatory signal for any rate-limited API recipe. |
rsigma_enrichment_http_cache_misses_total | counter | enricher_id | HTTP enricher response-cache misses. |
rsigma_enrichment_http_cache_expirations_total | counter | enricher_id | HTTP enricher response-cache entries evicted on expiry. |
The kind label is carried even though enricher_id typically already encodes it (asset_lookup_det vs asset_lookup_corr), so dashboards can compute sum by (kind) without depending on a naming convention.
OTLP (3 metrics)π
Exposed when the daemon is built with daemon-otlp and an OTLP receiver is active. The labelled counters surface after the first request of that kind.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_otlp_requests_total | counter | transport (http, grpc), encoding (e.g. json, protobuf, protobuf+gzip) | OTLP export requests received. |
rsigma_otlp_log_records_total | counter | β | Log records ingested via OTLP. |
rsigma_otlp_errors_total | counter | transport, reason (unsupported_content_type, decompression, decode, channel_closed) | OTLP request errors. |
TLS (2 metrics)π
Exposed when the daemon is built with daemon-tls. Both metrics render with their # HELP and # TYPE lines as soon as TLS is configured, even before the first handshake.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_tls_certificate_expiry_seconds | gauge | β | Seconds until the active TLS server certificate's not_after. Signed: negative once expired. Updated at startup and after every successful SIGHUP-triggered reload. |
rsigma_tls_active_connections | gauge | β | Currently active TLS-terminated connections on the API listener. Decrements on connection close (including handshake failure). |
Scrape configurationπ
Minimum Prometheus scrape config:
scrape_configs:
- job_name: rsigma
scrape_interval: 15s
static_configs:
- targets: ['rsigma.internal:9090']
15-30 s intervals are reasonable. The histograms use the default prometheus bucket boundaries; alert on the _bucket{le="..."} quantiles you care about rather than the average, which becomes meaningless under bimodal latency.
Useful alertsπ
groups:
- name: rsigma
rules:
# Engine cannot keep up.
- alert: RsigmaBackPressure
expr: rate(rsigma_back_pressure_events_total[5m]) > 0
for: 10m
labels: {severity: warning}
# Correlation state above 80% of the default 100000 cap.
- alert: RsigmaCorrelationStatePressure
expr: rsigma_correlation_state_entries > 80000
for: 10m
labels: {severity: warning}
# DLQ taking traffic.
- alert: RsigmaDlqVolume
expr: rate(rsigma_dlq_events_total[5m]) > 1
for: 15m
labels: {severity: warning}
# Reloads failing means rules are broken on disk.
- alert: RsigmaReloadsFailing
expr: rate(rsigma_reloads_failed_total[5m]) > 0
for: 10m
labels: {severity: critical}
# Dynamic source went stale (no successful resolve in 10 minutes).
- alert: RsigmaSourceStale
expr: time() - rsigma_source_last_resolved_timestamp > 600
for: 5m
labels: {severity: warning}
# Enricher consistently failing (timeouts or fetch errors).
- alert: RsigmaEnrichmentFailing
expr: |
sum by (enricher_id) (
rate(rsigma_enrichment_total{status=~"error|timeout"}[5m])
) > 1
for: 10m
labels: {severity: warning}
# TLS certificate expires within 14 days.
- alert: RsigmaTlsCertExpiring
expr: rsigma_tls_certificate_expiry_seconds < 14 * 86400
for: 5m
labels: {severity: warning}
# TLS certificate has already expired.
- alert: RsigmaTlsCertExpired
expr: rsigma_tls_certificate_expiry_seconds < 0
for: 1m
labels: {severity: critical}
Histograms: bucket guidanceπ
| Metric | Typical p50 | Typical p99 | Notes |
|---|---|---|---|
rsigma_event_processing_seconds | 1-30 Β΅s | < 1 ms | Per-event evaluation against the loaded rule set. Spikes correlate with reload events. |
rsigma_pipeline_latency_seconds | 1-100 Β΅s | < 5 ms | End-to-end from event dequeue to sink send. Dominated by sink latency (file vs NATS). |
rsigma_batch_size | 1 | 1 | Default --batch-size 1. With --batch-size 64 and load, p50 trends toward 64. |
event_processing_seconds p99 above 5 ms is usually a sign of misuse (regex-heavy rules without --cross-rule-ac, or many |all modifiers).
See alsoπ
- Observability for the broader observability story, including the
tracingevent targets that complement these metrics. - Performance Tuning for which metric to watch when sizing
--buffer-size,--batch-size, or correlationmax_state_entries. - Streaming Detection for how the
/metricsendpoint fits into the broader daemon API. daemon/metricssource for the registry implementation.