Prometheus Metricsπ
The engine daemon exposes Prometheus metrics on GET /metrics on the same --api-addr as the REST API. The full definition catalogue under --all-features (which is how the prebuilt release archives and the GHCR Docker image are built) is 44 metric names across nine concerns: 39 are always registered, and the OTLP (3) and TLS (2) families are feature-gated on daemon-otlp and daemon-tls respectively. The runtime exposes the ones that have ever fired in a given process. The three field-observer surfaces always render their # HELP/# TYPE lines (and stay at zero unless --observe-fields was passed); the others follow the lazy-registration pattern documented per section below.
The exact source of truth is the daemon/metrics module.
Engine core (17 metrics)π
These always show up. They cover ingest, matches, queue depth, back-pressure, reloads, and resource usage.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_events_processed_total | counter | β | Total events processed by the engine. |
rsigma_events_parse_errors_total | counter | β | JSON or log-format parse errors at the source. |
rsigma_detection_matches_total | counter | β | Total detection matches emitted. |
rsigma_correlation_matches_total | counter | β | Total correlation matches emitted. |
rsigma_detection_rules_loaded | gauge | β | Number of detection rules currently loaded. |
rsigma_correlation_rules_loaded | gauge | β | Number of correlation rules currently loaded. |
rsigma_correlation_state_entries | gauge | β | Active entries in the correlation state. Watch versus the max_state_entries cap (default 100000). |
rsigma_reloads_total | counter | β | Total reload attempts (file watcher, SIGHUP, POST /api/v1/reload). |
rsigma_reloads_failed_total | counter | β | Reload attempts that produced parse or compile errors. |
rsigma_uptime_seconds | gauge | β | Daemon uptime in seconds. |
rsigma_input_queue_depth | gauge | β | Events currently buffered in the sourceβengine channel. Tracked for every input, including the HTTP and OTLP push receivers. |
rsigma_output_queue_depth | gauge | β | Results currently buffered in the engineβsink channel. |
rsigma_back_pressure_events_total | counter | β | Times a source was blocked on a full event channel. |
rsigma_event_processing_seconds | histogram | β | Per-event processing latency. |
rsigma_pipeline_latency_seconds | histogram | β | End-to-end latency from event dequeue to sink send. |
rsigma_batch_size | histogram | β | Number of events processed per batch. |
rsigma_dlq_events_total | counter | β | Events routed to the dead-letter queue. |
rsigma_sink_queue_depth | gauge | sink | Results buffered in each sink's delivery queue. |
rsigma_sink_retries_total | counter | sink | Sink delivery retries after a retryable failure. |
rsigma_sink_dropped_total | counter | sink | Results dropped because a lossy sink's queue was full (?on_full=drop). |
rsigma_sink_delivery_failures_total | counter | sink | Sink deliveries that exhausted retries and were routed to the DLQ. |
rsigma_webhook_requests_total | counter | webhook_id, outcome (success, permanent_failure, rate_limited_wait) | Webhook requests by outcome. Queue depth, retries, drops, and DLQ routing are read from the shared per-sink series above, keyed by sink=<webhook id> (one-to-one with webhook_id). |
rsigma_webhook_request_duration_seconds | histogram | webhook_id | Per-webhook HTTP request latency. |
Per-rule labels (2 metrics)π
These counters carry labels that identify which rule fired. They surface on /metrics only after the first match for that kind.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_detection_matches_by_rule_total | counter | rule_title, level | Detection matches per rule. |
rsigma_correlation_matches_by_rule_total | counter | rule_title, level, correlation_type | Correlation matches per rule (correlation_type is event_count, value_count, temporal, temporal_ordered, value_sum, value_avg, value_percentile, or value_median). |
rule_title is not guaranteed to be unique in a rule set. If two rules share a title, their counters add together. For collision-free per-rule analytics, scrape rsigma_detection_matches_total and join against your detection NDJSON stream by rule_id outside Prometheus.
These two families feed the silence and noisy signals of rule hygiene (and the production-volume column of rule scorecard): pass a saved scrape or a live endpoint as --metrics and they join per-rule by rule_title.
Dynamic pipeline sources (5 metrics)π
Exposed when one or more pipelines declare dynamic sources. The labelled counters surface after the first resolve attempt for the relevant source; source_cache_hits_total and source_resolve_seconds are global (no source_id label).
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_source_resolves_total | counter | source_id, source_type (file, http, command, nats) | Total dynamic source resolution attempts. Counts every attempt, successful or not. |
rsigma_source_resolve_errors_total | counter | source_id, error_kind (Fetch, Parse, Extract, Timeout, ResourceLimit) | Failed dynamic source resolutions. |
rsigma_source_resolve_seconds | histogram | β | Dynamic source resolution latency. Aggregated across all sources. |
rsigma_source_cache_hits_total | counter | β | Times cached source data was served on resolution failure. Aggregated across all sources. |
rsigma_source_last_resolved_timestamp | gauge | source_id | Unix timestamp of the last successful resolution per source. Alert on staleness. |
The error_kind values come from rsigma_runtime::sources::SourceErrorKind. Fetch covers HTTP / file / command / NATS connect-or-read failures (per-protocol details land in the error_message log field, not the label). ResourceLimit covers the 10 MiB body cap, 30 s command exec cap, and similar.
Enrichment (6 metrics)π
Exposed when the daemon is built with daemon and --enrichers is passed. Every (enricher_id, kind, status) triple and every HTTP-cache enricher_id row is pre-registered at startup, so all six families render with their # HELP / # TYPE lines and zeroed counters on the first scrape, even before any event has fired. Filtered (kind- or scope-mismatched) enricher calls do not increment any counter, so cardinality stays bounded by the number of configured enrichers.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_enrichment_total | counter | enricher_id, kind (detection, correlation), status (success, skip, error, timeout, drop) | Per-call outcome counter. kind is the enricher's declared kind (the YAML kind: field), not a per-result discriminator. |
rsigma_enrichment_duration_seconds | histogram | enricher_id, kind | Per-enricher latency. Buckets target both fast template calls and slower http/command invocations. |
rsigma_enrichment_queue_depth | gauge | β | Pending enrichment calls (sum across both kinds). Watch this versus max_concurrent_enrichments. |
rsigma_enrichment_http_cache_hits_total | counter | enricher_id | HTTP enricher response-cache hits. Mandatory signal for any rate-limited API recipe. |
rsigma_enrichment_http_cache_misses_total | counter | enricher_id | HTTP enricher response-cache misses. |
rsigma_enrichment_http_cache_expirations_total | counter | enricher_id | HTTP enricher response-cache entries evicted on expiry. |
The kind label is carried even though enricher_id typically already encodes it (asset_lookup_det vs asset_lookup_corr), so dashboards can compute sum by (kind) without depending on a naming convention.
Alert pipeline (13 metrics)π
Exposed when the daemon is built with daemon. The fixed label sets on rsigma_dedup_results_total, rsigma_incidents_emitted_total, and rsigma_incident_overmerge_total, plus the rsigma_dedup_store_entries, rsigma_incidents_open, rsigma_silences_active, and rsigma_inhibit_sources_active gauges, are pre-registered at startup, so they render with their # HELP / # TYPE lines and zeroed series on the first scrape, even before --alert-pipeline is passed or any event fires. The rsigma_inhibited_total{rule} series appears once a rule first inhibits. See the Alert Pipeline guide.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_dedup_results_total | counter | action (emitted, folded, repeat, resolved) | Dedup outcomes: first fires emitted, duplicates folded, repeat re-emits, and resolved records. |
rsigma_dedup_store_entries | gauge | β | Active dedup alerts currently tracked. |
rsigma_dedup_evictions_total | counter | β | Active alerts evicted after resolving. |
rsigma_dedup_summaries_emitted_total | counter | β | Dedup summary records emitted (repeat re-emits plus resolved records). |
rsigma_incidents_open | gauge | β | Open incidents currently tracked by the grouping stage. |
rsigma_incidents_emitted_total | counter | trigger (group_wait, group_interval, repeat, resolved) | Incident emissions by trigger. |
rsigma_incident_results_total | counter | β | Total incident records emitted. |
rsigma_incident_overmerge_total | counter | guard (stop_value, cardinality_ceiling) | Entity-graph guard hits that suppressed a join. |
rsigma_silenced_total | counter | β | Results muted by an active silence. |
rsigma_silences_active | gauge | β | Currently-active silences. |
rsigma_inhibited_total | counter | rule | Results muted by an inhibition rule, by rule name. |
rsigma_inhibit_sources_active | gauge | β | Currently-active inhibition sources. |
rsigma_alert_pipeline_duration_seconds | histogram | β | Alert-pipeline stage duration in seconds. |
Risk-based alerting (9 metrics)π
Exposed when the daemon is built with daemon. The fixed label set on rsigma_risk_annotations_total and rsigma_risk_incidents_emitted_total, plus the rsigma_risk_entities_open and rsigma_risk_state_entries gauges, are pre-registered at startup, so they render with their # HELP / # TYPE lines and zeroed series on the first scrape, even before --risk is passed or any event fires. See the Risk-Based Alerting guide.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_risk_annotations_total | counter | action (scored, no_entity, skipped) | Risk-annotation outcomes: scored with entities, scored with no entity, or skipped (out of scope). |
rsigma_risk_annotation_score | histogram | β | Distribution of the per-detection resolved risk score. |
rsigma_risk_objects_total | counter | β | Risk objects extracted from firing detections. |
rsigma_risk_entities_open | gauge | β | Entities currently tracked by the risk accumulator. |
rsigma_risk_state_entries | gauge | β | Risk contributions currently retained across all entities. |
rsigma_risk_evictions_total | counter | β | Entities dropped from the accumulator (store full or aged out). |
rsigma_risk_incidents_emitted_total | counter | trigger (score, tactic_count) | Risk incidents emitted by trigger. |
rsigma_risk_incident_results_total | counter | β | Total risk incident records emitted. |
rsigma_risk_layer_duration_seconds | histogram | β | Risk-layer stage duration in seconds. |
OTLP (3 metrics)π
Exposed when the daemon is built with daemon-otlp and an OTLP receiver is active. The labelled counters surface after the first request of that kind.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_otlp_requests_total | counter | transport (http, grpc), encoding (e.g. json, protobuf, protobuf+gzip) | OTLP export requests received. |
rsigma_otlp_log_records_total | counter | β | Log records ingested via OTLP. |
rsigma_otlp_errors_total | counter | transport, reason (unsupported_content_type, decompression, decode, channel_closed) | OTLP request errors. |
TLS (2 metrics)π
Exposed when the daemon is built with daemon-tls. Both metrics render with their # HELP and # TYPE lines as soon as TLS is configured, even before the first handshake.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_tls_certificate_expiry_seconds | gauge | β | Seconds until the active TLS server certificate's not_after. Signed: negative once expired. Updated at startup and after every successful SIGHUP-triggered reload. |
rsigma_tls_active_connections | gauge | β | Currently active TLS-terminated connections on the API listener. Decrements on connection close (including handshake failure). |
Field observability (3 metrics)π
Exposed unconditionally; values stay at zero unless the daemon was started with --observe-fields. All three refresh on every /metrics scrape and after every successful /api/v1/fields/* call. See HTTP API: Field observability for the matching endpoints.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_fields_observed_total | counter | β | Total events scanned by the opt-in field observer. Advances regardless of whether the event had structured fields. |
rsigma_fields_observer_unique_keys | gauge | β | Distinct field names currently tracked. Saturates at --observe-fields-max-keys (default 10000). |
rsigma_fields_observer_overflow_dropped_total | counter | β | New-key insert attempts dropped because the observer was at capacity. A persistent positive rate signals that --observe-fields-max-keys is too low for the deployment. |
Schema observability (4 metrics)π
Exposed unconditionally; values stay at zero unless the daemon was started with --observe-schemas (or --discover-schemas, which implies it). All refresh on every /metrics scrape and on every GET /api/v1/schemas call. See HTTP API: Schema observability for the matching endpoint.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_events_by_schema_total | counter | schema | Events classified into each recognized schema (ecs, sysmon, windows_eventlog, cef, ocsf, generic_json, or a user-defined name). |
rsigma_events_unknown_schema_total | counter | β | Events that matched no schema signature. A rising rate signals a source whose schema RSigma does not recognize; add a signature with --schema-config. |
rsigma_events_ambiguous_schema_total | counter | β | Events where two different-name signatures tied at the winning specificity, so the name tie-break decided routing. Resolve by giving one signature a distinguishing predicate or a higher specificity. |
rsigma_unknown_schema_clusters | gauge | β | Distinct clusters of unrecognized event shapes that schema discovery would propose a signature for. Zero unless the daemon was started with --discover-schemas; drives the GET /api/v1/schemas/suggestions endpoint. |
Logsource-aware evaluation (4 metrics)π
Exposed unconditionally; values stay at zero unless the daemon was started with --logsource-routing. All refresh on every /metrics scrape. See Logsource-Aware Evaluation.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_rules_pruned_by_logsource_total | counter | β | Always-evaluated rules skipped because their product conflicts with the event's logsource. The scaling signal: it rises with the conflicting fraction of a mixed-product ruleset. |
rsigma_events_without_logsource_total | counter | β | Events with no extractable logsource, evaluated against every rule (fail-open). A high rate means events are not carrying a logsource tag and no static override or field map is set. |
rsigma_schema_rules_eligible | gauge | schema | Rules a schema's events evaluate after logsource pruning. Set when both schema routing and logsource routing are active; refreshed on scrape and reload. |
rsigma_schema_rules_pruned | gauge | schema | Rules pruned for a schema by its implied logsource. The higher this is relative to eligible, the less of the ruleset that schema exercises. |
Live event tap (4 metrics)π
Exposed unconditionally; values stay at zero unless the tap is enabled (daemon.tap.enabled: true) and an operator opens a session. See HTTP API: Live event tap and rsigma engine tap.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_tap_sessions_total | counter | β | Total tap sessions opened over the daemon's lifetime. |
rsigma_tap_active_sessions | gauge | β | Currently active tap sessions. Bounded by daemon.tap.max_sessions. |
rsigma_tap_events_streamed_total | counter | β | Events streamed to tap clients. |
rsigma_tap_events_dropped_total | counter | β | Events dropped from a tap (a full per-session buffer, or an unparseable line in a redacting raw capture). A positive rate means captured fixtures have gaps. |
Live detection tail (2 metrics)π
Exposed unconditionally; values stay at zero unless the tail is enabled (daemon.tail.enabled: true) and an operator opens a session. See HTTP API: Live detection tail and rsigma engine tail.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_tail_active_sessions | gauge | β | Currently active tail sessions. Bounded by daemon.tail.max_sessions. |
rsigma_tail_detections_dropped_total | counter | β | Detections dropped from a tail because a session buffer was full. A positive rate means a tail client could not keep up. |
Triage feedback loop (4 metrics)π
Exposed when the triage feedback loop is enabled (daemon.dispositions.enabled: true or --enable-dispositions). The ingest counters pre-register their fixed label sets so they render with zeroed series on the first scrape; rsigma_rule_false_positive_ratio is absent for a rule until it reaches daemon.dispositions.min_sample. See the Triage Feedback Loop guide and HTTP API: Dispositions.
| Metric | Type | Labels | Description |
|---|---|---|---|
rsigma_rule_false_positive_ratio | gauge | rule_title | Per-rule false-positive ratio over the rolling window. Absent until the rule reaches min_sample dispositions. |
rsigma_dispositions_total | counter | rule_title, verdict | Analyst dispositions counted, by rule and verdict (true_positive, false_positive, benign_true_positive). |
rsigma_disposition_ingest_total | counter | source, result | Ingest outcomes by source (api, file, http, nats) and result (accepted, duplicate, rejected). |
rsigma_disposition_ingest_errors_total | counter | reason | Ingest errors by reason (parse, validation). |
Scrape configurationπ
Minimum Prometheus scrape config:
scrape_configs:
- job_name: rsigma
scrape_interval: 15s
static_configs:
- targets: ['rsigma.internal:9090']
15-30 s intervals are reasonable. The histograms use the default prometheus bucket boundaries; alert on the _bucket{le="..."} quantiles you care about rather than the average, which becomes meaningless under bimodal latency.
Useful alertsπ
groups:
- name: rsigma
rules:
# Engine cannot keep up.
- alert: RsigmaBackPressure
expr: rate(rsigma_back_pressure_events_total[5m]) > 0
for: 10m
labels: {severity: warning}
# Correlation state above 80% of the default 100000 cap.
- alert: RsigmaCorrelationStatePressure
expr: rsigma_correlation_state_entries > 80000
for: 10m
labels: {severity: warning}
# DLQ taking traffic.
- alert: RsigmaDlqVolume
expr: rate(rsigma_dlq_events_total[5m]) > 1
for: 15m
labels: {severity: warning}
# Reloads failing means rules are broken on disk.
- alert: RsigmaReloadsFailing
expr: rate(rsigma_reloads_failed_total[5m]) > 0
for: 10m
labels: {severity: critical}
# Dynamic source went stale (no successful resolve in 10 minutes).
- alert: RsigmaSourceStale
expr: time() - rsigma_source_last_resolved_timestamp > 600
for: 5m
labels: {severity: warning}
# Enricher consistently failing (timeouts or fetch errors).
- alert: RsigmaEnrichmentFailing
expr: |
sum by (enricher_id) (
rate(rsigma_enrichment_total{status=~"error|timeout"}[5m])
) > 1
for: 10m
labels: {severity: warning}
# TLS certificate expires within 14 days.
- alert: RsigmaTlsCertExpiring
expr: rsigma_tls_certificate_expiry_seconds < 14 * 86400
for: 5m
labels: {severity: warning}
# TLS certificate has already expired.
- alert: RsigmaTlsCertExpired
expr: rsigma_tls_certificate_expiry_seconds < 0
for: 1m
labels: {severity: critical}
Histograms: bucket guidanceπ
| Metric | Typical p50 | Typical p99 | Notes |
|---|---|---|---|
rsigma_event_processing_seconds | 1-30 Β΅s | < 1 ms | Per-event evaluation against the loaded rule set. Spikes correlate with reload events. |
rsigma_pipeline_latency_seconds | 1-100 Β΅s | < 5 ms | End-to-end from event dequeue to sink send. Dominated by sink latency (file vs NATS). |
rsigma_batch_size | 1 | 1 | Default --batch-size 1. With --batch-size 64 and load, p50 trends toward 64. |
event_processing_seconds p99 above 5 ms is usually a sign of misuse (regex-heavy rules without --cross-rule-ac, or many |all modifiers).
See alsoπ
- Observability for the broader observability story, including the
tracingevent targets that complement these metrics. - Performance Tuning for which metric to watch when sizing
--buffer-size,--batch-size, or correlationmax_state_entries. - Streaming Detection for how the
/metricsendpoint fits into the broader daemon API. daemon/metricssource for the registry implementation.