Skip to content

Prometheus MetricsπŸ”—

The engine daemon exposes Prometheus metrics on GET /metrics on the same --api-addr as the REST API. The full definition catalogue under --all-features (which is how the prebuilt release archives and the GHCR Docker image are built) is 44 metric names across nine concerns: 39 are always registered, and the OTLP (3) and TLS (2) families are feature-gated on daemon-otlp and daemon-tls respectively. The runtime exposes the ones that have ever fired in a given process. The three field-observer surfaces always render their # HELP/# TYPE lines (and stay at zero unless --observe-fields was passed); the others follow the lazy-registration pattern documented per section below.

The exact source of truth is the daemon/metrics module.

Engine core (17 metrics)πŸ”—

These always show up. They cover ingest, matches, queue depth, back-pressure, reloads, and resource usage.

Metric Type Labels Description
rsigma_events_processed_total counter β€” Total events processed by the engine.
rsigma_events_parse_errors_total counter β€” JSON or log-format parse errors at the source.
rsigma_detection_matches_total counter β€” Total detection matches emitted.
rsigma_correlation_matches_total counter β€” Total correlation matches emitted.
rsigma_detection_rules_loaded gauge β€” Number of detection rules currently loaded.
rsigma_correlation_rules_loaded gauge β€” Number of correlation rules currently loaded.
rsigma_correlation_state_entries gauge β€” Active entries in the correlation state. Watch versus the max_state_entries cap (default 100000).
rsigma_reloads_total counter β€” Total reload attempts (file watcher, SIGHUP, POST /api/v1/reload).
rsigma_reloads_failed_total counter β€” Reload attempts that produced parse or compile errors.
rsigma_uptime_seconds gauge β€” Daemon uptime in seconds.
rsigma_input_queue_depth gauge — Events currently buffered in the source→engine channel. Tracked for every input, including the HTTP and OTLP push receivers.
rsigma_output_queue_depth gauge — Results currently buffered in the engine→sink channel.
rsigma_back_pressure_events_total counter β€” Times a source was blocked on a full event channel.
rsigma_event_processing_seconds histogram β€” Per-event processing latency.
rsigma_pipeline_latency_seconds histogram β€” End-to-end latency from event dequeue to sink send.
rsigma_batch_size histogram β€” Number of events processed per batch.
rsigma_dlq_events_total counter β€” Events routed to the dead-letter queue.
rsigma_sink_queue_depth gauge sink Results buffered in each sink's delivery queue.
rsigma_sink_retries_total counter sink Sink delivery retries after a retryable failure.
rsigma_sink_dropped_total counter sink Results dropped because a lossy sink's queue was full (?on_full=drop).
rsigma_sink_delivery_failures_total counter sink Sink deliveries that exhausted retries and were routed to the DLQ.
rsigma_webhook_requests_total counter webhook_id, outcome (success, permanent_failure, rate_limited_wait) Webhook requests by outcome. Queue depth, retries, drops, and DLQ routing are read from the shared per-sink series above, keyed by sink=<webhook id> (one-to-one with webhook_id).
rsigma_webhook_request_duration_seconds histogram webhook_id Per-webhook HTTP request latency.

Per-rule labels (2 metrics)πŸ”—

These counters carry labels that identify which rule fired. They surface on /metrics only after the first match for that kind.

Metric Type Labels Description
rsigma_detection_matches_by_rule_total counter rule_title, level Detection matches per rule.
rsigma_correlation_matches_by_rule_total counter rule_title, level, correlation_type Correlation matches per rule (correlation_type is event_count, value_count, temporal, temporal_ordered, value_sum, value_avg, value_percentile, or value_median).

rule_title is not guaranteed to be unique in a rule set. If two rules share a title, their counters add together. For collision-free per-rule analytics, scrape rsigma_detection_matches_total and join against your detection NDJSON stream by rule_id outside Prometheus.

These two families feed the silence and noisy signals of rule hygiene (and the production-volume column of rule scorecard): pass a saved scrape or a live endpoint as --metrics and they join per-rule by rule_title.

Dynamic pipeline sources (5 metrics)πŸ”—

Exposed when one or more pipelines declare dynamic sources. The labelled counters surface after the first resolve attempt for the relevant source; source_cache_hits_total and source_resolve_seconds are global (no source_id label).

Metric Type Labels Description
rsigma_source_resolves_total counter source_id, source_type (file, http, command, nats) Total dynamic source resolution attempts. Counts every attempt, successful or not.
rsigma_source_resolve_errors_total counter source_id, error_kind (Fetch, Parse, Extract, Timeout, ResourceLimit) Failed dynamic source resolutions.
rsigma_source_resolve_seconds histogram β€” Dynamic source resolution latency. Aggregated across all sources.
rsigma_source_cache_hits_total counter β€” Times cached source data was served on resolution failure. Aggregated across all sources.
rsigma_source_last_resolved_timestamp gauge source_id Unix timestamp of the last successful resolution per source. Alert on staleness.

The error_kind values come from rsigma_runtime::sources::SourceErrorKind. Fetch covers HTTP / file / command / NATS connect-or-read failures (per-protocol details land in the error_message log field, not the label). ResourceLimit covers the 10 MiB body cap, 30 s command exec cap, and similar.

Enrichment (6 metrics)πŸ”—

Exposed when the daemon is built with daemon and --enrichers is passed. Every (enricher_id, kind, status) triple and every HTTP-cache enricher_id row is pre-registered at startup, so all six families render with their # HELP / # TYPE lines and zeroed counters on the first scrape, even before any event has fired. Filtered (kind- or scope-mismatched) enricher calls do not increment any counter, so cardinality stays bounded by the number of configured enrichers.

Metric Type Labels Description
rsigma_enrichment_total counter enricher_id, kind (detection, correlation), status (success, skip, error, timeout, drop) Per-call outcome counter. kind is the enricher's declared kind (the YAML kind: field), not a per-result discriminator.
rsigma_enrichment_duration_seconds histogram enricher_id, kind Per-enricher latency. Buckets target both fast template calls and slower http/command invocations.
rsigma_enrichment_queue_depth gauge β€” Pending enrichment calls (sum across both kinds). Watch this versus max_concurrent_enrichments.
rsigma_enrichment_http_cache_hits_total counter enricher_id HTTP enricher response-cache hits. Mandatory signal for any rate-limited API recipe.
rsigma_enrichment_http_cache_misses_total counter enricher_id HTTP enricher response-cache misses.
rsigma_enrichment_http_cache_expirations_total counter enricher_id HTTP enricher response-cache entries evicted on expiry.

The kind label is carried even though enricher_id typically already encodes it (asset_lookup_det vs asset_lookup_corr), so dashboards can compute sum by (kind) without depending on a naming convention.

Alert pipeline (13 metrics)πŸ”—

Exposed when the daemon is built with daemon. The fixed label sets on rsigma_dedup_results_total, rsigma_incidents_emitted_total, and rsigma_incident_overmerge_total, plus the rsigma_dedup_store_entries, rsigma_incidents_open, rsigma_silences_active, and rsigma_inhibit_sources_active gauges, are pre-registered at startup, so they render with their # HELP / # TYPE lines and zeroed series on the first scrape, even before --alert-pipeline is passed or any event fires. The rsigma_inhibited_total{rule} series appears once a rule first inhibits. See the Alert Pipeline guide.

Metric Type Labels Description
rsigma_dedup_results_total counter action (emitted, folded, repeat, resolved) Dedup outcomes: first fires emitted, duplicates folded, repeat re-emits, and resolved records.
rsigma_dedup_store_entries gauge β€” Active dedup alerts currently tracked.
rsigma_dedup_evictions_total counter β€” Active alerts evicted after resolving.
rsigma_dedup_summaries_emitted_total counter β€” Dedup summary records emitted (repeat re-emits plus resolved records).
rsigma_incidents_open gauge β€” Open incidents currently tracked by the grouping stage.
rsigma_incidents_emitted_total counter trigger (group_wait, group_interval, repeat, resolved) Incident emissions by trigger.
rsigma_incident_results_total counter β€” Total incident records emitted.
rsigma_incident_overmerge_total counter guard (stop_value, cardinality_ceiling) Entity-graph guard hits that suppressed a join.
rsigma_silenced_total counter β€” Results muted by an active silence.
rsigma_silences_active gauge β€” Currently-active silences.
rsigma_inhibited_total counter rule Results muted by an inhibition rule, by rule name.
rsigma_inhibit_sources_active gauge β€” Currently-active inhibition sources.
rsigma_alert_pipeline_duration_seconds histogram β€” Alert-pipeline stage duration in seconds.

Risk-based alerting (9 metrics)πŸ”—

Exposed when the daemon is built with daemon. The fixed label set on rsigma_risk_annotations_total and rsigma_risk_incidents_emitted_total, plus the rsigma_risk_entities_open and rsigma_risk_state_entries gauges, are pre-registered at startup, so they render with their # HELP / # TYPE lines and zeroed series on the first scrape, even before --risk is passed or any event fires. See the Risk-Based Alerting guide.

Metric Type Labels Description
rsigma_risk_annotations_total counter action (scored, no_entity, skipped) Risk-annotation outcomes: scored with entities, scored with no entity, or skipped (out of scope).
rsigma_risk_annotation_score histogram β€” Distribution of the per-detection resolved risk score.
rsigma_risk_objects_total counter β€” Risk objects extracted from firing detections.
rsigma_risk_entities_open gauge β€” Entities currently tracked by the risk accumulator.
rsigma_risk_state_entries gauge β€” Risk contributions currently retained across all entities.
rsigma_risk_evictions_total counter β€” Entities dropped from the accumulator (store full or aged out).
rsigma_risk_incidents_emitted_total counter trigger (score, tactic_count) Risk incidents emitted by trigger.
rsigma_risk_incident_results_total counter β€” Total risk incident records emitted.
rsigma_risk_layer_duration_seconds histogram β€” Risk-layer stage duration in seconds.

OTLP (3 metrics)πŸ”—

Exposed when the daemon is built with daemon-otlp and an OTLP receiver is active. The labelled counters surface after the first request of that kind.

Metric Type Labels Description
rsigma_otlp_requests_total counter transport (http, grpc), encoding (e.g. json, protobuf, protobuf+gzip) OTLP export requests received.
rsigma_otlp_log_records_total counter β€” Log records ingested via OTLP.
rsigma_otlp_errors_total counter transport, reason (unsupported_content_type, decompression, decode, channel_closed) OTLP request errors.

TLS (2 metrics)πŸ”—

Exposed when the daemon is built with daemon-tls. Both metrics render with their # HELP and # TYPE lines as soon as TLS is configured, even before the first handshake.

Metric Type Labels Description
rsigma_tls_certificate_expiry_seconds gauge β€” Seconds until the active TLS server certificate's not_after. Signed: negative once expired. Updated at startup and after every successful SIGHUP-triggered reload.
rsigma_tls_active_connections gauge β€” Currently active TLS-terminated connections on the API listener. Decrements on connection close (including handshake failure).

Field observability (3 metrics)πŸ”—

Exposed unconditionally; values stay at zero unless the daemon was started with --observe-fields. All three refresh on every /metrics scrape and after every successful /api/v1/fields/* call. See HTTP API: Field observability for the matching endpoints.

Metric Type Labels Description
rsigma_fields_observed_total counter β€” Total events scanned by the opt-in field observer. Advances regardless of whether the event had structured fields.
rsigma_fields_observer_unique_keys gauge β€” Distinct field names currently tracked. Saturates at --observe-fields-max-keys (default 10000).
rsigma_fields_observer_overflow_dropped_total counter β€” New-key insert attempts dropped because the observer was at capacity. A persistent positive rate signals that --observe-fields-max-keys is too low for the deployment.

Schema observability (4 metrics)πŸ”—

Exposed unconditionally; values stay at zero unless the daemon was started with --observe-schemas (or --discover-schemas, which implies it). All refresh on every /metrics scrape and on every GET /api/v1/schemas call. See HTTP API: Schema observability for the matching endpoint.

Metric Type Labels Description
rsigma_events_by_schema_total counter schema Events classified into each recognized schema (ecs, sysmon, windows_eventlog, cef, ocsf, generic_json, or a user-defined name).
rsigma_events_unknown_schema_total counter β€” Events that matched no schema signature. A rising rate signals a source whose schema RSigma does not recognize; add a signature with --schema-config.
rsigma_events_ambiguous_schema_total counter β€” Events where two different-name signatures tied at the winning specificity, so the name tie-break decided routing. Resolve by giving one signature a distinguishing predicate or a higher specificity.
rsigma_unknown_schema_clusters gauge β€” Distinct clusters of unrecognized event shapes that schema discovery would propose a signature for. Zero unless the daemon was started with --discover-schemas; drives the GET /api/v1/schemas/suggestions endpoint.

Logsource-aware evaluation (4 metrics)πŸ”—

Exposed unconditionally; values stay at zero unless the daemon was started with --logsource-routing. All refresh on every /metrics scrape. See Logsource-Aware Evaluation.

Metric Type Labels Description
rsigma_rules_pruned_by_logsource_total counter β€” Always-evaluated rules skipped because their product conflicts with the event's logsource. The scaling signal: it rises with the conflicting fraction of a mixed-product ruleset.
rsigma_events_without_logsource_total counter β€” Events with no extractable logsource, evaluated against every rule (fail-open). A high rate means events are not carrying a logsource tag and no static override or field map is set.
rsigma_schema_rules_eligible gauge schema Rules a schema's events evaluate after logsource pruning. Set when both schema routing and logsource routing are active; refreshed on scrape and reload.
rsigma_schema_rules_pruned gauge schema Rules pruned for a schema by its implied logsource. The higher this is relative to eligible, the less of the ruleset that schema exercises.

Live event tap (4 metrics)πŸ”—

Exposed unconditionally; values stay at zero unless the tap is enabled (daemon.tap.enabled: true) and an operator opens a session. See HTTP API: Live event tap and rsigma engine tap.

Metric Type Labels Description
rsigma_tap_sessions_total counter β€” Total tap sessions opened over the daemon's lifetime.
rsigma_tap_active_sessions gauge β€” Currently active tap sessions. Bounded by daemon.tap.max_sessions.
rsigma_tap_events_streamed_total counter β€” Events streamed to tap clients.
rsigma_tap_events_dropped_total counter β€” Events dropped from a tap (a full per-session buffer, or an unparseable line in a redacting raw capture). A positive rate means captured fixtures have gaps.

Live detection tail (2 metrics)πŸ”—

Exposed unconditionally; values stay at zero unless the tail is enabled (daemon.tail.enabled: true) and an operator opens a session. See HTTP API: Live detection tail and rsigma engine tail.

Metric Type Labels Description
rsigma_tail_active_sessions gauge β€” Currently active tail sessions. Bounded by daemon.tail.max_sessions.
rsigma_tail_detections_dropped_total counter β€” Detections dropped from a tail because a session buffer was full. A positive rate means a tail client could not keep up.

Triage feedback loop (4 metrics)πŸ”—

Exposed when the triage feedback loop is enabled (daemon.dispositions.enabled: true or --enable-dispositions). The ingest counters pre-register their fixed label sets so they render with zeroed series on the first scrape; rsigma_rule_false_positive_ratio is absent for a rule until it reaches daemon.dispositions.min_sample. See the Triage Feedback Loop guide and HTTP API: Dispositions.

Metric Type Labels Description
rsigma_rule_false_positive_ratio gauge rule_title Per-rule false-positive ratio over the rolling window. Absent until the rule reaches min_sample dispositions.
rsigma_dispositions_total counter rule_title, verdict Analyst dispositions counted, by rule and verdict (true_positive, false_positive, benign_true_positive).
rsigma_disposition_ingest_total counter source, result Ingest outcomes by source (api, file, http, nats) and result (accepted, duplicate, rejected).
rsigma_disposition_ingest_errors_total counter reason Ingest errors by reason (parse, validation).

Scrape configurationπŸ”—

Minimum Prometheus scrape config:

scrape_configs:
  - job_name: rsigma
    scrape_interval: 15s
    static_configs:
      - targets: ['rsigma.internal:9090']

15-30 s intervals are reasonable. The histograms use the default prometheus bucket boundaries; alert on the _bucket{le="..."} quantiles you care about rather than the average, which becomes meaningless under bimodal latency.

Useful alertsπŸ”—

groups:
  - name: rsigma
    rules:
      # Engine cannot keep up.
      - alert: RsigmaBackPressure
        expr: rate(rsigma_back_pressure_events_total[5m]) > 0
        for: 10m
        labels: {severity: warning}

      # Correlation state above 80% of the default 100000 cap.
      - alert: RsigmaCorrelationStatePressure
        expr: rsigma_correlation_state_entries > 80000
        for: 10m
        labels: {severity: warning}

      # DLQ taking traffic.
      - alert: RsigmaDlqVolume
        expr: rate(rsigma_dlq_events_total[5m]) > 1
        for: 15m
        labels: {severity: warning}

      # Reloads failing means rules are broken on disk.
      - alert: RsigmaReloadsFailing
        expr: rate(rsigma_reloads_failed_total[5m]) > 0
        for: 10m
        labels: {severity: critical}

      # Dynamic source went stale (no successful resolve in 10 minutes).
      - alert: RsigmaSourceStale
        expr: time() - rsigma_source_last_resolved_timestamp > 600
        for: 5m
        labels: {severity: warning}

      # Enricher consistently failing (timeouts or fetch errors).
      - alert: RsigmaEnrichmentFailing
        expr: |
          sum by (enricher_id) (
            rate(rsigma_enrichment_total{status=~"error|timeout"}[5m])
          ) > 1
        for: 10m
        labels: {severity: warning}

      # TLS certificate expires within 14 days.
      - alert: RsigmaTlsCertExpiring
        expr: rsigma_tls_certificate_expiry_seconds < 14 * 86400
        for: 5m
        labels: {severity: warning}

      # TLS certificate has already expired.
      - alert: RsigmaTlsCertExpired
        expr: rsigma_tls_certificate_expiry_seconds < 0
        for: 1m
        labels: {severity: critical}

Histograms: bucket guidanceπŸ”—

Metric Typical p50 Typical p99 Notes
rsigma_event_processing_seconds 1-30 Β΅s < 1 ms Per-event evaluation against the loaded rule set. Spikes correlate with reload events.
rsigma_pipeline_latency_seconds 1-100 Β΅s < 5 ms End-to-end from event dequeue to sink send. Dominated by sink latency (file vs NATS).
rsigma_batch_size 1 1 Default --batch-size 1. With --batch-size 64 and load, p50 trends toward 64.

event_processing_seconds p99 above 5 ms is usually a sign of misuse (regex-heavy rules without --cross-rule-ac, or many |all modifiers).

See alsoπŸ”—

  • Observability for the broader observability story, including the tracing event targets that complement these metrics.
  • Performance Tuning for which metric to watch when sizing --buffer-size, --batch-size, or correlation max_state_entries.
  • Streaming Detection for how the /metrics endpoint fits into the broader daemon API.
  • daemon/metrics source for the registry implementation.