Skip to content

Prometheus MetricsπŸ”—

The engine daemon exposes Prometheus metrics on GET /metrics on the same --api-addr as the REST API. The full definition catalogue is 27 metric names across three concerns; the runtime exposes the ones that have ever fired in a given process. A startup scrape shows 21 names by default (one of the per-rule counters surfaces immediately because the registry pre-creates it for documentation); the remaining six lazy metrics register on first use of dynamic pipelines or OTLP.

The exact source of truth is the daemon/metrics module.

Engine core (16 metrics)πŸ”—

These always show up. They cover ingest, matches, queue depth, back-pressure, reloads, and resource usage.

Metric Type Labels Description
rsigma_events_processed_total counter β€” Total events processed by the engine.
rsigma_events_parse_errors_total counter β€” JSON or log-format parse errors at the source.
rsigma_detection_matches_total counter β€” Total detection matches emitted.
rsigma_correlation_matches_total counter β€” Total correlation matches emitted.
rsigma_detection_rules_loaded gauge β€” Number of detection rules currently loaded.
rsigma_correlation_rules_loaded gauge β€” Number of correlation rules currently loaded.
rsigma_correlation_state_entries gauge β€” Active entries in the correlation state. Watch versus the max_state_entries cap (default 100000).
rsigma_reloads_total counter β€” Total reload attempts (file watcher, SIGHUP, POST /api/v1/reload).
rsigma_reloads_failed_total counter β€” Reload attempts that produced parse or compile errors.
rsigma_uptime_seconds gauge β€” Daemon uptime in seconds.
rsigma_input_queue_depth gauge — Events currently buffered in the source→engine channel.
rsigma_output_queue_depth gauge — Results currently buffered in the engine→sink channel.
rsigma_back_pressure_events_total counter β€” Times a source was blocked on a full event channel.
rsigma_event_processing_seconds histogram β€” Per-event processing latency.
rsigma_pipeline_latency_seconds histogram β€” End-to-end latency from event dequeue to sink send.
rsigma_batch_size histogram β€” Number of events processed per batch.
rsigma_dlq_events_total counter β€” Events routed to the dead-letter queue.

Per-rule labels (2 metrics)πŸ”—

These counters carry labels that identify which rule fired. They surface on /metrics only after the first match for that kind.

Metric Type Labels Description
rsigma_detection_matches_by_rule_total counter rule_title, level Detection matches per rule.
rsigma_correlation_matches_by_rule_total counter rule_title, level, correlation_type Correlation matches per rule (correlation_type is event_count, value_count, temporal, temporal_ordered, value_sum, value_avg, value_percentile, or value_median).

rule_title is not guaranteed to be unique in a rule set. If two rules share a title, their counters add together. For collision-free per-rule analytics, scrape rsigma_detection_matches_total and join against your detection NDJSON stream by rule_id outside Prometheus.

Dynamic pipeline sources (5 metrics)πŸ”—

Exposed when one or more pipelines declare dynamic sources. The labelled counters surface after the first resolve attempt for the relevant source; source_cache_hits_total and source_resolve_seconds are global (no source_id label).

Metric Type Labels Description
rsigma_source_resolves_total counter source_id, source_type (file, http, command, nats) Total dynamic source resolution attempts. Counts every attempt, successful or not.
rsigma_source_resolve_errors_total counter source_id, error_kind (Fetch, Parse, Extract, Timeout, ResourceLimit) Failed dynamic source resolutions.
rsigma_source_resolve_seconds histogram β€” Dynamic source resolution latency. Aggregated across all sources.
rsigma_source_cache_hits_total counter β€” Times cached source data was served on resolution failure. Aggregated across all sources.
rsigma_source_last_resolved_timestamp gauge source_id Unix timestamp of the last successful resolution per source. Alert on staleness.

The error_kind values come from rsigma_runtime::sources::SourceErrorKind. Fetch covers HTTP / file / command / NATS connect-or-read failures (per-protocol details land in the error_message log field, not the label). ResourceLimit covers the 10 MiB body cap, 30 s command exec cap, and similar.

Enrichment (6 metrics)πŸ”—

Exposed when the daemon is built with daemon and --enrichers is passed. Every (enricher_id, kind, status) triple and every HTTP-cache enricher_id row is pre-registered at startup, so all six families render with their # HELP / # TYPE lines and zeroed counters on the first scrape, even before any event has fired. Filtered (kind- or scope-mismatched) enricher calls do not increment any counter, so cardinality stays bounded by the number of configured enrichers.

Metric Type Labels Description
rsigma_enrichment_total counter enricher_id, kind (detection, correlation), status (success, skip, error, timeout, drop) Per-call outcome counter. kind is the enricher's declared kind (the YAML kind: field), not a per-result discriminator.
rsigma_enrichment_duration_seconds histogram enricher_id, kind Per-enricher latency. Buckets target both fast template calls and slower http/command invocations.
rsigma_enrichment_queue_depth gauge β€” Pending enrichment calls (sum across both kinds). Watch this versus max_concurrent_enrichments.
rsigma_enrichment_http_cache_hits_total counter enricher_id HTTP enricher response-cache hits. Mandatory signal for any rate-limited API recipe.
rsigma_enrichment_http_cache_misses_total counter enricher_id HTTP enricher response-cache misses.
rsigma_enrichment_http_cache_expirations_total counter enricher_id HTTP enricher response-cache entries evicted on expiry.

The kind label is carried even though enricher_id typically already encodes it (asset_lookup_det vs asset_lookup_corr), so dashboards can compute sum by (kind) without depending on a naming convention.

OTLP (3 metrics)πŸ”—

Exposed when the daemon is built with daemon-otlp and an OTLP receiver is active. The labelled counters surface after the first request of that kind.

Metric Type Labels Description
rsigma_otlp_requests_total counter transport (http, grpc), encoding (e.g. json, protobuf, protobuf+gzip) OTLP export requests received.
rsigma_otlp_log_records_total counter β€” Log records ingested via OTLP.
rsigma_otlp_errors_total counter transport, reason (unsupported_content_type, decompression, decode, channel_closed) OTLP request errors.

TLS (2 metrics)πŸ”—

Exposed when the daemon is built with daemon-tls. Both metrics render with their # HELP and # TYPE lines as soon as TLS is configured, even before the first handshake.

Metric Type Labels Description
rsigma_tls_certificate_expiry_seconds gauge β€” Seconds until the active TLS server certificate's not_after. Signed: negative once expired. Updated at startup and after every successful SIGHUP-triggered reload.
rsigma_tls_active_connections gauge β€” Currently active TLS-terminated connections on the API listener. Decrements on connection close (including handshake failure).

Scrape configurationπŸ”—

Minimum Prometheus scrape config:

scrape_configs:
  - job_name: rsigma
    scrape_interval: 15s
    static_configs:
      - targets: ['rsigma.internal:9090']

15-30 s intervals are reasonable. The histograms use the default prometheus bucket boundaries; alert on the _bucket{le="..."} quantiles you care about rather than the average, which becomes meaningless under bimodal latency.

Useful alertsπŸ”—

groups:
  - name: rsigma
    rules:
      # Engine cannot keep up.
      - alert: RsigmaBackPressure
        expr: rate(rsigma_back_pressure_events_total[5m]) > 0
        for: 10m
        labels: {severity: warning}

      # Correlation state above 80% of the default 100000 cap.
      - alert: RsigmaCorrelationStatePressure
        expr: rsigma_correlation_state_entries > 80000
        for: 10m
        labels: {severity: warning}

      # DLQ taking traffic.
      - alert: RsigmaDlqVolume
        expr: rate(rsigma_dlq_events_total[5m]) > 1
        for: 15m
        labels: {severity: warning}

      # Reloads failing means rules are broken on disk.
      - alert: RsigmaReloadsFailing
        expr: rate(rsigma_reloads_failed_total[5m]) > 0
        for: 10m
        labels: {severity: critical}

      # Dynamic source went stale (no successful resolve in 10 minutes).
      - alert: RsigmaSourceStale
        expr: time() - rsigma_source_last_resolved_timestamp > 600
        for: 5m
        labels: {severity: warning}

      # Enricher consistently failing (timeouts or fetch errors).
      - alert: RsigmaEnrichmentFailing
        expr: |
          sum by (enricher_id) (
            rate(rsigma_enrichment_total{status=~"error|timeout"}[5m])
          ) > 1
        for: 10m
        labels: {severity: warning}

      # TLS certificate expires within 14 days.
      - alert: RsigmaTlsCertExpiring
        expr: rsigma_tls_certificate_expiry_seconds < 14 * 86400
        for: 5m
        labels: {severity: warning}

      # TLS certificate has already expired.
      - alert: RsigmaTlsCertExpired
        expr: rsigma_tls_certificate_expiry_seconds < 0
        for: 1m
        labels: {severity: critical}

Histograms: bucket guidanceπŸ”—

Metric Typical p50 Typical p99 Notes
rsigma_event_processing_seconds 1-30 Β΅s < 1 ms Per-event evaluation against the loaded rule set. Spikes correlate with reload events.
rsigma_pipeline_latency_seconds 1-100 Β΅s < 5 ms End-to-end from event dequeue to sink send. Dominated by sink latency (file vs NATS).
rsigma_batch_size 1 1 Default --batch-size 1. With --batch-size 64 and load, p50 trends toward 64.

event_processing_seconds p99 above 5 ms is usually a sign of misuse (regex-heavy rules without --cross-rule-ac, or many |all modifiers).

See alsoπŸ”—

  • Observability for the broader observability story, including the tracing event targets that complement these metrics.
  • Performance Tuning for which metric to watch when sizing --buffer-size, --batch-size, or correlation max_state_entries.
  • Streaming Detection for how the /metrics endpoint fits into the broader daemon API.
  • daemon/metrics source for the registry implementation.