Skip to content

Observability🔗

RSigma is built on tracing plus the prometheus crate. Every meaningful event in the daemon and CLI lands on one of:

  • A tracing event (info, warning, error, debug, or trace) on stderr.
  • A Prometheus counter, gauge, or histogram exposed on /metrics (daemon only).

This page covers the four observability surfaces you actually operate on: the log subscriber and its format, the RUST_LOG filter targets that surface specific concerns (NATS lifecycle, dynamic sources, hot-reload, HTTP requests, correlation memory pressure), what to scrape with Prometheus, and how to recognize the most useful tracing spans.

Log format and verbosity🔗

The daemon always emits structured JSON to stderr. The other commands (engine eval, rule lint, rule validate, backend convert, pipeline resolve) default to human-readable stdout/stderr output with no structured logs. Opt into a tracing subscriber for them with --log-format:

rsigma --log-format json rule validate rules/ -p pipelines/ecs.yml
rsigma --log-format text engine eval -r rules/ -e @events.ndjson
Value What it does
json Structured JSON, one object per line. Same shape the daemon always emits.
text Human-readable text with ANSI colors when stderr is a TTY.

--log-format adds the diagnostic-log stream alongside the existing stdout/stderr output; it never replaces them. So rsigma --log-format json engine eval ... still prints the MatchResult lines to stdout exactly as before; the JSON log lines arrive on stderr.

Verbosity is controlled by the standard RUST_LOG environment variable (tracing_subscriber::EnvFilter). The default is info. The flag has no effect on engine daemon, which is always JSON.

RUST_LOG filter targets🔗

Every emitted event carries a target field naming the module that produced it. Use that to narrow RUST_LOG to the area you care about:

Target What it surfaces When to enable above info
rsigma::daemon::server Daemon lifecycle: rule load, API server bind, source start, sink start, shutdown drain. Always at info. Drop to debug only when chasing startup ordering bugs.
rsigma::daemon::reload File watcher for rules and pipelines, reload triggers, atomic engine swap. debug when investigating "my rule change is not picked up".
rsigma::daemon::health Readiness state transitions (/readyz flipping 200 ↔ 503). debug if liveness probes flap.
rsigma_runtime::engine Rules + pipeline load, swap, recompile timing. debug to confirm the engine swap path during hot-reload.
rsigma_runtime::sources Per-source fetches (HTTP, file, command, NATS), cache hits and misses, parse failures. debug when a dynamic source is misbehaving.
rsigma_runtime::sources::refresh Scheduled refresh ticks for interval-based sources. debug to see refresh cadence; usually noisy.
rsigma_eval::correlation_engine Correlation state pressure (max_state_entries evictions), correlation matches. warn is enough in practice; the eviction message is what you actually want to alert on.
rsigma_eval::engine Cross-rule AC index limits, bloom-filter sizing. Static one-shot warnings. warn.
async_nats::connector NATS connect/disconnect/reconnect lifecycle. Appears with daemon-nats enabled. debug to trace transient connection drops; info is enough for steady-state.
async_nats NATS event-stream messages (event: connected, event: closed). Same.
tower_http::trace::on_request and tower_http::trace::on_response Per-request HTTP access logs for the /api/v1/*, /metrics, /v1/logs endpoints. debug for an access log; off in production unless debugging.

A few combinations that come up often in practice:

# Quiet production daemon: only warnings and above, but keep INFO for the
# daemon's own lifecycle messages so the boot sequence stays readable.
RUST_LOG="warn,rsigma::daemon=info" \
    rsigma engine daemon -r rules/

# Trace a hot-reload that is not picking up a rule change.
RUST_LOG="info,rsigma::daemon::reload=debug,rsigma_runtime::engine=debug" \
    rsigma engine daemon -r rules/

# Investigate a dynamic source that is timing out.
RUST_LOG="info,rsigma_runtime::sources=debug" \
    rsigma engine daemon -r rules/ -p pipelines/dynamic.yml

# HTTP access log on the daemon API.
RUST_LOG="info,tower_http=debug" \
    rsigma engine daemon -r rules/ --input http

Spans🔗

A tracing span is a structured scope around a unit of work. When the daemon resolves dynamic sources during a rule load, the span tree looks like this in the JSON output:

{
  "timestamp": "...",
  "level": "DEBUG",
  "fields": {"message": "Source fetched successfully", "source_id": "ips"},
  "target": "rsigma_runtime::sources",
  "span": {"rules_path": "/tmp/obs-test/rules", "name": "load_rules"},
  "spans": [{"rules_path": "/tmp/obs-test/rules", "name": "load_rules"}]
}

The span and spans fields tell you the call stack that produced the event without needing distributed tracing infrastructure. The named spans currently emitted:

Span Where Useful for
load_rules Engine swap path during startup and hot-reload. Correlating per-source fetches with the engine reload that triggered them.
evaluate_batch (debug only) Per-batch processing in LogProcessor. Includes batch_size, matches, elapsed_ms. Profiling batch latency vs throughput. Off at info.
otlp_logs_request One per OTLP /v1/logs POST or gRPC Export. Includes content encoding and record count. Detecting agents that send malformed OTLP or overly-large batches. Off at info.

Spans are emitted alongside events. To capture them in a structured aggregator (Loki, Datadog Logs, ClickHouse, etc.), index on the span.name field as well as target and level.

Prometheus metrics🔗

The daemon binds /metrics on the same --api-addr as the REST API. It exposes 27 metric definitions across three concerns:

Concern Metrics What they answer
Engine throughput rsigma_events_processed_total, rsigma_events_parse_errors_total, rsigma_detection_matches_total, rsigma_correlation_matches_total, rsigma_event_processing_seconds, rsigma_pipeline_latency_seconds, rsigma_batch_size, rsigma_uptime_seconds How fast are we ingesting, how often are rules firing, how long does each batch take?
Queue and back-pressure rsigma_input_queue_depth, rsigma_output_queue_depth, rsigma_back_pressure_events_total Is the engine keeping up with the source? Is the source faster than the sink?
Rule and state load rsigma_detection_rules_loaded, rsigma_correlation_rules_loaded, rsigma_correlation_state_entries, rsigma_reloads_total, rsigma_reloads_failed_total, rsigma_dlq_events_total How many rules are loaded, how full is the correlation state, are reloads succeeding?
Per-rule labels (appear after first match) rsigma_detection_matches_by_rule_total{rule_id="..."}, rsigma_correlation_matches_by_rule_total{rule_id="..."} Which specific rules are firing?
Dynamic sources (with -p pipelines that declare sources) rsigma_source_resolves_total, rsigma_source_resolve_errors_total, rsigma_source_resolve_seconds, rsigma_source_cache_hits_total, rsigma_source_last_resolved_timestamp Are HTTP/file/command sources reachable and timely?
OTLP (with daemon-otlp feature) rsigma_otlp_requests_total, rsigma_otlp_log_records_total, rsigma_otlp_errors_total How are upstream OTLP agents behaving?

Some metrics only appear after their first relevant event (per-rule labels, OTLP counters). A startup /metrics scrape shows about 20 distinct metric names; the full 27 emerge as the daemon does real work.

Scrape /metrics at 15-30 s intervals. The histograms (event_processing_seconds, pipeline_latency_seconds, batch_size) use the default Prometheus bucket boundaries; alert on the _bucket{le="..."} quantiles you care about rather than on the raw average.

A minimal scrape config:

scrape_configs:
  - job_name: rsigma
    scrape_interval: 15s
    static_configs:
      - targets: ['rsigma.internal:9090']

The full table with every label and source-of-truth pointer lives in the Prometheus metrics reference.

Useful alerting recipes🔗

These four alerts catch most operational regressions for free.

groups:
  - name: rsigma
    rules:
      # Engine is unable to keep up with the source.
      - alert: RsigmaBackPressure
        expr: rate(rsigma_back_pressure_events_total[5m]) > 0
        for: 10m
        labels: {severity: warning}
        annotations:
          summary: rsigma is back-pressuring the input

      # Correlation state heading toward the hard cap (default 100k).
      - alert: RsigmaCorrelationStatePressure
        expr: rsigma_correlation_state_entries > 80000
        for: 10m
        labels: {severity: warning}
        annotations:
          summary: rsigma correlation state above 80% of the hard cap

      # DLQ getting events means upstream is sending unparseable data.
      - alert: RsigmaDlqVolume
        expr: rate(rsigma_dlq_events_total[5m]) > 1
        for: 15m
        labels: {severity: warning}
        annotations:
          summary: rsigma is routing events to the dead-letter queue

      # Reload-failure rate. Rules path issues, pipeline parse errors.
      - alert: RsigmaReloadsFailing
        expr: rate(rsigma_reloads_failed_total[5m]) > 0
        for: 10m
        labels: {severity: critical}
        annotations:
          summary: rsigma rule reload is failing

Health probes🔗

For Kubernetes-style orchestrators:

Endpoint Returns Wire to
/healthz 200 once the listener is up. livenessProbe. Restart the container if this stops responding.
/readyz 200 once rules and pipelines are loaded; 503 during startup or after a failed reload. readinessProbe. Drain traffic when this returns 503.

/healthz is intentionally cheap and side-effect-free; do not rely on it to detect "the engine is silently dropping events". Use rsigma_back_pressure_events_total, rsigma_dlq_events_total, and rsigma_reloads_failed_total for that.

OpenTelemetry receivers🔗

OTLP is one of the supported input formats for the daemon (with the daemon-otlp feature). RSigma does NOT export traces of its own internal work over OTLP; the OTLP wiring is one-way and is for receiving log records from agents.

If you want to ship the daemon's tracing events into a tracing backend, the standard tracing-opentelemetry Rust bridge would be the path, but no first-party integration ships today. The structured JSON log stream is the canonical observability surface; pipe it into Loki, Vector → ClickHouse, Datadog Logs, or any equivalent.

Quick verification🔗

# Confirm the metrics endpoint is alive.
curl -s http://127.0.0.1:9090/metrics | head -20

# Confirm structured-log emission with the daemon target.
rsigma engine daemon -r rules/ 2>&1 | head -3

The first line of /metrics should be a # HELP rsigma_back_pressure_events_total ... block. The first daemon log line should be a Rules loaded event with target=rsigma::daemon::server. If either is missing, the build is probably without the daemon feature or with a broken --api-addr.

See also🔗