Observability🔗
RSigma is built on tracing plus the prometheus crate. Every meaningful event in the daemon and CLI lands on one of:
- A
tracingevent (info, warning, error, debug, or trace) on stderr. - A Prometheus counter, gauge, or histogram exposed on
/metrics(daemon only).
This page covers the four observability surfaces you actually operate on: the log subscriber and its format, the RUST_LOG filter targets that surface specific concerns (NATS lifecycle, dynamic sources, hot-reload, HTTP requests, correlation memory pressure), what to scrape with Prometheus, and how to recognize the most useful tracing spans.
Log format and verbosity🔗
The daemon always emits structured JSON to stderr. The other commands (engine eval, rule lint, rule validate, backend convert, pipeline resolve) default to human-readable stdout/stderr output with no structured logs. Opt into a tracing subscriber for them with --log-format:
rsigma --log-format json rule validate rules/ -p pipelines/ecs.yml
rsigma --log-format text engine eval -r rules/ -e @events.ndjson
| Value | What it does |
|---|---|
json | Structured JSON, one object per line. Same shape the daemon always emits. |
text | Human-readable text with ANSI colors when stderr is a TTY. |
--log-format adds the diagnostic-log stream alongside the existing stdout/stderr output; it never replaces them. So rsigma --log-format json engine eval ... still prints the MatchResult lines to stdout exactly as before; the JSON log lines arrive on stderr.
Verbosity is controlled by the standard RUST_LOG environment variable (tracing_subscriber::EnvFilter). The default is info. The flag has no effect on engine daemon, which is always JSON.
RUST_LOG filter targets🔗
Every emitted event carries a target field naming the module that produced it. Use that to narrow RUST_LOG to the area you care about:
| Target | What it surfaces | When to enable above info |
|---|---|---|
rsigma::daemon::server | Daemon lifecycle: rule load, API server bind, source start, sink start, shutdown drain. | Always at info. Drop to debug only when chasing startup ordering bugs. |
rsigma::daemon::reload | File watcher for rules and pipelines, reload triggers, atomic engine swap. | debug when investigating "my rule change is not picked up". |
rsigma::daemon::health | Readiness state transitions (/readyz flipping 200 ↔ 503). | debug if liveness probes flap. |
rsigma_runtime::engine | Rules + pipeline load, swap, recompile timing. | debug to confirm the engine swap path during hot-reload. |
rsigma_runtime::sources | Per-source fetches (HTTP, file, command, NATS), cache hits and misses, parse failures. | debug when a dynamic source is misbehaving. |
rsigma_runtime::sources::refresh | Scheduled refresh ticks for interval-based sources. | debug to see refresh cadence; usually noisy. |
rsigma_eval::correlation_engine | Correlation state pressure (max_state_entries evictions), correlation matches. | warn is enough in practice; the eviction message is what you actually want to alert on. |
rsigma_eval::engine | Cross-rule AC index limits, bloom-filter sizing. Static one-shot warnings. | warn. |
async_nats::connector | NATS connect/disconnect/reconnect lifecycle. Appears with daemon-nats enabled. | debug to trace transient connection drops; info is enough for steady-state. |
async_nats | NATS event-stream messages (event: connected, event: closed). | Same. |
tower_http::trace::on_request and tower_http::trace::on_response | Per-request HTTP access logs for the /api/v1/*, /metrics, /v1/logs endpoints. | debug for an access log; off in production unless debugging. |
A few combinations that come up often in practice:
# Quiet production daemon: only warnings and above, but keep INFO for the
# daemon's own lifecycle messages so the boot sequence stays readable.
RUST_LOG="warn,rsigma::daemon=info" \
rsigma engine daemon -r rules/
# Trace a hot-reload that is not picking up a rule change.
RUST_LOG="info,rsigma::daemon::reload=debug,rsigma_runtime::engine=debug" \
rsigma engine daemon -r rules/
# Investigate a dynamic source that is timing out.
RUST_LOG="info,rsigma_runtime::sources=debug" \
rsigma engine daemon -r rules/ -p pipelines/dynamic.yml
# HTTP access log on the daemon API.
RUST_LOG="info,tower_http=debug" \
rsigma engine daemon -r rules/ --input http
Spans🔗
A tracing span is a structured scope around a unit of work. When the daemon resolves dynamic sources during a rule load, the span tree looks like this in the JSON output:
{
"timestamp": "...",
"level": "DEBUG",
"fields": {"message": "Source fetched successfully", "source_id": "ips"},
"target": "rsigma_runtime::sources",
"span": {"rules_path": "/tmp/obs-test/rules", "name": "load_rules"},
"spans": [{"rules_path": "/tmp/obs-test/rules", "name": "load_rules"}]
}
The span and spans fields tell you the call stack that produced the event without needing distributed tracing infrastructure. The named spans currently emitted:
| Span | Where | Useful for |
|---|---|---|
load_rules | Engine swap path during startup and hot-reload. | Correlating per-source fetches with the engine reload that triggered them. |
evaluate_batch (debug only) | Per-batch processing in LogProcessor. Includes batch_size, matches, elapsed_ms. | Profiling batch latency vs throughput. Off at info. |
otlp_logs_request | One per OTLP /v1/logs POST or gRPC Export. Includes content encoding and record count. | Detecting agents that send malformed OTLP or overly-large batches. Off at info. |
Spans are emitted alongside events. To capture them in a structured aggregator (Loki, Datadog Logs, ClickHouse, etc.), index on the span.name field as well as target and level.
Prometheus metrics🔗
The daemon binds /metrics on the same --api-addr as the REST API. It exposes 27 metric definitions across three concerns:
| Concern | Metrics | What they answer |
|---|---|---|
| Engine throughput | rsigma_events_processed_total, rsigma_events_parse_errors_total, rsigma_detection_matches_total, rsigma_correlation_matches_total, rsigma_event_processing_seconds, rsigma_pipeline_latency_seconds, rsigma_batch_size, rsigma_uptime_seconds | How fast are we ingesting, how often are rules firing, how long does each batch take? |
| Queue and back-pressure | rsigma_input_queue_depth, rsigma_output_queue_depth, rsigma_back_pressure_events_total | Is the engine keeping up with the source? Is the source faster than the sink? |
| Rule and state load | rsigma_detection_rules_loaded, rsigma_correlation_rules_loaded, rsigma_correlation_state_entries, rsigma_reloads_total, rsigma_reloads_failed_total, rsigma_dlq_events_total | How many rules are loaded, how full is the correlation state, are reloads succeeding? |
| Per-rule labels (appear after first match) | rsigma_detection_matches_by_rule_total{rule_id="..."}, rsigma_correlation_matches_by_rule_total{rule_id="..."} | Which specific rules are firing? |
Dynamic sources (with -p pipelines that declare sources) | rsigma_source_resolves_total, rsigma_source_resolve_errors_total, rsigma_source_resolve_seconds, rsigma_source_cache_hits_total, rsigma_source_last_resolved_timestamp | Are HTTP/file/command sources reachable and timely? |
OTLP (with daemon-otlp feature) | rsigma_otlp_requests_total, rsigma_otlp_log_records_total, rsigma_otlp_errors_total | How are upstream OTLP agents behaving? |
Some metrics only appear after their first relevant event (per-rule labels, OTLP counters). A startup /metrics scrape shows about 20 distinct metric names; the full 27 emerge as the daemon does real work.
Scrape /metrics at 15-30 s intervals. The histograms (event_processing_seconds, pipeline_latency_seconds, batch_size) use the default Prometheus bucket boundaries; alert on the _bucket{le="..."} quantiles you care about rather than on the raw average.
A minimal scrape config:
scrape_configs:
- job_name: rsigma
scrape_interval: 15s
static_configs:
- targets: ['rsigma.internal:9090']
The full table with every label and source-of-truth pointer lives in the Prometheus metrics reference.
Useful alerting recipes🔗
These four alerts catch most operational regressions for free.
groups:
- name: rsigma
rules:
# Engine is unable to keep up with the source.
- alert: RsigmaBackPressure
expr: rate(rsigma_back_pressure_events_total[5m]) > 0
for: 10m
labels: {severity: warning}
annotations:
summary: rsigma is back-pressuring the input
# Correlation state heading toward the hard cap (default 100k).
- alert: RsigmaCorrelationStatePressure
expr: rsigma_correlation_state_entries > 80000
for: 10m
labels: {severity: warning}
annotations:
summary: rsigma correlation state above 80% of the hard cap
# DLQ getting events means upstream is sending unparseable data.
- alert: RsigmaDlqVolume
expr: rate(rsigma_dlq_events_total[5m]) > 1
for: 15m
labels: {severity: warning}
annotations:
summary: rsigma is routing events to the dead-letter queue
# Reload-failure rate. Rules path issues, pipeline parse errors.
- alert: RsigmaReloadsFailing
expr: rate(rsigma_reloads_failed_total[5m]) > 0
for: 10m
labels: {severity: critical}
annotations:
summary: rsigma rule reload is failing
Health probes🔗
For Kubernetes-style orchestrators:
| Endpoint | Returns | Wire to |
|---|---|---|
/healthz | 200 once the listener is up. | livenessProbe. Restart the container if this stops responding. |
/readyz | 200 once rules and pipelines are loaded; 503 during startup or after a failed reload. | readinessProbe. Drain traffic when this returns 503. |
/healthz is intentionally cheap and side-effect-free; do not rely on it to detect "the engine is silently dropping events". Use rsigma_back_pressure_events_total, rsigma_dlq_events_total, and rsigma_reloads_failed_total for that.
OpenTelemetry receivers🔗
OTLP is one of the supported input formats for the daemon (with the daemon-otlp feature). RSigma does NOT export traces of its own internal work over OTLP; the OTLP wiring is one-way and is for receiving log records from agents.
If you want to ship the daemon's tracing events into a tracing backend, the standard tracing-opentelemetry Rust bridge would be the path, but no first-party integration ships today. The structured JSON log stream is the canonical observability surface; pipe it into Loki, Vector → ClickHouse, Datadog Logs, or any equivalent.
Quick verification🔗
# Confirm the metrics endpoint is alive.
curl -s http://127.0.0.1:9090/metrics | head -20
# Confirm structured-log emission with the daemon target.
rsigma engine daemon -r rules/ 2>&1 | head -3
The first line of /metrics should be a # HELP rsigma_back_pressure_events_total ... block. The first daemon log line should be a Rules loaded event with target=rsigma::daemon::server. If either is missing, the build is probably without the daemon feature or with a broken --api-addr.
See also🔗
- Streaming Detection for the daemon's HTTP API surface that complements the metrics endpoint.
- Performance Tuning for which metric to watch when sizing
--buffer-size,--batch-size, or correlationmax_state_entries. - NATS Streaming for the NATS-specific log targets (
async_nats::connector). - Prometheus metrics reference for the full 27-metric catalog.
- HTTP API reference for every endpoint exposed alongside
/metrics. tracingfilter syntax for the exactRUST_LOGdirective grammar.