`rsigma engine discover-schemas`🔗

Mine unrecognized events into candidate schema signatures you can review and commit.

Synopsis🔗

rsigma engine discover-schemas [OPTIONS]

Description🔗

Reads a JSON/NDJSON corpus, runs a pure-Rust, glass-box mining pass over the events no built-in or --schema-config signature already recognizes (the unknown and generic_json events), and prints ranked candidate declarative signatures plus a paste-ready schemas: block. It clusters events by their field-key shape, picks the fields (and low-cardinality values) that best discriminate each cluster, and emits the same signature YAML that engine classify and schema routing consume via --schema-config.

This closes the authoring loop the schema tooling otherwise leaves to you: engine classify shows you have unknowns, discover-schemas proposes signatures, and classifying again with the pasted config verifies them. It never loads rules, evaluates detections, or applies a discovered signature on its own; a human always reviews and renames the proposals first (the emitted names are placeholders like discovered_alert).

For the live equivalent on a running daemon, see GET /api/v1/schemas/suggestions and the --discover-schemas flag on engine daemon.

Flags🔗

Flag	Default	Description
`-e, --event <EVENT>`	stdin	A single event as a JSON string, or `@path` to read NDJSON from a file. Without this flag, reads NDJSON from stdin.
`--schema-config <PATH>`	unset	YAML file of user-defined schema signatures. Events these already recognize are excluded from mining (alias-aware), so a defined schema is never re-proposed.
`--min-support <N>`	`3`	Minimum events a cluster must contain to yield a candidate. Filters out one-off shapes.
`--similarity <F>`	`0.6`	Jaccard similarity (0.0-1.0) at or above which shapes merge into one cluster. Higher is stricter (more, tighter clusters).
`--max-candidates <N>`	`20`	Maximum candidates to emit, highest support first.
`--max-predicates <N>`	`3`	Maximum predicates per candidate signature.
`--no-value-markers`	off	Propose presence predicates only; never emit `equals`/`in` value markers.
`--dry-run`	off	Reclassify the corpus with the proposed signatures loaded and report the before/after per-schema counts.
`--emit <report\\|yaml>`	`report`	`report` prints candidates with stats in the global output format; `yaml` prints only the paste-ready `schemas:` block.

The global --output-format flag selects json, ndjson, table, csv, or tsv for the report.

How proposals are built🔗

Exclusion. Events already recognized by a built-in or a --schema-config signature are skipped; only unknown and generic_json events are mined.
Clustering. Events collapse to distinct field-key shapes, which merge by Jaccard similarity. A diversity guard keeps two shapes that disagree on a single constant marker (for example vendor: foo vs vendor: bar) in separate clusters rather than fusing distinct schemas.
Selection. Each cluster's most discriminative fields become a minimal 1-3 predicate conjunction, weighted so a rarer marker beats a near-ubiquitous field. A low-cardinality, non-sensitive constant value becomes an equals/in marker; everything else becomes field_present. High-cardinality or free-form values (command lines, paths, IPs) are never emitted as value markers.
Suggested specificity sits above generic_json and below the strong built-ins, and each candidate is checked against the built-ins so a shadowed proposal is dropped.

Redaction🔗

The offline command reads a corpus you already hold and derives low-cardinality value markers in-process, emitting them only into the candidate YAML for your review. The online suggestions endpoint instead mines a keys-only sample (values are never retained), so its proposals are presence-only.

Examples🔗

Discover signatures from an NDJSON corpus and print the report:

cat events.ndjson | rsigma engine discover-schemas --output-format table

Emit only the paste-ready block and append it to a schema config:

rsigma engine discover-schemas -e @events.ndjson --emit yaml >> schemas.yml

Preview the classification impact before committing:

rsigma engine discover-schemas -e @events.ndjson --dry-run --output-format table

Skip anything an existing config already recognizes:

rsigma engine discover-schemas -e @events.ndjson --schema-config schemas.yml

Output🔗

The structured report carries a summary (events_mined, shapes, clusters, candidates, parse_errors), a candidates array (each with name, specificity, source of corpus or keys-only, support, coverage_of_unknown, predicates, sample_field_sets, and any overlap_warnings), and a signatures_yaml string identical to --emit yaml. With --dry-run a dry_run object carries the before/after per-schema counts. In table format the paste-ready block prints below the table.

Exit codes🔗

Code	Meaning
`0`	Success
`2`	Bad input (invalid inline JSON, unreadable file)
`3`	Bad schema config

See Exit Codes for the full scheme.

rsigma engine discover-schemas🔗