Scanning & Detection Overview
Scanning is the core capability of Slim.io. The platform discovers and classifies sensitive data in cloud storage by processing files through a multi-layered detection pipeline. This section covers how scans work, the detection engine architecture, and the configuration options available.
Scan Types
| Type | Trigger | Scope | Use Case |
|---|---|---|---|
| Full Scan | Manual or scheduled | All files in the connector scope | Initial baseline, periodic re-scan |
| Incremental Scan | Manual or scheduled | Files modified since the last scan | Ongoing monitoring with lower cost |
| Event-Driven Scan | Cloud storage event | Single file or batch | Real-time detection on new uploads |
Scan Lifecycle
Every scan progresses through a defined lifecycle:
Created → Queued → Running → Completed
↓
Failed / Cancelled- Created — Scan job is initialized
- Queued — Waiting for available workers within tier limits
- Running — Workers are actively processing files
- Completed — All files processed, findings stored
- Failed — Unrecoverable error (e.g., credential issue, infrastructure failure)
- Cancelled — Manually stopped by the user
Detection Pipeline
Each file passes through a multi-stage detection pipeline:
- Probabilistic Pre-Screen — A fast statistical check that quickly eliminates files unlikely to contain sensitive data, focusing scan effort on the highest-value targets.
- Classifier Execution — Files that pass the pre-screen are analyzed by all active classifiers (pattern, dictionary, proximity, checksum, and ML-assisted).
- Confidence Scoring — Each match receives a confidence score based on classifier type, pattern specificity, and contextual signals from surrounding fields.
- AI Disambiguation (Optional) — Findings that fall in a configurable ambiguous range are escalated to a multi-provider AI pipeline with automatic failover, which adjudicates the final classification.
- Deduplication — Overlapping findings from multiple classifiers are merged into a single canonical finding.
- Finding Storage — Final findings are persisted with full provenance metadata, including detection method and classifier version.
The probabilistic pre-screen significantly reduces processing time and cost by skipping low-value files before invoking the full classifier stack — without compromising recall on files that are likely to contain sensitive data.
Supported File Formats
Slim.io can process a wide range of file formats:
| Category | Formats |
|---|---|
| Structured Data | CSV, TSV, JSON, JSONL, Parquet, Avro, ORC |
| Modern Office | DOCX, DOCM, DOTX, DOTM, XLSX, XLSM, XLTX, XLTM, XLSB, PPTX, PPTM, POTX, POTM, PPSX, PPSM |
| OpenDocument | ODT, ODS, ODP, ODG, ODF, OTT, OTS, OTP |
| Legacy Office | DOC, DOT, XLS, XLT, XLA, PPT, POT, PPS |
| Documents | PDF, TXT, RTF |
| Email Archives | EML, MBOX, MSG (Outlook) |
| Configuration | YAML, TOML, XML, INI, ENV |
| Logs | Plain text, structured JSON logs |
| Archives | ZIP, GZIP, TAR, TAR.GZ (contents extracted and scanned, with recursion safety limits) |
Large files scan via streaming. Multi-GB Parquet/ORC files are processed with bounded memory usage — the platform reads only the file footer and the column ranges needed for sensitive-data detection. Large compressed archives spill to disk transparently, so a multi-GB ZIP doesn’t exceed scanner memory. There is no hard “file too large” cap from the platform; per-tenant size policies are configurable in Settings → Scanner.
Scan Capacity
Scan capacity is provisioned per-tenant based on your subscription agreement. Specific limits — concurrent scans, file counts, and worker parallelism — are configured by your customer success representative during onboarding and visible in Settings → Scanner → Limits in the Customer Dashboard. Contact your representative if you need additional capacity for a specific workload.
Learn More
- Scan Management — Starting, controlling, and monitoring scans
- Parallel Scanning Engine — Distributed scan architecture
- Scanner Fleet — Agentless and connector-based scanning infrastructure
- Classifiers — Detection rule types and configuration (170 built-in rules across 50+ countries)
- Smart Scanning Modes — Full, Incremental, Smart, and Bootstrap scan modes
- PII Detection Engine — Multi-stage detection pipeline architecture
- Detection-as-Code — YAML-based classifier definitions
- LLM Assist — AI-powered false positive reduction
- Event-Driven Scanning — Real-time scan triggers