Scanning & Detection Overview

Scanning is the core capability of Slim.io. The platform discovers and classifies sensitive data in cloud storage by processing files through a multi-layered detection pipeline. This section covers how scans work, the detection engine architecture, and the configuration options available.

Scan Types

Type	Trigger	Scope	Use Case
Full Scan	Manual or scheduled	All files in the connector scope	Initial baseline, periodic re-scan
Incremental Scan	Manual or scheduled	Files modified since the last scan	Ongoing monitoring with lower cost
Event-Driven Scan	Cloud storage event	Single file or batch	Real-time detection on new uploads

Scan Lifecycle

Every scan progresses through a defined lifecycle:


Created → Queued → Running → Completed
                       ↓
                  Failed / Cancelled

Created — Scan job is initialized
Queued — Waiting for available workers within tier limits
Running — Workers are actively processing files
Completed — All files processed, findings stored
Failed — Unrecoverable error (e.g., credential issue, infrastructure failure)
Cancelled — Manually stopped by the user

Detection Pipeline

Each file passes through a multi-stage detection pipeline:

Probabilistic Pre-Screen — A fast statistical check that quickly eliminates files unlikely to contain sensitive data, focusing scan effort on the highest-value targets.
Classifier Execution — Files that pass the pre-screen are analyzed by all active classifiers (pattern, dictionary, proximity, checksum, and ML-assisted).
Confidence Scoring — Each match receives a confidence score based on classifier type, pattern specificity, and contextual signals from surrounding fields.
AI Disambiguation (Optional) — Findings that fall in a configurable ambiguous range are escalated to a multi-provider AI pipeline with automatic failover, which adjudicates the final classification.
Deduplication — Overlapping findings from multiple classifiers are merged into a single canonical finding.
Finding Storage — Final findings are persisted with full provenance metadata, including detection method and classifier version.

The probabilistic pre-screen significantly reduces processing time and cost by skipping low-value files before invoking the full classifier stack — without compromising recall on files that are likely to contain sensitive data.

Supported File Formats

Slim.io can process a wide range of file formats:

Category	Formats
Structured Data	CSV, TSV, JSON, JSONL, Parquet, Avro, ORC
Modern Office	DOCX, DOCM, DOTX, DOTM, XLSX, XLSM, XLTX, XLTM, XLSB, PPTX, PPTM, POTX, POTM, PPSX, PPSM
OpenDocument	ODT, ODS, ODP, ODG, ODF, OTT, OTS, OTP
Legacy Office	DOC, DOT, XLS, XLT, XLA, PPT, POT, PPS
Documents	PDF, TXT, RTF
Email Archives	EML, MBOX, MSG (Outlook)
Configuration	YAML, TOML, XML, INI, ENV
Logs	Plain text, structured JSON logs
Archives	ZIP, GZIP, TAR, TAR.GZ (contents extracted and scanned, with recursion safety limits)

Large files scan via streaming. Multi-GB Parquet/ORC files are processed with bounded memory usage — the platform reads only the file footer and the column ranges needed for sensitive-data detection. Large compressed archives spill to disk transparently, so a multi-GB ZIP doesn’t exceed scanner memory. There is no hard “file too large” cap from the platform; per-tenant size policies are configurable in Settings → Scanner.

Scan Capacity

Scan capacity is provisioned per-tenant based on your subscription agreement. Specific limits — concurrent scans, file counts, and worker parallelism — are configured by your customer success representative during onboarding and visible in Settings → Scanner → Limits in the Customer Dashboard. Contact your representative if you need additional capacity for a specific workload.

Learn More

Scan Management — Starting, controlling, and monitoring scans
Parallel Scanning Engine — Distributed scan architecture
Scanner Fleet — Agentless and connector-based scanning infrastructure
Classifiers — Detection rule types and configuration (170 built-in rules across 50+ countries)
Smart Scanning Modes — Full, Incremental, Smart, and Bootstrap scan modes
PII Detection Engine — Multi-stage detection pipeline architecture
Detection-as-Code — YAML-based classifier definitions
LLM Assist — AI-powered false positive reduction
Event-Driven Scanning — Real-time scan triggers