Scanner Release Notes

Release history for the slim.io scanner agent. Each entry includes the version number, release date, change type, and a summary of changes.

slim.io-hosted scanners can be updated from the Scanner Fleet page in your dashboard. BYOC scanners: pull the latest image from Docker Hub (slimio/scanner:{version}-{profile}).

0.5.0 — May 17, 2026

Type: Minor Availability: Available on request — contact your customer success representative for image availability and per-tenant rollout.

Added — File format coverage

Email archives — Scanner now opens .eml, .mbox, and Outlook .msg files end-to-end. Headers, body parts (text + HTML), and attachments are all scanned for sensitive data. Each attachment is unpacked and routed to the appropriate scan path automatically (a PDF attached to an .eml is scanned the same way as a standalone PDF).
Legacy Microsoft Office — Native support for the pre-2007 binary formats: .doc, .dot (Word 97-2003), .xls, .xlt, .xla (Excel 97-2003), and .ppt, .pot, .pps (PowerPoint 97-2003). Customers with archived business documents from these eras no longer need to pre-convert to detect PII.
Expanded Office family — Full Office Open XML (macro-enabled DOCM/XLSM/PPTM, templates DOTX/DOTM/XLTX/XLTM/POTX/POTM, slideshows PPSX/PPSM, binary workbook XLSB) and full OpenDocument family (ODT/ODS/ODP/ODG plus their templates) are now first-class. The previous “treated as opaque ZIP” fallback for these formats is gone — they get the same structured parsing as DOCX/XLSX/PPTX.

Added — Large-file scanning

Streaming structured-file reads — Multi-GB Parquet and ORC files are now scanned with bounded memory usage. The scanner reads file footers and only the column ranges needed for sensitive-data detection — it never holds the whole file in memory. A 100 GB Parquet table no longer requires 100 GB of scanner RAM.
Disk-spill for large archives — Compressed archives larger than a configurable threshold automatically spill to a local temporary file and open from disk via random-access I/O. Multi-GB ZIP / TAR / TAR.GZ archives scan without exhausting scanner memory.
No platform “file too large” cap — The previous hardcoded 1 GiB cap has been removed. Per-tenant size policies remain configurable in Settings → Scanner (Limits → Per-file cap); the default is “no policy cap” and your customer success representative can adjust this per your environment.

Added — Self-serve config (Settings → Scanner)

Format allowlist — Customers can choose exactly which file types the scanner should process. Useful for cost control (e.g., select only Parquet to skip OOXML / archive / email parsing on a tenant where those formats aren’t sensitive-data-bearing). Empty list = scan everything (default).
Per-format size caps + tuning knobs — New inputs for per-file size threshold, archive disk-spill threshold, Parquet row-group batch size, and minimum string length for byte-level extraction. Each input renders with a clear “what it does” hint and a sensible default so you can leave them alone unless you have a specific reason to tune.
Bespoke forms replace JSON editing — Previously the binary-handling and recursion policies were edited as raw JSON. They now have dedicated form UIs with type-safe inputs, dropdowns, checkbox grids for the format allowlist, and “Unlimited” toggles for nullable caps.

Added — Scan-result visibility

Scan-path filter chips on the Scan Detail page — Findings now carry a tag indicating which type of file produced them (structured columnar / archive / email / legacy office). Filter chips at the top of the findings list let you focus on one type at a time. Each chip shows a count badge so you can see the distribution at a glance.
Archive-truncation banner — If a deeply nested archive (or one with a very high file count or uncompressed size) trips the platform’s safety limits, the Scan Detail page now shows a high-visibility banner explaining exactly what was cut off, why, and how to re-scan with higher limits. Previously this information was visible only in the audit logs — now it’s surfaced where you need it most.

Improved

Encoding detection — UTF-16 files (Windows event logs, .ini exports, .reg files) and non-ASCII content are detected more reliably. False positives on Windows-1252 / ISO-8859-1 content have been substantially reduced.
Archive recursion safety — The recursion engine that unwraps nested archives (a ZIP holding a TAR holding a DOCX) is now iterative rather than recursive, so deeply nested archives don’t risk scanner crashes from stack growth.

Changed

STANDARD and DEEP profile defaults — Recursion-depth, uncompressed-byte, and file-count defaults have been raised across all profiles to handle enterprise-scale archive workloads. LIGHT profile defaults are unchanged.

Migration Notes

No customer action required. All v0.5.0 features are backward-compatible — existing scans continue to work without configuration changes.
The new file format support is automatic. The next time you scan a connector that contains email or legacy Office files, they will be processed.
The new Settings → Scanner controls have safe defaults. You can leave them alone unless you specifically want to restrict format coverage or tune for unusual file profiles.
BYOC customers: pull the latest scanner image to receive the new formats. v0.4.0 scanners continue to work but won’t scan the new format types.

0.4.0 — April 27, 2026

Type: Minor Availability: Available on request — contact your customer success representative for image availability and per-tenant rollout.

Added

Real-time work delivery — Scanners now receive scan tasks over a long-lived connection rather than polling for work. New scans dispatch in seconds rather than waiting for the next poll cycle. Throughput on large multi-connector fleets is bounded by the slim.io control plane’s capacity, not the scanner’s poll interval.
Automatic reconnect with backoff — If the connection to the control plane drops (transient network blip, slim.io platform maintenance), the scanner reconnects automatically with exponential backoff. The retry budget is unbounded — a network outage in your VPC will not permanently disconnect the scanner. A reconnect counter is exposed for monitoring.
Duplicate-task prevention — On a reconnect, the control plane may re-deliver tasks that were in flight at disconnect time. The scanner now drops duplicates locally so the same task never executes twice, eliminating duplicated work and double-counted findings during platform updates.
Graceful drain coordination — When slim.io rolls out a platform update, scanners receive a clean reconnect signal with random jitter (1–5 seconds) so a fleet doesn’t reconnect at the same instant. Active scans continue uninterrupted.
Hosting topology declared at registration — Scanner registers with its hosting topology (BYOC, In-Customer-Cloud Agentless) so the topology shown in your Customer Dashboard reflects where the scanner is actually running.

Changed

Per-tenant feature flag — The new work-delivery path is gated by an opt-in tenant flag during the v0.4.0 rollout window. The legacy poll-based path remains the default until your tenant is enabled. Once 100% of tenants are on the new path, the legacy code is removed in a future release.

Migration Notes

v0.3.0 scanners continue to work without changes.
v0.4.0 binaries support both the legacy poll-based path (default off → fallback) and the new work-delivery path (enabled per-tenant). Switching is one environment variable.
No customer action required for in-flight scans during the upgrade — the scanner finishes any active jobs before reconnecting on the new path.

0.3.0 — April 16, 2026

Type: Minor

Added

Parallel scanning engine — Scans are distributed across multiple workers for faster processing of large environments. Load-balanced work partitioning keeps workers evenly utilized, and skew detection automatically adjusts parallelism when one resource dominates the data volume.
Pre-scan cost estimate — Before a scan starts, the platform estimates compute cost from resource count, total data volume, and connector type. Scans that exceed your configured ceiling are blocked with a clear message and remediation options (reduce scope, switch to Light profile, or raise the ceiling).
Graceful cancellation — Cancelling a running scan now preserves all findings discovered up to the cancel point. Workers finish their current resource, persist findings, and exit cleanly. The coverage report shows exactly which resources were scanned and which were not reached.
Credential refresh for long-running scans — Connector session tokens are refreshed automatically in the background, so long scans no longer fail when cloud provider credentials expire mid-run.
Per-resource coverage truth — The scan completeness score is computed from actual per-resource outcomes, not estimates. Every resource is accounted for in the coverage report with one of: scanned, skipped (format / access / size), failed, or cancelled.
Decision audit trail — Every decision the platform makes during a scan (resource included, skipped, budget exceeded, coverage target met, circuit isolation triggered, resource truncated) is recorded in an immutable audit trail. Accessible from the Scan Detail page with filters by decision type, or exportable for compliance reporting.
Scanner-at-rest encryption — Scanner identity, credentials, and the local write buffer are encrypted with AES-256 on disk. Applies to both slim.io-hosted and BYOC deployments.
Log scrubbing — All scanner log output is scrubbed before it leaves the process. Removes PII (SSN, credit card, phone, email), cloud provider access keys, bearer tokens, and connection strings containing passwords. Applies to stdout, stderr, exception tracebacks, and structured JSON logs — with no bypass, even in debug mode.
Worker-level recovery — If a worker crashes or is terminated, the platform detects the failure via periodic health checks and retries the assignment. Idempotent finding writes ensure no duplicates on retry.

Changed

Worker auto-scaling is now based on file volume and tier limits rather than a fixed worker count. Small scans use fewer workers to minimize overhead; large scans scale up to the maximum available for your plan.
Scans with successful enumeration but incomplete coverage transition to Partial (not Failed). Partial scans carry a first-class coverage report and their findings are fully usable in the Data Catalog.

Fixed

Scans on sources dominated by a single multi-GB file no longer idle additional workers while one worker processes the dominant resource.
Coverage score calculation no longer over-counts resources that were enumerated but failed during read.

0.2.0 — April 3, 2026

Type: Minor

Added

Smart Scan mode — Risk-prioritized scanning with change detection. Fairness decay prevents low-risk resource starvation. Priority bucketing stabilizes ordering across runs.
Bootstrap Scan mode — Progressive onboarding for large environments. Three-stage flow: metadata enumeration, stratified priority sampling, and public exposure sweep. Memory-safe for millions of objects using heap-based selection.
Per-object scan metadata — Tracks last_scanned_at, last_verified_at, and scan_count per object for Smart mode reverification.
Reverification with severity weighting — Unchanged resources with prior findings are periodically re-scanned. High-severity findings (SSN, credit card) reverify faster than low-severity (email, phone). Jitter prevents thundering herd.
Scanner profiles — Four optimized image profiles: full, cloud-storage, database, saas. Each includes only the SDKs needed for its connector types.
Scan completeness score — 0.0 to 1.0 metric quantifying how thoroughly a scan covered the target data. Accounts for access denied, scan errors, and partial enumeration.
Scan Profiles (Light / Standard / Deep) — Controls scanning intensity and load on customer infrastructure.
Per-object findings drill-down — Click any scanned object to see paginated findings filtered by PII type, with masked evidence.
File type detection — Automatic format identification for CSV, JSON, Parquet, Excel, PDF, Avro, ORC, and plain text.
Scanner Fleet management — Deploy, monitor, and deregister scanners from the Customer Dashboard.
BYOC deployment — Docker Compose and Kubernetes manifests for self-hosted scanning. Scanner images on Docker Hub (slimio/scanner).
Connector health alerts — Configurable alerts for credential expiry, authentication failures, and health degradation.

Changed

Default scan type changed from Full to Smart (recommended for most use cases).
Scanner images now published with pinned version tags only. No :latest tag on Docker Hub.
Registration uses direct tenant subcollection queries instead of collection group queries for faster authentication.

Fixed

Scanner registration token TTL check for timezone-naive datetime values.
Tenant identifier validation no longer rejects uppercase characters during scanner registration.

0.1.0 — March 15, 2026

Type: Initial Release

Added

Core scanning engine with 170 PII classifiers across 4 detection tiers.
Support for 17 connector types: AWS S3, GCP Cloud Storage, Azure Blob, PostgreSQL, MySQL, MSSQL, Oracle, DB2, Snowflake, Databricks, Slack, Teams, Salesforce, Google Drive, OneDrive, SharePoint.
Agentless cloud scanning with streaming downloads (4 MB default chunk size).
Database scanning with server-side cursors and adaptive sampling.
SaaS scanning via provider APIs with incremental change detection.
Budget enforcement: time, bytes, resources, and findings limits.
Finding feedback (confirm/reject) for classifier calibration.