Parallel Scanning Engine

Slim.io’s parallel scanning engine distributes file processing across multiple workers to achieve high throughput on large storage volumes. This page covers the architecture, chunk partitioning strategy, and deployment model.

Architecture

The parallel scanning engine uses a coordinator-worker pattern:


Scan Request
  → Coordinator
    → Partition files into chunks
    → Spawn worker jobs
      → Worker 1: Process chunk 1 (files 0–999)
      → Worker 2: Process chunk 2 (files 1000–1999)
      → Worker N: Process chunk N
    → Aggregate results
  → Scan Complete

Coordinator

The coordinator is responsible for:

Listing all files in the connector scope
Applying file filters (prefix, extension, size)
Partitioning the file list into balanced chunks
Spawning worker jobs on the target compute platform
Tracking worker progress and handling failures
Aggregating findings into the final scan result

Workers

Each worker is an isolated container that:

Receives a chunk of file paths to process
Downloads files from cloud storage using the connector’s credentials
Runs each file through the detection pipeline
Persists findings to the data store
Reports progress back to the coordinator

Work Partitioning

Resources are automatically partitioned into balanced assignments for parallel processing. The partitioning algorithm considers:

Resource count — Assignments are sized for balanced worker utilization
Aggregate data volume — Each worker receives a roughly equal share of total bytes to process
Size distribution — The platform uses a load-balanced assignment strategy that accounts for individual resource sizes. A mix of large and small files is distributed so no single worker gets disproportionately more data than others.
Skew detection — When one resource dominates the total data volume (e.g., a single multi-GB file among thousands of small files), the platform automatically adjusts parallelism to avoid wasting idle workers

Compute Backends

Workers can execute on multiple compute platforms depending on the deployment model:

Platform	Managed By	Use Case
Managed containers	Slim.io (SaaS)	Default for hosted deployments
AWS Lambda	Customer (BYOC)	AWS-native BYOC deployments
Azure Container Apps	Customer (BYOC)	Azure-native BYOC deployments
Kubernetes	Customer (BYOC)	Multi-cloud or on-premises BYOC

In SaaS mode, workers run on Slim.io’s infrastructure and scan results are stored securely in Slim.io’s data store. In BYOC mode, workers run inside the customer’s VPC and findings can be pushed to Slim.io or retained locally.

Pre-Screen Filter

Before running the full detection pipeline, each file undergoes a probabilistic pre-screen:

A small sample of the file content is extracted from strategic positions
The sample is checked against a probabilistic filter trained on PII-indicative patterns
Files that fail the filter check (no matches) are skipped entirely
Files that pass are sent to the full classifier pipeline

This pre-screen eliminates the majority of non-sensitive files without false negatives — the filter guarantees that a file containing PII will always pass through to the detection pipeline.

Worker Scaling

Worker count is determined by:

Tier Limits — Maximum concurrent workers per your subscription
File Volume — More files spawn more workers, up to the tier limit
Coordinator Heuristic — The coordinator estimates optimal worker count based on average file size and historical processing rates

Auto-Scaling Behavior

The coordinator automatically adjusts worker count based on file volume and tier limits. Small scans run on fewer workers to minimize overhead; large scans scale up to the maximum available for your plan.

Pre-Scan Cost Estimation

Before dispatching workers, the coordinator estimates the total scan cost based on:

Resource count — Number of files or objects to scan
Total data volume — Aggregate size of all resources
Connector type — Different providers have different API pricing (e.g., object storage reads vs. data warehouse queries)

If the estimated cost exceeds a configurable ceiling, the scan is blocked before any compute is consumed. This prevents cost surprises on unexpectedly large data sources.

Failure Handling

The parallel engine handles failures at multiple levels:

Resource-Level Retry — If a single file fails to download or parse, the worker retries with exponential backoff before marking it as failed. Findings written before the failure are preserved.
Worker-Level Recovery — If a worker crashes or is terminated, the platform detects the failure via periodic health checks. The failed worker’s assignment can be retried, and idempotent finding writes ensure no duplicates are created on retry.
Scan-Level Timeout — Scans have a configurable timeout. If exceeded, remaining work is cancelled and partial results are preserved with an accurate coverage report.
Credential Refresh — Long-running workers automatically refresh expired cloud provider credentials before they expire, preventing mid-scan authentication failures.
Graceful Cancellation — When a scan is cancelled, workers complete their current resource, persist findings and progress, then exit cleanly. The scan result includes all findings discovered before cancellation.

Deployment Configuration

For BYOC deployments, the scan deployment configuration is available via the API:


GET /api/v1/scans/:id/deployment-config

This returns the worker container image, environment variables, and chunk assignments needed to run workers in your own infrastructure.