PII Detection Engine

The PII detection engine is the core analysis pipeline that identifies sensitive data within scanned content. It uses a multi-stage architecture that balances speed, accuracy, and cost by progressively applying more expensive analysis only when needed.

Detection Pipeline Overview

Every piece of content passes through multiple stages in sequence. Each stage acts as a filter — only content that passes the previous stage is evaluated by the next, more expensive stage.


Content In
  → Pre-Screen (probabilistic filter — eliminates non-sensitive content)
  → Pattern Matching (all active classifier patterns)
  → Validation (format verification — checksums, length rules)
  → Context Analysis (surrounding text, support/negative keywords)
  → AI Review (ambiguous findings only)
  → Findings Out

Pre-Screen

A probabilistic filter that quickly eliminates content unlikely to contain sensitive data. Content that passes the filter proceeds to pattern matching. Content that fails is skipped, avoiding the cost of running classifiers against non-sensitive files.

Zero false negatives — Content containing PII will always pass the filter
High true negative rate — Eliminates the majority of non-sensitive content in a typical scan
Cost — Negligible; runs in microseconds per file

Pattern Matching

All active classifier patterns are executed against the content. Patterns are processed in a deterministic order — more specific patterns (e.g., SSN with context keywords) run before generic patterns (e.g., bare digit sequences) to ensure the most precise match wins.

The pattern library includes 170 built-in patterns covering personal identification, financial data, health records, government IDs across 50+ countries, credentials, and network identifiers.

Pattern execution order matters. Specific classifiers (SSN with context) are checked before generic classifiers (postal codes). This prevents misclassification when data patterns overlap.

Validation

Raw regex matches are validated using format-specific algorithms to eliminate structurally invalid matches:

Checksum validation — Mathematical verification (e.g., Luhn for credit cards, Mod-97 for IBANs)
Format rules — Structural checks (e.g., SSN area/group/serial constraints, state-specific driver’s license lengths)
Area code validation — Phone number area code plausibility checks

A match that fails validation is rejected and does not produce a finding, regardless of how well it matched the regex pattern. This stage eliminates the majority of false positives from structural patterns.

Context Analysis

Validated matches are analyzed in their surrounding text context to further refine confidence scores. The engine examines a configurable window before and after each match.

Support terms — Keywords that increase confidence when found near a match (e.g., “social security” near a 9-digit number increases SSN confidence).

Negative terms — Keywords that decrease confidence when found near a match (e.g., “order number” or “SKU” near a digit sequence reduces credit card confidence).

When negative terms are found and the score drops below the configurable discard threshold, the finding is eliminated.

AI Review

Findings with confidence scores in the ambiguous range are sent to an AI model for contextual analysis. The model examines the surrounding text and determines whether the match is genuinely sensitive data or a false positive.

See LLM Assist for details on deployment modes and configuration.

Confidence Scoring

The engine assigns findings a confidence score between 0.0 and 1.0. The score reflects how certain the detection is, based on the combination of pattern match quality, validation result, contextual signals, and AI review.

Confidence is affected by:

Pattern Specificity — More specific patterns with checksum validation yield higher confidence
Contextual Signals — Support terms boost confidence; negative terms reduce it
Multiple Classifiers — When multiple classifiers match the same data, confidence is combined
Suppression Rules — Known false positive patterns can reduce confidence to zero

Findings below the configured confidence threshold are discarded and not stored. The threshold is configurable per tenant to balance precision and recall.

Higher confidence thresholds reduce false positives but may miss some legitimate PII. Lower thresholds catch more PII but generate more noise. The default threshold balances these tradeoffs for most environments.

Exclusion Lists

Exclusion lists prevent known non-sensitive values from generating findings. Three levels of exclusion are supported:

Global Exclusions

Values excluded across all classifiers and all tenants. These are maintained by Slim.io and cover well-known test values (test SSNs, test credit cards, fictional phone numbers).

Pattern Exclusions

Regex patterns that exclude entire classes of values:


exclusions:
  - pattern: '999-\d{2}-\d{4}'
    reason: "Individual Taxpayer Identification Numbers are not SSNs"

Tenant-Level Exclusions

Customer-specific exclusion lists for values that are known false positives in their environment. Managed through the Customer Dashboard under Classifiers > Suppression Rules.

Each suppression entry tracks:

Masked text — The excluded value (partially masked for display)
Added by — Who created the suppression
Timestamp — When it was added
Hit count — How many times this suppression has prevented a false positive

False Positive Management

The platform provides several mechanisms for managing false positives:

Suppression Rules

Create suppression rules to permanently exclude specific values or patterns. See the Suppression Rules section above.

Confidence Tuning

Adjust classifier confidence thresholds to reduce false positives for specific classifiers:

Increase the base confidence threshold on a classifier to make it more selective
Add proximity keywords to require contextual evidence
Pair regex classifiers with checksum validation

AI Review

Enable LLM Assist to automatically review borderline findings. The AI provides a second opinion on ambiguous matches, significantly reducing false positive rates without human intervention.

Bulk Actions

Use the Data Catalog bulk action feature to mark multiple findings as reviewed or to assign suppressions in batch.

Learn More

Classifiers — Full classifier reference with all 170 built-in rules
LLM Assist — AI-powered false positive reduction
Parallel Scanning Engine — Distributed scan architecture
Detection-as-Code — YAML-based classifier definitions