Data Classification

Slim.io classifies sensitive data using a layered detection system that combines regex pattern matching, dictionary lookups, and contextual analysis. This page covers the full set of built-in detection patterns, how to create custom classifiers using the Detection-as-Code system, and how confidence scoring works.

Built-In Detection Patterns

Slim.io ships with 170 built-in detection patterns covering PII, PHI, PCI, and credential categories across 50+ countries. Detection patterns include regex matching, checksum validation, proximity keywords, and AI-assisted classification. All built-in patterns are active by default and can be individually disabled or tuned.

Personal Identifiers

Category	Detection Method	Example Pattern
US Social Security Number	Regex + Proximity + Checksum	`XXX-XX-XXXX`
US Passport Number	Regex + Proximity	9-digit alphanumeric
US Driver’s License	Regex (state-specific)	Varies by state
National ID (non-US)	Regex + Proximity	UK NIN, Canadian SIN, etc.
Date of Birth	Regex + Proximity	`MM/DD/YYYY`, `YYYY-MM-DD`
Full Name	ML (NER) + Proximity	Named entity recognition
Gender	Dictionary + Proximity	Keyword matching
Ethnicity / Race	Dictionary + Proximity	Keyword matching
Religion	Dictionary + Proximity	Keyword matching

Financial Data

Category	Detection Method	Example Pattern
Credit Card Number	Regex + Luhn Checksum	13-19 digit with Luhn validation
Bank Account Number	Regex + Proximity	Varies by country
Routing Number (ABA)	Regex + Checksum	9-digit with ABA checksum
IBAN	Regex + Checksum	Country-specific format with mod-97
SWIFT/BIC Code	Regex	8-11 character bank identifier
Tax ID / EIN	Regex + Proximity	`XX-XXXXXXX`

Contact Information

Category	Detection Method	Example Pattern
Email Address	Regex	Standard RFC 5322 pattern
Phone Number	Regex (international)	US, UK, EU, APAC formats
Physical Address	ML (NER) + Regex	Street address extraction
IP Address	Regex	IPv4 and IPv6
MAC Address	Regex	`XX:XX:XX:XX:XX:XX`
URL with Auth	Regex	URLs containing credentials

Health Data (PHI)

Category	Detection Method	Example Pattern
Medical Record Number	Regex + Proximity	Facility-specific patterns
Insurance ID	Regex + Proximity	Payer-specific formats
Drug / Prescription Name	Dictionary	FDA drug database
Medical Condition	Dictionary	ICD-10 code mapping
Lab Result	Regex + Proximity	Numeric values with medical units
Health Plan Number	Regex + Proximity	Payer-specific formats

Credentials & Secrets

Category	Detection Method	Example Pattern
API Key	Regex + Proximity	Provider-specific key prefixes
AWS Access Key	Regex	`AKIA` prefix, 20 characters
Private Key (RSA/EC)	Regex	PEM header detection
Password	Regex + Proximity	Keyword proximity with value
JWT Token	Regex	Base64-encoded three-part structure
Database Connection String	Regex	Protocol-specific URI patterns
OAuth Token	Regex + Proximity	Bearer token patterns

Custom Classifiers

When built-in classifiers do not cover your specific data types, create custom classifiers using YAML definitions.

Classifier Types

Type	Best For	How It Works
Regex	Structured data with predictable formats	Pattern matching against regular expressions
Dictionary	Finite sets of known values	Lookup against curated word lists
Proximity	Reducing false positives on common patterns	Regex match + nearby keyword requirement
Checksum	Data with embedded validation digits	Pattern match + algorithmic verification
ML	Unstructured data, named entities	Machine learning model inference

Creating a Custom Classifier

Define your classifier in YAML and deploy via the dashboard, API, or Git sync:


apiVersion: slim.io/v1
kind: Classifier
metadata:
  name: internal-employee-id
  description: "Company internal employee identifier"
spec:
  type: proximity
  pattern: '\bEMP-\d{6}\b'
  keywords: ["employee", "emp id", "staff", "worker"]
  window: 100
  category: Internal ID
  confidence: high  # high | medium | low — relative to your tuning
  enabled: true

Detection-as-Code

Store classifier definitions in a Git repository and sync them automatically:


slim-io-config/
  classifiers/
    internal-employee-id.yaml
    vendor-contract-id.yaml
    custom-health-code.yaml

Enable Git sync under Settings > Integrations to deploy classifier changes on merge to your main branch. See Detection-as-Code for the full workflow.

Confidence Scoring

Every detection produces a confidence score between 0.0 and 1.0. The score reflects how certain the engine is that the detected value is truly sensitive data.

Score Components

Base confidence — Set by the classifier definition (a checksum-validated detection starts higher than a bare regex match)
Contextual boost — Proximity keyword matches increase confidence (e.g., “SSN:” near a 9-digit number)
Multi-classifier merge — When multiple classifiers match the same data, confidence is combined using Bayesian merging
Suppression rules — Known false positive patterns reduce confidence to zero

Confidence Tiers

Slim.io groups findings into four confidence tiers. The exact numeric boundaries are tunable per environment via Settings > Detection — the public-facing labels are stable.

Tier	Recommended Action
High	Automate remediation (tokenize, mask, quarantine)
Medium	Review recommended; consider LLM Assist for disambiguation
Low	Log for awareness; likely requires manual review
Noise	Suppressed; not stored by default

The default discard threshold is configured per environment. Findings below it are not stored. Adjust this threshold per classifier or globally under Settings > Detection.

Tuning Detection Accuracy

Review false positives — Use the Investigation view to identify classifiers that generate noise. Adjust confidence thresholds or add suppression rules.
Enable LLM Assist — For findings inside the configurable ambiguous range, LLM Assist sends the surrounding context to an AI model for disambiguation.
Add proximity keywords — Converting a regex classifier to a proximity classifier dramatically reduces false positives.
Use suppression rules — Define patterns for known test data, placeholder values, or synthetic data that should not generate findings.
Monitor category distribution — Check the category breakdown in the Executive Scorecard to identify classifiers that may be over- or under-detecting.