Data Classification
Slim.io classifies sensitive data using a layered detection system that combines regex pattern matching, dictionary lookups, and contextual analysis. This page covers the full set of built-in detection patterns, how to create custom classifiers using the Detection-as-Code system, and how confidence scoring works.
Built-In Detection Patterns
Slim.io ships with 170 built-in detection patterns covering PII, PHI, PCI, and credential categories across 50+ countries. Detection patterns include regex matching, checksum validation, proximity keywords, and AI-assisted classification. All built-in patterns are active by default and can be individually disabled or tuned.
Personal Identifiers
| Category | Detection Method | Example Pattern |
|---|---|---|
| US Social Security Number | Regex + Proximity + Checksum | XXX-XX-XXXX |
| US Passport Number | Regex + Proximity | 9-digit alphanumeric |
| US Driver’s License | Regex (state-specific) | Varies by state |
| National ID (non-US) | Regex + Proximity | UK NIN, Canadian SIN, etc. |
| Date of Birth | Regex + Proximity | MM/DD/YYYY, YYYY-MM-DD |
| Full Name | ML (NER) + Proximity | Named entity recognition |
| Gender | Dictionary + Proximity | Keyword matching |
| Ethnicity / Race | Dictionary + Proximity | Keyword matching |
| Religion | Dictionary + Proximity | Keyword matching |
Financial Data
| Category | Detection Method | Example Pattern |
|---|---|---|
| Credit Card Number | Regex + Luhn Checksum | 13-19 digit with Luhn validation |
| Bank Account Number | Regex + Proximity | Varies by country |
| Routing Number (ABA) | Regex + Checksum | 9-digit with ABA checksum |
| IBAN | Regex + Checksum | Country-specific format with mod-97 |
| SWIFT/BIC Code | Regex | 8-11 character bank identifier |
| Tax ID / EIN | Regex + Proximity | XX-XXXXXXX |
Contact Information
| Category | Detection Method | Example Pattern |
|---|---|---|
| Email Address | Regex | Standard RFC 5322 pattern |
| Phone Number | Regex (international) | US, UK, EU, APAC formats |
| Physical Address | ML (NER) + Regex | Street address extraction |
| IP Address | Regex | IPv4 and IPv6 |
| MAC Address | Regex | XX:XX:XX:XX:XX:XX |
| URL with Auth | Regex | URLs containing credentials |
Health Data (PHI)
| Category | Detection Method | Example Pattern |
|---|---|---|
| Medical Record Number | Regex + Proximity | Facility-specific patterns |
| Insurance ID | Regex + Proximity | Payer-specific formats |
| Drug / Prescription Name | Dictionary | FDA drug database |
| Medical Condition | Dictionary | ICD-10 code mapping |
| Lab Result | Regex + Proximity | Numeric values with medical units |
| Health Plan Number | Regex + Proximity | Payer-specific formats |
Credentials & Secrets
| Category | Detection Method | Example Pattern |
|---|---|---|
| API Key | Regex + Proximity | Provider-specific key prefixes |
| AWS Access Key | Regex | AKIA prefix, 20 characters |
| Private Key (RSA/EC) | Regex | PEM header detection |
| Password | Regex + Proximity | Keyword proximity with value |
| JWT Token | Regex | Base64-encoded three-part structure |
| Database Connection String | Regex | Protocol-specific URI patterns |
| OAuth Token | Regex + Proximity | Bearer token patterns |
Custom Classifiers
When built-in classifiers do not cover your specific data types, create custom classifiers using YAML definitions.
Classifier Types
| Type | Best For | How It Works |
|---|---|---|
| Regex | Structured data with predictable formats | Pattern matching against regular expressions |
| Dictionary | Finite sets of known values | Lookup against curated word lists |
| Proximity | Reducing false positives on common patterns | Regex match + nearby keyword requirement |
| Checksum | Data with embedded validation digits | Pattern match + algorithmic verification |
| ML | Unstructured data, named entities | Machine learning model inference |
Creating a Custom Classifier
Define your classifier in YAML and deploy via the dashboard, API, or Git sync:
apiVersion: slim.io/v1
kind: Classifier
metadata:
name: internal-employee-id
description: "Company internal employee identifier"
spec:
type: proximity
pattern: '\bEMP-\d{6}\b'
keywords: ["employee", "emp id", "staff", "worker"]
window: 100
category: Internal ID
confidence: high # high | medium | low — relative to your tuning
enabled: trueDetection-as-Code
Store classifier definitions in a Git repository and sync them automatically:
slim-io-config/
classifiers/
internal-employee-id.yaml
vendor-contract-id.yaml
custom-health-code.yamlEnable Git sync under Settings > Integrations to deploy classifier changes on merge to your main branch. See Detection-as-Code for the full workflow.
Confidence Scoring
Every detection produces a confidence score between 0.0 and 1.0. The score reflects how certain the engine is that the detected value is truly sensitive data.
Score Components
- Base confidence — Set by the classifier definition (a checksum-validated detection starts higher than a bare regex match)
- Contextual boost — Proximity keyword matches increase confidence (e.g., “SSN:” near a 9-digit number)
- Multi-classifier merge — When multiple classifiers match the same data, confidence is combined using Bayesian merging
- Suppression rules — Known false positive patterns reduce confidence to zero
Confidence Tiers
Slim.io groups findings into four confidence tiers. The exact numeric boundaries are tunable per environment via Settings > Detection — the public-facing labels are stable.
| Tier | Recommended Action |
|---|---|
| High | Automate remediation (tokenize, mask, quarantine) |
| Medium | Review recommended; consider LLM Assist for disambiguation |
| Low | Log for awareness; likely requires manual review |
| Noise | Suppressed; not stored by default |
The default discard threshold is configured per environment. Findings below it are not stored. Adjust this threshold per classifier or globally under Settings > Detection.
Tuning Detection Accuracy
- Review false positives — Use the Investigation view to identify classifiers that generate noise. Adjust confidence thresholds or add suppression rules.
- Enable LLM Assist — For findings inside the configurable ambiguous range, LLM Assist sends the surrounding context to an AI model for disambiguation.
- Add proximity keywords — Converting a regex classifier to a proximity classifier dramatically reduces false positives.
- Use suppression rules — Define patterns for known test data, placeholder values, or synthetic data that should not generate findings.
- Monitor category distribution — Check the category breakdown in the Executive Scorecard to identify classifiers that may be over- or under-detecting.