Classifiers
Classifiers are detection rules that identify specific types of sensitive data within your files and databases. Slim.io ships with 170 built-in classifiers covering PII categories across 50+ countries and supports custom classifier definitions via YAML.
How Classifiers Work
When a scan runs, every file passes through the active set of classifiers. Each classifier examines the content for a specific pattern or signal. When a match is found, the classifier produces a finding with a confidence score indicating how certain the match is. Findings are then deduplicated, scored, and stored in the Data Catalog.
The classifier execution order is deterministic: specific patterns (e.g., credit card with Luhn validation) run before generic patterns (e.g., bare digit sequences) to ensure the most precise match wins.
Classifier Types
Regex Classifiers
Pattern matching against regular expressions. Best suited for structured data with predictable formats such as SSNs, phone numbers, and email addresses.
name: us-ssn
type: regex
pattern: '\b\d{3}-\d{2}-\d{4}\b'
category: SSN
confidence: high # high | medium | low — relative to your tuning
description: "US Social Security Number (XXX-XX-XXXX format)"ML Classifiers
Machine learning models trained on labeled datasets. Best for unstructured data where patterns are not easily expressed as regex, such as names, physical addresses, and free-text medical records.
name: email-body-pii
type: ml
model: slim-io/pii-ner-v2
category: auto # model determines the category
confidence_threshold: medium # high | medium | low
description: "NER model for PII detection in unstructured text"ML classifiers run on Slim.io’s infrastructure. In BYOC mode, the model artifact is deployed alongside the scanning agent inside your VPC.
Dictionary Classifiers
Lookup against known value lists. Useful for matching against finite sets of known values such as medical terms, country names, or internal project codes.
name: medical-terms
type: dictionary
source: dictionaries/medical-conditions.txt
category: PHI
confidence: medium # high | medium | low
case_sensitive: false
description: "Medical condition terms from ICD-10 codebook"Proximity Classifiers
Contextual detection that combines a pattern match with nearby keyword presence. Reduces false positives by requiring contextual evidence within a configurable character window.
name: ssn-with-context
type: proximity
pattern: '\b\d{3}-\d{2}-\d{4}\b'
keywords: ["ssn", "social security", "social sec", "taxpayer"]
window: 50 # characters before/after the pattern match
category: SSN
confidence: high # proximity match is the highest-confidence tier
description: "SSN pattern with nearby contextual keywords"Checksum Classifiers
Validation algorithms that verify data integrity. Used for data types with embedded check digits such as credit card numbers (Luhn), IBANs (Mod97), and Canadian SINs.
name: credit-card-luhn
type: checksum
algorithm: luhn
pattern: '\b\d{13,19}\b'
category: Credit Card
confidence: high # checksum-validated detection is the highest-confidence tier
description: "Credit card number validated by Luhn algorithm"Built-In Classifiers (170 Rules)
Slim.io ships with 170 classifiers organized into the following categories. International classifiers cover government IDs, tax identifiers, and national insurance numbers for 50+ countries.
Personal Identification
| Classifier | Type | Validation |
|---|---|---|
| US SSN (with context) | Proximity | SSA area/group/serial validation |
| US SSN (dashes) | Regex | Cannot start with 000, 666, 9XX |
| US SSN (spaces) | Regex | SSA format rules |
| Canadian SIN (with context) | Proximity | Luhn algorithm |
| Canadian SIN (dashes) | Regex | Luhn algorithm |
| Canadian SIN (spaces) | Regex | Luhn algorithm |
| US Passport | Proximity | 9-digit format |
| Canadian Passport | Proximity | 2-letter + 6-digit format |
| Generic Passport | Proximity | 1-2 letters + 6-9 digits |
Financial
| Classifier | Type | Validation |
|---|---|---|
| Credit Card (with context) | Proximity | Luhn algorithm |
| Visa | Checksum | Starts with 4, Luhn |
| Mastercard | Checksum | Starts with 51-55, Luhn |
| American Express | Checksum | Starts with 34/37, Luhn |
| Discover | Checksum | Starts with 6011/65, Luhn |
| Diners Club | Checksum | Starts with 300-305/36/38, Luhn |
| JCB | Checksum | Starts with 2131/1800/35, Luhn |
US Driver’s Licenses (State-Specific)
| Classifier | Format | States Covered |
|---|---|---|
| State-specific patterns | Proximity | CA, TX, FL, NY, PA, IL, OH, GA, NC, MI, WA, AZ, MA, VA, NJ |
| Generic US DL | Proximity | Fallback for remaining states |
Canadian Driver’s Licenses (Provincial)
| Classifier | Format | Provinces Covered |
|---|---|---|
| Provincial patterns | Proximity | ON, QC, BC, AB, MB, SK, NS, NB, NL, PE |
Contact Information
| Classifier | Type | Coverage |
|---|---|---|
| Phone (with context) | Proximity | US and Canadian numbers |
| Phone (international) | Regex | +1 country code format |
| Phone (parentheses) | Regex | (XXX) XXX-XXXX |
| Phone (dashes, dots, spaces) | Regex | All separator formats |
| Phone (with extension) | Regex | ext/x/extension suffix |
| Email Address | Regex | Standard email format |
| Canadian Postal Code | Regex | A1A 1A1 format |
Health Information (PHI)
| Classifier | Type | Coverage |
|---|---|---|
| US Medicare | Proximity | MBI format |
| US Medicaid | Proximity | 10-12 digit format |
| US Health Insurance ID | Proximity | Prefix + 8-12 digits |
| Ontario OHIP | Proximity | 10-digit + 2-letter |
| Quebec RAMQ | Proximity | 4-letter + 8-digit |
| BC PHN | Proximity | 10-digit |
| Alberta Health Card | Proximity | 9-digit + 2-digit |
| Manitoba Health Card | Proximity | 9-digit |
| Saskatchewan Health Card | Proximity | 9-digit |
Network & Technical
| Classifier | Type | Description |
|---|---|---|
| IP Address | Regex | IPv4 addresses (excludes version strings) |
| MAC Address | Regex | Standard MAC address format |
International Government IDs
| Region | Classifiers | Validation |
|---|---|---|
| EU/EEA | German Tax ID, French NIR, Spanish DNI/NIE, Italian Codice Fiscale, Dutch BSN, Polish PESEL, Swedish Personal Number, UK NHS, Belgian National Number, Greek AFM | Checksum validation where applicable |
| Asia-Pacific | Indian Aadhaar, Indian PAN, Japanese My Number, South Korean RRN, Australian TFN, Singapore NRIC, Hong Kong ID, China National ID, Taiwan ID, Thailand ID, Malaysia MyKad, Philippines SSS, NZ IRD, Vietnam CCCD | Verhoeff (Aadhaar), Luhn, Mod-11 |
| Latin America | Brazil CPF, Brazil CNPJ, Chile RUT, Colombia Cedula, Peru DNI, Uruguay Cedula, Venezuela Cedula | Mod-11, Mod-97 |
| Middle East / Africa | Turkey TC Kimlik, Egypt National ID, Saudi National ID, Israel Teudat Zehut, South Africa ID | Mod-11, Luhn |
Financial (International)
| Classifier | Type | Coverage |
|---|---|---|
| IBAN | Checksum | Mod-97 validation, 70+ country formats |
| SWIFT/BIC | Regex | 8 or 11 character bank codes |
| Routing Number | Checksum | US ABA routing (9 digits, checksum) |
Credentials & Secrets
| Classifier | Type | Description |
|---|---|---|
| AWS Access Key | Regex | AKIA prefix + 16 alphanumeric |
| Private Key / Certificate | Regex | PEM header detection |
| Generic API Key | Regex | Common API key patterns |
Confidence Scoring
Each classifier produces a base confidence score between 0.0 and 1.0. The final score is adjusted by:
- Pattern Specificity — More specific patterns (e.g., with checksum validation) yield higher confidence
- Contextual Signals — Proximity matches and keyword presence boost confidence
- Multiple Classifiers — When multiple classifiers match the same data, confidence is combined using a Bayesian merge (not simple addition)
- Suppression Rules — Known false positive patterns can reduce confidence to zero
Confidence Tiers
Slim.io groups detections into four tiers. The exact numeric boundaries are tunable per environment via Settings > Detection — the public-facing labels are stable.
| Tier | Interpretation |
|---|---|
| High | Strong match, high certainty |
| Medium | Probable match, review recommended |
| Low | Possible match, likely needs LLM Assist |
| Noise | Suppressed or extremely uncertain |
Findings below the configured discard threshold are not stored. The discard threshold is configurable per classifier and globally.
Risk Weights
Each classifier has a riskWeight between 0.0 and 1.0 that determines how much a finding contributes to the overall risk score of an asset. Financial and health data classifiers typically have higher risk weights than contact information classifiers.
Managing Classifiers
Viewing Classifiers
In the Customer Dashboard, navigate to Classifiers in the sidebar. The Active Rules tab displays all classifiers in a card grid showing:
- Type badge — Color-coded indicator (ML = violet, Regex = blue, Dictionary = orange)
- Category — The sensitive data category (e.g., SSN, Credit Card, Email)
- Confidence bar — Visual indicator of the classifier’s base confidence score
- Active findings — How many current findings this classifier has generated
- Enable/disable toggle — Instantly activate or deactivate the classifier
Filtering and Searching
Use the filter bar above the classifier grid to narrow results:
- Search — Free-text search across classifier names and categories
- Type filter — Filter by Regex, ML, or Dictionary
- Category filter — Filter by specific PII category
- Status filter — Show only Enabled or Disabled classifiers
The result count updates in real time as you apply filters.
Enabling and Disabling Classifiers
Click the toggle on any classifier card to enable or disable it. The toggle uses optimistic updates, so the change appears instantly in the UI. Disabled classifiers are not executed during scans.
Disabling a classifier does not remove existing findings generated by that classifier. To clear existing findings, run a new scan after disabling.
Low Confidence Warnings
Classifiers with a confidence score below 60% display a yellow warning banner on their card. These classifiers may produce a higher rate of false positives and should be reviewed or paired with proximity keywords to improve accuracy.
Suppression Rules
Suppression rules prevent known false positives from generating findings:
name: suppress-test-ssns
type: suppression
target_classifier: us-ssn
patterns:
- '000-00-0000'
- '123-45-6789'
- '999-\d{2}-\d{4}'
description: "Suppress known test SSN values"The suppression audit table (available in the Classifiers page) tracks all active suppressions with hit counts, showing how many times each rule has prevented a false positive.
Classifier Templates
Group classifiers into reusable templates for different compliance requirements. See Classifier Templates for details.
Learn More
- Classifier Templates — Reusable classifier groups for scans
- Detection-as-Code — YAML-based classifier definitions managed through Git
- LLM Assist — AI-powered false positive reduction for borderline findings
- PII Detection Engine — The full detection pipeline architecture