Classifiers

Classifiers are detection rules that identify specific types of sensitive data within your files and databases. Slim.io ships with 170 built-in classifiers covering PII categories across 50+ countries and supports custom classifier definitions via YAML.

How Classifiers Work

When a scan runs, every file passes through the active set of classifiers. Each classifier examines the content for a specific pattern or signal. When a match is found, the classifier produces a finding with a confidence score indicating how certain the match is. Findings are then deduplicated, scored, and stored in the Data Catalog.

The classifier execution order is deterministic: specific patterns (e.g., credit card with Luhn validation) run before generic patterns (e.g., bare digit sequences) to ensure the most precise match wins.

Classifier Types

Regex Classifiers

Pattern matching against regular expressions. Best suited for structured data with predictable formats such as SSNs, phone numbers, and email addresses.


name: us-ssn
type: regex
pattern: '\b\d{3}-\d{2}-\d{4}\b'
category: SSN
confidence: high  # high | medium | low — relative to your tuning
description: "US Social Security Number (XXX-XX-XXXX format)"

ML Classifiers

Machine learning models trained on labeled datasets. Best for unstructured data where patterns are not easily expressed as regex, such as names, physical addresses, and free-text medical records.


name: email-body-pii
type: ml
model: slim-io/pii-ner-v2
category: auto  # model determines the category
confidence_threshold: medium  # high | medium | low
description: "NER model for PII detection in unstructured text"

ML classifiers run on Slim.io’s infrastructure. In BYOC mode, the model artifact is deployed alongside the scanning agent inside your VPC.

Dictionary Classifiers

Lookup against known value lists. Useful for matching against finite sets of known values such as medical terms, country names, or internal project codes.


name: medical-terms
type: dictionary
source: dictionaries/medical-conditions.txt
category: PHI
confidence: medium  # high | medium | low
case_sensitive: false
description: "Medical condition terms from ICD-10 codebook"

Proximity Classifiers

Contextual detection that combines a pattern match with nearby keyword presence. Reduces false positives by requiring contextual evidence within a configurable character window.


name: ssn-with-context
type: proximity
pattern: '\b\d{3}-\d{2}-\d{4}\b'
keywords: ["ssn", "social security", "social sec", "taxpayer"]
window: 50  # characters before/after the pattern match
category: SSN
confidence: high  # proximity match is the highest-confidence tier
description: "SSN pattern with nearby contextual keywords"

Checksum Classifiers

Validation algorithms that verify data integrity. Used for data types with embedded check digits such as credit card numbers (Luhn), IBANs (Mod97), and Canadian SINs.


name: credit-card-luhn
type: checksum
algorithm: luhn
pattern: '\b\d{13,19}\b'
category: Credit Card
confidence: high  # checksum-validated detection is the highest-confidence tier
description: "Credit card number validated by Luhn algorithm"

Built-In Classifiers (170 Rules)

Slim.io ships with 170 classifiers organized into the following categories. International classifiers cover government IDs, tax identifiers, and national insurance numbers for 50+ countries.

Personal Identification

Classifier	Type	Validation
US SSN (with context)	Proximity	SSA area/group/serial validation
US SSN (dashes)	Regex	Cannot start with 000, 666, 9XX
US SSN (spaces)	Regex	SSA format rules
Canadian SIN (with context)	Proximity	Luhn algorithm
Canadian SIN (dashes)	Regex	Luhn algorithm
Canadian SIN (spaces)	Regex	Luhn algorithm
US Passport	Proximity	9-digit format
Canadian Passport	Proximity	2-letter + 6-digit format
Generic Passport	Proximity	1-2 letters + 6-9 digits

Financial

Classifier	Type	Validation
Credit Card (with context)	Proximity	Luhn algorithm
Visa	Checksum	Starts with 4, Luhn
Mastercard	Checksum	Starts with 51-55, Luhn
American Express	Checksum	Starts with 34/37, Luhn
Discover	Checksum	Starts with 6011/65, Luhn
Diners Club	Checksum	Starts with 300-305/36/38, Luhn
JCB	Checksum	Starts with 2131/1800/35, Luhn

US Driver’s Licenses (State-Specific)

Classifier	Format	States Covered
State-specific patterns	Proximity	CA, TX, FL, NY, PA, IL, OH, GA, NC, MI, WA, AZ, MA, VA, NJ
Generic US DL	Proximity	Fallback for remaining states

Canadian Driver’s Licenses (Provincial)

Classifier	Format	Provinces Covered
Provincial patterns	Proximity	ON, QC, BC, AB, MB, SK, NS, NB, NL, PE

Contact Information

Classifier	Type	Coverage
Phone (with context)	Proximity	US and Canadian numbers
Phone (international)	Regex	+1 country code format
Phone (parentheses)	Regex	(XXX) XXX-XXXX
Phone (dashes, dots, spaces)	Regex	All separator formats
Phone (with extension)	Regex	ext/x/extension suffix
Email Address	Regex	Standard email format
Canadian Postal Code	Regex	A1A 1A1 format

Health Information (PHI)

Classifier	Type	Coverage
US Medicare	Proximity	MBI format
US Medicaid	Proximity	10-12 digit format
US Health Insurance ID	Proximity	Prefix + 8-12 digits
Ontario OHIP	Proximity	10-digit + 2-letter
Quebec RAMQ	Proximity	4-letter + 8-digit
BC PHN	Proximity	10-digit
Alberta Health Card	Proximity	9-digit + 2-digit
Manitoba Health Card	Proximity	9-digit
Saskatchewan Health Card	Proximity	9-digit

Network & Technical

Classifier	Type	Description
IP Address	Regex	IPv4 addresses (excludes version strings)
MAC Address	Regex	Standard MAC address format

International Government IDs

Region	Classifiers	Validation
EU/EEA	German Tax ID, French NIR, Spanish DNI/NIE, Italian Codice Fiscale, Dutch BSN, Polish PESEL, Swedish Personal Number, UK NHS, Belgian National Number, Greek AFM	Checksum validation where applicable
Asia-Pacific	Indian Aadhaar, Indian PAN, Japanese My Number, South Korean RRN, Australian TFN, Singapore NRIC, Hong Kong ID, China National ID, Taiwan ID, Thailand ID, Malaysia MyKad, Philippines SSS, NZ IRD, Vietnam CCCD	Verhoeff (Aadhaar), Luhn, Mod-11
Latin America	Brazil CPF, Brazil CNPJ, Chile RUT, Colombia Cedula, Peru DNI, Uruguay Cedula, Venezuela Cedula	Mod-11, Mod-97
Middle East / Africa	Turkey TC Kimlik, Egypt National ID, Saudi National ID, Israel Teudat Zehut, South Africa ID	Mod-11, Luhn

Financial (International)

Classifier	Type	Coverage
IBAN	Checksum	Mod-97 validation, 70+ country formats
SWIFT/BIC	Regex	8 or 11 character bank codes
Routing Number	Checksum	US ABA routing (9 digits, checksum)

Credentials & Secrets

Classifier	Type	Description
AWS Access Key	Regex	AKIA prefix + 16 alphanumeric
Private Key / Certificate	Regex	PEM header detection
Generic API Key	Regex	Common API key patterns

Confidence Scoring

Each classifier produces a base confidence score between 0.0 and 1.0. The final score is adjusted by:

Pattern Specificity — More specific patterns (e.g., with checksum validation) yield higher confidence
Contextual Signals — Proximity matches and keyword presence boost confidence
Multiple Classifiers — When multiple classifiers match the same data, confidence is combined using a Bayesian merge (not simple addition)
Suppression Rules — Known false positive patterns can reduce confidence to zero

Confidence Tiers

Slim.io groups detections into four tiers. The exact numeric boundaries are tunable per environment via Settings > Detection — the public-facing labels are stable.

Tier	Interpretation
High	Strong match, high certainty
Medium	Probable match, review recommended
Low	Possible match, likely needs LLM Assist
Noise	Suppressed or extremely uncertain

Findings below the configured discard threshold are not stored. The discard threshold is configurable per classifier and globally.

Risk Weights

Each classifier has a riskWeight between 0.0 and 1.0 that determines how much a finding contributes to the overall risk score of an asset. Financial and health data classifiers typically have higher risk weights than contact information classifiers.

Managing Classifiers

Viewing Classifiers

In the Customer Dashboard, navigate to Classifiers in the sidebar. The Active Rules tab displays all classifiers in a card grid showing:

Type badge — Color-coded indicator (ML = violet, Regex = blue, Dictionary = orange)
Category — The sensitive data category (e.g., SSN, Credit Card, Email)
Confidence bar — Visual indicator of the classifier’s base confidence score
Active findings — How many current findings this classifier has generated
Enable/disable toggle — Instantly activate or deactivate the classifier

Filtering and Searching

Use the filter bar above the classifier grid to narrow results:

Search — Free-text search across classifier names and categories
Type filter — Filter by Regex, ML, or Dictionary
Category filter — Filter by specific PII category
Status filter — Show only Enabled or Disabled classifiers

The result count updates in real time as you apply filters.

Enabling and Disabling Classifiers

Click the toggle on any classifier card to enable or disable it. The toggle uses optimistic updates, so the change appears instantly in the UI. Disabled classifiers are not executed during scans.

Disabling a classifier does not remove existing findings generated by that classifier. To clear existing findings, run a new scan after disabling.

Low Confidence Warnings

Classifiers with a confidence score below 60% display a yellow warning banner on their card. These classifiers may produce a higher rate of false positives and should be reviewed or paired with proximity keywords to improve accuracy.

Suppression Rules

Suppression rules prevent known false positives from generating findings:


name: suppress-test-ssns
type: suppression
target_classifier: us-ssn
patterns:
  - '000-00-0000'
  - '123-45-6789'
  - '999-\d{2}-\d{4}'
description: "Suppress known test SSN values"

The suppression audit table (available in the Classifiers page) tracks all active suppressions with hit counts, showing how many times each rule has prevented a false positive.

Classifier Templates

Group classifiers into reusable templates for different compliance requirements. See Classifier Templates for details.

Learn More

Classifier Templates — Reusable classifier groups for scans
Detection-as-Code — YAML-based classifier definitions managed through Git
LLM Assist — AI-powered false positive reduction for borderline findings
PII Detection Engine — The full detection pipeline architecture