Skip to Content

Classifiers

Classifiers are detection rules that identify specific types of sensitive data within your files and databases. Slim.io ships with 170 built-in classifiers covering PII categories across 50+ countries and supports custom classifier definitions via YAML.

How Classifiers Work

When a scan runs, every file passes through the active set of classifiers. Each classifier examines the content for a specific pattern or signal. When a match is found, the classifier produces a finding with a confidence score indicating how certain the match is. Findings are then deduplicated, scored, and stored in the Data Catalog.

The classifier execution order is deterministic: specific patterns (e.g., credit card with Luhn validation) run before generic patterns (e.g., bare digit sequences) to ensure the most precise match wins.

Classifier Types

Regex Classifiers

Pattern matching against regular expressions. Best suited for structured data with predictable formats such as SSNs, phone numbers, and email addresses.

name: us-ssn type: regex pattern: '\b\d{3}-\d{2}-\d{4}\b' category: SSN confidence: high # high | medium | low — relative to your tuning description: "US Social Security Number (XXX-XX-XXXX format)"

ML Classifiers

Machine learning models trained on labeled datasets. Best for unstructured data where patterns are not easily expressed as regex, such as names, physical addresses, and free-text medical records.

name: email-body-pii type: ml model: slim-io/pii-ner-v2 category: auto # model determines the category confidence_threshold: medium # high | medium | low description: "NER model for PII detection in unstructured text"

ML classifiers run on Slim.io’s infrastructure. In BYOC mode, the model artifact is deployed alongside the scanning agent inside your VPC.

Dictionary Classifiers

Lookup against known value lists. Useful for matching against finite sets of known values such as medical terms, country names, or internal project codes.

name: medical-terms type: dictionary source: dictionaries/medical-conditions.txt category: PHI confidence: medium # high | medium | low case_sensitive: false description: "Medical condition terms from ICD-10 codebook"

Proximity Classifiers

Contextual detection that combines a pattern match with nearby keyword presence. Reduces false positives by requiring contextual evidence within a configurable character window.

name: ssn-with-context type: proximity pattern: '\b\d{3}-\d{2}-\d{4}\b' keywords: ["ssn", "social security", "social sec", "taxpayer"] window: 50 # characters before/after the pattern match category: SSN confidence: high # proximity match is the highest-confidence tier description: "SSN pattern with nearby contextual keywords"

Checksum Classifiers

Validation algorithms that verify data integrity. Used for data types with embedded check digits such as credit card numbers (Luhn), IBANs (Mod97), and Canadian SINs.

name: credit-card-luhn type: checksum algorithm: luhn pattern: '\b\d{13,19}\b' category: Credit Card confidence: high # checksum-validated detection is the highest-confidence tier description: "Credit card number validated by Luhn algorithm"

Built-In Classifiers (170 Rules)

Slim.io ships with 170 classifiers organized into the following categories. International classifiers cover government IDs, tax identifiers, and national insurance numbers for 50+ countries.

Personal Identification

ClassifierTypeValidation
US SSN (with context)ProximitySSA area/group/serial validation
US SSN (dashes)RegexCannot start with 000, 666, 9XX
US SSN (spaces)RegexSSA format rules
Canadian SIN (with context)ProximityLuhn algorithm
Canadian SIN (dashes)RegexLuhn algorithm
Canadian SIN (spaces)RegexLuhn algorithm
US PassportProximity9-digit format
Canadian PassportProximity2-letter + 6-digit format
Generic PassportProximity1-2 letters + 6-9 digits

Financial

ClassifierTypeValidation
Credit Card (with context)ProximityLuhn algorithm
VisaChecksumStarts with 4, Luhn
MastercardChecksumStarts with 51-55, Luhn
American ExpressChecksumStarts with 34/37, Luhn
DiscoverChecksumStarts with 6011/65, Luhn
Diners ClubChecksumStarts with 300-305/36/38, Luhn
JCBChecksumStarts with 2131/1800/35, Luhn

US Driver’s Licenses (State-Specific)

ClassifierFormatStates Covered
State-specific patternsProximityCA, TX, FL, NY, PA, IL, OH, GA, NC, MI, WA, AZ, MA, VA, NJ
Generic US DLProximityFallback for remaining states

Canadian Driver’s Licenses (Provincial)

ClassifierFormatProvinces Covered
Provincial patternsProximityON, QC, BC, AB, MB, SK, NS, NB, NL, PE

Contact Information

ClassifierTypeCoverage
Phone (with context)ProximityUS and Canadian numbers
Phone (international)Regex+1 country code format
Phone (parentheses)Regex(XXX) XXX-XXXX
Phone (dashes, dots, spaces)RegexAll separator formats
Phone (with extension)Regexext/x/extension suffix
Email AddressRegexStandard email format
Canadian Postal CodeRegexA1A 1A1 format

Health Information (PHI)

ClassifierTypeCoverage
US MedicareProximityMBI format
US MedicaidProximity10-12 digit format
US Health Insurance IDProximityPrefix + 8-12 digits
Ontario OHIPProximity10-digit + 2-letter
Quebec RAMQProximity4-letter + 8-digit
BC PHNProximity10-digit
Alberta Health CardProximity9-digit + 2-digit
Manitoba Health CardProximity9-digit
Saskatchewan Health CardProximity9-digit

Network & Technical

ClassifierTypeDescription
IP AddressRegexIPv4 addresses (excludes version strings)
MAC AddressRegexStandard MAC address format

International Government IDs

RegionClassifiersValidation
EU/EEAGerman Tax ID, French NIR, Spanish DNI/NIE, Italian Codice Fiscale, Dutch BSN, Polish PESEL, Swedish Personal Number, UK NHS, Belgian National Number, Greek AFMChecksum validation where applicable
Asia-PacificIndian Aadhaar, Indian PAN, Japanese My Number, South Korean RRN, Australian TFN, Singapore NRIC, Hong Kong ID, China National ID, Taiwan ID, Thailand ID, Malaysia MyKad, Philippines SSS, NZ IRD, Vietnam CCCDVerhoeff (Aadhaar), Luhn, Mod-11
Latin AmericaBrazil CPF, Brazil CNPJ, Chile RUT, Colombia Cedula, Peru DNI, Uruguay Cedula, Venezuela CedulaMod-11, Mod-97
Middle East / AfricaTurkey TC Kimlik, Egypt National ID, Saudi National ID, Israel Teudat Zehut, South Africa IDMod-11, Luhn

Financial (International)

ClassifierTypeCoverage
IBANChecksumMod-97 validation, 70+ country formats
SWIFT/BICRegex8 or 11 character bank codes
Routing NumberChecksumUS ABA routing (9 digits, checksum)

Credentials & Secrets

ClassifierTypeDescription
AWS Access KeyRegexAKIA prefix + 16 alphanumeric
Private Key / CertificateRegexPEM header detection
Generic API KeyRegexCommon API key patterns

Confidence Scoring

Each classifier produces a base confidence score between 0.0 and 1.0. The final score is adjusted by:

  • Pattern Specificity — More specific patterns (e.g., with checksum validation) yield higher confidence
  • Contextual Signals — Proximity matches and keyword presence boost confidence
  • Multiple Classifiers — When multiple classifiers match the same data, confidence is combined using a Bayesian merge (not simple addition)
  • Suppression Rules — Known false positive patterns can reduce confidence to zero

Confidence Tiers

Slim.io groups detections into four tiers. The exact numeric boundaries are tunable per environment via Settings > Detection — the public-facing labels are stable.

TierInterpretation
HighStrong match, high certainty
MediumProbable match, review recommended
LowPossible match, likely needs LLM Assist
NoiseSuppressed or extremely uncertain

Findings below the configured discard threshold are not stored. The discard threshold is configurable per classifier and globally.

Risk Weights

Each classifier has a riskWeight between 0.0 and 1.0 that determines how much a finding contributes to the overall risk score of an asset. Financial and health data classifiers typically have higher risk weights than contact information classifiers.

Managing Classifiers

Viewing Classifiers

In the Customer Dashboard, navigate to Classifiers in the sidebar. The Active Rules tab displays all classifiers in a card grid showing:

  • Type badge — Color-coded indicator (ML = violet, Regex = blue, Dictionary = orange)
  • Category — The sensitive data category (e.g., SSN, Credit Card, Email)
  • Confidence bar — Visual indicator of the classifier’s base confidence score
  • Active findings — How many current findings this classifier has generated
  • Enable/disable toggle — Instantly activate or deactivate the classifier

Filtering and Searching

Use the filter bar above the classifier grid to narrow results:

  • Search — Free-text search across classifier names and categories
  • Type filter — Filter by Regex, ML, or Dictionary
  • Category filter — Filter by specific PII category
  • Status filter — Show only Enabled or Disabled classifiers

The result count updates in real time as you apply filters.

Enabling and Disabling Classifiers

Click the toggle on any classifier card to enable or disable it. The toggle uses optimistic updates, so the change appears instantly in the UI. Disabled classifiers are not executed during scans.

Disabling a classifier does not remove existing findings generated by that classifier. To clear existing findings, run a new scan after disabling.

Low Confidence Warnings

Classifiers with a confidence score below 60% display a yellow warning banner on their card. These classifiers may produce a higher rate of false positives and should be reviewed or paired with proximity keywords to improve accuracy.

Suppression Rules

Suppression rules prevent known false positives from generating findings:

name: suppress-test-ssns type: suppression target_classifier: us-ssn patterns: - '000-00-0000' - '123-45-6789' - '999-\d{2}-\d{4}' description: "Suppress known test SSN values"

The suppression audit table (available in the Classifiers page) tracks all active suppressions with hit counts, showing how many times each rule has prevented a false positive.

Classifier Templates

Group classifiers into reusable templates for different compliance requirements. See Classifier Templates for details.

Learn More

Last updated on