Skip to Content
Data Loss PreventionData Classification

Data Classification

Slim.io classifies sensitive data using a layered detection system that combines regex pattern matching, dictionary lookups, and contextual analysis. This page covers the full set of built-in detection patterns, how to create custom classifiers using the Detection-as-Code system, and how confidence scoring works.

Built-In Detection Patterns

Slim.io ships with 170 built-in detection patterns covering PII, PHI, PCI, and credential categories across 50+ countries. Detection patterns include regex matching, checksum validation, proximity keywords, and AI-assisted classification. All built-in patterns are active by default and can be individually disabled or tuned.

Personal Identifiers

CategoryDetection MethodExample Pattern
US Social Security NumberRegex + Proximity + ChecksumXXX-XX-XXXX
US Passport NumberRegex + Proximity9-digit alphanumeric
US Driver’s LicenseRegex (state-specific)Varies by state
National ID (non-US)Regex + ProximityUK NIN, Canadian SIN, etc.
Date of BirthRegex + ProximityMM/DD/YYYY, YYYY-MM-DD
Full NameML (NER) + ProximityNamed entity recognition
GenderDictionary + ProximityKeyword matching
Ethnicity / RaceDictionary + ProximityKeyword matching
ReligionDictionary + ProximityKeyword matching

Financial Data

CategoryDetection MethodExample Pattern
Credit Card NumberRegex + Luhn Checksum13-19 digit with Luhn validation
Bank Account NumberRegex + ProximityVaries by country
Routing Number (ABA)Regex + Checksum9-digit with ABA checksum
IBANRegex + ChecksumCountry-specific format with mod-97
SWIFT/BIC CodeRegex8-11 character bank identifier
Tax ID / EINRegex + ProximityXX-XXXXXXX

Contact Information

CategoryDetection MethodExample Pattern
Email AddressRegexStandard RFC 5322 pattern
Phone NumberRegex (international)US, UK, EU, APAC formats
Physical AddressML (NER) + RegexStreet address extraction
IP AddressRegexIPv4 and IPv6
MAC AddressRegexXX:XX:XX:XX:XX:XX
URL with AuthRegexURLs containing credentials

Health Data (PHI)

CategoryDetection MethodExample Pattern
Medical Record NumberRegex + ProximityFacility-specific patterns
Insurance IDRegex + ProximityPayer-specific formats
Drug / Prescription NameDictionaryFDA drug database
Medical ConditionDictionaryICD-10 code mapping
Lab ResultRegex + ProximityNumeric values with medical units
Health Plan NumberRegex + ProximityPayer-specific formats

Credentials & Secrets

CategoryDetection MethodExample Pattern
API KeyRegex + ProximityProvider-specific key prefixes
AWS Access KeyRegexAKIA prefix, 20 characters
Private Key (RSA/EC)RegexPEM header detection
PasswordRegex + ProximityKeyword proximity with value
JWT TokenRegexBase64-encoded three-part structure
Database Connection StringRegexProtocol-specific URI patterns
OAuth TokenRegex + ProximityBearer token patterns

Custom Classifiers

When built-in classifiers do not cover your specific data types, create custom classifiers using YAML definitions.

Classifier Types

TypeBest ForHow It Works
RegexStructured data with predictable formatsPattern matching against regular expressions
DictionaryFinite sets of known valuesLookup against curated word lists
ProximityReducing false positives on common patternsRegex match + nearby keyword requirement
ChecksumData with embedded validation digitsPattern match + algorithmic verification
MLUnstructured data, named entitiesMachine learning model inference

Creating a Custom Classifier

Define your classifier in YAML and deploy via the dashboard, API, or Git sync:

apiVersion: slim.io/v1 kind: Classifier metadata: name: internal-employee-id description: "Company internal employee identifier" spec: type: proximity pattern: '\bEMP-\d{6}\b' keywords: ["employee", "emp id", "staff", "worker"] window: 100 category: Internal ID confidence: high # high | medium | low — relative to your tuning enabled: true

Detection-as-Code

Store classifier definitions in a Git repository and sync them automatically:

slim-io-config/ classifiers/ internal-employee-id.yaml vendor-contract-id.yaml custom-health-code.yaml

Enable Git sync under Settings > Integrations to deploy classifier changes on merge to your main branch. See Detection-as-Code for the full workflow.

Confidence Scoring

Every detection produces a confidence score between 0.0 and 1.0. The score reflects how certain the engine is that the detected value is truly sensitive data.

Score Components

  • Base confidence — Set by the classifier definition (a checksum-validated detection starts higher than a bare regex match)
  • Contextual boost — Proximity keyword matches increase confidence (e.g., “SSN:” near a 9-digit number)
  • Multi-classifier merge — When multiple classifiers match the same data, confidence is combined using Bayesian merging
  • Suppression rules — Known false positive patterns reduce confidence to zero

Confidence Tiers

Slim.io groups findings into four confidence tiers. The exact numeric boundaries are tunable per environment via Settings > Detection — the public-facing labels are stable.

TierRecommended Action
HighAutomate remediation (tokenize, mask, quarantine)
MediumReview recommended; consider LLM Assist for disambiguation
LowLog for awareness; likely requires manual review
NoiseSuppressed; not stored by default

The default discard threshold is configured per environment. Findings below it are not stored. Adjust this threshold per classifier or globally under Settings > Detection.

Tuning Detection Accuracy

  1. Review false positives — Use the Investigation view to identify classifiers that generate noise. Adjust confidence thresholds or add suppression rules.
  2. Enable LLM Assist — For findings inside the configurable ambiguous range, LLM Assist sends the surrounding context to an AI model for disambiguation.
  3. Add proximity keywords — Converting a regex classifier to a proximity classifier dramatically reduces false positives.
  4. Use suppression rules — Define patterns for known test data, placeholder values, or synthetic data that should not generate findings.
  5. Monitor category distribution — Check the category breakdown in the Executive Scorecard to identify classifiers that may be over- or under-detecting.
Last updated on