Merchant Normalization
Maps raw transaction text to canonical merchant names.
How It Works
graph LR
Raw["REWE SAGT DANKE 1234"] --> Normalize[Text Normalization]
Normalize --> Match[Alias Matching]
Match --> Merchant["REWE"]
- Build text source: Combine
payee,description,payerfields - Match aliases: Check each merchant's aliases against text
- Select best match: Prioritize by mode rank + match length
Alias Modes
| Mode | Priority | Description | Example |
|---|---|---|---|
word |
3 (best) | Exact word boundary | { word: "rewe" } |
contains |
2 | Substring match | { contains: "aldi sagt danke" } |
regex |
1 | Pattern match | { regex: "e\\.?(center|deka)" } |
Merchant File Format
slug: merchants.germany
version: 2
normalize:
fields: [payee, description, payer]
merchants:
- name: "REWE"
aliases:
- { word: "rewe" }
- name: "EDEKA"
aliases:
- { regex: "e\\.?(center|deka)" }
- name: "DM"
aliases:
- { word: "dm" }
- { contains: "dm drogerie" }
Catalog Files
data/rules/merchants/
├── base.yaml # Global merchants (IKEA, Amazon)
├── germany.yaml # German retailers (REWE, Aldi, DM)
├── international.yaml # International (Netflix, Spotify)
└── saas.yaml # SaaS services
Adding a Merchant
- Find the right file (or create new)
- Add entry with unique
name - Add aliases (most specific mode first)
- name: "My Store"
aliases:
- { word: "mystore" } # exact match first
- { contains: "my store gmbh" } # fallback
Matching Priority
When multiple aliases match:
1. Higher mode rank wins (word > contains > regex)
2. Longer match length wins
3. Earlier alias order wins
Code Location
| File | Purpose |
|---|---|
app/ingest/importer/pipeline/normalize.py |
Matching logic |
app/ingest/importer/pipeline/rules_loader.py |
YAML parsing |
app/ingest/importer/pipeline/ruleset_loader.py |
Catalog loading |