Merchants

Merchant Normalization

Maps raw transaction text to canonical merchant names.

How It Works

graph LR
    Raw["REWE SAGT DANKE 1234"] --> Normalize[Text Normalization]
    Normalize --> Match[Alias Matching]
    Match --> Merchant["REWE"]
  1. Build text source: Combine payee, description, payer fields
  2. Match aliases: Check each merchant's aliases against text
  3. Select best match: Prioritize by mode rank + match length

Alias Modes

Mode Priority Description Example
word 3 (best) Exact word boundary { word: "rewe" }
contains 2 Substring match { contains: "aldi sagt danke" }
regex 1 Pattern match { regex: "e\\.?(center|deka)" }

Merchant File Format

slug: merchants.germany
version: 2
normalize:
  fields: [payee, description, payer]
merchants:
  - name: "REWE"
    aliases:
      - { word: "rewe" }

  - name: "EDEKA"
    aliases:
      - { regex: "e\\.?(center|deka)" }

  - name: "DM"
    aliases:
      - { word: "dm" }
      - { contains: "dm drogerie" }

Catalog Files

data/rules/merchants/
├── base.yaml           # Global merchants (IKEA, Amazon)
├── germany.yaml        # German retailers (REWE, Aldi, DM)
├── international.yaml  # International (Netflix, Spotify)
└── saas.yaml          # SaaS services

Adding a Merchant

  1. Find the right file (or create new)
  2. Add entry with unique name
  3. Add aliases (most specific mode first)
- name: "My Store"
  aliases:
    - { word: "mystore" }           # exact match first
    - { contains: "my store gmbh" } # fallback

Matching Priority

When multiple aliases match: 1. Higher mode rank wins (word > contains > regex) 2. Longer match length wins 3. Earlier alias order wins

Code Location

File Purpose
app/ingest/importer/pipeline/normalize.py Matching logic
app/ingest/importer/pipeline/rules_loader.py YAML parsing
app/ingest/importer/pipeline/ruleset_loader.py Catalog loading