usage

Router recipes

Regex-driven query routers that scope retrieval to the right corner of your index.

What routers do

A router is an ordered {match, filters} entry under retrieval.routers in docmancer.yaml. When a query matches the regex, the router's filters are merged into the retrieval call, scoping the search to a slice of the index. The first matching router wins; routers do not stack.

There is no default router list. Routing rules are inherently domain-specific, so a default that suits a trademark database would mis-route a developer docs site (and vice versa). Pick the recipes you need from below and add them to your own config.

Routers only fire in dispatcher modes

Routers are evaluated by the retrieval dispatcher, which only runs for --mode dense, --mode sparse, or --mode hybrid (or when retrieval.default_mode is set to one of those in docmancer.yaml). The default mode is lexical, and lexical-only queries bypass the dispatcher and call the SQLite store directly, so a router list with no other config changes will have no effect on a default docmancer query "...". Either run with an explicit --mode or set retrieval.default_mode: hybrid (with a vector store configured) before relying on routers.

Anatomy

retrieval:
  routers:
    - match: "<regex>"
      filters:
        <payload-field>: <value-or-clause>
      description: "<optional human-readable label>"

The filter payload is passed directly to the vector store, so any field your loader writes into a section's payload is fair game. Common shapes: a plain scalar (status_code: LIVE), a list (in clause: tag: {in: [api, sdk]}), or a numeric comparison if your store supports it.

Recipe: route "live" queries to current USPTO records

USPTO trademark records carry a numeric status_code (a string like "700"), not a LIVE / DEAD literal. The "live registration" range is roughly 600899; the full mapping is in the USPTO status-code reference. Match on the live status codes you care about using an in clause:

retrieval:
  routers:
    - match: "(?i)\\b(live|current|active|registered)\\b"
      filters:
        status_code:
          in: ["700", "701", "702", "710"]
      description: "Restrict to common live registration status codes"

Adjust the list to the codes your corpus actually contains. If your store supports numeric range filters and your payload encodes status_code as a number, a range clause is cleaner than enumerating values; otherwise stick to in with strings since that is the shape the USPTO normalizer emits.

Recipe: route a Nice trademark class

USPTO records expose every class on the case under the list field international_classes (plural) — a single record can carry multiple classes when goods and services span them. Use an in clause so a record with ["030", "035"] matches a class-30 query:

retrieval:
  routers:
    - match: "(?i)class\\s*0?30\\b"
      filters:
        international_classes:
          in: ["030"]
      description: "Nice class 30 (coffee, tea, sugar...)"

Add one entry per class you care about, in the order they should be checked. Classes are zero-padded three-character strings ("030", not "30"), matching the on-disk payload.

Recipe: scope to a product in a multi-product portal

When the corpus covers several products, route mentions of a product name to its docset root so retrieval doesn't bleed across products. Pair this with retrieval.hierarchical for portals with many pages per product.

retrieval:
  routers:
    - match: "(?i)\\bbilling\\b"
      filters:
        docset_root: "https://docs.example.com/billing"
    - match: "(?i)\\bauth\\b"
      filters:
        docset_root: "https://docs.example.com/auth"

Recipe: scope to a language or framework

If you index multiple language SDKs, route language keywords to the matching SDK section.

retrieval:
  routers:
    - match: "(?i)\\b(python|py|pip)\\b"
      filters:
        sdk: python
    - match: "(?i)\\b(typescript|ts|node|npm)\\b"
      filters:
        sdk: typescript

Verifying a router fires

Run a query with --explain to see which retrieval sources contributed; if your filter is wrong, dense and sparse will return empty and the dispatcher will fall back to lexical-only. The router itself is logged at debug level, so DOCMANCER_LOG_LEVEL=debug docmancer query "..." will show which router matched.

Tips

  1. Anchor your regex. Use word boundaries (\\b) so "auth" doesn't also catch "author".
  2. Case-insensitive by default. The dispatcher uses re.IGNORECASE, so do not add (?i) unless you want to be explicit.
  3. Order matters. The first match wins. Put more specific patterns before broader ones.
  4. Invalid regex is skipped, not fatal. A malformed pattern logs a warning and the dispatcher moves on, so a typo in one router does not break the rest.