Two ingest commands
| Command | Use it for |
|---|---|
docmancer add <url> | URLs and GitHub repos |
docmancer ingest <path> | Local directories or files |
Both commands write into the same hybrid index (SQLite FTS5 + Qdrant dense + Qdrant sparse).
Add from a URL
docmancer add https://docs.example.com
Docmancer auto-detects the docs platform and chooses the best fetching strategy:
| Platform | Detection | Strategy |
|---|---|---|
| GitBook | llms-full.txt endpoint | Full-text download |
| Mintlify | llms.txt or sitemap.xml | Sitemap crawl |
| GitHub | Repository URL | README + docs directory extraction |
| Generic web | sitemap.xml or nav crawl | Page-by-page fetch |
Force a specific provider:
docmancer add https://docs.example.com --provider mintlify
Add from a GitHub repo
docmancer add https://github.com/owner/repo
Extracts the README and any docs/ directory content.
Ingest local files
docmancer ingest ./my-internal-docs
Supported file formats:
| Format | Extra needed |
|---|---|
Markdown (.md, .mdx) | none |
Plain text (.txt) | none |
PDF (.pdf) | docmancer[local] |
DOCX (.docx) | docmancer[local] |
RTF (.rtf) | docmancer[local] |
HTML (.html, .htm) | docmancer[local] |
Install the parsers when you need them:
pip install 'docmancer[local]'
Options
docmancer add flags:
| Flag | Default | Description |
|---|---|---|
--provider | auto | auto, gitbook, mintlify, web, github, crawl4ai |
--strategy | auto | Force discovery strategy |
--max-pages | 500 | Limit pages fetched |
--browser | off | Use Playwright for JS-heavy sites (needs docmancer[browser]) |
--fetch-workers | auto | Number of concurrent page fetch workers |
--recreate | off | Drop and rebuild the index for this source |
docmancer ingest flags:
| Flag | Default | Description |
|---|---|---|
--no-vectors | off | Skip Qdrant; index lexical-only (FTS5) |
--recreate | off | Drop and rebuild the index for this source |
--workers | auto | Concurrent parsing workers |
Update existing sources
Re-fetch and re-index when upstream docs change:
docmancer update
docmancer update https://docs.example.com
Updates reuse the content-hash-keyed embeddings cache, so unchanged sections skip re-embedding.