Enterprises are drowning in unstructured documents. Legal contracts, compliance policies, technical specifications, audit reports, financial statements, research papers - massive document repositories containing critical business knowledge that remains locked away and unsearchable. This knowledge is scattered across departments, creating silos where finance can't access legal precedents, engineering can't find compliance requirements, and executives can't get a unified view of organizational commitments.
Current solutions fail at enterprise scale. Simple keyword search misses context. Vector search without proper preprocessing fails on real documents - tables lose their structure, acronyms aren't linked to their definitions, and cross-page references are lost. Generic AI tools hallucinate when precision matters most. And when auditors or regulators ask for evidence, you need the exact source with full context - not an AI's interpretation or an isolated paragraph missing the conditions and requirements around it.
Acronym Hell: Financial and regulated documents are packed with acronyms (SLA, KPI, GDPR, SOX). Vector embeddings can't connect "Service Level Agreement" with "SLA" appearing 50 pages later. Your search for "service levels" returns nothing because the document only uses "SLA".
Corporate Jargon: Industry-specific terms that don't exist in general training data. "Counterparty risk", "regulatory capital", "compliance framework" - these need domain understanding, not just semantic similarity.
Missing Context: Vector search finds individual paragraphs but misses the bigger picture. You ask about "payment terms" and get a random paragraph mentioning "30 days" without the surrounding context about penalties, conditions, or exceptions.
Evidence Requirements: In regulated environments, answers need the full context - not just text citations. When compliance metrics are in a table, you need that table in the response. When a process diagram explains the workflow, you need that image. Legal and financial work requires complete evidence: the text, the tables, the charts - everything relevant to support the answer.
Hybrid Intelligence: Vector search for semantic understanding + keyword extraction for precise terminology + relationship mapping to connect concepts.
Knowledge Graphs: We extract acronyms, validate their meanings with LLMs, and build relationships between terms. Now "SLA" searches also find "Service Level Agreement" content.
Structure Preservation: Tables, images, and document hierarchy stay intact. You get the actual compliance table, not a text description of it.
Source Attribution: Every answer includes exact page numbers and document sections. No hallucinations - if we don't know, we say so.
Install Catalyst directly from GitHub (PyPI release coming soon):
# Using uv (recommended)
uv add "blockether-catalyst[extraction,api] @ git+https://github.com/Blockether/catalyst.git"
# Or using pip
pip install "blockether-catalyst[extraction,api] @ git+https://github.com/Blockether/catalyst.git"- 🔍 In-Memory Hybrid Search: Platform-independent vector + keyword search that runs anywhere (Lambda, containers, edge) - no external dependencies
- 📦 Embedded Model: Ships with a lightweight embedder (~32MB) directly in the library - no API calls, no latency, works offline
- 📄 PDF Intelligence: Extracts text, tables, images, and maintains document structure
- 🧠 LLM Consensus: Multiple validation passes ensure extraction accuracy
- 🔗 Knowledge Graphs: Automatically links acronyms, terms, and concepts across documents
- 📊 Structure Preservation: Tables and charts remain intact, not converted to text
- 🎯 Source Attribution: Every answer includes exact page numbers and document sections
- 🚀 Async Processing: Built on ASGI for high-performance document pipelines
- 🔧 Zero Dependencies: Fully self-contained - no vector DBs, no external services, deploy anywhere
- Web UI: Ready-to-deploy document Q&A interface with HTMX
- Workflow Engine: Agno integration for complex document processing pipelines
- Visualization: Knowledge graph and chunk relationship visualizations
- MCP Server: Model Context Protocol support for AI assistants
Think of Catalyst as a forensic document investigator rather than a simple search engine. While traditional RAG systems are great for general Q&A, Catalyst excels when you need deep understanding, complete evidence trails, and the ability to connect concepts across massive document repositories.
| Aspect | Traditional RAG | Catalyst |
|---|---|---|
| 📥 Document Processing | • Simple chunking • Basic text extraction • Vector embeddings only |
📄 Extract text, tables, images 🧠 Build knowledge graphs 🔗 Map relationships & acronyms ✅ LLM consensus validation |
| 🔍 Search Capabilities | • Semantic similarity only • Isolated chunks • No acronym understanding |
⚡ Hybrid search (semantic + keyword) 🔗 Cross-document linking 📍 Exact source attribution 🎯 Understands domain jargon |
| ⚙️ Infrastructure | • Requires vector database • API dependencies • Cloud-first design |
🚫 No vector DB needed 🚫 No external services 🔌 Runs offline anywhere 💨 Minimal footprint |
| ⏱️ Performance | • Fast indexing (seconds) • Query: 100-500ms |
🐢 Slow extraction (2-5 min/doc) ⚡ Query: milliseconds 💾 One-time processing |
| 📊 Best For | • General Q&A • Frequently changing docs • Simple retrieval |
🏛️ Regulated industries ⚖️ Legal/compliance 📋 Audit trails 🔍 Deep understanding |
The key difference? Catalyst spends time upfront to deeply understand your documents - extracting not just text but relationships, definitions, and context. This investment pays off when you need answers that would require a human expert to manually read through hundreds of pages.
| Scenario | Example Query | Use Catalyst? | Why |
|---|---|---|---|
| 📋 Compliance Audits | "Show me all GDPR requirements with source pages" | ✅ YES | Full attribution, regulatory-grade accuracy |
| ⚖️ Legal Discovery | "Find all liability clauses across 50 contracts" | ✅ YES | Cross-document linking, relationship mapping |
| 💰 Financial Analysis | "Trace this risk metric to its definition" | ✅ YES | Connects terms, definitions, calculations |
| 🔧 Technical Docs | "Map all API dependencies in the system" | ✅ YES | Understands system relationships |
| 💬 Chat Support | "How do I reset my password?" | ❌ NO | Use simple RAG - overkill for FAQs |
| 📱 Mobile Apps | "Real-time in-app search" | ❌ NO | Extraction not suitable for mobile |
| 🔄 Dynamic Content | "Latest news updates" | ❌ NO | Re-extraction takes minutes |
| 🌍 General Knowledge | "Who won the World Cup?" | ❌ NO | Just use ChatGPT directly |
-
Deploy an OpenAI‑compatible multimodal model server (for image+text extraction).
- Examples: MiniCPM-o-2_6 (https://huggingface.co/openbmb/MiniCPM-o-2_6) or QwenVL (https://huggingface.co/unsloth/gpt-4o-GGUF).
- Why: a local/remote OpenAI-compatible endpoint enables accurate multimodal extraction (images, tables, captions).
- Compatibility note: see OpenAI-compatible deployment guidance — https://www.alibabacloud.com/help/en/model-studio/qwen-vl-compatible-with-openai
-
Clone this repository:
git clone https://github.com/Blockether/catalyst.git
cd catalyst- Run the knowledge extraction tool:
python tools/KnowledgeExtraction.py "*.pdf"- Ensure your model server endpoint and credentials (if any) are configured before running the extractor.
MIT License - see LICENSE for details.
