GitHub - Blockether/catalyst: Turn complex documents into queryable knowledge systems for regulated industries. No hallucinations - just accurate answers with full source attribution

Turn complex documents into queryable knowledge systems for regulated industries. No hallucinations - just accurate answers with full source attribution

Why Catalyst? • Quick Start • Features • Examples

Why Catalyst?

The Problem

Enterprises are drowning in unstructured documents. Legal contracts, compliance policies, technical specifications, audit reports, financial statements, research papers - massive document repositories containing critical business knowledge that remains locked away and unsearchable. This knowledge is scattered across departments, creating silos where finance can't access legal precedents, engineering can't find compliance requirements, and executives can't get a unified view of organizational commitments.

Current solutions fail at enterprise scale. Simple keyword search misses context. Vector search without proper preprocessing fails on real documents - tables lose their structure, acronyms aren't linked to their definitions, and cross-page references are lost. Generic AI tools hallucinate when precision matters most. And when auditors or regulators ask for evidence, you need the exact source with full context - not an AI's interpretation or an isolated paragraph missing the conditions and requirements around it.

Why Vector Search Alone Isn't Enough

Acronym Hell: Financial and regulated documents are packed with acronyms (SLA, KPI, GDPR, SOX). Vector embeddings can't connect "Service Level Agreement" with "SLA" appearing 50 pages later. Your search for "service levels" returns nothing because the document only uses "SLA".

Corporate Jargon: Industry-specific terms that don't exist in general training data. "Counterparty risk", "regulatory capital", "compliance framework" - these need domain understanding, not just semantic similarity.

Missing Context: Vector search finds individual paragraphs but misses the bigger picture. You ask about "payment terms" and get a random paragraph mentioning "30 days" without the surrounding context about penalties, conditions, or exceptions.

Evidence Requirements: In regulated environments, answers need the full context - not just text citations. When compliance metrics are in a table, you need that table in the response. When a process diagram explains the workflow, you need that image. Legal and financial work requires complete evidence: the text, the tables, the charts - everything relevant to support the answer.

Our Approach

Hybrid Intelligence: Vector search for semantic understanding + keyword extraction for precise terminology + relationship mapping to connect concepts.

Knowledge Graphs: We extract acronyms, validate their meanings with LLMs, and build relationships between terms. Now "SLA" searches also find "Service Level Agreement" content.

Structure Preservation: Tables, images, and document hierarchy stay intact. You get the actual compliance table, not a text description of it.

Source Attribution: Every answer includes exact page numbers and document sections. No hallucinations - if we don't know, we say so.

Quick Start

Install Catalyst directly from GitHub (PyPI release coming soon):

# Using uv (recommended)
uv add "blockether-catalyst[extraction,api] @ git+https://github.com/Blockether/catalyst.git"

# Or using pip
pip install "blockether-catalyst[extraction,api] @ git+https://github.com/Blockether/catalyst.git"

Features

Core Capabilities

🔍 In-Memory Hybrid Search: Platform-independent vector + keyword search that runs anywhere (Lambda, containers, edge) - no external dependencies
📦 Embedded Model: Ships with a lightweight embedder (~32MB) directly in the library - no API calls, no latency, works offline
📄 PDF Intelligence: Extracts text, tables, images, and maintains document structure
🧠 LLM Consensus: Multiple validation passes ensure extraction accuracy
🔗 Knowledge Graphs: Automatically links acronyms, terms, and concepts across documents
📊 Structure Preservation: Tables and charts remain intact, not converted to text
🎯 Source Attribution: Every answer includes exact page numbers and document sections
🚀 Async Processing: Built on ASGI for high-performance document pipelines
🔧 Zero Dependencies: Fully self-contained - no vector DBs, no external services, deploy anywhere

Integrations

Web UI: Ready-to-deploy document Q&A interface with HTMX
Workflow Engine: Agno integration for complex document processing pipelines
Visualization: Knowledge graph and chunk relationship visualizations
MCP Server: Model Context Protocol support for AI assistants

🎯 When to Use Catalyst

Think of Catalyst as a forensic document investigator rather than a simple search engine. While traditional RAG systems are great for general Q&A, Catalyst excels when you need deep understanding, complete evidence trails, and the ability to connect concepts across massive document repositories.

Catalyst vs Traditional RAG

Aspect	Traditional RAG	Catalyst
📥 Document Processing	• Simple chunking • Basic text extraction • Vector embeddings only	📄 Extract text, tables, images 🧠 Build knowledge graphs 🔗 Map relationships & acronyms ✅ LLM consensus validation
🔍 Search Capabilities	• Semantic similarity only • Isolated chunks • No acronym understanding	⚡ Hybrid search (semantic + keyword) 🔗 Cross-document linking 📍 Exact source attribution 🎯 Understands domain jargon
⚙️ Infrastructure	• Requires vector database • API dependencies • Cloud-first design	🚫 No vector DB needed 🚫 No external services 🔌 Runs offline anywhere 💨 Minimal footprint
⏱️ Performance	• Fast indexing (seconds) • Query: 100-500ms	🐢 Slow extraction (2-5 min/doc) ⚡ Query: milliseconds 💾 One-time processing
📊 Best For	• General Q&A • Frequently changing docs • Simple retrieval	🏛️ Regulated industries ⚖️ Legal/compliance 📋 Audit trails 🔍 Deep understanding

The key difference? Catalyst spends time upfront to deeply understand your documents - extracting not just text but relationships, definitions, and context. This investment pays off when you need answers that would require a human expert to manually read through hundreds of pages.

Real-World Use Cases

Scenario	Example Query	Use Catalyst?	Why
📋 Compliance Audits	"Show me all GDPR requirements with source pages"	✅ YES	Full attribution, regulatory-grade accuracy
⚖️ Legal Discovery	"Find all liability clauses across 50 contracts"	✅ YES	Cross-document linking, relationship mapping
💰 Financial Analysis	"Trace this risk metric to its definition"	✅ YES	Connects terms, definitions, calculations
🔧 Technical Docs	"Map all API dependencies in the system"	✅ YES	Understands system relationships
💬 Chat Support	"How do I reset my password?"	❌ NO	Use simple RAG - overkill for FAQs
📱 Mobile Apps	"Real-time in-app search"	❌ NO	Extraction not suitable for mobile
🔄 Dynamic Content	"Latest news updates"	❌ NO	Re-extraction takes minutes
🌍 General Knowledge	"Who won the World Cup?"	❌ NO	Just use ChatGPT directly

Knowledge Extraction

Deploy an OpenAI‑compatible multimodal model server (for image+text extraction).
- Examples: MiniCPM-o-2_6 (https://huggingface.co/openbmb/MiniCPM-o-2_6) or QwenVL (https://huggingface.co/unsloth/gpt-4o-GGUF).
- Why: a local/remote OpenAI-compatible endpoint enables accurate multimodal extraction (images, tables, captions).
- Compatibility note: see OpenAI-compatible deployment guidance — https://www.alibabacloud.com/help/en/model-studio/qwen-vl-compatible-with-openai
Clone this repository:

git clone https://github.com/Blockether/catalyst.git
cd catalyst

Run the knowledge extraction tool:

python tools/KnowledgeExtraction.py "*.pdf"

Ensure your model server endpoint and credentials (if any) are configured before running the extractor.

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
.serena		.serena
.vscode		.vscode
docs/assets		docs/assets
src/blockether_catalyst		src/blockether_catalyst
tests		tests
tools		tools
verification		verification
.gitignore		.gitignore
.nvmrc		.nvmrc
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
config.yml		config.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why Catalyst? • Quick Start • Features • Examples

Why Catalyst?

The Problem

Why Vector Search Alone Isn't Enough

Our Approach

Quick Start

Features

Core Capabilities

Integrations

🎯 When to Use Catalyst

Catalyst vs Traditional RAG

Real-World Use Cases

Knowledge Extraction

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Blockether/catalyst

Folders and files

Latest commit

History

Repository files navigation

Why Catalyst? • Quick Start • Features • Examples

Why Catalyst?

The Problem

Why Vector Search Alone Isn't Enough

Our Approach

Quick Start

Features

Core Capabilities

Integrations

🎯 When to Use Catalyst

Catalyst vs Traditional RAG

Real-World Use Cases

Knowledge Extraction

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages