semantic memory
Knowledge Pipeline — Curated External Knowledge for AI Agents
Architecture
Challenge
AI agents need curated external knowledge, not raw documents. Feeding PDFs to LLMs with fixed-size chunking destroys semantic boundaries, fragments meaning, and produces retrieval results that lack context.
Solution
3-Level Chunking Pipeline (Document → Section → Paragraph) with semantic boundary detection. Multiple ingestion sources (arXiv, PDF, manual upload). Hybrid search combining semantic, keyword, and graph traversal. MCP server exposing knowledge to any connected AI agent.
Built an MCP server on PostgreSQL + pgvector with FastMCP. Designed hierarchical chunking that preserves document structure. Implemented ingestion pipelines for arXiv (direct API) and PDF (marker-pdf). Built a CLI with ~15 subcommands for ingestion, search, and graph operations. Integrated graph analysis for knowledge visualization.
Outcome
Knowledge module complementing Cognitive Memory in a multi-agent ecosystem. Semantic Memory provides curated external knowledge, Cognitive Memory provides relational and episodic memory. Both serve the same agents (tethr, I/O) via MCP.
Learnings
- 3-level chunking preserves meaning that fixed-size approaches destroy — Document → Section → Paragraph hierarchy respects how knowledge is structured, not just how tokens are counted.
- The distinction between knowledge and memory matters architecturally — external facts (Semantic Memory) and personal experience (Cognitive Memory) need different storage, retrieval, and update semantics.
- CLI-first design accelerated development — ingesting, searching, and debugging via terminal before building the MCP layer meant the core logic was solid before any agent connected.
Stack
Connections
| Metric | Value | Significance | |
|---|---|---|---|
| Chunking Strategy | 3-Level | Document → Section → Paragraph hierarchy preserves semantic meaning | Chunking Strategy 3-Level Document → Section → Paragraph hierarchy preserves semantic meaning |
| Ingestion Sources | arXiv, PDF, manual | Multiple ingestion pipelines for different knowledge sources | Ingestion Sources arXiv, PDF, manual Multiple ingestion pipelines for different knowledge sources |
| CLI Commands | ~15 | Full CLI for ingestion, search, graph operations, and exploration | CLI Commands ~15 Full CLI for ingestion, search, graph operations, and exploration |