RAGLite: lightweight RAG with DuckDB or PostgreSQL and late chunking
GitHub RepoImpressions70
View on GitHub
@githubprojectsPost Author

RAGLite: Lightweight RAG with DuckDB or PostgreSQL and Late Chunking

Building a RAG pipeline usually means stitching together a bunch of heavy dependencies — vector databases, embedding services, orchestration frameworks. It’s powerful, but often overkill if you just want to query a few PDFs or a codebase. RAGLite takes the opposite approach: keep it small, keep it local, and only add complexity when you actually need it.

It uses DuckDB or PostgreSQL as the vector store, relies on your existing SQL skills, and introduces late chunking as a smarter way to handle retrieval. No Docker containers for a small project. No cloud API key just to test an idea.


What It Does

RAGLite is a minimal Python library for retrieval-augmented generation. You give it documents (PDFs, markdown, code, whatever), it chunks them, embeds them, and stores them in DuckDB or PostgreSQL. Then you ask questions, and it retrieves the relevant chunks, hands them to an LLM, and gives you an answer.

That part is standard. What makes RAGLite interesting is how it handles chunking and storage.

  • Late chunking: Instead of chunking documents first and then embedding, it embeds whole documents/sections and only chunks at retrieval time. This means you don’t lose context from arbitrary chunk boundaries. When you query, it splits the relevant part dynamically, which often yields better answers. The repo has a clear diagram showing this in action.
  • BYO database: DuckDB for single-user, local use (no server setup). PostgreSQL for multi-user or production, with pgvector for similarity search.
  • SQL-based vectors: You write SQL to query embeddings. If you know PostgreSQL or DuckDB, you already know how to use this.

Why It’s Cool

Three things jumped out when I looked at the code:

  1. Late chunking actually works. Most RAG tools chunk into fixed sizes, which destroys semantic relationships between sentences or paragraphs. Late chunking keeps the full context until the last moment, so retrieval is more accurate. The paper they reference is solid, and the implementation feels lightweight.
  2. No infrastructure hell. You can run this on a laptop with DuckDB, no server process. When you need to scale, you swap DuckDB for PostgreSQL with a config change. That’s it.
  3. Transparent embeddings. The embeddings are just BLOBs in SQL rows. You can inspect them, filter them, join them with other tables. No black-box vector database magic.

Use cases: internal documentation QA, personal note retrieval, lightweight code search, prototyping RAG for demos. Anything where you want a working system without the AWS bill.


How to Try It

Install it:

pip install raglite

Point it at your documents:

from raglite import RAGLite

rag = RAGLite(db_url="duckdb:///my_rag.db")
rag.insert_directory("docs/")
answer = rag.query("What is late chunking?")
print(answer)

That’s it. For PostgreSQL, change the db_url to postgresql://user:pass@host/db. The repo has a full example with PDFs.

You’ll need an LLM backend (OpenAI, Ollama, or any OpenAI-compatible endpoint) and an embedding model. The default uses a local sentence transformer if you don’t have one configured.


Final Thoughts

RAGLite feels like the right tool for the 80% case where you don’t need a vector database cluster. It’s simple, the late chunking is a genuinely clever optimization, and it fits into a SQL workflow most devs already know. If you’ve been avoiding RAG because it feels like “too much setup,” this is worth a look.

It’s still early — the API might change, and it won’t handle millions of documents out of the box — but for a weekend project or a small internal tool, it’s exactly what I’d reach for.


Found via @githubprojects

Back to Projects
Last updated: June 17, 2026 at 05:08 AM