Build document intelligence into your polyglot applications
GitHub RepoImpressions582

Build document intelligence into your polyglot applications

@githubprojectsPost Author

Project Description

View on GitHub

Build Document Intelligence into Your Polyglot Stack with Kreuzberg

Ever found yourself needing to pull data from a PDF invoice, parse a scanned report, or extract structured info from a contract, only to be met with a wall of complex APIs, vendor lock-in, or the dread of writing yet another parser? In a world where applications speak multiple languages and data comes in messy, real-world formats, adding document intelligence often feels heavier than it should.

What if you could drop a single, self-contained service into your stack that handles the document heavy lifting, regardless of whether your main app is in Go, Python, Node, or Rust? That’s the gap Kreuzberg aims to fill.

What It Does

Kreuzberg is a developer-focused document intelligence service. In simple terms, you send it documents (like PDFs, images, or office files), and it sends back structured, usable data. It wraps powerful machine learning models for tasks like optical character recognition (OCR), document classification, and data extraction into a single, containerized service you run yourself. Think of it as a private, programmable brain for your documents that you control.

Why It's Cool

The clever part isn't just the ML magic under the hood—it's the approach. Kreuzberg is built as a polyglot-friendly gRPC service first. This means you get strongly-typed contracts (via Protobuf) and high-performance communication out of the box. Your Go microservice can talk to it as naturally as your Python data pipeline or your TypeScript backend.

Instead of wrestling with different SDKs or cloud-specific quirks, you generate a client from the Protobuf definition in your language of choice and start making calls. It’s designed to be infrastructure, not just a library. This makes it a perfect fit for microservices architectures or any setup where you need consistent document processing across different parts of your system.

By being self-hosted, it also keeps your sensitive documents in your environment, addressing a major concern for many industries. It’s the kind of tool that feels built for engineers who care about both capability and clean integration.

How to Try It

The quickest way to get started is with Docker. The repository provides a straightforward example.

  1. Clone the repo:

    git clone https://github.com/kreuzberg-dev/kreuzberg
    cd kreuzberg
    
  2. Fire up the service using Docker Compose:

    docker-compose up
    

    This spins up the Kreuzberg service, ready to accept requests.

  3. The repository includes example clients. You can check the examples/ directory to see how to interact with the gRPC API from different languages. The core is defining your processing pipeline (like "extract all text" or "find these specific fields") in a request and sending it over.

For a deeper dive into the API definitions and capabilities, head straight to the GitHub repository.

Final Thoughts

Kreuzberg resonates with me because it solves a practical problem in a developer-centric way. It doesn’t try to be a flashy AI demo; it tries to be a reliable component. If you’re building a system that needs to ingest PDFs, forms, or images and you’re tired of gluing together disparate services or dealing with opaque cloud pricing, this is worth a look. It lets you add sophisticated document intelligence as simply as you’d add a database or a cache to your stack—keeping the power and the data where your code lives.


Follow for more projects: @githubprojects

Back to Projects
Project ID: 7f1dea4a-9507-4740-85f5-73e13af979afLast updated: December 24, 2025 at 11:47 AM