Extract structured data from text using LLMs with source grounding
GitHub RepoImpressions1k

Extract structured data from text using LLMs with source grounding

@githubprojectsPost Author

Project Description

View on GitHub

Extract Structured Data from Text with LLMs and Source Grounding

If you've ever tried to get structured data out of a chunk of unstructured text, you know the pain. Regular expressions only get you so far, and writing custom parsers for every new format is a slog. Large Language Models (LLMs) are great at understanding text, but using them for data extraction often feels like a black box—you get an answer, but you have no idea which parts of the text it came from. That's where source grounding changes the game.

Enter Langextract, a new open-source library from Google. It tackles this exact problem. It uses an LLM to pull structured data from text, but it also pins each piece of extracted data directly back to the source text that justifies it. You get a clean JSON object and a built-in audit trail.

What It Does

In short, Langextract is a Python library that provides a simple function, extract(). You give it your text and a Pydantic model describing the data structure you want. It returns an instance of that model, populated from the text. The key difference from a simple LLM call is that every field in the returned object is annotated with the specific span of text—the start and end character indices—that was used to generate its value. This is the "source grounding."

Why It's Cool

The source grounding feature is the standout here. It moves beyond "just trust the model" to a more verifiable, transparent approach. This is huge for:

  • Building Reliable Pipelines: You can automatically validate extracted data by checking it against the original source snippet.
  • Debugging & Improvement: When the model extracts something wrong, you can instantly see why it thought that. This makes iterating on your prompts or your source data much faster.
  • Auditability: For applications in legal, financial, or scientific domains, being able to cite the exact source for a piece of data is critical.

It's also pragmatic. It's built on top of familiar tools (Python, Pydantic) and uses the Gemini API, making it relatively straightforward to integrate into an existing workflow. It feels less like a research prototype and more like a practical tool meant for developers.

How to Try It

Getting started is pretty standard for a Python library. First, you'll need a Gemini API key from Google AI Studio.

  1. Install it:

    pip install langextract
    
  2. Set your API key:

    export GOOGLE_API_KEY="your-key-here"
    
  3. Write a simple extraction script:

    import langextract
    from pydantic import BaseModel
    from typing import List
    
    class Movie(BaseModel):
        title: str
        release_year: int
        main_actors: List[str]
    
    text = """
    The Dark Knight is a 2008 superhero film. It features Christian Bale,
    the late Heath Ledger, and Michael Caine in leading roles.
    """
    
    result = langextract.extract(Movie, text)
    print(f"Title: {result.title}")
    print(f"Title Source: {text[result.langextract_spans.title.start:result.langextract_spans.title.end]}")
    # This would print: "Title: The Dark Knight"
    # And the source snippet would be "The Dark Knight"
    

Check out the GitHub repository for more detailed examples and the full API documentation.

Final Thoughts

Langextract feels like a step in the right direction for production use of LLMs. It acknowledges that for many real-world tasks, an answer isn't enough—you need justification. By baking source grounding into the core extraction process, it provides a cleaner, more trustworthy pattern than trying to bolt citation on after the fact.

If you're building anything that needs to turn documents, emails, or web pages into structured data, this library is worth a close look. It might just save you from writing your next fragile, one-off parser.


Follow us for more interesting projects: @githubprojects

Back to Projects
Project ID: 8d41a87e-6e63-4abe-8d29-bbfe3b69befdLast updated: December 23, 2025 at 06:47 AM