AI Engineering From Scratch – Master AI Without Black Boxes
GitHub RepoImpressions2.3k

AI Engineering From Scratch – Master AI Without Black Boxes

@githubprojectsPost Author

Project Description

View on GitHub

AI Engineering From Scratch – No Black Boxes, Just Code

You’ve probably used OpenAI, Hugging Face, or LlamaIndex. They’re great, but they often hide the internals behind abstractions. If you’re the kind of developer who wants to know exactly how an embedding works, how RAG retrieval actually looks under the hood, or how to train a small transformer from scratch, you’ve probably felt the itch to peel back the curtain.

That’s exactly what this GitHub repo does. It’s a hands-on, code-first guide to building AI components from the ground up — no opaque libraries, no magic. Just Python, numpy, and a lot of clear explanations.


What It Does

ai-engineering-from-scratch is a collection of Jupyter notebooks and scripts that walk you through building core AI/ML components from scratch. It covers:

  • Tokenization – byte-pair encoding and word-level tokenizers
  • Embeddings – Word2Vec, GloVe, and positional encodings
  • Transformers – attention mechanisms, multi-head attention, and a full transformer from scratch
  • Retrieval-Augmented Generation (RAG) – chunking, vector search, and a basic RAG pipeline
  • Fine-tuning – simple examples of adapting pretrained models

Each component is built with readable Python, often using nothing more than numpy and basic math. You can run it locally, step through it line by line, and actually understand what’s happening.


Why It’s Cool

Most tutorials stop at “use this library.” This one stops at “here’s the actual algorithm, line by line.” Here’s what makes it stand out:

  • No black boxes – Every layer of a transformer, every attention head, every embedding lookup is implemented in plain code. You can print the tensors, inspect the gradients, and see exactly what changes.
  • Educational by design – The code is heavily commented, with explanations written in a way that assumes you know Python but not necessarily ML theory. It’s not a production framework, it’s a learning tool.
  • Covers the whole pipeline – From tokenization to training to inference, you follow the full flow. You’re not learning isolated pieces; you’re building a mental model of how they connect.
  • RAG done right – The RAG section is particularly clean: it shows you how to chunk documents, create embeddings from scratch (not just use SentenceTransformers), and do a simple cosine similarity search without relying on FAISS or Elasticsearch.

If you’ve ever wondered “but how does attention really work?” or “what does a vector database actually do inside?”, this repo answers those questions with code you can run.


How to Try It

  1. Clone the repo:

    git clone https://github.com/rohitg00/ai-engineering-from-scratch.git
    cd ai-engineering-from-scratch
    
  2. Install dependencies (it’s minimal – mainly numpy, maybe tqdm):

    pip install numpy tqdm
    
  3. Open any notebook:

    jupyter notebook notebooks/
    
  4. Start with 01_tokenization.ipynb or 03_attention_from_scratch.ipynb – those are the most fun.

No GPU required. No API keys. Just a laptop and curiosity.


Final Thoughts

This isn’t a “build your own AI in 5 minutes” repo. It’s a “spend an afternoon and finally get it” repo. If you’re a developer who likes to understand the stack all the way down — or if you’ve ever felt frustrated by tutorials that hand-wave the hard parts — you’ll enjoy this.

It won’t replace your production stack, but it will make you a better engineer when you go back to using those libraries. Highly recommend for any dev who wants to stop treating AI like magic and start treating it like code.


Found this useful? Follow us for more hand-picked dev tools and open source projects.
@githubprojects

Back to Projects
Project ID: 6f3e6367-91e9-4e33-ab7d-95fa3fde24d7Last updated: April 29, 2026 at 06:47 AM