Watch unlabeled Minecraft videos teach an AI to play the game
GitHub RepoImpressions55
View on GitHub
@githubprojectsPost Author

Teaching AI to Play Minecraft by Watching YouTube Videos

You know the usual way to train a game-playing AI: set up a reward function, run a million simulations, tweak hyperparameters, pray. But what if the AI could just watch a few unlabeled Minecraft videos on YouTube and figure out how to play? That's exactly what OpenAI's Video Pre-Training (VPT) project does.

Instead of hand-crafting reward signals or requiring massive amounts of labeled gameplay data, VPT learns from millions of hours of raw, unlabeled video. It's like giving an AI a YouTube binge session and asking it to pick up the basics of Minecraft from scratch. No special training labels, no explicit reward functions. Just pixels and audio.

What It Does

Video Pre-Training is a framework for training foundation models that can play Minecraft by watching unlabeled video data. The core idea is straightforward:

  1. Train an inverse dynamics model (IDM) on a small amount of labeled gameplay (about 2000 hours). The IDM learns to predict actions from video frames.
  2. Apply the IDM to a huge dataset of unlabeled Minecraft YouTube videos (70,000+ hours). This generates action labels for the video frames, creating a massive labeled dataset.
  3. Train a video pretraining (VPT) model on this generated dataset using a simple next-frame prediction objective, but with the learned action labels.
  4. Fine-tune the VPT model on specific tasks using reinforcement learning or supervised learning.

The result? The AI learns to do things like chop trees, craft tools, and navigate the world—without ever being explicitly told what to do.

Why It’s Cool

This isn't another "watch the AI fail at Minecraft" demo. VPT actually achieves state-of-the-art results on the MineRL benchmark, and does it with minimal human intervention. Here's what makes it stand out:

  • Zero reward design, almost. The small labeled dataset (2000 hours) is used only to train the IDM. The main 70,000 hours of video are completely unlabeled. Compare that to traditional reinforcement learning where you'd hand-craft reward functions for every single action.
  • Works from raw pixels. No need to compress the game state into clean vectors. The model consumes raw video frames like a human would.
  • Generalizes surprisingly well. The pretrained model isn't just good at one task. It learns a broad suite of Minecraft behaviors—mining, crafting, building—and can be fine-tuned for specific goals.
  • Data source is YouTube. They scraped real human gameplay. That’s messy, noisy, and imperfect, but also rich with the kind of behavior that matters in open-world games.

The clever part is the IDM approach. Because they only need a small amount of labeled data to teach the IDM how humans act, the rest becomes self-supervised. It's a neat way to bootstrap from scarce labeled data to massive unlabeled datasets.

How to Try It

The code and weights are open source on GitHub. You can download the pretrained VPT models, see how they were trained, and even try fine-tuning them for your own tasks.

Check the repo here: openai/Video-Pre-Training

To get started:

git clone https://github.com/openai/Video-Pre-Training
cd Video-Pre-Training
pip install -r requirements.txt

Then check the vpt/ directory for the model implementation and demo.py for running it. You'll need a machine with a GPU (they used 8x H100s), but you can also download pretrained checkpoints and play with inference on smaller hardware.

For training from scratch, be warned: the full pipeline requires massive compute. The pretraining used 8 V100 GPUs for several days. But if you just want to see the model in action or fine-tune on a small task, the codebase is well-documented and approachable.

Final Thoughts

VPT is a neat demonstration that high-quality behavior can emerge from passive observation. It's not AGI playing Minecraft, but it's a real step toward agents that learn from unstructured real-world data—like watching YouTube videos.

For developers and researchers, the key takeaway is the IDM + unlabeled video pattern. This approach could transfer to other domains where labeled action data is expensive but raw video is cheap. Robotics, autonomous driving, even user interface automation could benefit.

And honestly? It's just satisfying seeing an AI learn to punch a tree by watching hours of other people doing the same thing. That's the kind of emergent behavior that makes you smile.


Found this interesting? We share cool open source projects like this daily. Follow us on X: @githubprojects

Back to Projects
Last updated: June 17, 2026 at 02:22 PM