A new educational workshop enables developers to build a working GPT-style language model from the ground up, using only public code and minimal computational resources. The project, inspired by Andrej Karpathy’s nanoGPT, strips down the complexity of large language models (LLMs) into an accessible, step-by-step training pipeline that runs on consumer hardware — including laptops with Apple Silicon, NVIDIA GPUs, or even CPU-only systems.
Overview
The workshop, hosted on GitHub under the repository llm-from-scratch, is designed to be completed in a single session. It guides users through writing every component of a GPT model in PyTorch, from tokenization to text generation, without relying on pre-trained weights or black-box libraries like AutoModel.from_pretrained(). The target model size is approximately 10 million parameters (Medium config), which trains in about 45 minutes on an M3 Pro chip.
Three model configurations are provided:
- Tiny: ~0.5M parameters, 2 layers, 2 attention heads, 128 embedding dimensions — trains in ~5 minutes
- Small: ~4M parameters, 4 layers, 4 heads, 256 embedding dimensions — ~20 minutes
- Medium (default): ~10M parameters, 6 layers, 6 heads, 384 embedding dimensions — ~45 minutes
All use character-level tokenization with a vocabulary size of 65 and a context length (block_size) of 256.
What it does
The workshop is structured into six parts, each focusing on a core component of the LLM pipeline:
- Tokenization: Implement a character-level tokenizer. The guide explains why Byte Pair Encoding (BPE) fails on small datasets like Shakespeare (~1MB) due to rare token bigrams, making character-level encoding more effective at this scale.
- Transformer Architecture: Build the full GPT model, including token and positional embeddings, multi-head self-attention, layer normalization, MLP blocks, and residual connections.
- Training Loop: Code the complete training process — forward pass, cross-entropy loss, backpropagation, AdamW optimizer, gradient clipping, and learning rate scheduling.
- Text Generation: Implement inference with sampling techniques such as temperature scaling and top-k filtering for autoregressive text generation.
- Putting It All Together: Train the model on the provided
shakespeare.txtdataset, analyze loss curves, and explore scaling effects. - Competition: Challenge users to train the best AI poet by experimenting with datasets, hyperparameters, and model size.
The project supports local execution via uv (