Skip to content

kanthgithub/llm-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

llm-from-scratch

What is Pytorch

  • Pytorch is tensor-library, Automatic differentiation engine, Deep learning library
  • library that allows you to do deep-learning, training and fine-tuning a model

LLM Model From Scratch

This Jupyter notebook demonstrates the process of building a Large Language Model (LLM) from scratch, covering essential components such as data preparation, tokenization, model architecture (embedding, attention mechanisms, feed-forward networks), training, and evaluation. It aims to provide a fundamental understanding of LLM construction without relying on high-level frameworks for core components.

Table of Contents

Introduction

This notebook provides a step-by-step guide to understanding the internal workings of a transformer-based LLM. It focuses on implementing the foundational blocks of such a model from first principles, making it an excellent resource for those who want to grasp the core concepts behind modern LLMs.

Credits

A few images and concepts in this notebook are inspired by Sebastian Raschka's blog.

Setup and Dependencies

To run this notebook, you will need the following libraries:

  • torch: For building and training the neural network.
  • torch.nn: Neural network modules.
  • torch.nn.functional (as F): Common neural network functions.
  • torch.utils.data: Utilities for data loading.
  • tiktoken: For advanced tokenization (specifically, OpenAI's cl100k_base tokenizer is used).
  • tqdm: For progress bars during training.

You can install these dependencies using pip:

pip install torch tiktoken tqdm

Dataset

The notebook utilizes a simple dataset for demonstration purposes. The dataset consists of pairs of input and target sequences.

  • Input Sample: [1, 5, 2, 6, 3, 7, 4, 8]

  • Target Sample: [5, 2, 6, 3, 7, 4, 8, 9]

This setup is characteristic of sequence-to-sequence or next-token prediction tasks common in LLMs.

Tokenization

The notebook implements a custom tokenization process and also leverages tiktoken for more robust tokenization.

  • Custom Tokenizer: A basic tokenizer is created to map characters to integers and vice versa. It builds a vocabulary from the input text and provides encode and decode functions.

  • tiktoken Integration: The cl100k_base tokenizer from tiktoken (used by OpenAI's models like gpt-4, gpt-3.5-turbo, text-embedding-ada-002) is also demonstrated for more practical tokenization.

Model Architecture

The LLM is built component by component, detailing the implementation of each key layer:

Embedding Layer

  • Token Embeddings: Maps input tokens (integers) to dense vector representations.

  • Positional Embeddings: Adds positional information to the token embeddings, crucial for transformers to understand the order of tokens in a sequence. This notebook implements fixed positional embeddings (sine and cosine functions).

Attention Mechanism

The core of the Transformer architecture. This notebook details the SelfAttention mechanism.

  • Query, Key, Value (QKV): Explains how input sequences are transformed into Query, Key, and Value matrices.

  • Scaled Dot-Product Attention: Calculation of attention scores using the formula: softmax(Q * K^T / sqrt(d_k)) * V.

  • Masked Self-Attention: For decoder-only models, a look-ahead mask is applied to prevent tokens from attending to future tokens during training.

Multi-Head Attention

Extends the single attention mechanism by performing multiple attention calculations in parallel.

  • Combines the outputs of multiple attention heads to capture diverse relationships within the sequence.

Feed Forward Network (FFN)

  • A simple two-layer neural network with a ReLU activation, applied independently to each position in the sequence.

Transformer Block

  • Combines Multi-Head Attention and a Feed Forward Network, along with Layer Normalization and residual connections.

Decoder Only Transformer

  • The complete LLM architecture, stacking multiple Transformer Blocks. This architecture is suitable for generative tasks like text completion.

    • Input: Token IDs.

    • Output: Logits for the next token in the sequence.

Training

The training process involves:

- Loss Function: Cross-entropy loss is used, suitable for multi-class classification (predicting the next token from the vocabulary).

- Optimizer: AdamW optimizer is employed for efficient training.

- Batching: Data is processed in batches to improve training stability and speed.

- Training Loop: Iterates over the dataset, performs forward and backward passes, and updates model weights.

- Evaluation: The model is evaluated on a test set to monitor performance.

Inference

After training, the notebook demonstrates how to use the trained model for inference to generate new sequences.

  • Greedy Decoding: The model predicts the next token with the highest probability at each step.
  • Generating Text: Shows how to seed the model with an initial sequence and have it generate a continuation.

This README provides a high-level overview of the llm-model-from-scratch.ipynb notebook. For detailed implementation and further understanding, please refer to the notebook itself.

About

playbook for llm from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors