LLM Evolution History - Rebuilding GPT2 - (1)

This article is based on Andrej Karpathy's 4-hour reproduction of GPT-2. After watching it, I found it to be an excellent video; it serves as the concluding chapter in the history of LLM evolution. This article provides a textual supplement based on that. For earlier content, please refer to the homepage of https://blog.nagi.fun, where the blogger has written very comprehensively.

This series is planned to be divided into three parts: the main implementation, accelerated implementation, and distributed training.

Implementing GPT-2 nn.Module#

Config Configuration#

@dataclass
class GPTConfig():
    block_size: int=1024     # Sequence length limit (context window length)
    vocab_size: int=50527    # Vocabulary size
    n_layer: int=12          # Number of Transformer layers
    n_head: int=12           # Number of attention heads
    n_embd: int=768          # Embedding dimension (vector length for each token)

@dataclass decorator defines a configuration class named GPTConfig

(If you don't understand decorators, you can look it up on CSDN or Zhihu)

Why use dataclass:

• Regular classes require manually writing the __init__ method, but with the decorator, it's very simple.
• Supports explicit declaration, and you can directly print(GPTConfig(n_head=16)) to print the parameters.

BackBone#

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            # word token embedding
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            # word position embedding
            wpe = nn.Embedding(config.block_size, config.n_embd),
            # main block
            h = nn.ModuleList([Block(config) for _ in range(config.n_layers)]),
            # word token embedding,
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.attn = CasualSelfAttention(config)
        self.mlp = mlp(config)
    
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

transformer

self.transformer: the core of the transformer architecture

nn.ModuleDict: a dictionary inside nn.Module, nn.ModuleDict(ln_f = nn.LayerNorm(config.n_embd),) can be understood as {ln_f: nn.LayerNorm(config.n_embd)}

wte: linear layer from word to embedding [length of vocabulary, embedding dimension], transforms words into feature vectors

wpe: linear layer from word position to embedding [sequence length, embedding dimension], transforms positional information into feature vectors

h: the encoder core of the transformer, each Block consists of an attention and an mlp

ln_f: LayerNorm, normalizes the large variance obtained after Pre-Norm, further explanation follows

lm_head: the final output layer, converts the feature vector of words into specific words

Block: the Transformer consists of multiple identical Blocks

Tips⚠️: Here you will find that the LN layer of GPT2 is before the attention and mlp, which is different from the LN layer and residual connection in the original text of the above image (residual first, then normalization).

Karpathy's explanation is: the original model first connects the residual and then performs LN normalization, indicating that the connected residual is also normalized, which is not good. A pure residual is better because during backpropagation, when the gradient flows back, addition evenly distributes its gradient to its two branches, meaning the gradient flows directly to the input through the residual path, which is preferable from an optimization perspective.

To be honest, I didn't understand his explanation, so I searched online for related content and found that GPT2's approach is called pre-norm, while the method in Attention is All You Need is called post-norm.

pre-norm

Su's explanation of the differences between these two is very insightful. The residual connection is $x+F(x)$ , if $x$ has a variance of $\sigma^2_1$ and $F(x)$ has a variance of $\sigma^2_2$ , then the variance after the residual connection is $\sigma^2_1+\sigma^2_2$ , which means the residual amplifies the variance. We need to find a way to reduce this variance. A naive method is to add normalization, which is $x_{t+1}=Norm(x_t+F(x))$ . However, while this stabilizes the forward propagation variance, it severely weakens the identity branch of the residual, thus losing the advantage of the residual being "easy to train." Typically, it requires warming up and setting a sufficiently small learning rate to converge. The transformer structure has two characteristics: sensitivity to hyperparameters during the warm-up phase and slow convergence during the optimization process. (The author also does not know why), which means that under post-norm conditions, it becomes harder to converge, and the training cost will also increase to some extent.

Now to explain how the identity branch of the residual is weakened (which is what Karpathy refers to as a clean residual). Suppose initially $x$ and $F(x)$ both have a variance of 1, then $x+F(x)$ has a variance of 2, and the normalization operation is responsible for reducing the variance back to 1. This indicates that in the initial stage, Post Norm is equivalent to

x_{t+1}=\frac{x_t+F(x_t)}{\sqrt{2}}

Recursively,

x_l=\frac{x_{l-1}}{\sqrt{2}}+\frac{F_{l-1}(x_{l-1})}{\sqrt{2}}=\frac{x_{l-2}}{{2}}+\frac{F_{l-2}(x_{l-2})}{{2}}+\frac{F_{l-1}(x_{l-1})}{\sqrt{2}}

x_l=\frac{x_{0}}{{2^{l/2}}}+\frac{F_{0}(x_{0})}{{2^{l/2}}}+\frac{F_{1}(x_{1})}{{2}^{(l-1)/2}}+\frac{F_{2}(x_{2})}{{2}^{(l-2)/2}}+...+\frac{F_{l-1}(x_{l-1})}{{2}^{1/2}}

The original meaning of the residual is to create a "green channel" for the previous input layers, allowing gradients to be more directly backpropagated. However, in Post Norm, this "green channel" is severely weakened, with weights closer to the front being smaller, making it so that after multiple residual connections, the earlier residuals cannot perceive the gradient changes at the end, rendering the residual "nominal" and thus difficult to train.

For the paper, see "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE."

The corrected Pre-Norm takes the form of

x_{t+1}=x_t+F_t(Norm(x_t))

Expanding the iteration:

x_{t+1}=x_t+F_t(Norm(x_t)) = x_{t-1}+F_{t-1}(Norm(x_{t-1}))+F_t(Norm(x_t))

x_{t}=x_{0}+F_{0}(Norm(x_{0}))+F_1(Norm(x_1))+...+F_{l-1}(Norm(x_{l-1}))

Each residual channel is equally weighted, making the effect of the residual more pronounced than Post Norm, thus it is easier to optimize. Of course, this means that the final $x_l$ variance will be large, so before the prediction layer, $x_l$ also needs to add a normalization, which is exactly ln_f.

Karpathy mentions that Attention is where tokens communicate; it is a pooling function, a weighted sum function, and a reduce operation.
MLP occurs on each individual token, with no information collected or exchanged between tokens; it is a map operation.

MLP#

class mlp(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, config.n_embd*4)
        self.Gelu = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(config.n_embd*4, config.n_embd)
    
    def forward(self, x):
        x = self.c_fc(x)
        x = self.Gelu(x)
        x = self.c_proj(x)
        return x

A very simple MLP linear mapping from [n_embd, 4 * n_embd] to [4 * n_embd, n_embd], with a non-linear layer GELU activation function in between. The function graph of GELU is very similar to ReLU, but it is differentiable at the tail, which solves the problem of ReLU having a derivative of 0 when x is less than 0, and this smoothness produces better results.

Karpathy discusses why the tanh approximation is used, mentioning that this is a historical legacy issue. During the TensorFlow era, using the exact GELU was particularly slow, so a function using tanh to approximate GELU was developed.

GELU

Attention#

class CasualSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, config.n_embd*3)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_embd = config.n_embd
        self.n_head = config.n_head
        # The role of this will be discussed in the weight module later
        self.c_proj.NANOGPT_SCALE_INIT = 1 
        self.register_buffer("bias", torch.tril((torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)))
    
    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        # Obtain qkv
        q, k, v = qkv.split(self.n_embd, dim=2)
        # query, key, value are all split into [B, n_head, T, n_embd//n_head]
        query = q.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
        key = k.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
        value = v.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
        # QK^T/d
        att = query @ key.transpose(-1, -2) * (1.0/math.sqrt(key.size(-1)))
        mask_att = att.masked_fill(self.bias[:,:,:T,:T]==0, float('-inf'))
        wei = F.softmax(mask_att, dim=-1)
        out = wei @ value
        out = out.transpose(1,2).contiguous().view(B, T, C)
        out = self.c_proj(out)
        return out

self.c_attn: the combination of $W_q,W_k,W_v$ , transforms the input $x$ into inputs $Q,K,V$

self.c_proj: a linear layer after calculating $\frac{QK^T}{\sqrt{d_k}}V$

self.n_embd: the feature vector space for each token

self.n_head: the number of heads in the multi-head attention mechanism

self.bias: here, the bias means a mask, which is an upper triangular matrix that prevents earlier tokens from learning from later tokens. The specific principle is as follows: for input $x$ :

x=\begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix},bias=\begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{bmatrix} ,mask_{att}=\begin{bmatrix} x_{11} & -inf & -inf \\ x_{21} & x_{22} & -inf \\ x_{31} & x_{32} & x_{33} \end{bmatrix}

-inf will become a value close to 0 during the subsequent softmax process, thus having no effect on classification.

contiguous(): transpose does not change the physical order, it only changes the formal order, and using this function can correct the physical order. For example, for the array $[[[1,2][7,8]],[[3,4][5,6]]]\space \space \space \space \space shape=[1,2,2]$ , transpose(1,2) results in $[[[1,2][2,3]],[[3,5][4,6]]]\space \space \space \space \space shape=[1,2,2]$ , but the physical storage of both arrays is $[1,2,7,8,3,4,5,6]$ , so an error will occur when performing a view operation on the transposed array.

Download from Hugging Face#

    # Inside class GPT
    @classmethod
    def from_pretrained(cls, model_type):
        """Loads pretrained GPT-2 model weights from huggingface"""
        # Four types of models
        assert model_type in {'gpt2','gpt2-medium','gpt2-large','gpt2-xl'}
        # Print which one you are loading
        print("Loading weights from pretrained gpt:%s"%model_type)
        # Each GPT corresponds to different hyperparameters
        config_args ={
        'gpt2' : dict(n_layer=12,n_head=12,n_embd=768), # 124M params
        'gpt2-medium' : dict(n_layer=24,n_head=16,n_embd=1024), #350M params
        'gpt2-large' : dict(n_layer=36,n_head=20,n_embd=1280), #774M params
        'gpt2-xl' : dict(n_layer=48,n_head=25,n_embd=1600), #1558M params
        }[model_type]
        # Vocabulary size is always 50527
        config_args['vocab_size'] = 50257 
        # The size of a single block is always 1024
        config_args['block_size'] = 1024
        # Import hyperparameters into the model
        config = GPTConfig(**config_args)
        model = GPT(config)
        # sd is the model parameter name dictionary
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask

        # Download weights from HF, sd_hf is the downloaded model parameter name dictionary
        model_hf = GPT2LMHeadModel.from_pretrained(model_type, cache_dir="/home/shong_Tan/project/gpt_2/model_weight", local_files_only=True)
        sd_hf = model_hf.state_dict()

        sd_keys_hf = sd_hf.keys()
        # Discard the mask bias in HF weights
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
        # Discard bias in HF weights
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # Ensure that sd and hf_sd have the same number of parameter names
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        # Ensure that sd and hf_sd have the same weight names for transformer blocks
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                assert sd_hf[k].shape == sd[k].shape, f"mismatched keys: {sd_hf[k].shape} != {sd[k].shape}"
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])
        return model

Just read the code comments.

Tips⚠️: The shapes of lm_head.weight and transformer.wte.weight downloaded from HF are the same, both are $[50527, 768]$ , one is the input embedding and the other is the output logits. These two should be consistent, reflecting that when a token is embedded into a feature vector, after interaction, when outputting, it is still this feature vector, which can be transformed back into the original token. Meanwhile, $50527*768 \approx 40M$ , which can save a lot of GPU memory.

Forward#

    # Inside class GPT
    def forward(self, idx, target):
        # Entering with dimensions [batch, token length]
        B, T = idx.size()
        # Token length cannot exceed context
        assert T <= self.config.block_size, f"Exceeds input context length limit {T-self.config.block_size} token"
        # pos [0,1,2,..,T-1], and remember to place it on the device
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        # Position embedding
        pos = self.transformer.wpe(pos) #(T, n_embd)
        # Token embedding
        tok = self.transformer.wte(idx) #(B, T, n_embd)
        # The addition is on the (T, n_embd) dimension
        x = tok + pos
        # Through transformer blocks
        for block in self.transformer.h:
            x = block(x)
        # Final layer normalization
        x = self.transformer.ln_f(x)
        # Linear layer output
        logits = self.lm_head(x) #(B, T, vocab_size)
        loss = None
        # If there is a target, i.e., a label, training is performed, calculating the loss function; otherwise, inference can be done
        if target is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), target.view(-1))
        return logits, loss

# A small test
num_return_sequences = 5
max_length = 30
model = GPT.from_pretrained('gpt2')
# eval will disable dropout layers during evaluation, and will have different responses for batchnorm, as well as freeze parameters
model.eval()
# Move the model to GPU
model.to('cuda')

# The following is tokenization, just use OpenAI's tiktoken library. If you want to know the principle, it is recommended to check the blogger's blog at the beginning of the article.
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Hello, I'm a language model, ")
tokens = torch.tensor(tokens, dtype=torch.long) # [8, ]
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1) # [5,8]
x = tokens.to('cuda')

while x.size(1) < max_length:
    with torch.no_grad():
        # Input the model to get results
        logits, loss  = model(x) # x: [B, T]    logits:[B,T,C]
        # Take the prediction of the last token
        logits = logits[:, -1, :] # [B, 1, C]
        # Perform softmax on the last dimension C
        probs = F.softmax(logits, dim=-1) # [B, 1, C]
        # Select the top k largest probabilities and corresponding indices from the top k probabilities
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) 
        # Randomly select a probability from the top k
        ix = torch.multinomial(topk_probs, 1)
        # Find the index corresponding to the selected probability
        xcol = torch.gather(topk_indices, -1, ix)
        # Concatenate the obtained output token to x as input [B, T+1]
        x = torch.cat((x, xcol), dim=1)

Tokenization form: Transforming "Hello, I'm a language model, " into [15496, 11, 314, 1101, 257, 3303, 2746, 11, 220]

You can try it yourself at the following website:

https://tiktokenizer.vercel.app/

Initialization#

Dataset#

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
# This is for Apple's M chip series
elif hasattr(torch.backends, "eps") and torch.backends.mps.is_available():
    device = 'mps'
print("Using device: ", device)
import tiktoken
enc = tiktoken.get_encoding('gpt2')
with open('input.txt', 'r') as f:
    data = f.read()
text = data[:1000]
tokens = enc.encode(text)
B, T = 4, 32
buf = torch.tensor(tokens[:B*T+1])
buf.to(device)
# Essentially predicting n+1 words based on the first n words
x = buf[:-1].view(B, T)
y = buf[1:].view(B,T)

model.GPT(GPTConfig())
model.to(device)
logits, loss = model(x)
print(loss.item())

Here, loss is approximately 11, because $-log(\frac{1}{50257})\approx 11$ .

Training a single batch code:

# Using AdamW optimizer, understand the difference between Adam and SGD yourself
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
    # Clear optimizer gradients
    optimizer.zero_grad()
    # Get logits and loss
    logits, loss = model(x, y)
    # Backpropagate to compute gradients
    loss.backward()
    # Update original parameters using gradients
    optimizer.step()

The Adam optimizer can converge faster than SGD.

Dataloader function:

class DataLoaderLite():

    def __init__(self, B, T):
        self.B = B
        self.T = T
        # Read the entire input.txt
        with open('input.txt', 'r') as f:
            data = f.read()
        enc = tiktoken.get_encoding('gpt2')
        tokens = enc.encode(data)
        self.tokens = torch.tensor(tokens, dtype=torch.long)
        print(f"load {len(self.tokens)} tokens")
        print(f"1 epoch = {len(self.tokens)//(B*T)} batched")
        # Define the current position in the batch
        self.current_position = 0

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.tokens[self.current_position: self.current_position+B*T+1]
        x = buf[:-1].view(B, T)
        y = buf[1:].view(B, T)
        # Each batch has B*T pairs
        self.current_position += B*T
        # If the batch runs out of tokens, return to tokens[0]
        if self.current_position+B*T+1 > len(self.tokens):
            self.current_position = 0
        return x, y

Corrected training code:

train_loader = DataLoaderLite(4, 32 )
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
    optimizer.zero_grad()
    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)
    logits, loss = model(x, y)
    loss.backward()
    optimizer.step()
    print(f"step: {i}, loss: {loss.item()}")

Weights#

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, ' '):
                std = std * (2*self.config.n_layer**-0.5)
            torch.nn.init.normal_(module.weight, mean=0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)

        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0, std=0.02)

std = std * (2*self.config.n_layer)**-0.5: The variance here considers the contribution of the residual flow; each residual connection indicates that the input has made an equal contribution once, requiring a factor to handle it, $\frac{1}{\sqrt{2*n_{layer}}}$ . This controls the excessive variance caused by the residual connection in Pre-Norm, and the factor of 2 is due to the fact that both Attention and MLP use a residual once in each layer.

std: The source of the std value is also based on the documentation in GPT2, which suggests it should be around $\frac{1}{\sqrt{n_{embd}}}$ .