This article is based on Andrej Karpathy's 4-hour reproduction of GPT-2. After watching it, I found it to be an excellent video; it serves as the concluding chapter in the history of LLM evolution. This article provides a textual supplement based on that. For earlier content, please refer to the homepage of https://blog.nagi.fun, where the blogger has written very comprehensively.
This series is planned to be divided into three parts: the main implementation, accelerated implementation, and distributed training.
Implementing GPT-2 nn.Module#
Config Configuration#
@dataclass
class GPTConfig():
block_size: int=1024 # Sequence length limit (context window length)
vocab_size: int=50527 # Vocabulary size
n_layer: int=12 # Number of Transformer layers
n_head: int=12 # Number of attention heads
n_embd: int=768 # Embedding dimension (vector length for each token)
@dataclass
decorator defines a configuration class named GPTConfig
(If you don't understand decorators, you can look it up on CSDN or Zhihu)
Why use dataclass:
• Regular classes require manually writing the __init__
method, but with the decorator, it's very simple.
• Supports explicit declaration, and you can directly print(GPTConfig(n_head=16))
to print the parameters.
BackBone#
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
# word token embedding
wte = nn.Embedding(config.vocab_size, config.n_embd),
# word position embedding
wpe = nn.Embedding(config.block_size, config.n_embd),
# main block
h = nn.ModuleList([Block(config) for _ in range(config.n_layers)]),
# word token embedding,
ln_f = nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.attn = CasualSelfAttention(config)
self.mlp = mlp(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
self.transformer
: the core of the transformer architecture
nn.ModuleDict
: a dictionary inside nn.Module, nn.ModuleDict(ln_f = nn.LayerNorm(config.n_embd),)
can be understood as {ln_f: nn.LayerNorm(config.n_embd)}
wte
: linear layer from word to embedding [length of vocabulary, embedding dimension], transforms words into feature vectors
wpe
: linear layer from word position to embedding [sequence length, embedding dimension], transforms positional information into feature vectors
h
: the encoder core of the transformer, each Block consists of an attention and an mlp
ln_f
: LayerNorm, normalizes the large variance obtained after Pre-Norm, further explanation follows
lm_head
: the final output layer, converts the feature vector of words into specific words
Block
: the Transformer consists of multiple identical Blocks
Tips⚠️: Here you will find that the LN layer of GPT2 is before the attention and mlp, which is different from the LN layer and residual connection in the original text of the above image (residual first, then normalization).
Karpathy's explanation is: the original model first connects the residual and then performs LN normalization, indicating that the connected residual is also normalized, which is not good. A pure residual is better because during backpropagation, when the gradient flows back, addition evenly distributes its gradient to its two branches, meaning the gradient flows directly to the input through the residual path, which is preferable from an optimization perspective.To be honest, I didn't understand his explanation, so I searched online for related content and found that GPT2's approach is called pre-norm, while the method in Attention is All You Need is called post-norm.
Su's explanation of the differences between these two is very insightful. The residual connection is , if has a variance of and has a variance of , then the variance after the residual connection is , which means the residual amplifies the variance. We need to find a way to reduce this variance. A naive method is to add normalization, which is . However, while this stabilizes the forward propagation variance, it severely weakens the identity branch of the residual, thus losing the advantage of the residual being "easy to train." Typically, it requires warming up and setting a sufficiently small learning rate to converge. The transformer structure has two characteristics: sensitivity to hyperparameters during the warm-up phase and slow convergence during the optimization process. (The author also does not know why), which means that under post-norm conditions, it becomes harder to converge, and the training cost will also increase to some extent.
Now to explain how the identity branch of the residual is weakened (which is what Karpathy refers to as a clean residual). Suppose initially and both have a variance of 1, then has a variance of 2, and the normalization operation is responsible for reducing the variance back to 1. This indicates that in the initial stage, Post Norm is equivalent to
Recursively,
The original meaning of the residual is to create a "green channel" for the previous input layers, allowing gradients to be more directly backpropagated. However, in Post Norm, this "green channel" is severely weakened, with weights closer to the front being smaller, making it so that after multiple residual connections, the earlier residuals cannot perceive the gradient changes at the end, rendering the residual "nominal" and thus difficult to train.For the paper, see "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE."
The corrected Pre-Norm takes the form of
Expanding the iteration:
Each residual channel is equally weighted, making the effect of the residual more pronounced than Post Norm, thus it is easier to optimize. Of course, this means that the final variance will be large, so before the prediction layer, also needs to add a normalization, which is exactly ln_f
.
Karpathy mentions that Attention is where tokens communicate; it is a pooling function, a weighted sum function, and a reduce operation
.
MLP occurs on each individual token, with no information collected or exchanged between tokens; it is a map operation
.
MLP#
class mlp(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, config.n_embd*4)
self.Gelu = nn.GELU(approximate='tanh')
self.c_proj = nn.Linear(config.n_embd*4, config.n_embd)
def forward(self, x):
x = self.c_fc(x)
x = self.Gelu(x)
x = self.c_proj(x)
return x
A very simple MLP linear mapping from [n_embd, 4 * n_embd]
to [4 * n_embd, n_embd]
, with a non-linear layer GELU activation function in between. The function graph of GELU is very similar to ReLU, but it is differentiable at the tail, which solves the problem of ReLU having a derivative of 0 when x is less than 0, and this smoothness produces better results.
Karpathy discusses why the tanh approximation is used, mentioning that this is a historical legacy issue. During the TensorFlow era, using the exact GELU was particularly slow, so a function using tanh to approximate GELU was developed.
Attention#
class CasualSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
self.c_attn = nn.Linear(config.n_embd, config.n_embd*3)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.n_embd = config.n_embd
self.n_head = config.n_head
# The role of this will be discussed in the weight module later
self.c_proj.NANOGPT_SCALE_INIT = 1
self.register_buffer("bias", torch.tril((torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)))
def forward(self, x):
B, T, C = x.size()
qkv = self.c_attn(x)
# Obtain qkv
q, k, v = qkv.split(self.n_embd, dim=2)
# query, key, value are all split into [B, n_head, T, n_embd//n_head]
query = q.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
key = k.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
value = v.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
# QK^T/d
att = query @ key.transpose(-1, -2) * (1.0/math.sqrt(key.size(-1)))
mask_att = att.masked_fill(self.bias[:,:,:T,:T]==0, float('-inf'))
wei = F.softmax(mask_att, dim=-1)
out = wei @ value
out = out.transpose(1,2).contiguous().view(B, T, C)
out = self.c_proj(out)
return out
self.c_attn
: the combination of , transforms the input into inputs
self.c_proj
: a linear layer after calculating
self.n_embd
: the feature vector space for each token
self.n_head
: the number of heads in the multi-head attention mechanism
self.bias
: here, the bias means a mask, which is an upper triangular matrix that prevents earlier tokens from learning from later tokens. The specific principle is as follows: for input :
-inf
will become a value close to 0 during the subsequent softmax
process, thus having no effect on classification.
contiguous()
: transpose does not change the physical order, it only changes the formal order, and using this function can correct the physical order. For example, for the array , transpose(1,2)
results in , but the physical storage of both arrays is , so an error will occur when performing a view operation on the transposed array.
Download from Hugging Face#
# Inside class GPT
@classmethod
def from_pretrained(cls, model_type):
"""Loads pretrained GPT-2 model weights from huggingface"""
# Four types of models
assert model_type in {'gpt2','gpt2-medium','gpt2-large','gpt2-xl'}
# Print which one you are loading
print("Loading weights from pretrained gpt:%s"%model_type)
# Each GPT corresponds to different hyperparameters
config_args ={
'gpt2' : dict(n_layer=12,n_head=12,n_embd=768), # 124M params
'gpt2-medium' : dict(n_layer=24,n_head=16,n_embd=1024), #350M params
'gpt2-large' : dict(n_layer=36,n_head=20,n_embd=1280), #774M params
'gpt2-xl' : dict(n_layer=48,n_head=25,n_embd=1600), #1558M params
}[model_type]
# Vocabulary size is always 50527
config_args['vocab_size'] = 50257
# The size of a single block is always 1024
config_args['block_size'] = 1024
# Import hyperparameters into the model
config = GPTConfig(**config_args)
model = GPT(config)
# sd is the model parameter name dictionary
sd = model.state_dict()
sd_keys = sd.keys()
sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask
# Download weights from HF, sd_hf is the downloaded model parameter name dictionary
model_hf = GPT2LMHeadModel.from_pretrained(model_type, cache_dir="/home/shong_Tan/project/gpt_2/model_weight", local_files_only=True)
sd_hf = model_hf.state_dict()
sd_keys_hf = sd_hf.keys()
# Discard the mask bias in HF weights
sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
# Discard bias in HF weights
sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
# Ensure that sd and hf_sd have the same number of parameter names
assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
# Ensure that sd and hf_sd have the same weight names for transformer blocks
for k in sd_keys_hf:
if any(k.endswith(w) for w in transposed):
assert sd_hf[k].shape[::-1] == sd[k].shape
with torch.no_grad():
sd[k].copy_(sd_hf[k].t())
else:
assert sd_hf[k].shape == sd[k].shape, f"mismatched keys: {sd_hf[k].shape} != {sd[k].shape}"
with torch.no_grad():
sd[k].copy_(sd_hf[k])
return model
Just read the code comments.
Tips⚠️: The shapes of lm_head.weight
and transformer.wte.weight
downloaded from HF are the same, both are , one is the input embedding and the other is the output logits. These two should be consistent, reflecting that when a token is embedded into a feature vector, after interaction, when outputting, it is still this feature vector, which can be transformed back into the original token. Meanwhile, , which can save a lot of GPU memory.
Forward#
# Inside class GPT
def forward(self, idx, target):
# Entering with dimensions [batch, token length]
B, T = idx.size()
# Token length cannot exceed context
assert T <= self.config.block_size, f"Exceeds input context length limit {T-self.config.block_size} token"
# pos [0,1,2,..,T-1], and remember to place it on the device
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
# Position embedding
pos = self.transformer.wpe(pos) #(T, n_embd)
# Token embedding
tok = self.transformer.wte(idx) #(B, T, n_embd)
# The addition is on the (T, n_embd) dimension
x = tok + pos
# Through transformer blocks
for block in self.transformer.h:
x = block(x)
# Final layer normalization
x = self.transformer.ln_f(x)
# Linear layer output
logits = self.lm_head(x) #(B, T, vocab_size)
loss = None
# If there is a target, i.e., a label, training is performed, calculating the loss function; otherwise, inference can be done
if target is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), target.view(-1))
return logits, loss
# A small test
num_return_sequences = 5
max_length = 30
model = GPT.from_pretrained('gpt2')
# eval will disable dropout layers during evaluation, and will have different responses for batchnorm, as well as freeze parameters
model.eval()
# Move the model to GPU
model.to('cuda')
# The following is tokenization, just use OpenAI's tiktoken library. If you want to know the principle, it is recommended to check the blogger's blog at the beginning of the article.
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Hello, I'm a language model, ")
tokens = torch.tensor(tokens, dtype=torch.long) # [8, ]
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1) # [5,8]
x = tokens.to('cuda')
while x.size(1) < max_length:
with torch.no_grad():
# Input the model to get results
logits, loss = model(x) # x: [B, T] logits:[B,T,C]
# Take the prediction of the last token
logits = logits[:, -1, :] # [B, 1, C]
# Perform softmax on the last dimension C
probs = F.softmax(logits, dim=-1) # [B, 1, C]
# Select the top k largest probabilities and corresponding indices from the top k probabilities
topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
# Randomly select a probability from the top k
ix = torch.multinomial(topk_probs, 1)
# Find the index corresponding to the selected probability
xcol = torch.gather(topk_indices, -1, ix)
# Concatenate the obtained output token to x as input [B, T+1]
x = torch.cat((x, xcol), dim=1)
Tokenization form: Transforming "Hello, I'm a language model, " into [15496, 11, 314, 1101, 257, 3303, 2746, 11, 220]
You can try it yourself at the following website:
https://tiktokenizer.vercel.app/
Initialization#
Dataset#
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
# This is for Apple's M chip series
elif hasattr(torch.backends, "eps") and torch.backends.mps.is_available():
device = 'mps'
print("Using device: ", device)
import tiktoken
enc = tiktoken.get_encoding('gpt2')
with open('input.txt', 'r') as f:
data = f.read()
text = data[:1000]
tokens = enc.encode(text)
B, T = 4, 32
buf = torch.tensor(tokens[:B*T+1])
buf.to(device)
# Essentially predicting n+1 words based on the first n words
x = buf[:-1].view(B, T)
y = buf[1:].view(B,T)
model.GPT(GPTConfig())
model.to(device)
logits, loss = model(x)
print(loss.item())
Here, loss
is approximately 11, because .
Training a single batch code:
# Using AdamW optimizer, understand the difference between Adam and SGD yourself
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
# Clear optimizer gradients
optimizer.zero_grad()
# Get logits and loss
logits, loss = model(x, y)
# Backpropagate to compute gradients
loss.backward()
# Update original parameters using gradients
optimizer.step()
The Adam optimizer can converge faster than SGD.
Dataloader function:
class DataLoaderLite():
def __init__(self, B, T):
self.B = B
self.T = T
# Read the entire input.txt
with open('input.txt', 'r') as f:
data = f.read()
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode(data)
self.tokens = torch.tensor(tokens, dtype=torch.long)
print(f"load {len(self.tokens)} tokens")
print(f"1 epoch = {len(self.tokens)//(B*T)} batched")
# Define the current position in the batch
self.current_position = 0
def next_batch(self):
B, T = self.B, self.T
buf = self.tokens[self.current_position: self.current_position+B*T+1]
x = buf[:-1].view(B, T)
y = buf[1:].view(B, T)
# Each batch has B*T pairs
self.current_position += B*T
# If the batch runs out of tokens, return to tokens[0]
if self.current_position+B*T+1 > len(self.tokens):
self.current_position = 0
return x, y
Corrected training code:
train_loader = DataLoaderLite(4, 32 )
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
optimizer.zero_grad()
x, y = train_loader.next_batch()
x, y = x.to(device), y.to(device)
logits, loss = model(x, y)
loss.backward()
optimizer.step()
print(f"step: {i}, loss: {loss.item()}")
Weights#
def _init_weights(self, module):
if isinstance(module, nn.Linear):
std = 0.02
if hasattr(module, ' '):
std = std * (2*self.config.n_layer**-0.5)
torch.nn.init.normal_(module.weight, mean=0, std=std)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0, std=0.02)
std = std * (2*self.config.n_layer)**-0.5
: The variance here considers the contribution of the residual flow; each residual connection indicates that the input has made an equal contribution once, requiring a factor to handle it, . This controls the excessive variance caused by the residual connection in Pre-Norm, and the factor of 2 is due to the fact that both Attention and MLP use a residual once in each layer.
std
: The source of the std value is also based on the documentation in GPT2, which suggests it should be around .