mirror of
https://github.com/HackTricks-wiki/hacktricks.git
synced 2025-10-10 18:36:50 +00:00
947 lines
38 KiB
Markdown
947 lines
38 KiB
Markdown
# 6. Pre-training & Loading models
|
||
|
||
{{#include ../../banners/hacktricks-training.md}}
|
||
|
||
## Text Generation
|
||
|
||
Ili kufundisha mfano, tutahitaji mfano huo uweze kuzalisha token mpya. Kisha tutalinganisha token zilizozalishwa na zile zinazotarajiwa ili kufundisha mfano **kujifunza token anazohitaji kuzalisha**.
|
||
|
||
Kama katika mifano ya awali, tayari tumepiga makadirio ya baadhi ya token, inawezekana kutumia kazi hiyo kwa kusudi hili.
|
||
|
||
> [!TIP]
|
||
> Lengo la awamu hii ya sita ni rahisi sana: **Fundisha mfano kutoka mwanzo**. Kwa hili, usanifu wa awali wa LLM utatumika na mizunguko kadhaa ikipita juu ya seti za data kwa kutumia kazi za hasara zilizofafanuliwa na msaidizi kufundisha vigezo vyote vya mfano.
|
||
|
||
## Text Evaluation
|
||
|
||
Ili kufanya mafunzo sahihi, inahitajika kupima makadirio yaliyopatikana kwa token inayotarajiwa. Lengo la mafunzo ni kuongeza uwezekano wa token sahihi, ambayo inahusisha kuongeza uwezekano wake kulinganisha na token nyingine.
|
||
|
||
Ili kuongeza uwezekano wa token sahihi, uzito wa mfano lazima ubadilishwe ili uwezekano huo uweze kuongezeka. Sasisho za uzito zinafanywa kupitia **backpropagation**. Hii inahitaji **kazi ya hasara kuongeza**. Katika kesi hii, kazi itakuwa **tofauti kati ya makadirio yaliyofanywa na ile inayotakiwa**.
|
||
|
||
Hata hivyo, badala ya kufanya kazi na makadirio ya moja kwa moja, itafanya kazi na logarithm yenye msingi n. Hivyo, ikiwa makadirio ya sasa ya token inayotarajiwa ilikuwa 7.4541e-05, logarithm ya asili (msingi *e*) ya **7.4541e-05** ni takriban **-9.5042**.\
|
||
Kisha, kwa kila ingizo lenye urefu wa muktadha wa token 5 kwa mfano, mfano utahitaji kutabiri token 5, ambapo token 4 za kwanza ni za mwisho wa ingizo na ya tano ni ile iliyotabiriwa. Kwa hivyo, kwa kila ingizo tutakuwa na makadirio 5 katika kesi hiyo (hata kama token 4 za kwanza zilikuwa katika ingizo, mfano haujui hili) zikiwa na token 5 zinazotarajiwa na hivyo uwezekano 5 wa kuongeza.
|
||
|
||
Kwa hivyo, baada ya kufanya logarithm ya asili kwa kila makadirio, **kati** inahesabiwa, **ishara ya minus inatolewa** (hii inaitwa _hasara ya msalaba_) na hiyo ndiyo **nambari ya kupunguza iwe karibu na 0 kadri inavyowezekana** kwa sababu logarithm ya asili ya 1 ni 0:
|
||
|
||
<figure><img src="../../images/image (10) (1).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233">https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233</a></p></figcaption></figure>
|
||
|
||
Njia nyingine ya kupima jinsi mfano ulivyo mzuri inaitwa perplexity. **Perplexity** ni kipimo kinachotumika kutathmini jinsi mfano wa uwezekano unavyotabiri sampuli. Katika uundaji wa lugha, inawakilisha **kutokuwa na uhakika kwa mfano** wakati wa kutabiri token inayofuata katika mfuatano.\
|
||
Kwa mfano, thamani ya perplexity ya 48725, inamaanisha kwamba wakati inahitajika kutabiri token, haijui ni ipi kati ya token 48,725 katika msamiati ndiyo sahihi.
|
||
|
||
## Pre-Train Example
|
||
|
||
Hii ni nambari ya awali iliyopendekezwa katika [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb) mara nyingine kidogo kubadilishwa
|
||
|
||
<details>
|
||
|
||
<summary>Previous code used here but already explained in previous sections</summary>
|
||
```python
|
||
"""
|
||
This is code explained before so it won't be exaplained
|
||
"""
|
||
|
||
import tiktoken
|
||
import torch
|
||
import torch.nn as nn
|
||
from torch.utils.data import Dataset, DataLoader
|
||
|
||
|
||
class GPTDatasetV1(Dataset):
|
||
def __init__(self, txt, tokenizer, max_length, stride):
|
||
self.input_ids = []
|
||
self.target_ids = []
|
||
|
||
# Tokenize the entire text
|
||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||
|
||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||
for i in range(0, len(token_ids) - max_length, stride):
|
||
input_chunk = token_ids[i:i + max_length]
|
||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||
self.input_ids.append(torch.tensor(input_chunk))
|
||
self.target_ids.append(torch.tensor(target_chunk))
|
||
|
||
def __len__(self):
|
||
return len(self.input_ids)
|
||
|
||
def __getitem__(self, idx):
|
||
return self.input_ids[idx], self.target_ids[idx]
|
||
|
||
|
||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||
stride=128, shuffle=True, drop_last=True, num_workers=0):
|
||
# Initialize the tokenizer
|
||
tokenizer = tiktoken.get_encoding("gpt2")
|
||
|
||
# Create dataset
|
||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||
|
||
# Create dataloader
|
||
dataloader = DataLoader(
|
||
dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
|
||
|
||
return dataloader
|
||
|
||
|
||
class MultiHeadAttention(nn.Module):
|
||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||
super().__init__()
|
||
assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
|
||
|
||
self.d_out = d_out
|
||
self.num_heads = num_heads
|
||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||
|
||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||
self.dropout = nn.Dropout(dropout)
|
||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
|
||
|
||
def forward(self, x):
|
||
b, num_tokens, d_in = x.shape
|
||
|
||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||
queries = self.W_query(x)
|
||
values = self.W_value(x)
|
||
|
||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||
|
||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||
keys = keys.transpose(1, 2)
|
||
queries = queries.transpose(1, 2)
|
||
values = values.transpose(1, 2)
|
||
|
||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||
|
||
# Original mask truncated to the number of tokens and converted to boolean
|
||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||
|
||
# Use the mask to fill attention scores
|
||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||
|
||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||
attn_weights = self.dropout(attn_weights)
|
||
|
||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||
|
||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
|
||
context_vec = self.out_proj(context_vec) # optional projection
|
||
|
||
return context_vec
|
||
|
||
|
||
class LayerNorm(nn.Module):
|
||
def __init__(self, emb_dim):
|
||
super().__init__()
|
||
self.eps = 1e-5
|
||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||
|
||
def forward(self, x):
|
||
mean = x.mean(dim=-1, keepdim=True)
|
||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||
return self.scale * norm_x + self.shift
|
||
|
||
|
||
class GELU(nn.Module):
|
||
def __init__(self):
|
||
super().__init__()
|
||
|
||
def forward(self, x):
|
||
return 0.5 * x * (1 + torch.tanh(
|
||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||
(x + 0.044715 * torch.pow(x, 3))
|
||
))
|
||
|
||
|
||
class FeedForward(nn.Module):
|
||
def __init__(self, cfg):
|
||
super().__init__()
|
||
self.layers = nn.Sequential(
|
||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||
GELU(),
|
||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||
)
|
||
|
||
def forward(self, x):
|
||
return self.layers(x)
|
||
|
||
|
||
class TransformerBlock(nn.Module):
|
||
def __init__(self, cfg):
|
||
super().__init__()
|
||
self.att = MultiHeadAttention(
|
||
d_in=cfg["emb_dim"],
|
||
d_out=cfg["emb_dim"],
|
||
context_length=cfg["context_length"],
|
||
num_heads=cfg["n_heads"],
|
||
dropout=cfg["drop_rate"],
|
||
qkv_bias=cfg["qkv_bias"])
|
||
self.ff = FeedForward(cfg)
|
||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
|
||
|
||
def forward(self, x):
|
||
# Shortcut connection for attention block
|
||
shortcut = x
|
||
x = self.norm1(x)
|
||
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
|
||
x = self.drop_shortcut(x)
|
||
x = x + shortcut # Add the original input back
|
||
|
||
# Shortcut connection for feed-forward block
|
||
shortcut = x
|
||
x = self.norm2(x)
|
||
x = self.ff(x)
|
||
x = self.drop_shortcut(x)
|
||
x = x + shortcut # Add the original input back
|
||
|
||
return x
|
||
|
||
|
||
class GPTModel(nn.Module):
|
||
def __init__(self, cfg):
|
||
super().__init__()
|
||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||
|
||
self.trf_blocks = nn.Sequential(
|
||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
|
||
|
||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
|
||
|
||
def forward(self, in_idx):
|
||
batch_size, seq_len = in_idx.shape
|
||
tok_embeds = self.tok_emb(in_idx)
|
||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
|
||
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
|
||
x = self.drop_emb(x)
|
||
x = self.trf_blocks(x)
|
||
x = self.final_norm(x)
|
||
logits = self.out_head(x)
|
||
return logits
|
||
```
|
||
</details>
|
||
```python
|
||
# Download contents to train the data with
|
||
import os
|
||
import urllib.request
|
||
|
||
file_path = "the-verdict.txt"
|
||
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
|
||
|
||
if not os.path.exists(file_path):
|
||
with urllib.request.urlopen(url) as response:
|
||
text_data = response.read().decode('utf-8')
|
||
with open(file_path, "w", encoding="utf-8") as file:
|
||
file.write(text_data)
|
||
else:
|
||
with open(file_path, "r", encoding="utf-8") as file:
|
||
text_data = file.read()
|
||
|
||
total_characters = len(text_data)
|
||
tokenizer = tiktoken.get_encoding("gpt2")
|
||
total_tokens = len(tokenizer.encode(text_data))
|
||
|
||
print("Data downloaded")
|
||
print("Characters:", total_characters)
|
||
print("Tokens:", total_tokens)
|
||
|
||
# Model initialization
|
||
GPT_CONFIG_124M = {
|
||
"vocab_size": 50257, # Vocabulary size
|
||
"context_length": 256, # Shortened context length (orig: 1024)
|
||
"emb_dim": 768, # Embedding dimension
|
||
"n_heads": 12, # Number of attention heads
|
||
"n_layers": 12, # Number of layers
|
||
"drop_rate": 0.1, # Dropout rate
|
||
"qkv_bias": False # Query-key-value bias
|
||
}
|
||
|
||
torch.manual_seed(123)
|
||
model = GPTModel(GPT_CONFIG_124M)
|
||
model.eval()
|
||
print ("Model initialized")
|
||
|
||
|
||
# Functions to transform from tokens to ids and from to ids to tokens
|
||
def text_to_token_ids(text, tokenizer):
|
||
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
|
||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||
return encoded_tensor
|
||
|
||
def token_ids_to_text(token_ids, tokenizer):
|
||
flat = token_ids.squeeze(0) # remove batch dimension
|
||
return tokenizer.decode(flat.tolist())
|
||
|
||
|
||
|
||
# Define loss functions
|
||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||
logits = model(input_batch)
|
||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||
return loss
|
||
|
||
|
||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||
total_loss = 0.
|
||
if len(data_loader) == 0:
|
||
return float("nan")
|
||
elif num_batches is None:
|
||
num_batches = len(data_loader)
|
||
else:
|
||
# Reduce the number of batches to match the total number of batches in the data loader
|
||
# if num_batches exceeds the number of batches in the data loader
|
||
num_batches = min(num_batches, len(data_loader))
|
||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||
if i < num_batches:
|
||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||
total_loss += loss.item()
|
||
else:
|
||
break
|
||
return total_loss / num_batches
|
||
|
||
|
||
# Apply Train/validation ratio and create dataloaders
|
||
train_ratio = 0.90
|
||
split_idx = int(train_ratio * len(text_data))
|
||
train_data = text_data[:split_idx]
|
||
val_data = text_data[split_idx:]
|
||
|
||
torch.manual_seed(123)
|
||
|
||
train_loader = create_dataloader_v1(
|
||
train_data,
|
||
batch_size=2,
|
||
max_length=GPT_CONFIG_124M["context_length"],
|
||
stride=GPT_CONFIG_124M["context_length"],
|
||
drop_last=True,
|
||
shuffle=True,
|
||
num_workers=0
|
||
)
|
||
|
||
val_loader = create_dataloader_v1(
|
||
val_data,
|
||
batch_size=2,
|
||
max_length=GPT_CONFIG_124M["context_length"],
|
||
stride=GPT_CONFIG_124M["context_length"],
|
||
drop_last=False,
|
||
shuffle=False,
|
||
num_workers=0
|
||
)
|
||
|
||
|
||
# Sanity checks
|
||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||
print("Not enough tokens for the training loader. "
|
||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||
"increase the `training_ratio`")
|
||
|
||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||
print("Not enough tokens for the validation loader. "
|
||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||
"decrease the `training_ratio`")
|
||
|
||
print("Train loader:")
|
||
for x, y in train_loader:
|
||
print(x.shape, y.shape)
|
||
|
||
print("\nValidation loader:")
|
||
for x, y in val_loader:
|
||
print(x.shape, y.shape)
|
||
|
||
train_tokens = 0
|
||
for input_batch, target_batch in train_loader:
|
||
train_tokens += input_batch.numel()
|
||
|
||
val_tokens = 0
|
||
for input_batch, target_batch in val_loader:
|
||
val_tokens += input_batch.numel()
|
||
|
||
print("Training tokens:", train_tokens)
|
||
print("Validation tokens:", val_tokens)
|
||
print("All tokens:", train_tokens + val_tokens)
|
||
|
||
|
||
# Indicate the device to use
|
||
if torch.cuda.is_available():
|
||
device = torch.device("cuda")
|
||
elif torch.backends.mps.is_available():
|
||
device = torch.device("mps")
|
||
else:
|
||
device = torch.device("cpu")
|
||
|
||
print(f"Using {device} device.")
|
||
|
||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
|
||
|
||
|
||
|
||
# Pre-calculate losses without starting yet
|
||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
|
||
|
||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
|
||
train_loss = calc_loss_loader(train_loader, model, device)
|
||
val_loss = calc_loss_loader(val_loader, model, device)
|
||
|
||
print("Training loss:", train_loss)
|
||
print("Validation loss:", val_loss)
|
||
|
||
|
||
# Functions to train the data
|
||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||
eval_freq, eval_iter, start_context, tokenizer):
|
||
# Initialize lists to track losses and tokens seen
|
||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||
tokens_seen, global_step = 0, -1
|
||
|
||
# Main training loop
|
||
for epoch in range(num_epochs):
|
||
model.train() # Set model to training mode
|
||
|
||
for input_batch, target_batch in train_loader:
|
||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||
loss.backward() # Calculate loss gradients
|
||
optimizer.step() # Update model weights using loss gradients
|
||
tokens_seen += input_batch.numel()
|
||
global_step += 1
|
||
|
||
# Optional evaluation step
|
||
if global_step % eval_freq == 0:
|
||
train_loss, val_loss = evaluate_model(
|
||
model, train_loader, val_loader, device, eval_iter)
|
||
train_losses.append(train_loss)
|
||
val_losses.append(val_loss)
|
||
track_tokens_seen.append(tokens_seen)
|
||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||
|
||
# Print a sample text after each epoch
|
||
generate_and_print_sample(
|
||
model, tokenizer, device, start_context
|
||
)
|
||
|
||
return train_losses, val_losses, track_tokens_seen
|
||
|
||
|
||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||
model.eval()
|
||
with torch.no_grad():
|
||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||
model.train()
|
||
return train_loss, val_loss
|
||
|
||
|
||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||
model.eval()
|
||
context_size = model.pos_emb.weight.shape[0]
|
||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||
with torch.no_grad():
|
||
token_ids = generate_text(
|
||
model=model, idx=encoded,
|
||
max_new_tokens=50, context_size=context_size
|
||
)
|
||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||
model.train()
|
||
|
||
|
||
# Start training!
|
||
import time
|
||
start_time = time.time()
|
||
|
||
torch.manual_seed(123)
|
||
model = GPTModel(GPT_CONFIG_124M)
|
||
model.to(device)
|
||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
|
||
|
||
num_epochs = 10
|
||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||
model, train_loader, val_loader, optimizer, device,
|
||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||
start_context="Every effort moves you", tokenizer=tokenizer
|
||
)
|
||
|
||
end_time = time.time()
|
||
execution_time_minutes = (end_time - start_time) / 60
|
||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||
|
||
|
||
|
||
# Show graphics with the training process
|
||
import matplotlib.pyplot as plt
|
||
from matplotlib.ticker import MaxNLocator
|
||
import math
|
||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||
ax1.plot(
|
||
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
|
||
)
|
||
ax1.set_xlabel("Epochs")
|
||
ax1.set_ylabel("Loss")
|
||
ax1.legend(loc="upper right")
|
||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
|
||
ax2 = ax1.twiny()
|
||
ax2.plot(tokens_seen, train_losses, alpha=0)
|
||
ax2.set_xlabel("Tokens seen")
|
||
fig.tight_layout()
|
||
plt.show()
|
||
|
||
# Compute perplexity from the loss values
|
||
train_ppls = [math.exp(loss) for loss in train_losses]
|
||
val_ppls = [math.exp(loss) for loss in val_losses]
|
||
# Plot perplexity over tokens seen
|
||
plt.figure()
|
||
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
|
||
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
|
||
plt.xlabel('Tokens Seen')
|
||
plt.ylabel('Perplexity')
|
||
plt.title('Perplexity over Training')
|
||
plt.legend()
|
||
plt.show()
|
||
|
||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||
|
||
|
||
torch.save({
|
||
"model_state_dict": model.state_dict(),
|
||
"optimizer_state_dict": optimizer.state_dict(),
|
||
},
|
||
"/tmp/model_and_optimizer.pth"
|
||
)
|
||
```
|
||
### Functions to transform text <--> ids
|
||
|
||
Hizi ni baadhi ya kazi rahisi ambazo zinaweza kutumika kubadilisha kutoka maandiko kutoka kwa msamiati hadi ids na kinyume chake. Hii inahitajika mwanzoni mwa kushughulikia maandiko na mwishoni mwa utabiri:
|
||
```python
|
||
# Functions to transform from tokens to ids and from to ids to tokens
|
||
def text_to_token_ids(text, tokenizer):
|
||
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
|
||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||
return encoded_tensor
|
||
|
||
def token_ids_to_text(token_ids, tokenizer):
|
||
flat = token_ids.squeeze(0) # remove batch dimension
|
||
return tokenizer.decode(flat.tolist())
|
||
```
|
||
### Generate text functions
|
||
|
||
Katika sehemu ya awali, kazi ambayo ilipata **token inayowezekana zaidi** baada ya kupata logits. Hata hivyo, hii itamaanisha kwamba kwa kila ingizo, matokeo sawa daima yatakuwa yanazalishwa ambayo inafanya iwe ya kutabirika sana.
|
||
|
||
Kazi ifuatayo ya `generate_text`, itatumia dhana za `top-k`, `temperature` na `multinomial`.
|
||
|
||
- **`top-k`** inamaanisha kwamba tutaanza kupunguza hadi `-inf` uwezekano wa token zote isipokuwa za juu k. Hivyo, ikiwa k=3, kabla ya kufanya uamuzi, token 3 zinazowezekana zaidi zitakuwa na uwezekano tofauti na `-inf`.
|
||
- **`temperature`** inamaanisha kwamba kila uwezekano utagawanywa kwa thamani ya joto. Thamani ya `0.1` itaboresha uwezekano wa juu zaidi ikilinganishwa na wa chini, wakati joto la `5` kwa mfano litafanya iwe tambarare zaidi. Hii husaidia kuboresha tofauti katika majibu tunayotaka LLM iwe nayo.
|
||
- Baada ya kutumia joto, kazi ya **`softmax`** inatumika tena ili kufanya token zote zilizobaki kuwa na uwezekano wa jumla wa 1.
|
||
- Hatimaye, badala ya kuchagua token yenye uwezekano mkubwa zaidi, kazi ya **`multinomial`** inatumika ili **kutabiri token inayofuata kulingana na uwezekano wa mwisho**. Hivyo ikiwa token 1 ilikuwa na asilimia 70 ya uwezekano, token 2 asilimia 20 na token 3 asilimia 10, asilimia 70 ya wakati token 1 itachaguliwa, asilimia 20 ya wakati itakuwa token 2 na asilimia 10 ya wakati itakuwa token 3.
|
||
```python
|
||
# Generate text function
|
||
def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
|
||
|
||
# For-loop is the same as before: Get logits, and only focus on last time step
|
||
for _ in range(max_new_tokens):
|
||
idx_cond = idx[:, -context_size:]
|
||
with torch.no_grad():
|
||
logits = model(idx_cond)
|
||
logits = logits[:, -1, :]
|
||
|
||
# New: Filter logits with top_k sampling
|
||
if top_k is not None:
|
||
# Keep only top_k values
|
||
top_logits, _ = torch.topk(logits, top_k)
|
||
min_val = top_logits[:, -1]
|
||
logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)
|
||
|
||
# New: Apply temperature scaling
|
||
if temperature > 0.0:
|
||
logits = logits / temperature
|
||
|
||
# Apply softmax to get probabilities
|
||
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
|
||
|
||
# Sample from the distribution
|
||
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
|
||
|
||
# Otherwise same as before: get idx of the vocab entry with the highest logits value
|
||
else:
|
||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
|
||
|
||
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
|
||
break
|
||
|
||
# Same as before: append sampled index to the running sequence
|
||
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
|
||
|
||
return idx
|
||
```
|
||
> [!TIP]
|
||
> Kuna mbadala wa kawaida wa `top-k` unaitwa [**`top-p`**](https://en.wikipedia.org/wiki/Top-p_sampling), pia inajulikana kama nucleus sampling, ambayo badala ya kupata sampuli k zenye uwezekano mkubwa, in **andaa** msamiati wote unaotokana na uwezekano na **jumlisha** kutoka kwa uwezekano wa juu hadi wa chini hadi **kigezo kifikwe**.
|
||
>
|
||
> Kisha, **maneno hayo tu** ya msamiati yatazingatiwa kulingana na uwezekano wao wa jamaa.
|
||
>
|
||
> Hii inaruhusu kutohitaji kuchagua idadi ya sampuli `k`, kwani k bora inaweza kuwa tofauti katika kila kesi, bali **kigezo tu**.
|
||
>
|
||
> _Kumbuka kwamba uboreshaji huu haujajumuishwa katika msimbo wa awali._
|
||
|
||
> [!TIP]
|
||
> Njia nyingine ya kuboresha maandiko yaliyotengenezwa ni kwa kutumia **Beam search** badala ya utafutaji wa greedy uliofanywa katika mfano huu.\
|
||
> Tofauti na utafutaji wa greedy, ambao unachagua neno linalowezekana zaidi katika kila hatua na kujenga mlolongo mmoja, **beam search inashika rekodi ya 𝑘 k ya juu ya alama za sehemu za mlolongo** (zinazoitwa "beams") katika kila hatua. Kwa kuchunguza uwezekano wengi kwa wakati mmoja, inasawazisha ufanisi na ubora, ikiongeza nafasi za **kupata mlolongo bora zaidi** ambao unaweza kupuuziliwa mbali na mbinu ya greedy kutokana na chaguzi za mapema, zisizo bora.
|
||
>
|
||
> _Kumbuka kwamba uboreshaji huu haujajumuishwa katika msimbo wa awali._
|
||
|
||
### Loss functions
|
||
|
||
Funguo la **`calc_loss_batch`** linahesabu msongamano wa msalaba wa utabiri wa kundi moja.\
|
||
Funguo la **`calc_loss_loader`** linapata msongamano wa msalaba wa makundi yote na kuhesabu **msongamano wa wastani wa msalaba**.
|
||
```python
|
||
# Define loss functions
|
||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||
logits = model(input_batch)
|
||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||
return loss
|
||
|
||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||
total_loss = 0.
|
||
if len(data_loader) == 0:
|
||
return float("nan")
|
||
elif num_batches is None:
|
||
num_batches = len(data_loader)
|
||
else:
|
||
# Reduce the number of batches to match the total number of batches in the data loader
|
||
# if num_batches exceeds the number of batches in the data loader
|
||
num_batches = min(num_batches, len(data_loader))
|
||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||
if i < num_batches:
|
||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||
total_loss += loss.item()
|
||
else:
|
||
break
|
||
return total_loss / num_batches
|
||
```
|
||
> [!TIP]
|
||
> **Gradient clipping** ni mbinu inayotumika kuboresha **utulivu wa mafunzo** katika mitandao mikubwa ya neva kwa kuweka **kigezo cha juu** kwa ukubwa wa gradient. Wakati gradient zinapozidi `max_norm` hii iliyowekwa awali, zinapunguzwa kwa uwiano ili kuhakikisha kwamba masasisho ya vigezo vya mfano yanabaki ndani ya kiwango kinachoweza kudhibitiwa, kuzuia matatizo kama vile gradient zinazoshuka na kuhakikisha mafunzo yanadhibitiwa na kuwa thabiti zaidi.
|
||
>
|
||
> _Kumbuka kwamba uboreshaji huu haujajumuishwa katika msimbo wa awali._
|
||
>
|
||
> Angalia mfano ufuatao:
|
||
|
||
<figure><img src="../../images/image (6) (1).png" alt=""><figcaption></figcaption></figure>
|
||
|
||
### Kupakia Data
|
||
|
||
Mifunction `create_dataloader_v1` na `create_dataloader_v1` tayari zimejadiliwa katika sehemu ya awali.
|
||
|
||
Kutoka hapa, kumbuka jinsi ilivyoainishwa kwamba 90% ya maandiko yatatumika kwa mafunzo wakati 10% itatumika kwa uthibitisho na seti zote zinahifadhiwa katika waendeshaji wa data 2 tofauti.\
|
||
Kumbuka kwamba wakati mwingine sehemu ya seti ya data pia inachwa kwa seti ya majaribio ili kutathmini vizuri utendaji wa mfano.
|
||
|
||
Waendeshaji wote wa data wanatumia saizi sawa ya kundi, urefu wa juu na stride na idadi ya wafanyakazi (0 katika kesi hii).\
|
||
Tofauti kuu ni data inayotumiwa na kila mmoja, na waathibitishaji hawang'oa ya mwisho wala kuchanganya data kwani si muhimu kwa madhumuni ya uthibitisho.
|
||
|
||
Pia ukweli kwamba **stride ni kubwa kama urefu wa muktadha**, ina maana kwamba hakutakuwa na overlapping kati ya muktadha inayotumika kufundisha data (inapunguza overfitting lakini pia seti ya data ya mafunzo).
|
||
|
||
Zaidi ya hayo, kumbuka kwamba saizi ya kundi katika kesi hii ni 2 ili kugawanya data katika makundi 2, lengo kuu la hili ni kuruhusu usindikaji wa sambamba na kupunguza matumizi kwa kundi.
|
||
```python
|
||
train_ratio = 0.90
|
||
split_idx = int(train_ratio * len(text_data))
|
||
train_data = text_data[:split_idx]
|
||
val_data = text_data[split_idx:]
|
||
|
||
torch.manual_seed(123)
|
||
|
||
train_loader = create_dataloader_v1(
|
||
train_data,
|
||
batch_size=2,
|
||
max_length=GPT_CONFIG_124M["context_length"],
|
||
stride=GPT_CONFIG_124M["context_length"],
|
||
drop_last=True,
|
||
shuffle=True,
|
||
num_workers=0
|
||
)
|
||
|
||
val_loader = create_dataloader_v1(
|
||
val_data,
|
||
batch_size=2,
|
||
max_length=GPT_CONFIG_124M["context_length"],
|
||
stride=GPT_CONFIG_124M["context_length"],
|
||
drop_last=False,
|
||
shuffle=False,
|
||
num_workers=0
|
||
)
|
||
```
|
||
## Sanity Checks
|
||
|
||
Lengo ni kuangalia kama kuna tokens za kutosha kwa mafunzo, maumbo ni yale yanayotarajiwa na kupata taarifa kuhusu idadi ya tokens zilizotumika kwa mafunzo na kwa uthibitisho:
|
||
```python
|
||
# Sanity checks
|
||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||
print("Not enough tokens for the training loader. "
|
||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||
"increase the `training_ratio`")
|
||
|
||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||
print("Not enough tokens for the validation loader. "
|
||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||
"decrease the `training_ratio`")
|
||
|
||
print("Train loader:")
|
||
for x, y in train_loader:
|
||
print(x.shape, y.shape)
|
||
|
||
print("\nValidation loader:")
|
||
for x, y in val_loader:
|
||
print(x.shape, y.shape)
|
||
|
||
train_tokens = 0
|
||
for input_batch, target_batch in train_loader:
|
||
train_tokens += input_batch.numel()
|
||
|
||
val_tokens = 0
|
||
for input_batch, target_batch in val_loader:
|
||
val_tokens += input_batch.numel()
|
||
|
||
print("Training tokens:", train_tokens)
|
||
print("Validation tokens:", val_tokens)
|
||
print("All tokens:", train_tokens + val_tokens)
|
||
```
|
||
### Chagua kifaa kwa mafunzo na hesabu za awali
|
||
|
||
Msimbo ufuatao unachagua kifaa cha kutumia na kuhesabu hasara ya mafunzo na hasara ya uthibitisho (bila kuwa na mafunzo yoyote bado) kama hatua ya mwanzo.
|
||
```python
|
||
# Indicate the device to use
|
||
|
||
if torch.cuda.is_available():
|
||
device = torch.device("cuda")
|
||
elif torch.backends.mps.is_available():
|
||
device = torch.device("mps")
|
||
else:
|
||
device = torch.device("cpu")
|
||
|
||
print(f"Using {device} device.")
|
||
|
||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
|
||
|
||
# Pre-calculate losses without starting yet
|
||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
|
||
|
||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
|
||
train_loss = calc_loss_loader(train_loader, model, device)
|
||
val_loss = calc_loss_loader(val_loader, model, device)
|
||
|
||
print("Training loss:", train_loss)
|
||
print("Validation loss:", val_loss)
|
||
```
|
||
### Training functions
|
||
|
||
The function `generate_and_print_sample` itachukua muktadha na kuunda baadhi ya tokens ili kupata hisia kuhusu jinsi modeli ilivyo nzuri katika hatua hiyo. Hii inaitwa na `train_model_simple` katika kila hatua.
|
||
|
||
The function `evaluate_model` inaitwa mara kwa mara kama inavyoashiria kwa kazi ya mafunzo na inatumika kupima hasara ya mafunzo na hasara ya uthibitisho katika hatua hiyo ya mafunzo ya modeli.
|
||
|
||
Kisha kazi kubwa `train_model_simple` ndiyo inayofanya mafunzo ya modeli. Inatarajia:
|
||
|
||
- Mtu wa kupakia data ya mafunzo (ikiwa na data tayari imegawanywa na kuandaliwa kwa mafunzo)
|
||
- Mtu wa kuthibitisha
|
||
- **optimizer** ya kutumia wakati wa mafunzo: Hii ndiyo kazi itakayotumia gradients na kusasisha vigezo ili kupunguza hasara. Katika kesi hii, kama utakavyoona, `AdamW` inatumika, lakini kuna nyingi zaidi.
|
||
- `optimizer.zero_grad()` inaitwa ili kurekebisha gradients katika kila raundi ili zisijikusanye.
|
||
- **`lr`** param ni **kasi ya kujifunza** ambayo inamua **ukubwa wa hatua** zinazochukuliwa wakati wa mchakato wa kuboresha unaposasisha vigezo vya modeli. Kasi ya kujifunza **ndogo** inamaanisha optimizer **inafanya sasisho ndogo** kwa uzito, ambayo inaweza kusababisha **mwelekeo** sahihi lakini inaweza **kuchelewesha** mafunzo. Kasi ya kujifunza **kubwa** inaweza kuharakisha mafunzo lakini **ina hatari ya kupita** chini ya kiwango cha chini cha kazi ya hasara (**kuruka juu** ya mahali ambapo kazi ya hasara inakuwa ndogo).
|
||
- **Weight Decay** inabadilisha hatua ya **Kuhesabu Hasara** kwa kuongeza neno la ziada linalotoza uzito mkubwa. Hii inahimiza optimizer kupata suluhisho zenye uzito mdogo, ikisawazisha kati ya kufaa data vizuri na kuweka modeli rahisi ili kuzuia overfitting katika mifano ya kujifunza mashine kwa kukataza modeli kupewa umuhimu mkubwa kwa kipengele chochote kimoja.
|
||
- Optimizers za jadi kama SGD na L2 regularization zinachanganya uzito wa kupungua na gradient ya kazi ya hasara. Hata hivyo, **AdamW** (toleo la optimizer ya Adam) inatenganisha uzito wa kupungua kutoka kwa sasisho la gradient, ikisababisha udhibiti mzuri zaidi.
|
||
- Kifaa cha kutumia kwa mafunzo
|
||
- Idadi ya epochs: Idadi ya nyakati za kupita juu ya data ya mafunzo
|
||
- Mara ya tathmini: Mara ya kuita `evaluate_model`
|
||
- Iteration ya tathmini: Idadi ya batches za kutumia wakati wa kutathmini hali ya sasa ya modeli unapokita `generate_and_print_sample`
|
||
- Muktadha wa kuanzia: Sentensi ya kuanzia kutumia unapokita `generate_and_print_sample`
|
||
- Tokenizer
|
||
```python
|
||
# Functions to train the data
|
||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||
eval_freq, eval_iter, start_context, tokenizer):
|
||
# Initialize lists to track losses and tokens seen
|
||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||
tokens_seen, global_step = 0, -1
|
||
|
||
# Main training loop
|
||
for epoch in range(num_epochs):
|
||
model.train() # Set model to training mode
|
||
|
||
for input_batch, target_batch in train_loader:
|
||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||
loss.backward() # Calculate loss gradients
|
||
optimizer.step() # Update model weights using loss gradients
|
||
tokens_seen += input_batch.numel()
|
||
global_step += 1
|
||
|
||
# Optional evaluation step
|
||
if global_step % eval_freq == 0:
|
||
train_loss, val_loss = evaluate_model(
|
||
model, train_loader, val_loader, device, eval_iter)
|
||
train_losses.append(train_loss)
|
||
val_losses.append(val_loss)
|
||
track_tokens_seen.append(tokens_seen)
|
||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||
|
||
# Print a sample text after each epoch
|
||
generate_and_print_sample(
|
||
model, tokenizer, device, start_context
|
||
)
|
||
|
||
return train_losses, val_losses, track_tokens_seen
|
||
|
||
|
||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||
model.eval() # Set in eval mode to avoid dropout
|
||
with torch.no_grad():
|
||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||
model.train() # Back to training model applying all the configurations
|
||
return train_loss, val_loss
|
||
|
||
|
||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||
model.eval() # Set in eval mode to avoid dropout
|
||
context_size = model.pos_emb.weight.shape[0]
|
||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||
with torch.no_grad():
|
||
token_ids = generate_text(
|
||
model=model, idx=encoded,
|
||
max_new_tokens=50, context_size=context_size
|
||
)
|
||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||
model.train() # Back to training model applying all the configurations
|
||
```
|
||
> [!TIP]
|
||
> Ili kuboresha kiwango cha kujifunza kuna mbinu kadhaa muhimu zinazoitwa **linear warmup** na **cosine decay.**
|
||
>
|
||
> **Linear warmup** inajumuisha kufafanua kiwango cha awali cha kujifunza na kiwango cha juu na kuendelea kukisasisha baada ya kila kipindi. Hii ni kwa sababu kuanza mafunzo na masasisho madogo ya uzito hupunguza hatari ya mfano kukutana na masasisho makubwa, yanayoweza kuleta kutokuwa na utulivu wakati wa awamu yake ya mafunzo.\
|
||
> **Cosine decay** ni mbinu ambayo **inapunguza polepole kiwango cha kujifunza** ikifuatia curve ya nusu-cosine **baada ya awamu ya warmup**, ikichelewesha masasisho ya uzito ili **kupunguza hatari ya kupita** chini ya kiwango cha hasara na kuhakikisha utulivu wa mafunzo katika awamu za baadaye.
|
||
>
|
||
> _Kumbuka kwamba maboresho haya hayajajumuishwa katika msimbo wa awali._
|
||
|
||
### Anza mafunzo
|
||
```python
|
||
import time
|
||
start_time = time.time()
|
||
|
||
torch.manual_seed(123)
|
||
model = GPTModel(GPT_CONFIG_124M)
|
||
model.to(device)
|
||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
|
||
|
||
num_epochs = 10
|
||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||
model, train_loader, val_loader, optimizer, device,
|
||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||
start_context="Every effort moves you", tokenizer=tokenizer
|
||
)
|
||
|
||
end_time = time.time()
|
||
execution_time_minutes = (end_time - start_time) / 60
|
||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||
```
|
||
### Print training evolution
|
||
|
||
Kwa kutumia kazi ifuatayo, inawezekana kuchapisha maendeleo ya mfano wakati ulikuwa unafundishwa.
|
||
```python
|
||
import matplotlib.pyplot as plt
|
||
from matplotlib.ticker import MaxNLocator
|
||
import math
|
||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||
ax1.plot(
|
||
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
|
||
)
|
||
ax1.set_xlabel("Epochs")
|
||
ax1.set_ylabel("Loss")
|
||
ax1.legend(loc="upper right")
|
||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
|
||
ax2 = ax1.twiny()
|
||
ax2.plot(tokens_seen, train_losses, alpha=0)
|
||
ax2.set_xlabel("Tokens seen")
|
||
fig.tight_layout()
|
||
plt.show()
|
||
|
||
# Compute perplexity from the loss values
|
||
train_ppls = [math.exp(loss) for loss in train_losses]
|
||
val_ppls = [math.exp(loss) for loss in val_losses]
|
||
# Plot perplexity over tokens seen
|
||
plt.figure()
|
||
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
|
||
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
|
||
plt.xlabel('Tokens Seen')
|
||
plt.ylabel('Perplexity')
|
||
plt.title('Perplexity over Training')
|
||
plt.legend()
|
||
plt.show()
|
||
|
||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||
```
|
||
### Hifadhi mfano
|
||
|
||
Inawezekana kuhifadhi mfano + optimizer ikiwa unataka kuendelea na mafunzo baadaye:
|
||
```python
|
||
# Save the model and the optimizer for later training
|
||
torch.save({
|
||
"model_state_dict": model.state_dict(),
|
||
"optimizer_state_dict": optimizer.state_dict(),
|
||
},
|
||
"/tmp/model_and_optimizer.pth"
|
||
)
|
||
# Note that this model with the optimizer occupied close to 2GB
|
||
|
||
# Restore model and optimizer for training
|
||
checkpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)
|
||
|
||
model = GPTModel(GPT_CONFIG_124M)
|
||
model.load_state_dict(checkpoint["model_state_dict"])
|
||
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
|
||
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
|
||
model.train(); # Put in training mode
|
||
```
|
||
Au tu mfano ikiwa unapanga kutumia tu:
|
||
```python
|
||
# Save the model
|
||
torch.save(model.state_dict(), "model.pth")
|
||
|
||
# Load it
|
||
model = GPTModel(GPT_CONFIG_124M)
|
||
|
||
model.load_state_dict(torch.load("model.pth", map_location=device))
|
||
|
||
model.eval() # Put in eval mode
|
||
```
|
||
## Kupakia uzito wa GPT2
|
||
|
||
Kuna skripti 2 za haraka za kupakia uzito wa GPT2 kwenye kompyuta yako. Kwa zote mbili unaweza kunakili hifadhi [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) kwenye kompyuta yako, kisha:
|
||
|
||
- Skripti [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py) itashusha uzito wote na kubadilisha fomati kutoka OpenAI hadi zile zinazotarajiwa na LLM yetu. Skripti pia imeandaliwa na usanidi unaohitajika na na prompt: "Kila juhudi inakusogeza"
|
||
- Skripti [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb) inakuwezesha kupakia uzito wowote wa GPT2 kwenye kompyuta yako (badilisha tu var `CHOOSE_MODEL`) na kutabiri maandiko kutoka kwa baadhi ya prompts.
|
||
|
||
## Marejeo
|
||
|
||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||
|
||
|
||
{{#include ../../banners/hacktricks-training.md}}
|