mirror of
https://github.com/HackTricks-wiki/hacktricks.git
synced 2025-10-10 18:36:50 +00:00
239 lines
9.7 KiB
Markdown
239 lines
9.7 KiB
Markdown
# 2. Data Sampling
|
|
|
|
{{#include ../../banners/hacktricks-training.md}}
|
|
|
|
## **Data Sampling**
|
|
|
|
**Data Sampling** is 'n belangrike proses om data voor te berei vir die opleiding van groot taalmodelle (LLMs) soos GPT. Dit behels die organisering van teksdata in invoer- en teikensekwensies wat die model gebruik om te leer hoe om die volgende woord (of token) te voorspel op grond van die voorafgaande woorde. Korrek data sampling verseker dat die model effektief taalpatrone en afhanklikhede vasvang.
|
|
|
|
> [!TIP]
|
|
> Die doel van hierdie tweede fase is baie eenvoudig: **Steek die invoerdata en berei dit voor vir die opleidingsfase deur gewoonlik die dataset in sinne van 'n spesifieke lengte te skei en ook die verwagte reaksie te genereer.**
|
|
|
|
### **Why Data Sampling Matters**
|
|
|
|
LLMs soos GPT word opgelei om teks te genereer of te voorspel deur die konteks wat deur vorige woorde verskaf word, te verstaan. Om dit te bereik, moet die opleidingsdata op 'n manier gestruktureer wees sodat die model die verhouding tussen sekwensies van woorde en hul daaropvolgende woorde kan leer. Hierdie gestruktureerde benadering stel die model in staat om te generaliseer en samehangende en konteksueel relevante teks te genereer.
|
|
|
|
### **Key Concepts in Data Sampling**
|
|
|
|
1. **Tokenization:** Om teks in kleiner eenhede genaamd tokens (bv. woorde, subwoorde of karakters) te verdeel.
|
|
2. **Sequence Length (max_length):** Die aantal tokens in elke invoersekwensie.
|
|
3. **Sliding Window:** 'n Metode om oorvleuelende invoersekwensies te skep deur 'n venster oor die getokeniseerde teks te beweeg.
|
|
4. **Stride:** Die aantal tokens wat die glijdende venster vorentoe beweeg om die volgende sekwensie te skep.
|
|
|
|
### **Step-by-Step Example**
|
|
|
|
Laat ons deur 'n voorbeeld stap om data sampling te illustreer.
|
|
|
|
**Example Text**
|
|
```arduino
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
|
|
```
|
|
**Tokenisering**
|
|
|
|
Neem aan ons gebruik 'n **basiese tokenizer** wat die teks in woorde en leestekens verdeel:
|
|
```vbnet
|
|
Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]
|
|
```
|
|
**Parameters**
|
|
|
|
- **Max Sequence Length (max_length):** 4 tokens
|
|
- **Sliding Window Stride:** 1 token
|
|
|
|
**Creating Input and Target Sequences**
|
|
|
|
1. **Sliding Window Approach:**
|
|
- **Input Sequences:** Elke invoerreeks bestaan uit `max_length` tokens.
|
|
- **Target Sequences:** Elke teikenreeks bestaan uit die tokens wat onmiddellik volg op die ooreenstemmende invoerreeks.
|
|
2. **Generating Sequences:**
|
|
|
|
<table><thead><tr><th width="177">Window Position</th><th>Input Sequence</th><th>Target Sequence</th></tr></thead><tbody><tr><td>1</td><td>["Lorem", "ipsum", "dolor", "sit"]</td><td>["ipsum", "dolor", "sit", "amet,"]</td></tr><tr><td>2</td><td>["ipsum", "dolor", "sit", "amet,"]</td><td>["dolor", "sit", "amet,", "consectetur"]</td></tr><tr><td>3</td><td>["dolor", "sit", "amet,", "consectetur"]</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td></tr><tr><td>4</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td><td>["amet,", "consectetur", "adipiscing", "elit."]</td></tr></tbody></table>
|
|
|
|
3. **Resulting Input and Target Arrays:**
|
|
|
|
- **Input:**
|
|
|
|
```python
|
|
[
|
|
["Lorem", "ipsum", "dolor", "sit"],
|
|
["ipsum", "dolor", "sit", "amet,"],
|
|
["dolor", "sit", "amet,", "consectetur"],
|
|
["sit", "amet,", "consectetur", "adipiscing"],
|
|
]
|
|
```
|
|
|
|
- **Target:**
|
|
|
|
```python
|
|
[
|
|
["ipsum", "dolor", "sit", "amet,"],
|
|
["dolor", "sit", "amet,", "consectetur"],
|
|
["sit", "amet,", "consectetur", "adipiscing"],
|
|
["amet,", "consectetur", "adipiscing", "elit."],
|
|
]
|
|
```
|
|
|
|
**Visual Representation**
|
|
|
|
<table><thead><tr><th width="222">Token Position</th><th>Token</th></tr></thead><tbody><tr><td>1</td><td>Lorem</td></tr><tr><td>2</td><td>ipsum</td></tr><tr><td>3</td><td>dolor</td></tr><tr><td>4</td><td>sit</td></tr><tr><td>5</td><td>amet,</td></tr><tr><td>6</td><td>consectetur</td></tr><tr><td>7</td><td>adipiscing</td></tr><tr><td>8</td><td>elit.</td></tr></tbody></table>
|
|
|
|
**Sliding Window with Stride 1:**
|
|
|
|
- **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
|
|
- **Second Window (Positions 2-5):** \["ipsum", "dolor", "sit", "amet,"] → **Target:** \["dolor", "sit", "amet,", "consectetur"]
|
|
- **Third Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
|
|
- **Fourth Window (Positions 4-7):** \["sit", "amet,", "consectetur", "adipiscing"] → **Target:** \["amet,", "consectetur", "adipiscing", "elit."]
|
|
|
|
**Understanding Stride**
|
|
|
|
- **Stride of 1:** Die venster beweeg vorentoe met een token elke keer, wat lei tot hoogs oorvleuelende reekse. Dit kan lei tot beter leer van kontekstuele verhoudings, maar kan die risiko van oorpassing verhoog aangesien soortgelyke datapunte herhaal word.
|
|
- **Stride of 2:** Die venster beweeg vorentoe met twee tokens elke keer, wat oorvleueling verminder. Dit verminder redundans en rekenaarlading, maar mag dalk sommige kontekstuele nuanses mis.
|
|
- **Stride Equal to max_length:** Die venster beweeg vorentoe met die hele venstergrootte, wat lei tot nie-oorvleuelende reekse. Dit minimaliseer data redundans, maar mag die model se vermoë om afhanklikhede oor reekse te leer beperk.
|
|
|
|
**Example with Stride of 2:**
|
|
|
|
Gebruik die dieselfde getokeniseerde teks en `max_length` van 4:
|
|
|
|
- **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
|
|
- **Second Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
|
|
- **Third Window (Positions 5-8):** \["amet,", "consectetur", "adipiscing", "elit."] → **Target:** \["consectetur", "adipiscing", "elit.", "sed"] _(Assuming continuation)_
|
|
|
|
## Code Example
|
|
|
|
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb):
|
|
```python
|
|
# Download the text to pre-train the LLM
|
|
import urllib.request
|
|
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
|
|
file_path = "the-verdict.txt"
|
|
urllib.request.urlretrieve(url, file_path)
|
|
|
|
with open("the-verdict.txt", "r", encoding="utf-8") as f:
|
|
raw_text = f.read()
|
|
|
|
"""
|
|
Create a class that will receive some params lie tokenizer and text
|
|
and will prepare the input chunks and the target chunks to prepare
|
|
the LLM to learn which next token to generate
|
|
"""
|
|
import torch
|
|
from torch.utils.data import Dataset, DataLoader
|
|
|
|
class GPTDatasetV1(Dataset):
|
|
def __init__(self, txt, tokenizer, max_length, stride):
|
|
self.input_ids = []
|
|
self.target_ids = []
|
|
|
|
# Tokenize the entire text
|
|
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
|
|
|
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
|
for i in range(0, len(token_ids) - max_length, stride):
|
|
input_chunk = token_ids[i:i + max_length]
|
|
target_chunk = token_ids[i + 1: i + max_length + 1]
|
|
self.input_ids.append(torch.tensor(input_chunk))
|
|
self.target_ids.append(torch.tensor(target_chunk))
|
|
|
|
def __len__(self):
|
|
return len(self.input_ids)
|
|
|
|
def __getitem__(self, idx):
|
|
return self.input_ids[idx], self.target_ids[idx]
|
|
|
|
|
|
"""
|
|
Create a data loader which given the text and some params will
|
|
prepare the inputs and targets with the previous class and
|
|
then create a torch DataLoader with the info
|
|
"""
|
|
|
|
import tiktoken
|
|
|
|
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
|
stride=128, shuffle=True, drop_last=True,
|
|
num_workers=0):
|
|
|
|
# Initialize the tokenizer
|
|
tokenizer = tiktoken.get_encoding("gpt2")
|
|
|
|
# Create dataset
|
|
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
|
|
|
# Create dataloader
|
|
dataloader = DataLoader(
|
|
dataset,
|
|
batch_size=batch_size,
|
|
shuffle=shuffle,
|
|
drop_last=drop_last,
|
|
num_workers=num_workers
|
|
)
|
|
|
|
return dataloader
|
|
|
|
|
|
"""
|
|
Finally, create the data loader with the params we want:
|
|
- The used text for training
|
|
- batch_size: The size of each batch
|
|
- max_length: The size of each entry on each batch
|
|
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
|
|
- shuffle: Re-order randomly
|
|
"""
|
|
dataloader = create_dataloader_v1(
|
|
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
|
|
)
|
|
|
|
data_iter = iter(dataloader)
|
|
first_batch = next(data_iter)
|
|
print(first_batch)
|
|
|
|
# Note the batch_size of 8, the max_length of 4 and the stride of 1
|
|
[
|
|
# Input
|
|
tensor([[ 40, 367, 2885, 1464],
|
|
[ 367, 2885, 1464, 1807],
|
|
[ 2885, 1464, 1807, 3619],
|
|
[ 1464, 1807, 3619, 402],
|
|
[ 1807, 3619, 402, 271],
|
|
[ 3619, 402, 271, 10899],
|
|
[ 402, 271, 10899, 2138],
|
|
[ 271, 10899, 2138, 257]]),
|
|
# Target
|
|
tensor([[ 367, 2885, 1464, 1807],
|
|
[ 2885, 1464, 1807, 3619],
|
|
[ 1464, 1807, 3619, 402],
|
|
[ 1807, 3619, 402, 271],
|
|
[ 3619, 402, 271, 10899],
|
|
[ 402, 271, 10899, 2138],
|
|
[ 271, 10899, 2138, 257],
|
|
[10899, 2138, 257, 7026]])
|
|
]
|
|
|
|
# With stride=4 this will be the result:
|
|
[
|
|
# Input
|
|
tensor([[ 40, 367, 2885, 1464],
|
|
[ 1807, 3619, 402, 271],
|
|
[10899, 2138, 257, 7026],
|
|
[15632, 438, 2016, 257],
|
|
[ 922, 5891, 1576, 438],
|
|
[ 568, 340, 373, 645],
|
|
[ 1049, 5975, 284, 502],
|
|
[ 284, 3285, 326, 11]]),
|
|
# Target
|
|
tensor([[ 367, 2885, 1464, 1807],
|
|
[ 3619, 402, 271, 10899],
|
|
[ 2138, 257, 7026, 15632],
|
|
[ 438, 2016, 257, 922],
|
|
[ 5891, 1576, 438, 568],
|
|
[ 340, 373, 645, 1049],
|
|
[ 5975, 284, 502, 284],
|
|
[ 3285, 326, 11, 287]])
|
|
]
|
|
```
|
|
## Verwysings
|
|
|
|
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
|
|
|
|
|
{{#include ../../banners/hacktricks-training.md}}
|