# 2. Uzorkovanje Podataka {{#include ../../banners/hacktricks-training.md}} ## **Uzorkovanje Podataka** **Uzorkovanje Podataka** je ključni proces u pripremi podataka za obučavanje velikih jezičkih modela (LLM) poput GPT-a. Uključuje organizovanje tekstualnih podataka u ulazne i ciljne sekvence koje model koristi da nauči kako da predviđa sledeću reč (ili token) na osnovu prethodnih reči. Pravilno uzorkovanje podataka osigurava da model efikasno hvata jezičke obrasce i zavisnosti. > [!TIP] > Cilj ove druge faze je vrlo jednostavan: **Uzorkujte ulazne podatke i pripremite ih za fazu obučavanja obično razdvajanjem skupa podataka na rečenice određene dužine i generisanjem očekivanog odgovora.** ### **Zašto je Uzorkovanje Podataka Važno** LLM-ovi kao što je GPT obučavaju se da generišu ili predviđaju tekst razumevanjem konteksta koji pružaju prethodne reči. Da bi se to postiglo, obučeni podaci moraju biti strukturirani na način da model može naučiti odnos između sekvenci reči i njihovih sledećih reči. Ovaj strukturirani pristup omogućava modelu da generalizuje i generiše koherentan i kontekstualno relevantan tekst. ### **Ključni Koncepti u Uzorkovanju Podataka** 1. **Tokenizacija:** Razbijanje teksta na manje jedinice nazvane tokeni (npr. reči, podreči ili karakteri). 2. **Dužina Sekvence (max_length):** Broj tokena u svakoj ulaznoj sekvenci. 3. **Klizni Prozor:** Metod za kreiranje preklapajućih ulaznih sekvenci pomeranjem prozora preko tokenizovanog teksta. 4. **Korak:** Broj tokena koje klizni prozor pomera unapred da bi kreirao sledeću sekvencu. ### **Primer Korak po Korak** Hajde da prođemo kroz primer kako bismo ilustrovali uzorkovanje podataka. **Primer Teksta** ```arduino "Lorem ipsum dolor sit amet, consectetur adipiscing elit." ``` **Tokenizacija** Pretpostavimo da koristimo **osnovni tokenizator** koji deli tekst na reči i interpunkcijske znakove: ```vbnet Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."] ``` **Parametri** - **Maksimalna dužina sekvence (max_length):** 4 tokena - **Korak kliznog prozora:** 1 token **Kreiranje ulaznih i ciljanih sekvenci** 1. **Pristup kliznom prozoru:** - **Ulazne sekvence:** Svaka ulazna sekvenca se sastoji od `max_length` tokena. - **Ciljane sekvence:** Svaka ciljana sekvenca se sastoji od tokena koji odmah slede odgovarajuću ulaznu sekvencu. 2. **Generisanje sekvenci:**

Pozicija prozora	Ulazna sekvenca	Ciljana sekvenca
1	["Lorem", "ipsum", "dolor", "sit"]	["ipsum", "dolor", "sit", "amet,"]
2	["ipsum", "dolor", "sit", "amet,"]	["dolor", "sit", "amet,", "consectetur"]
3	["dolor", "sit", "amet,", "consectetur"]	["sit", "amet,", "consectetur", "adipiscing"]
4	["sit", "amet,", "consectetur", "adipiscing"]	["amet,", "consectetur", "adipiscing", "elit."]

3. **Rezultantni ulazni i ciljani nizovi:** - **Ulaz:** ```python [ ["Lorem", "ipsum", "dolor", "sit"], ["ipsum", "dolor", "sit", "amet,"], ["dolor", "sit", "amet,", "consectetur"], ["sit", "amet,", "consectetur", "adipiscing"], ] ``` - **Cilj:** ```python [ ["ipsum", "dolor", "sit", "amet,"], ["dolor", "sit", "amet,", "consectetur"], ["sit", "amet,", "consectetur", "adipiscing"], ["amet,", "consectetur", "adipiscing", "elit."], ] ``` **Vizuelna reprezentacija**

Pozicija tokena	Token
1	Lorem
2	ipsum
3	dolor
4	sit
5	amet,
6	consectetur
7	adipiscing
8	elit.

**Klizni prozor sa korakom 1:** - **Prvi prozor (Pozicije 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Cilj:** \["ipsum", "dolor", "sit", "amet,"] - **Drugi prozor (Pozicije 2-5):** \["ipsum", "dolor", "sit", "amet,"] → **Cilj:** \["dolor", "sit", "amet,", "consectetur"] - **Treći prozor (Pozicije 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Cilj:** \["sit", "amet,", "consectetur", "adipiscing"] - **Četvrti prozor (Pozicije 4-7):** \["sit", "amet,", "consectetur", "adipiscing"] → **Cilj:** \["amet,", "consectetur", "adipiscing", "elit."] **Razumevanje koraka** - **Korak od 1:** Prozor se pomera napred za jedan token svaki put, što rezultira visokom preklapanju sekvenci. To može dovesti do boljeg učenja kontekstualnih odnosa, ali može povećati rizik od prekomernog prilagođavanja jer se slične tačke podataka ponavljaju. - **Korak od 2:** Prozor se pomera napred za dva tokena svaki put, smanjujući preklapanje. Ovo smanjuje redundanciju i računarsko opterećenje, ali može propustiti neke kontekstualne nijanse. - **Korak jednak max_length:** Prozor se pomera napred za celu veličinu prozora, što rezultira nepreklapajućim sekvencama. Ovo minimizira redundanciju podataka, ali može ograničiti sposobnost modela da uči zavisnosti između sekvenci. **Primer sa korakom od 2:** Koristeći isti tokenizovani tekst i `max_length` od 4: - **Prvi prozor (Pozicije 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Cilj:** \["ipsum", "dolor", "sit", "amet,"] - **Drugi prozor (Pozicije 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Cilj:** \["sit", "amet,", "consectetur", "adipiscing"] - **Treći prozor (Pozicije 5-8):** \["amet,", "consectetur", "adipiscing", "elit."] → **Cilj:** \["consectetur", "adipiscing", "elit.", "sed"] _(Pretpostavljajući nastavak)_ ## Primer koda Hajde da ovo bolje razumemo iz primera koda sa [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb): ```python # Download the text to pre-train the LLM import urllib.request url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt") file_path = "the-verdict.txt" urllib.request.urlretrieve(url, file_path) with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read() """ Create a class that will receive some params lie tokenizer and text and will prepare the input chunks and the target chunks to prepare the LLM to learn which next token to generate """ import torch from torch.utils.data import Dataset, DataLoader class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = [] # Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # Use a sliding window to chunk the book into overlapping sequences of max_length for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): return len(self.input_ids) def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx] """ Create a data loader which given the text and some params will prepare the inputs and targets with the previous class and then create a torch DataLoader with the info """ import tiktoken def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): # Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2") # Create dataset dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # Create dataloader dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers ) return dataloader """ Finally, create the data loader with the params we want: - The used text for training - batch_size: The size of each batch - max_length: The size of each entry on each batch - stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated. - shuffle: Re-order randomly """ dataloader = create_dataloader_v1( raw_text, batch_size=8, max_length=4, stride=1, shuffle=False ) data_iter = iter(dataloader) first_batch = next(data_iter) print(first_batch) # Note the batch_size of 8, the max_length of 4 and the stride of 1 [ # Input tensor([[ 40, 367, 2885, 1464], [ 367, 2885, 1464, 1807], [ 2885, 1464, 1807, 3619], [ 1464, 1807, 3619, 402], [ 1807, 3619, 402, 271], [ 3619, 402, 271, 10899], [ 402, 271, 10899, 2138], [ 271, 10899, 2138, 257]]), # Target tensor([[ 367, 2885, 1464, 1807], [ 2885, 1464, 1807, 3619], [ 1464, 1807, 3619, 402], [ 1807, 3619, 402, 271], [ 3619, 402, 271, 10899], [ 402, 271, 10899, 2138], [ 271, 10899, 2138, 257], [10899, 2138, 257, 7026]]) ] # With stride=4 this will be the result: [ # Input tensor([[ 40, 367, 2885, 1464], [ 1807, 3619, 402, 271], [10899, 2138, 257, 7026], [15632, 438, 2016, 257], [ 922, 5891, 1576, 438], [ 568, 340, 373, 645], [ 1049, 5975, 284, 502], [ 284, 3285, 326, 11]]), # Target tensor([[ 367, 2885, 1464, 1807], [ 3619, 402, 271, 10899], [ 2138, 257, 7026, 15632], [ 438, 2016, 257, 922], [ 5891, 1576, 438, 568], [ 340, 373, 645, 1049], [ 5975, 284, 502, 284], [ 3285, 326, 11, 287]]) ] ``` ## Napredne strategije uzorkovanja (2023-2025) ### 1. Težinsko mešanje zasnovano na temperaturi Najmoderniji LLM-ovi retko se obučavaju na jednom korpusu. Umesto toga, uzorkuju iz nekoliko heterogenih izvora podataka (kod, web, akademski radovi, forumi…). Relativni udeo svakog izvora može snažno uticati na performanse u nastavku. Nedavni modeli otvorenog koda kao što je Llama 2 uveli su **shemu uzorkovanja zasnovanu na temperaturi** gde verovatnoća izvlačenja dokumenta iz korpusa *i* postaje ``` p(i) = \frac{w_i^{\alpha}}{\sum_j w_j^{\alpha}} ``` • *w_i* – sirovi procenat tokena korpusa *i* • *α* ("temperatura") – vrednost u (0,1]. α < 1 izravnava distribuciju, dajući veću težinu manjim visokokvalitetnim korpusima. Llama 2 je koristila α = 0.7 i pokazala da smanjenje α povećava ocene evaluacije na zadacima bogatim znanjem, dok stabilizuje mešavinu obuke. Ista taktika se koristi u Mistral (2023) i Claude 3. ```python from collections import Counter def temperature_sample(corpus_ids, alpha=0.7): counts = Counter(corpus_ids) # number of tokens seen per corpus probs = {c: c_count**alpha for c, c_count in counts.items()} Z = sum(probs.values()) probs = {c: p/Z for c, p in probs.items()} # Now draw according to probs to fill every batch ``` ``` ### 2. Sequence Packing / Dynamic Batching GPU memory is wasted when every sequence in a batch is padded to the longest example. "Packing" concatenates multiple shorter sequences until the **exact** `max_length` is reached and builds a parallel `attention_mask` so that tokens do not attend across segment boundaries. Packing can improve throughput by 20–40 % with no gradient change and is supported out-of-the-box in * PyTorch `torchtext.experimental.agents.PackedBatch` * HuggingFace `DataCollatorForLanguageModeling(pad_to_multiple_of=…)` Dynamic batching frameworks (e.g. FlashAttention 2, vLLM 2024) combine sequence packing with just-in-time kernel selection, enabling thousand-token context training at 400+ K tokens/s on A100-80G. ### 3. Deduplication & Quality Filtering Repeated passages cause memorization and provide an easy channel for data-poisoning. Modern pipelines therefore: 1. MinHash/FAISS near-duplicate detection at **document** and **128-gram** level. 2. Filter documents whose perplexity under a small reference model is > µ + 3σ (noisy OCR, garbled HTML). 3. Block-list documents that contain PII or CWE keywords using regex & spaCy NER. The Llama 2 team deduplicated with 8-gram MinHash and removed ~15 % of CommonCrawl before sampling. OpenAI’s 2024 "Deduplicate Everything" paper demonstrates ≤0.04 duplicate ratio reduces over-fitting and speeds convergence. ## Security & Privacy Considerations During Sampling ### Data-Poisoning / Backdoor Attacks Researchers showed that inserting <1 % backdoored sentences can make a model obey a hidden trigger ("PoisonGPT", 2023). Recommended mitigations: * **Shuffled mixing** – make sure adjacent training examples originate from different sources; this dilutes gradient alignment of malicious spans. * **Gradient similarity scoring** – compute cosine similarity of example gradient to batch average; outliers are candidates for removal. * **Dataset versioning & hashes** – freeze immutable tarballs and verify SHA-256 before each training run. ### Membership-Inference & Memorization Long overlap between sliding-window samples increases the chance that rare strings (telephone numbers, secret keys) are memorized. OpenAI’s 2024 study on ChatGPT memorization reports that raising stride from 1 × `max_length` to 4 × reduces verbatim leakage by ≈50 % with negligible loss in perplexity. Practical recommendations: * Use **stride ≥ max_length** except for <1B parameter models where data volume is scarce. * Add random masking of 1-3 tokens per window during training; this lowers memorization while preserving utility. --- ## References - [Build a Large Language Model from Scratch (Manning, 2024)](https://www.manning.com/books/build-a-large-language-model-from-scratch) - [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288) - [PoisonGPT: Assessing Backdoor Vulnerabilities in Large Language Models (BlackHat EU 2023)](https://arxiv.org/abs/2308.12364) {{#include ../../banners/hacktricks-training.md}}