Translated ['src/AI/AI-llm-architecture/2.-data-sampling.md'] to sw

This commit is contained in:
Translator 2025-08-04 12:32:42 +00:00
parent c012c50dfd
commit f106db80bf

View File

@ -4,27 +4,27 @@
## **Data Sampling** ## **Data Sampling**
**Data Sampling** ni mchakato muhimu katika kuandaa data kwa ajili ya mafunzo ya mifano mikubwa ya lugha (LLMs) kama GPT. Inahusisha kuandaa data ya maandiko katika mfuatano wa ingizo na malengo ambayo mfano hutumia kujifunza jinsi ya kutabiri neno linalofuata (au token) kulingana na maneno yaliyotangulia. Uwekaji sahihi wa data unahakikisha kwamba mfano unapata kwa ufanisi mifumo ya lugha na utegemezi. **Data Sampling** ni mchakato muhimu katika kuandaa data kwa ajili ya mafunzo ya mifano mikubwa ya lugha (LLMs) kama GPT. Inahusisha kuandaa data ya maandiko katika mfuatano wa ingizo na malengo ambayo mfano hutumia kujifunza jinsi ya kutabiri neno linalofuata (au token) kulingana na maneno yaliyotangulia. Sampuli sahihi za data zinahakikisha kwamba mfano unapata kwa ufanisi mifumo ya lugha na utegemezi.
> [!TIP] > [!TIP]
> Lengo la awamu hii ya pili ni rahisi sana: **Chukua sampuli ya data ya ingizo na uiandaa kwa ajili ya awamu ya mafunzo kwa kawaida kwa kutenganisha dataset katika sentensi za urefu maalum na pia kuzalisha jibu linalotarajiwa.** > Lengo la awamu hii ya pili ni rahisi sana: **Sampuli data ya ingizo na kuandaa kwa ajili ya awamu ya mafunzo kwa kawaida kwa kutenganisha dataset katika sentensi za urefu maalum na pia kuzalisha jibu linalotarajiwa.**
### **Kwa Nini Uwekaji Data ni Muhimu** ### **Why Data Sampling Matters**
LLMs kama GPT zinafundishwa kuzalisha au kutabiri maandiko kwa kuelewa muktadha unaotolewa na maneno ya awali. Ili kufikia hili, data ya mafunzo inapaswa kuandaliwa kwa njia ambayo mfano unaweza kujifunza uhusiano kati ya mfuatano wa maneno na maneno yao yanayofuata. Njia hii iliyopangwa inaruhusu mfano kuweza kujumlisha na kuzalisha maandiko yanayoeleweka na yanayohusiana na muktadha. LLMs kama GPT zinafundishwa kuzalisha au kutabiri maandiko kwa kuelewa muktadha unaotolewa na maneno ya awali. Ili kufikia hili, data ya mafunzo inapaswa kuandaliwa kwa njia ambayo mfano unaweza kujifunza uhusiano kati ya mfuatano wa maneno na maneno yao yanayofuata. Njia hii iliyopangwa inaruhusu mfano kuweza kujumlisha na kuzalisha maandiko yanayofaa na yanayoendana na muktadha.
### **Mifano Muhimu katika Uwekaji Data** ### **Key Concepts in Data Sampling**
1. **Tokenization:** Kugawanya maandiko katika vitengo vidogo vinavyoitwa tokens (mfano, maneno, subwords, au wahusika). 1. **Tokenization:** Kugawanya maandiko katika vitengo vidogo vinavyoitwa tokens (mfano, maneno, subwords, au wahusika).
2. **Urefu wa Mfuatano (max_length):** Idadi ya tokens katika kila mfuatano wa ingizo. 2. **Sequence Length (max_length):** Idadi ya tokens katika kila mfuatano wa ingizo.
3. **Sliding Window:** Njia ya kuunda mfuatano wa ingizo unaoshirikiana kwa kusogeza dirisha juu ya maandiko yaliyotolewa tokens. 3. **Sliding Window:** Njia ya kuunda mfuatano wa ingizo unaoshirikiana kwa kusogeza dirisha juu ya maandiko yaliyotolewa tokens.
4. **Stride:** Idadi ya tokens ambayo dirisha linasogea mbele ili kuunda mfuatano unaofuata. 4. **Stride:** Idadi ya tokens ambayo dirisha linalosogea linahamia mbele ili kuunda mfuatano unaofuata.
### **Mfano wa Hatua kwa Hatua** ### **Step-by-Step Example**
Hebu tupitie mfano ili kuonyesha uwekaji data. Tufanye mfano ili kuonyesha sampuli za data.
**Mfano wa Maandishi** **Example Text**
```arduino ```arduino
"Lorem ipsum dolor sit amet, consectetur adipiscing elit." "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
``` ```
@ -85,9 +85,9 @@ Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing",
**Understanding Stride** **Understanding Stride**
- **Stride of 1:** Dirisha linahamia mbele kwa token moja kila wakati, na kusababisha mfuatano unaoshirikiwa sana. Hii inaweza kuleta kujifunza bora ya uhusiano wa muktadha lakini inaweza kuongeza hatari ya overfitting kwani data zinazofanana zinajirudia. - **Stride of 1:** Dirisha huenda mbele kwa token moja kila wakati, na kusababisha mfuatano unaoshirikiwa sana. Hii inaweza kuleta kujifunza bora ya uhusiano wa muktadha lakini inaweza kuongeza hatari ya overfitting kwani data zinazofanana zinajirudia.
- **Stride of 2:** Dirisha linahamia mbele kwa token mbili kila wakati, kupunguza ushirikiano. Hii inapunguza upungufu na mzigo wa kompyuta lakini inaweza kukosa baadhi ya nuances za muktadha. - **Stride of 2:** Dirisha huenda mbele kwa token mbili kila wakati, kupunguza ushirikiano. Hii inapunguza kurudiwa na mzigo wa kompyuta lakini inaweza kukosa baadhi ya nuances za muktadha.
- **Stride Equal to max_length:** Dirisha linahamia mbele kwa ukubwa mzima wa dirisha, na kusababisha mfuatano usio na ushirikiano. Hii inapunguza upungufu wa data lakini inaweza kupunguza uwezo wa mfano kujifunza utegemezi kati ya mfuatano. - **Stride Equal to max_length:** Dirisha huenda mbele kwa ukubwa mzima wa dirisha, na kusababisha mfuatano usio na ushirikiano. Hii inapunguza kurudiwa kwa data lakini inaweza kupunguza uwezo wa mfano kujifunza utegemezi kati ya mfuatano.
**Example with Stride of 2:** **Example with Stride of 2:**
@ -230,9 +230,70 @@ tensor([[ 367, 2885, 1464, 1807],
[ 3285, 326, 11, 287]]) [ 3285, 326, 11, 287]])
] ]
``` ```
## Marejeo ## Advanced Sampling Strategies (2023-2025)
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch) ### 1. Temperature-Based Mixture Weighting
LLMs za kisasa mara chache hujifunza kwenye mkusanyiko mmoja. Badala yake, wanachukua sampuli kutoka vyanzo kadhaa tofauti (msimbo, wavuti, karatasi za kitaaluma, majukwaa…). Sehemu ya uhusiano wa kila chanzo inaweza kuathiri kwa nguvu utendaji wa chini. Mifano ya hivi karibuni ya chanzo wazi kama Llama 2 ilianzisha **temperaturebased sampling scheme** ambapo uwezekano wa kuchora hati kutoka kwa mkusanyiko *i* unakuwa
```
p(i) = \frac{w_i^{\alpha}}{\sum_j w_j^{\alpha}}
```
*w<sub>i</sub>* asilimia ya token ghafi ya corpus *i*
*α* ("joto") thamani katika (0,1]. α < 1 inafanya usambazaji kuwa tambarare, ikitoa uzito zaidi kwa corpus ndogo zenye ubora wa juu.
Llama 2 ilitumia α = 0.7 na kuonyesha kwamba kupunguza α kuliongeza alama za tathmini kwenye kazi zenye maarifa mengi huku ikihifadhi mchanganyiko wa mafunzo kuwa thabiti. Njia hiyo hiyo inachukuliwa na Mistral (2023) na Claude 3.
```python
from collections import Counter
def temperature_sample(corpus_ids, alpha=0.7):
counts = Counter(corpus_ids) # number of tokens seen per corpus
probs = {c: c_count**alpha for c, c_count in counts.items()}
Z = sum(probs.values())
probs = {c: p/Z for c, p in probs.items()}
# Now draw according to probs to fill every batch
```
```
### 2. Sequence Packing / Dynamic Batching
GPU memory is wasted when every sequence in a batch is padded to the longest example. "Packing" concatenates multiple shorter sequences until the **exact** `max_length` is reached and builds a parallel `attention_mask` so that tokens do not attend across segment boundaries. Packing can improve throughput by 2040 % with no gradient change and is supported out-of-the-box in
* PyTorch `torchtext.experimental.agents.PackedBatch`
* HuggingFace `DataCollatorForLanguageModeling(pad_to_multiple_of=…)`
Dynamic batching frameworks (e.g. FlashAttention 2, vLLM 2024) combine sequence packing with just-in-time kernel selection, enabling thousand-token context training at 400+ K tokens/s on A100-80G.
### 3. Deduplication & Quality Filtering
Repeated passages cause memorization and provide an easy channel for data-poisoning. Modern pipelines therefore:
1. MinHash/FAISS near-duplicate detection at **document** and **128-gram** level.
2. Filter documents whose perplexity under a small reference model is > µ + 3σ (noisy OCR, garbled HTML).
3. Block-list documents that contain PII or CWE keywords using regex & spaCy NER.
The Llama 2 team deduplicated with 8-gram MinHash and removed ~15 % of CommonCrawl before sampling. OpenAIs 2024 "Deduplicate Everything" paper demonstrates ≤0.04 duplicate ratio reduces over-fitting and speeds convergence.
## Security & Privacy Considerations During Sampling
### Data-Poisoning / Backdoor Attacks
Researchers showed that inserting <1 % backdoored sentences can make a model obey a hidden trigger ("PoisonGPT", 2023). Recommended mitigations:
* **Shuffled mixing** make sure adjacent training examples originate from different sources; this dilutes gradient alignment of malicious spans.
* **Gradient similarity scoring** compute cosine similarity of example gradient to batch average; outliers are candidates for removal.
* **Dataset versioning & hashes** freeze immutable tarballs and verify SHA-256 before each training run.
### Membership-Inference & Memorization
Long overlap between sliding-window samples increases the chance that rare strings (telephone numbers, secret keys) are memorized. OpenAIs 2024 study on ChatGPT memorization reports that raising stride from 1 × `max_length` to 4 × reduces verbatim leakage by ≈50 % with negligible loss in perplexity.
Practical recommendations:
* Use **stride ≥ max_length** except for <1B parameter models where data volume is scarce.
* Add random masking of 1-3 tokens per window during training; this lowers memorization while preserving utility.
---
## References
- [Build a Large Language Model from Scratch (Manning, 2024)](https://www.manning.com/books/build-a-large-language-model-from-scratch)
- [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
- [PoisonGPT: Assessing Backdoor Vulnerabilities in Large Language Models (BlackHat EU 2023)](https://arxiv.org/abs/2308.12364)
{{#include ../../banners/hacktricks-training.md}} {{#include ../../banners/hacktricks-training.md}}