Translated ['src/AI/AI-llm-architecture/2.-data-sampling.md'] to uk

2025-10-10 18:36:50 +00:00 · 2025-08-04 12:29:56 +00:00 · 2025-08-04 12:29:56 +00:00 · 8b3bd9c0a7
commit 8b3bd9c0a7
parent dd50bb6c02
1 changed files with 64 additions and 3 deletions
--- a/src/AI/AI-llm-architecture/2.-data-sampling.md
+++ b/src/AI/AI-llm-architecture/2.-data-sampling.md
@ -87,7 +87,7 @@ Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing",
 - **Крок 1:** Вікно рухається вперед на один токен щоразу, що призводить до високої перекритості послідовностей. Це може призвести до кращого навчання контекстуальних зв'язків, але може збільшити ризик перенавчання, оскільки подібні дані повторюються.
 - **Крок 2:** Вікно рухається вперед на два токени щоразу, зменшуючи перекриття. Це зменшує надмірність і обчислювальне навантаження, але може пропустити деякі контекстуальні нюанси.
- **Крок, рівний max_length:** Вікно рухається вперед на всю величину вікна, що призводить до неперекриваючих послідовностей. Це мінімізує надмірність даних, але може обмежити здатність моделі вивчати залежності між послідовностями.
+- **Крок, рівний max_length:** Вікно рухається вперед на весь розмір вікна, що призводить до неперекриваючих послідовностей. Це мінімізує надмірність даних, але може обмежити здатність моделі вивчати залежності між послідовностями.
 **Приклад з кроком 2:**
@ -230,9 +230,70 @@ tensor([[  367,  2885,  1464,  1807],
 [ 3285,   326,    11,   287]])
 ]
 ```
-## Посилання
+## Advanced Sampling Strategies (2023-2025)
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
+### 1. Temperature-Based Mixture Weighting
 Сучасні LLM рідко навчаються на єдиному корпусі. Натомість вони вибирають з кількох гетерогенних джерел даних (код, веб, наукові статті, форуми…). Відносна пропорція кожного джерела може сильно вплинути на подальшу продуктивність. Нещодавні моделі з відкритим кодом, такі як Llama 2, представили **схему вибірки на основі температури**, де ймовірність вибору документа з корпусу *i* стає
 ```
 p(i) = \frac{w_i^{\alpha}}{\sum_j w_j^{\alpha}}
 ```
 • *w<sub>i</sub>*  – сирий відсоток токенів корпусу *i*  
 • *α* ("температура") – значення в (0,1].  α < 1 згладжує розподіл, надаючи більше ваги меншим високоякісним корпусам.  
 Llama 2 використовував α = 0.7 і показав, що зменшення α підвищує оцінки в завданнях, що вимагають знань, при збереженні стабільного навчального міксу.  Той же трюк застосовується Mistral (2023) та Claude 3.
 ```python
 from collections import Counter
 def temperature_sample(corpus_ids, alpha=0.7):
 counts = Counter(corpus_ids)           # number of tokens seen per corpus
 probs  = {c: c_count**alpha for c, c_count in counts.items()}
 Z = sum(probs.values())
 probs = {c: p/Z for c, p in probs.items()}
 # Now draw according to probs to fill every batch
 ```
 ```
 ### 2. Sequence Packing / Dynamic Batching
 GPU memory is wasted when every sequence in a batch is padded to the longest example.  "Packing" concatenates multiple shorter sequences until the **exact** `max_length` is reached and builds a parallel `attention_mask` so that tokens do not attend across segment boundaries.  Packing can improve throughput by 20–40 % with no gradient change and is supported out-of-the-box in
 * PyTorch `torchtext.experimental.agents.PackedBatch`
 * HuggingFace `DataCollatorForLanguageModeling(pad_to_multiple_of=…)`
 Dynamic batching frameworks (e.g. FlashAttention 2, vLLM 2024) combine sequence packing with just-in-time kernel selection, enabling thousand-token context training at 400+ K tokens/s on A100-80G.
 ### 3. Deduplication & Quality Filtering
 Repeated passages cause memorization and provide an easy channel for data-poisoning.  Modern pipelines therefore:
 1. MinHash/FAISS near-duplicate detection at **document** and **128-gram** level.
 2. Filter documents whose perplexity under a small reference model is > µ + 3σ (noisy OCR, garbled HTML).
 3. Block-list documents that contain PII or CWE keywords using regex & spaCy NER.
 The Llama 2 team deduplicated with 8-gram MinHash and removed ~15 % of CommonCrawl before sampling.  OpenAI’s 2024 "Deduplicate Everything" paper demonstrates ≤0.04 duplicate ratio reduces over-fitting and speeds convergence.
 ## Security & Privacy Considerations During Sampling
 ### Data-Poisoning / Backdoor Attacks
 Researchers showed that inserting <1 % backdoored sentences can make a model obey a hidden trigger ("PoisonGPT", 2023).  Recommended mitigations:
 * **Shuffled mixing** – make sure adjacent training examples originate from different sources; this dilutes gradient alignment of malicious spans.
 * **Gradient similarity scoring** – compute cosine similarity of example gradient to batch average; outliers are candidates for removal.
 * **Dataset versioning & hashes** – freeze immutable tarballs and verify SHA-256 before each training run.
 ### Membership-Inference & Memorization
 Long overlap between sliding-window samples increases the chance that rare strings (telephone numbers, secret keys) are memorized.  OpenAI’s 2024 study on ChatGPT memorization reports that raising stride from 1 × `max_length` to 4 × reduces verbatim leakage by ≈50 % with negligible loss in perplexity.
 Practical recommendations:
 * Use **stride ≥ max_length** except for <1B parameter models where data volume is scarce.
 * Add random masking of 1-3 tokens per window during training; this lowers memorization while preserving utility.
 ---
 ## References
 - [Build a Large Language Model from Scratch (Manning, 2024)](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 - [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
 - [PoisonGPT: Assessing Backdoor Vulnerabilities in Large Language Models (BlackHat EU 2023)](https://arxiv.org/abs/2308.12364)
 {{#include ../../banners/hacktricks-training.md}}