mirror of
				https://github.com/HackTricks-wiki/hacktricks.git
				synced 2025-10-10 18:36:50 +00:00 
			
		
		
		
	Merge pull request #1234 from HackTricks-wiki/research_update_src_AI_AI-llm-architecture_2.-data-sampling_20250803_162345
Research Update Enhanced src/AI/AI-llm-architecture/2.-data-...
This commit is contained in:
		
						commit
						fae57586cd
					
				@ -236,9 +236,72 @@ tensor([[  367,  2885,  1464,  1807],
 | 
				
			|||||||
]
 | 
					]
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Advanced Sampling Strategies (2023-2025)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### 1. Temperature-Based Mixture Weighting
 | 
				
			||||||
 | 
					State-of-the-art LLMs are rarely trained on a single corpus.  Instead, they sample from several heterogeneous data sources (code, web, academic papers, forums…).  The relative proportion of each source can strongly affect downstream performance.  Recent open-source models such as Llama 2 introduced a **temperature‐based sampling scheme** where the probability of drawing a document from corpus *i* becomes
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					p(i) = \frac{w_i^{\alpha}}{\sum_j w_j^{\alpha}}
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					• *w<sub>i</sub>*  – raw token percentage of corpus *i*  
 | 
				
			||||||
 | 
					• *α* ("temperature") – a value in (0,1].  α < 1 flattens the distribution, giving more weight to smaller high-quality corpora.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Llama 2 used α = 0.7 and showed that decreasing α boosted evaluation scores on knowledge-heavy tasks while keeping the training mix stable.  The same trick is adopted by Mistral (2023) and Claude 3.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					from collections import Counter
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def temperature_sample(corpus_ids, alpha=0.7):
 | 
				
			||||||
 | 
					    counts = Counter(corpus_ids)           # number of tokens seen per corpus
 | 
				
			||||||
 | 
					    probs  = {c: c_count**alpha for c, c_count in counts.items()}
 | 
				
			||||||
 | 
					    Z = sum(probs.values())
 | 
				
			||||||
 | 
					    probs = {c: p/Z for c, p in probs.items()}
 | 
				
			||||||
 | 
					    # Now draw according to probs to fill every batch
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### 2. Sequence Packing / Dynamic Batching
 | 
				
			||||||
 | 
					GPU memory is wasted when every sequence in a batch is padded to the longest example.  "Packing" concatenates multiple shorter sequences until the **exact** `max_length` is reached and builds a parallel `attention_mask` so that tokens do not attend across segment boundaries.  Packing can improve throughput by 20–40 % with no gradient change and is supported out-of-the-box in
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* PyTorch `torchtext.experimental.agents.PackedBatch`
 | 
				
			||||||
 | 
					* HuggingFace `DataCollatorForLanguageModeling(pad_to_multiple_of=…)`
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Dynamic batching frameworks (e.g. FlashAttention 2, vLLM 2024) combine sequence packing with just-in-time kernel selection, enabling thousand-token context training at 400+ K tokens/s on A100-80G.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### 3. Deduplication & Quality Filtering
 | 
				
			||||||
 | 
					Repeated passages cause memorization and provide an easy channel for data-poisoning.  Modern pipelines therefore:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. MinHash/FAISS near-duplicate detection at **document** and **128-gram** level.
 | 
				
			||||||
 | 
					2. Filter documents whose perplexity under a small reference model is > µ + 3σ (noisy OCR, garbled HTML).
 | 
				
			||||||
 | 
					3. Block-list documents that contain PII or CWE keywords using regex & spaCy NER.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Llama 2 team deduplicated with 8-gram MinHash and removed ~15 % of CommonCrawl before sampling.  OpenAI’s 2024 "Deduplicate Everything" paper demonstrates ≤0.04 duplicate ratio reduces over-fitting and speeds convergence.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Security & Privacy Considerations During Sampling
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Data-Poisoning / Backdoor Attacks
 | 
				
			||||||
 | 
					Researchers showed that inserting <1 % backdoored sentences can make a model obey a hidden trigger ("PoisonGPT", 2023).  Recommended mitigations:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* **Shuffled mixing** – make sure adjacent training examples originate from different sources; this dilutes gradient alignment of malicious spans.
 | 
				
			||||||
 | 
					* **Gradient similarity scoring** – compute cosine similarity of example gradient to batch average; outliers are candidates for removal.
 | 
				
			||||||
 | 
					* **Dataset versioning & hashes** – freeze immutable tarballs and verify SHA-256 before each training run.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Membership-Inference & Memorization
 | 
				
			||||||
 | 
					Long overlap between sliding-window samples increases the chance that rare strings (telephone numbers, secret keys) are memorized.  OpenAI’s 2024 study on ChatGPT memorization reports that raising stride from 1 × `max_length` to 4 × reduces verbatim leakage by ≈50 % with negligible loss in perplexity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Practical recommendations:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* Use **stride ≥ max_length** except for <1B parameter models where data volume is scarce.
 | 
				
			||||||
 | 
					* Add random masking of 1-3 tokens per window during training; this lowers memorization while preserving utility.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## References
 | 
					## References
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
					- [Build a Large Language Model from Scratch (Manning, 2024)](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
				
			||||||
 | 
					- [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
 | 
				
			||||||
 | 
					- [PoisonGPT: Assessing Backdoor Vulnerabilities in Large Language Models (BlackHat EU 2023)](https://arxiv.org/abs/2308.12364)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
{{#include ../../banners/hacktricks-training.md}}
 | 
					{{#include ../../banners/hacktricks-training.md}}
 | 
				
			||||||
 | 
				
			|||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user