Bagaimana untuk menjalankan llama b bfwith ghs-Tutorial Python-php.cn

Makmal Lambda mempunyai separuh daripada GH200 sekarang untuk membiasakan lebih ramai orang dengan perkakas ARM. Ini bermakna anda mungkin sebenarnya mampu menjalankan model sumber terbuka terbesar! Satu-satunya kaveat ialah anda perlu sekali-sekala membina sesuatu daripada sumber. Begini cara saya menjalankan llama 405b dengan ketepatan penuh pada GH200s.

Buat contoh

Llama 405b ialah kira-kira 750GB jadi anda mahukan kira-kira 10 96GB GPU untuk menjalankannya. (GH200 mempunyai kelajuan pertukaran memori CPU-GPU yang cukup baik -- itulah jenis titik keseluruhan GH200 -- jadi anda boleh menggunakan sesedikit 3. Masa setiap token akan teruk, tetapi jumlah pemprosesan boleh diterima, jika anda sedang melakukan pemprosesan kelompok.) Log masuk ke makmal lambda dan buat sekumpulan kejadian GH200. Pastikan anda memberikan mereka semua sistem fail rangkaian kongsi yang sama.

How to run llama b bfwith ghs

Simpan alamat ip ke ~/ips.txt.

Pembantu sambungan ssh pukal

Saya lebih suka bash & ssh terus daripada apa-apa yang mewah seperti kubernetes atau slurm. Ia boleh diurus dengan beberapa pembantu.

# skip fingerprint confirmation
for ip in $(cat ~/ips.txt); do
    echo "doing $ip"
    ssh-keyscan $ip >> ~/.ssh/known_hosts
done

function run_ip() {
    ssh -i ~/.ssh/lambda_id_ed25519 ubuntu@$ip -- stdbuf -oL -eL bash -l -c "$(printf "%q" "$*")" < /dev/null
}
function run_k() { ip=$(sed -n "$k"p ~/ips.txt) run_ip "$@"; }
function runhead() { ip="$(head -n1 ~/ips.txt)" run_ip "$@"; }

function run_ips() {
    for ip in $ips; do
        ip=$ip run_ip "$@" |& sed "s/^/$ip\t /" &
        # pids="$pids $!"
    done
    wait &> /dev/null
}
function runall() { ips="$(cat ~/ips.txt)" run_ips "$@"; }
function runrest() { ips="$(tail -n+2 ~/ips.txt)" run_ips "$@"; }

function ssh_k() {
    ip=$(sed -n "$k"p ~/ips.txt)
    ssh -i ~/.ssh/lambda_id_ed25519 ubuntu@$ip
}
alias ssh_head='k=1 ssh_k'

function killall() {
    pkill -ife '.ssh/lambda_id_ed25519'
    sleep 1
    pkill -ife -9 '.ssh/lambda_id_ed25519'
    while [[ -n "$(jobs -p)" ]]; do fg || true; done
}

Salin selepas log masuk

Sediakan cache NFS

Kami akan meletakkan persekitaran python dan pemberat model dalam NFS. Ia akan dimuatkan dengan lebih pantas jika kita menyimpannya.

# First, check the NFS works.
# runall ln -s my_other_fs_name shared
runhead 'echo world > shared/hello'
runall cat shared/hello

# Install and enable cachefilesd
runall sudo apt-get update
runall sudo apt-get install -y cachefilesd
runall "echo '
RUN=yes
CACHE_TAG=mycache
CACHE_BACKEND=Path=/var/cache/fscache
CACHEFS_RECLAIM=0
' | sudo tee -a /etc/default/cachefilesd"
runall sudo systemctl restart cachefilesd
runall 'sudo journalctl -u cachefilesd | tail -n2'

# Set the "fsc" option on the NFS mount
runhead cat /etc/fstab # should have mount to ~/shared
runall cp /etc/fstab etc-fstab-bak.txt
runall sudo sed -i 's/,proto=tcp,/,proto=tcp,fsc,/g' /etc/fstab
runall cat /etc/fstab

# Remount
runall sudo umount /home/ubuntu/wash2
runall sudo mount /home/ubuntu/wash2
runall cat /proc/fs/nfsfs/volumes # FSC column should say "yes"

# Test cache speedup
runhead dd if=/dev/urandom of=shared/bigfile bs=1M count=8192
runall dd if=shared/bigfile of=/dev/null bs=1M # First one takes 8 seconds
runall dd if=shared/bigfile of=/dev/null bs=1M # Seond takes 0.6 seconds

Salin selepas log masuk

Cipta persekitaran konda

Daripada berhati-hati melakukan arahan yang sama pada setiap mesin, kita boleh menggunakan persekitaran conda dalam NFS dan hanya mengawalnya dengan nod kepala.

# We'll also use a shared script instead of changing ~/.profile directly.
# Easier to fix mistakes that way.
runhead 'echo ". /opt/miniconda/etc/profile.d/conda.sh" >> shared/common.sh'
runall 'echo "source /home/ubuntu/shared/common.sh" >> ~/.profile'
runall which conda

# Create the environment
runhead 'conda create --prefix ~/shared/311 -y python=3.11'
runhead '~/shared/311/bin/python --version' # double-check that it is executable
runhead 'echo "conda activate ~/shared/311" >> shared/common.sh'
runall which python

Salin selepas log masuk

Pasang kebergantungan aphrodite

Aphrodite ialah cabang vllm yang bermula sedikit lebih cepat dan mempunyai beberapa ciri tambahan.
Ia akan menjalankan API inferens serasi openai dan model itu sendiri.

Anda memerlukan obor, triton dan perhatian kilat.
Anda boleh mendapatkan binaan obor aarch64 daripada pytorch.org (anda tidak mahu membinanya sendiri).
Dua lagi anda boleh membina sendiri atau menggunakan roda yang saya buat.

Jika anda membina daripada sumber, maka anda boleh menjimatkan sedikit masa dengan menjalankan python setup.py bdist_wheel untuk triton, flash-attention dan aphrodite secara selari pada tiga mesin berbeza. Atau anda boleh melakukannya satu demi satu pada mesin yang sama.

runhead pip install 'numpy<2' torch==2.4.0 --index-url 'https://download.pytorch.org/whl/cu124'

# fix for "libstdc++.so.6: version `GLIBCXX_3.4.30' not found" error:
runhead conda install -y -c conda-forge libstdcxx-ng=12

runhead python -c 'import torch; print(torch.tensor(2).cuda() + 2, "torch ok")'

Salin selepas log masuk

triton & perhatian kilat dari roda

runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/triton-3.2.0+git755d4164-cp311-cp311-linux_aarch64.whl'
runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/aphrodite_flash_attn-2.6.1.post2-cp311-cp311-linux_aarch64.whl'

Salin selepas log masuk

triton dari sumber

k=1 ssh_k # ssh into first machine

pip install -U pip setuptools wheel ninja cmake setuptools_scm
git config --global feature.manyFiles true # faster clones
git clone https://github.com/triton-lang/triton.git ~/shared/triton
cd ~/shared/triton/python
git checkout 755d4164 # <-- optional, tested versions
# Note that ninja already parallelizes everything to the extent possible,
# so no sense trying to change the cmake flags or anything.
python setup.py bdist_wheel
pip install --no-deps dist/*.whl # good idea to download this too for later
python -c 'import triton; print("triton ok")'

Salin selepas log masuk

perhatian kilat daripada sumber

k=2 ssh_k # go into second machine

git clone https://github.com/AlpinDale/flash-attention  ~/shared/flash-attention
cd ~/shared/flash-attention
python setup.py bdist_wheel
pip install --no-deps dist/*.whl
python -c 'import aphrodite_flash_attn; import aphrodite_flash_attn_2_cuda; print("flash attn ok")'

Salin selepas log masuk

Pasang aphrodite

Anda boleh menggunakan roda saya atau membinanya sendiri.

aphrodite dari roda

# skip fingerprint confirmation
for ip in $(cat ~/ips.txt); do
    echo "doing $ip"
    ssh-keyscan $ip >> ~/.ssh/known_hosts
done

function run_ip() {
    ssh -i ~/.ssh/lambda_id_ed25519 ubuntu@$ip -- stdbuf -oL -eL bash -l -c "$(printf "%q" "$*")" < /dev/null
}
function run_k() { ip=$(sed -n "$k"p ~/ips.txt) run_ip "$@"; }
function runhead() { ip="$(head -n1 ~/ips.txt)" run_ip "$@"; }

function run_ips() {
    for ip in $ips; do
        ip=$ip run_ip "$@" |& sed "s/^/$ip\t /" &
        # pids="$pids $!"
    done
    wait &> /dev/null
}
function runall() { ips="$(cat ~/ips.txt)" run_ips "$@"; }
function runrest() { ips="$(tail -n+2 ~/ips.txt)" run_ips "$@"; }

function ssh_k() {
    ip=$(sed -n "$k"p ~/ips.txt)
    ssh -i ~/.ssh/lambda_id_ed25519 ubuntu@$ip
}
alias ssh_head='k=1 ssh_k'

function killall() {
    pkill -ife '.ssh/lambda_id_ed25519'
    sleep 1
    pkill -ife -9 '.ssh/lambda_id_ed25519'
    while [[ -n "$(jobs -p)" ]]; do fg || true; done
}

Salin selepas log masuk

aphrodite dari sumber

# First, check the NFS works.
# runall ln -s my_other_fs_name shared
runhead 'echo world > shared/hello'
runall cat shared/hello

# Install and enable cachefilesd
runall sudo apt-get update
runall sudo apt-get install -y cachefilesd
runall "echo '
RUN=yes
CACHE_TAG=mycache
CACHE_BACKEND=Path=/var/cache/fscache
CACHEFS_RECLAIM=0
' | sudo tee -a /etc/default/cachefilesd"
runall sudo systemctl restart cachefilesd
runall 'sudo journalctl -u cachefilesd | tail -n2'

# Set the "fsc" option on the NFS mount
runhead cat /etc/fstab # should have mount to ~/shared
runall cp /etc/fstab etc-fstab-bak.txt
runall sudo sed -i 's/,proto=tcp,/,proto=tcp,fsc,/g' /etc/fstab
runall cat /etc/fstab

# Remount
runall sudo umount /home/ubuntu/wash2
runall sudo mount /home/ubuntu/wash2
runall cat /proc/fs/nfsfs/volumes # FSC column should say "yes"

# Test cache speedup
runhead dd if=/dev/urandom of=shared/bigfile bs=1M count=8192
runall dd if=shared/bigfile of=/dev/null bs=1M # First one takes 8 seconds
runall dd if=shared/bigfile of=/dev/null bs=1M # Seond takes 0.6 seconds

Salin selepas log masuk

Semak semua pemasangan berjaya

# We'll also use a shared script instead of changing ~/.profile directly.
# Easier to fix mistakes that way.
runhead 'echo ". /opt/miniconda/etc/profile.d/conda.sh" >> shared/common.sh'
runall 'echo "source /home/ubuntu/shared/common.sh" >> ~/.profile'
runall which conda

# Create the environment
runhead 'conda create --prefix ~/shared/311 -y python=3.11'
runhead '~/shared/311/bin/python --version' # double-check that it is executable
runhead 'echo "conda activate ~/shared/311" >> shared/common.sh'
runall which python

Salin selepas log masuk

Muat turun berat

Pergi ke https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct dan pastikan anda mempunyai kebenaran yang betul. Kelulusan biasanya mengambil masa kira-kira sejam. Dapatkan token daripada https://huggingface.co/settings/tokens

runhead pip install 'numpy<2' torch==2.4.0 --index-url 'https://download.pytorch.org/whl/cu124'

# fix for "libstdc++.so.6: version `GLIBCXX_3.4.30' not found" error:
runhead conda install -y -c conda-forge libstdcxx-ng=12

runhead python -c 'import torch; print(torch.tensor(2).cuda() + 2, "torch ok")'

Salin selepas log masuk

jalankan llama 405b

Kami akan menyedarkan pelayan antara satu sama lain dengan memulakan ray.

runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/triton-3.2.0+git755d4164-cp311-cp311-linux_aarch64.whl'
runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/aphrodite_flash_attn-2.6.1.post2-cp311-cp311-linux_aarch64.whl'

Salin selepas log masuk

Kita boleh mulakan aphrodite dalam satu tab terminal:

k=1 ssh_k # ssh into first machine

pip install -U pip setuptools wheel ninja cmake setuptools_scm
git config --global feature.manyFiles true # faster clones
git clone https://github.com/triton-lang/triton.git ~/shared/triton
cd ~/shared/triton/python
git checkout 755d4164 # <-- optional, tested versions
# Note that ninja already parallelizes everything to the extent possible,
# so no sense trying to change the cmake flags or anything.
python setup.py bdist_wheel
pip install --no-deps dist/*.whl # good idea to download this too for later
python -c 'import triton; print("triton ok")'

Salin selepas log masuk

Dan jalankan pertanyaan daripada mesin tempatan dalam terminal kedua:

k=2 ssh_k # go into second machine

git clone https://github.com/AlpinDale/flash-attention  ~/shared/flash-attention
cd ~/shared/flash-attention
python setup.py bdist_wheel
pip install --no-deps dist/*.whl
python -c 'import aphrodite_flash_attn; import aphrodite_flash_attn_2_cuda; print("flash attn ok")'

Salin selepas log masuk

runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/aphrodite_engine-0.6.4.post1-cp311-cp311-linux_aarch64.whl'

Salin selepas log masuk

Kepantasan yang baik untuk teks, tetapi agak perlahan untuk kod. Jika anda menyambungkan 2 pelayan 8xH100 maka anda akan menghampiri 16 token sesaat, tetapi kosnya tiga kali ganda.

bacaan lanjut

secara teorinya anda boleh membuat skrip contoh & pemusnahan dengan API makmal lambda https://cloud.lambdalabs.com/api/v1/docs
dokumen aphrodite https://aphrodite.pygmalion.chat/
dokumen vllm (api kebanyakannya sama) https://docs.vllm.ai/en/latest/

Atas ialah kandungan terperinci Bagaimana untuk menjalankan llama b bfwith ghs. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!