nano ~/.ssh/config
Deploying and interfacing with an LLM on a KISSKI server
Note to myself
Links
- GWDG HPC Documentation (hosts KISSKI)
- KISSKI
- Our KISSKI project: FUBerlin-LLMs
- vLLM Documentation
- LMDeploy Documentation
- New candidates:
Connect to KISSKI via ssh
Add account (once)
host KISSKI
Hostname glogin-gpu.hpc.gwdg.de
User XXXXXX
IdentityFile ~/.ssh/XXX
Connect
ssh KISSKI
Start interactive session in GPU node
srun -p kisski --pty -n 1 -c 32 -G A100:4 bash
-n
: Nodes (more than one would need additional config of vLLM server)-c
: CPU Cores (max. 32; we don’t need that much, but if we take all GPUs of a node we might as well take all CPU cores)-G
: Type of GPU (A100; H100 would be possible withkisski-h100
); number of GPUs (max. 4; 1, 2, or 4 for simple vLLM set-up)bash
interactive bash console
Wait for resources; less than 1 minute in my recent experience.
tmux
So that service keeps running if connection is lost
tmux new -s vllm_serve
Install server packages in conda environments (only once)
See documentation vLLM and LMDeploy documentations for details.
For vLLM:
conda create -n vllm_env python=3.10 -y
conda activate vllm_env
pip install vllm
For LMDeploy:
conda create -n lmdeploy_env python=3.8 -y
conda activate lmdeploy_env
pip install lmdeploy
Deploy model from Huggingface with vLLM
conda activate vllm_env
For gated models:
export HF_TOKEN=XXXXX
--download-dir
is important because user storage is too small for model weights. Always specify so that second deployment does not trigger new download.- Look up specific
vllm serve
options for each model. - Large models are large; first deployment triggers download, will take some time.
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
vllm serve nvidia/Llama-3.1-Nemotron-70B-Instruct-HF \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--tokenizer-mode auto \
--config-format auto \
--load-format auto \
--dtype auto \
--download-dir "$PROJECT_DIR/models"
vllm serve allenai/Llama-3.1-Tulu-3-70B \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--tokenizer-mode auto \
--config-format auto \
--load-format auto \
--dtype auto \
--download-dir "$PROJECT_DIR/models"
neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
vllm serve neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 \
--tensor-parallel-size 4 \
--max-model-len 4096 \
--tokenizer-mode auto \
--config-format auto \
--load-format auto \
--dtype auto \
--download-dir "$PROJECT_DIR/models"
mistralai/Mistral-Large-Instruct-2411
vllm serve mistralai/Mistral-Large-Instruct-2411 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor_parallel_size 4 \
--download-dir "$PROJECT_DIR/models"
- requires
kisski-h100
node
mistralai/Mistral-Small-Instruct-2409
vllm serve mistralai/Mistral-Small-Instruct-2409 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--download-dir "$PROJECT_DIR/models"
- fits on a single GPU
- Add
--tensor-parallel-size 4
for faster parallel calls.
vllm serve microsoft/phi-4 \
--tensor-parallel-size 2 \
--tokenizer_mode auto \
--config_format auto \
--load_format auto \
--download-dir "$PROJECT_DIR/models"
- fits on a single GPU
- supports only 1 or 2 GPUs, slower than Mistral-Small with 4 GPUs.
neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic
vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic \
--enforce-eager \
--max-num-seqs 16 \
--tensor-parallel-size 4 \
--download-dir "$PROJECT_DIR/models"
Deploy model from Huggingface with LMDeploy
conda activate lmdeploy_env
Download (once)
huggingface-cli download OpenGVLab/InternVL2_5-78B-MPO --local-dir "$PROJECT_DIR/models/OpenGVLab/InternVL2_5-78B-MPO"
Deploy
lmdeploy serve api_server "$PROJECT_DIR/models/OpenGVLab/InternVL2_5-78B-MPO" \
--server-port 8000 \
--tp 4
- No option found to specify download directory. If model weights fit into personal cache at all, might be moved manually to project storage. If not, we need to figure out how to download model manually to project storage.
- OpenGVLab/InternVL2_5-78B-MPO, OpenGVLab/InternVL2_5-38B-MPO do not fit in personal cache.
Test on server
In new tmux window
curl http://localhost:8000/v1/models | json_pp
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" -d '{
"model": "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF",
"messages": [
{"role": "system", "content": "Talk like a pirate."},
{"role": "user", "content": "Tell me a joke."}
]
}' | json_pp
“Double” ssh to login node to compute node
squeue --me
ssh -t -t KISSKI -L 8000:localhost:8000 ssh -N ggpu162 -L 8000:localhost:8000
ggpu162
: replace with name of assigned node
Check availability in R
library(jsonlite)
library(httr2)
= request(base_url = "http://localhost:8000/v1/models") |>
mdl req_perform() |>
resp_body_json() |>
$data |>
_1]] |>
_[[$id
_ mdl
[1] "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"
request(base_url = "http://localhost:8000/v1/chat/completions") |>
req_body_json(list(
model = mdl,
messages = list(
list(role = "system", content = "Talk like a pirate!"),
list(role = "user", content = "Tell me a joke.")
),temperature = 0
|>
)) req_perform() |>
resp_body_string() |>
prettify()
{
"id": "chatcmpl-dd41debcba6748ab9294ffd0b0f3efe2",
"object": "chat.completion",
"created": 1736454675,
"model": "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Yer lookin' fer a joke, eh? Alright then, matey! Here be one fer ye:\n\nWhy did the pirate quit his job?\n\n(pause fer dramatic effect)\n\nBecause he was sick o' all the arrrr-guments! (get it? ahh, never mind, ye landlubber!)",
"tool_calls": [
]
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 25,
"total_tokens": 91,
"completion_tokens": 66,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
Next steps
- Try multi-node to deploy larger models, e.g., deepseek-ai/DeepSeek-V3
- Put everything into a SLURM script for convenience.
- Check whether deployed model can be accessed by other project members. If not, figure out how to.
Informal test notes
- Walltime gains for Mistral Small: 20: 203.259 secs, 40: 162.893, 80: 138.172
- Quantization seems to hurt performance more than I would have thought.