Deploying and interfacing with an LLM on a KISSKI server

Note to myself

Author

Marko Bachl

Published

January 9, 2025

Connect to KISSKI via ssh

Add account (once)

nano ~/.ssh/config

host KISSKI
        Hostname glogin-gpu.hpc.gwdg.de
        User XXXXXX
        IdentityFile ~/.ssh/XXX

Connect

ssh KISSKI

Start interactive session in GPU node

srun -p kisski --pty -n 1 -c 32 -G A100:4 bash

-n: Nodes (more than one would need additional config of vLLM server)
-c: CPU Cores (max. 32; we don’t need that much, but if we take all GPUs of a node we might as well take all CPU cores)
-G: Type of GPU (A100; H100 would be possible with kisski-h100); number of GPUs (max. 4; 1, 2, or 4 for simple vLLM set-up)
bash interactive bash console

Wait for resources; less than 1 minute in my recent experience.

tmux

So that service keeps running if connection is lost

tmux new -s vllm_serve

Install server packages in conda environments (only once)

~~See documentation vLLM and LMDeploy documentations for details.~~

For vLLM:

conda create -n vllm_env python=3.10 -y
conda activate vllm_env
pip install vllm

For LMDeploy:

conda create -n lmdeploy_env python=3.8 -y
conda activate lmdeploy_env
pip install lmdeploy

Deploy model from Huggingface with vLLM

conda activate vllm_env

For gated models:

export HF_TOKEN=XXXXX

--download-dir is important because user storage is too small for model weights. Always specify so that second deployment does not trigger new download.
Look up specific vllm serve options for each model.
Large models are large; first deployment triggers download, will take some time.

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

vllm serve nvidia/Llama-3.1-Nemotron-70B-Instruct-HF \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --tokenizer-mode auto \
    --config-format auto \
    --load-format auto \
    --dtype auto \
    --download-dir "$PROJECT_DIR/models"

allenai/Llama-3.1-Tulu-3-70B

vllm serve allenai/Llama-3.1-Tulu-3-70B \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --tokenizer-mode auto \
    --config-format auto \
    --load-format auto \
    --dtype auto \
    --download-dir "$PROJECT_DIR/models"

neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16

vllm serve neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 \
    --tensor-parallel-size 4 \
    --max-model-len 4096 \
    --tokenizer-mode auto \
    --config-format auto \
    --load-format auto \
    --dtype auto \
    --download-dir "$PROJECT_DIR/models"

mistralai/Mistral-Large-Instruct-2411

vllm serve mistralai/Mistral-Large-Instruct-2411 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor_parallel_size 4 \
    --download-dir "$PROJECT_DIR/models"

requires kisski-h100 node

mistralai/Mistral-Small-Instruct-2409

vllm serve mistralai/Mistral-Small-Instruct-2409 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --download-dir "$PROJECT_DIR/models"

fits on a single GPU
Add --tensor-parallel-size 4 for faster parallel calls.

microsoft/phi-4

vllm serve microsoft/phi-4 \
    --tensor-parallel-size 2 \
    --tokenizer_mode auto \
    --config_format auto \
    --load_format auto \
    --download-dir "$PROJECT_DIR/models"

fits on a single GPU
supports only 1 or 2 GPUs, slower than Mistral-Small with 4 GPUs.

neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic

vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic \
    --enforce-eager \
    --max-num-seqs 16 \
    --tensor-parallel-size 4 \
    --download-dir "$PROJECT_DIR/models"

Deploy model from Huggingface with LMDeploy

conda activate lmdeploy_env

OpenGVLab/InternVL2_5-78B-MPO

Download (once)

huggingface-cli download OpenGVLab/InternVL2_5-78B-MPO --local-dir "$PROJECT_DIR/models/OpenGVLab/InternVL2_5-78B-MPO"

Deploy

lmdeploy serve api_server "$PROJECT_DIR/models/OpenGVLab/InternVL2_5-78B-MPO" \
    --server-port 8000 \
    --tp 4

No option found to specify download directory. If model weights fit into personal cache at all, might be moved manually to project storage. If not, we need to figure out how to download model manually to project storage.
OpenGVLab/InternVL2_5-78B-MPO, OpenGVLab/InternVL2_5-38B-MPO do not fit in personal cache.

Test on server

In new tmux window

curl http://localhost:8000/v1/models | json_pp

curl http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json"     -d '{
        "model": "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF",
        "messages": [
            {"role": "system", "content": "Talk like a pirate."},
            {"role": "user", "content": "Tell me a joke."}
        ]
    }' | json_pp

Check availability in R

library(jsonlite)
library(httr2)

mdl = request(base_url = "http://localhost:8000/v1/models") |> 
  req_perform() |> 
  resp_body_json() |>
  _$data |> 
  _[[1]] |> 
  _$id
mdl

[1] "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"

request(base_url = "http://localhost:8000/v1/chat/completions") |> 
  req_body_json(list(
    model = mdl,
    messages = list(
      list(role = "system", content = "Talk like a pirate!"),
      list(role = "user", content = "Tell me a joke.")
    ),
    temperature = 0
  )) |> 
  req_perform() |> 
  resp_body_string() |> 
  prettify()

{
    "id": "chatcmpl-dd41debcba6748ab9294ffd0b0f3efe2",
    "object": "chat.completion",
    "created": 1736454675,
    "model": "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Yer lookin' fer a joke, eh? Alright then, matey! Here be one fer ye:\n\nWhy did the pirate quit his job?\n\n(pause fer dramatic effect)\n\nBecause he was sick o' all the arrrr-guments! (get it? ahh, never mind, ye landlubber!)",
                "tool_calls": [

                ]
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 25,
        "total_tokens": 91,
        "completion_tokens": 66,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null
}

Next steps

Try multi-node to deploy larger models, e.g., deepseek-ai/DeepSeek-V3
Put everything into a SLURM script for convenience.
Check whether deployed model can be accessed by other project members. If not, figure out how to.

Informal test notes

Walltime gains for Mistral Small: 20: 203.259 secs, 40: 162.893, 80: 138.172
Quantization seems to hurt performance more than I would have thought.

Links