Using NVIDIA NIM Containers
We illustrate using this Llama 3.1 (8B) model.
NVIDIA provides Docker images for the running the model locally but Docker is ill-suited to a server infrastructure. Marlowe uses Apptainer instead and so one needs to convert the provided Docker image into an Apptainer image. Also, since the image is hosted on the NVIDIA Developer Site that requires one to authenticate, one has to use a Docker tools on one’s local machine to download the image, and then upload it to Marlowe for conversion for use with Apptainer.
We will assume you have set up your Apptainer cache directory as noted here.
-
Create an NVIDIA Developer account if you haven’t already done so.
-
On your local machine, install Docker Desktop if you haven’t already done so and ensure it is running.
-
Get an API KEY for logging into NVIDIA GPU Cloud (NGC). For our example, you can obtain one by clicking on the the Get API Key at the top of the python code.
-
In a terminal on your local machine (
bluebird
), log into the NGC via:bluebird$ docker login nvcr.io
-
Pull down the Llama image and save it as a tar file. The image is about 13G!
bluebird$ docker pull nvcr.io/nim/meta/llama-3.1-8b-instruct:latest bluebird$ docker save nvcr.io/nim/meta/llama-3.1-8b-instruct:latest -o llama.tar
-
Upload to Marlowe project directory (assuming
m223813
is your project directory).bluebird$ scp llama.tar <SUnetID>@login.marlowe.stanford.edu:/scratch/m223813
-
Log in to Marlowe and convert the image to a
.sif
file. This takes a while to complete and is done once.you@login-02$ cd /scratch/m223813 you@login-02$ ml load apptainer you@login-02$ apptainer build llama.sif docker-archive://llama.tar
-
Run an interactive queue on the partition provided for you. We use 8 GPUs in our example. Note down the node number which is typically something like
n01
orn02
etc. We’ll assumen01
in what follows.you@login-02$ srun --partition=<your_partition> --gres=gpu:8 --ntasks=1 --time=1:00:00 --pty /bin/bash
-
Run the container on
n01
; this will take about 10 minutes the first time. Note the use of the API Key from step 2 which can be set up once in your~/.bash_profile
for convenience.export NGC_API_KEY=<your_api_key> ## fill in your API key ml load apptainer export LOCAL_NIM_CACHE=$PROJDIR/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" apptainer run --nv \ --bind "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ --env NGC_API_KEY=$NGC_API_KEY \ llama.sif
Once the API service has started, you will see lines like below:
INFO 2024-10-17 18:26:48.395 server.py:82] Started server process [394400] INFO 2024-10-17 18:26:48.395 on.py:48] Waiting for application startup. INFO 2024-10-17 18:26:48.409 on.py:62] Application startup complete. INFO 2024-10-17 18:26:48.411 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 2024-10-17 18:27:10.522 httptools_impl.py:481] 127.0.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200
-
Typically, one would forward the port using
ssh
to access the service onlocalhost:8000
or similar, but that is not feasible in an HPC environment. One can do batch processing by creating another session on the job noden01
(log into Marlowe, and thenssh n01
) to hit the API service.
Below is the result of a test API call using curl
on n01
where we ask for a limerick about Marlowe:
curl -X 'POST' \
'http://localhost:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role":"user", "content":"Write a limerick about Marlowe GPU Cluster (31 DGXs)"}],
"max_tokens": 64
}'
JSON Output:
{"id":"chat-b2f3244d98b647caaa0d32ca54ed57db","object":"chat.completion","created":1729214830,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"There once was a cluster so fine,\nMarlowe GPU's, with power divine,\nThirty-one DGXs to play,\n Made for work in a major way,\nApplied Math's problems did align."},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":28,"total_tokens":69,"completion_tokens":41}}
which contains:
There once was a cluster so fine,
Marlowe GPU's, with power divine,
Thirty-one DGXs to play,
Made for work in a major way,
Applied Math's problems did align.