Building a GenAI RAG Stack with NVIDIA NIM and LLaMA 3-8B on a Single Node

Deploying NVIDIA NIM and RAG Playground

Deploying Retrieval-Augmented Generation (RAG) applications has become simpler thanks to NVIDIA's NIM inference microservices and the RAG Blueprint. In this post, I'll walk you through how I deployed the LLaMA 3-8B language model using NIMs and connected it with the RAG Playground using Docker Compose — all on a single node.

Architecture Overview

We’re deploying two main components:

  • NVIDIA NIMs: Containerized inference microservices.
  • RAG Playground: A sample app from the NVIDIA RAG Blueprint.

System Prerequisites

  • Ubuntu 22.04+ node with NVIDIA GPU
  • NGC, Docker, Docker Compose, NVIDIA Container Toolkit
  • Internet access to pull containers from nvcr.io

Step 1: Deploy LLaMA 3-8B (or NAY OTHER) Using NIMs

pip install nvidia-nim
nim run nvcr.io/nim/nvidia/llama-3-8b-instruct-4bit-awq:1.3.0 --port 8000

Step 2: Deploy Embedding & Reranking Microservices

docker run -d --gpus all \
  -p 9080:8000 \
  --name nemo-retriever-embedding-microservice \
  nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0

docker run -d --gpus all \
  -p 1976:8000 \
  --name nemo-retriever-ranking-microservice \
  nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.0

Step 3: Deploy RAG Playground

git clone https://github.com/NVIDIA-AI-Blueprints/rag.git
cd rag
git checkout v1.0.0
cd deploy/compose

Create .env file dynamically:

export HOST_IP=$(hostname -I | awk '{print $1}')

cat < .env
EMBEDDING_MS_BASE=http://$HOST_IP:9080
RANKING_MS_BASE=http://$HOST_IP:1976
LLM_MS_BASE=http://$HOST_IP:8000
EOF

Start services:

docker compose --env-file .env up -d

Step 4: Validate Setup

Visit RAG Playground UI in your browser at: http://<YOUR_NODE_IP>:8090

Bonus Tips

  • Use --gpus "device=N" in Docker to pin workloads to specific GPUs.
  • Run nvidia-smi to monitor utilization and memory.

Final Thoughts

I have created a Ansible plybook project to configure NVIDIA AI Enterprise and RAG. NVIDIA’s NIM services provide a powerful way to run LLMs in production. Combined with the RAG Blueprint, this is a production-grade GenAI setup ideal for internal POCs and prototyping.

Links - NVIDIA BLOG FOR RAG - NVIDIA AI ENTERPRISE

Post a Comment (0)
Previous Post Next Post