Building a GenAI RAG Stack with NVIDIA NIM and LLaMA 3-8B on a Single Node

Deploying NVIDIA NIM and RAG Playground

Deploying Retrieval-Augmented Generation (RAG) applications has become simpler thanks to NVIDIA's NIM inference microservices and the RAG Blueprint. In this post, I'll walk you through how I deployed the LLaMA 3-8B language model using NIMs and connected it with the RAG Playground using Docker Compose — all on a single node.

Architecture Overview

We’re deploying two main components:

NVIDIA NIMs: Containerized inference microservices.
RAG Playground: A sample app from the NVIDIA RAG Blueprint.

System Prerequisites

Ubuntu 22.04+ node with NVIDIA GPU
NGC, Docker, Docker Compose, NVIDIA Container Toolkit
Internet access to pull containers from nvcr.io

Step 1: Deploy LLaMA 3-8B (or NAY OTHER) Using NIMs

pip install nvidia-nim
nim run nvcr.io/nim/nvidia/llama-3-8b-instruct-4bit-awq:1.3.0 --port 8000

Step 2: Deploy Embedding & Reranking Microservices

docker run -d --gpus all \
  -p 9080:8000 \
  --name nemo-retriever-embedding-microservice \
  nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0

docker run -d --gpus all \
  -p 1976:8000 \
  --name nemo-retriever-ranking-microservice \
  nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.0

Step 3: Deploy RAG Playground

git clone https://github.com/NVIDIA-AI-Blueprints/rag.git
cd rag
git checkout v1.0.0
cd deploy/compose

Create .env file dynamically:

export HOST_IP=$(hostname -I | awk '{print $1}')

cat < .env
EMBEDDING_MS_BASE=http://$HOST_IP:9080
RANKING_MS_BASE=http://$HOST_IP:1976
LLM_MS_BASE=http://$HOST_IP:8000
EOF

Start services:

docker compose --env-file .env up -d

Step 4: Validate Setup

Visit RAG Playground UI in your browser at: http://<YOUR_NODE_IP>:8090

Bonus Tips

Use --gpus "device=N" in Docker to pin workloads to specific GPUs.
Run nvidia-smi to monitor utilization and memory.

Final Thoughts

I have created a Ansible plybook project to configure NVIDIA AI Enterprise and RAG. NVIDIA’s NIM services provide a powerful way to run LLMs in production. Combined with the RAG Blueprint, this is a production-grade GenAI setup ideal for internal POCs and prototyping.

Links - NVIDIA BLOG FOR RAG - NVIDIA AI ENTERPRISE

Building a GenAI RAG Stack with NVIDIA NIM and LLaMA 3-8B on a Single Node

Architecture Overview

System Prerequisites

Step 1: Deploy LLaMA 3-8B (or NAY OTHER) Using NIMs

Step 2: Deploy Embedding & Reranking Microservices

Step 3: Deploy RAG Playground

Step 4: Validate Setup

Bonus Tips

Final Thoughts

Building a GenAI RAG Stack with NVIDIA NIM and LLaMA 3-8B on a Single Node

Contact Form