reNgine Tutorial

GPU Setup for Local LLM (Ollama)

Configure local GPU-accelerated LLMs using Ollama for offline AI analysis.

Overview

This tutorial guides you through setting up Ollama, a local LLM runtime, to run AI-powered vulnerability analysis entirely on your infrastructure. This approach eliminates external API dependencies, ensures data privacy, reduces costs for high-volume scanning, and enables offline operation. We'll cover GPU requirements, NVIDIA driver installation, Ollama setup, and reNgine integration.

Prerequisites

  • reNgine Cloud instance (AWS/Azure) with GPU-enabled instance type
  • NVIDIA GPU with 8GB+ VRAM (16GB+ recommended)
  • Ubuntu 20.04/22.04 or similar Linux distribution
  • Root or sudo access to the server
  • At least 50GB free disk space for models

What You'll Learn

  • GPU requirements and instance selection
  • Installing NVIDIA drivers and CUDA toolkit
  • Setting up Ollama for local LLM inference
  • Downloading and running LLM models
  • Configuring reNgine to use Ollama
  • Model selection and performance tuning

Step 1: Choose GPU-Enabled Instance

Select a cloud instance with suitable GPU capabilities. Larger models require more VRAM.

Recommended AWS Instances

Instance Type GPU VRAM Best For
g4dn.xlarge NVIDIA T4 16GB Small models (7B-13B)
g4dn.2xlarge NVIDIA T4 16GB Medium models (13B-20B)
g5.xlarge NVIDIA A10G 24GB Large models (30B-70B)
p3.2xlarge NVIDIA V100 16GB High performance needs

Azure GPU Instances

Instance Type GPU VRAM Best For
NC4as_T4_v3 NVIDIA T4 16GB Small to medium models
NC6s_v3 NVIDIA V100 16GB Production workloads
NC24ads_A100_v4 NVIDIA A100 80GB Largest models (70B+)

Step 2: Install NVIDIA Drivers

Install NVIDIA GPU drivers and CUDA toolkit to enable GPU acceleration.

# SSH into your GPU instance
ssh user@your-rengine-server

# Update system packages
sudo apt update && sudo apt upgrade -y

# Check if GPU is detected
lspci | grep -i nvidia

# You should see output like:
# 00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

# Install NVIDIA driver
sudo apt install -y nvidia-driver-535

# Or use CUDA toolkit installer (includes drivers)
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sudo sh cuda_12.3.0_545.23.06_linux.run

# Follow installer prompts:
# - Accept license
# - Install NVIDIA driver: YES
# - Install CUDA toolkit: YES
# - Install samples: Optional

# Reboot to load drivers
sudo reboot

After reboot, verify GPU is accessible:

# Check NVIDIA driver installation
nvidia-smi

# Expected output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 545.23.06    Driver Version: 545.23.06    CUDA Version: 12.3     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |===============================+======================+======================|
# |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
# | N/A   34C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+

Step 3: Install Docker with NVIDIA Container Toolkit

Ollama runs in Docker, so we need Docker with GPU support.

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add current user to docker group
sudo usermod -aG docker $USER

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

# You should see nvidia-smi output from inside the container

Step 4: Install Ollama

Install Ollama to run LLM models locally with GPU acceleration.

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Or using Docker
docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Verify Ollama is running
curl http://localhost:11434/api/version

# Response:
# {"version":"0.1.17"}

Ollama is now running and ready to download models.

Step 5: Download and Test LLM Models

Choose and download an appropriate model for vulnerability analysis.

Recommended Models for Security Analysis

Model Size VRAM Required Quality
llama2:7b 3.8GB 8GB Good for basic analysis
llama2:13b 7.4GB 16GB Better reasoning
mixtral:8x7b 26GB 24GB+ Excellent analysis
codellama:13b 7.4GB 16GB Code-focused analysis
mistral:7b 4.1GB 8GB Fast and capable
# Download a model (this may take 5-15 minutes depending on model size)
ollama pull llama2:13b

# Or using Docker
docker exec -it ollama ollama pull llama2:13b

# Test the model
ollama run llama2:13b

# Interactive prompt will appear - test it:
>>> Analyze this vulnerability: SQL injection in login form parameter 'username'

# Model will respond with analysis

# Exit with: /bye

# List installed models
ollama list

# Expected output:
# NAME            ID              SIZE    MODIFIED
# llama2:13b      d5611f7b5f14    7.4 GB  2 minutes ago

Monitor GPU usage during model inference with watch -n 1 nvidia-smi

Step 6: Configure reNgine to Use Ollama

Update reNgine configuration to use local Ollama instead of OpenAI.

# SSH into reNgine server
cd /opt/rengine

# Edit environment configuration
nano .env

# Add/update these variables:
AI_ENABLED=true
AI_PROVIDER=ollama
OLLAMA_API_URL=http://localhost:11434
OLLAMA_MODEL=llama2:13b
OLLAMA_TIMEOUT=120  # Seconds to wait for response
OLLAMA_NUM_PREDICT=2048  # Max tokens to generate

# Disable OpenAI (if previously configured)
OPENAI_ENABLED=false

# Save and exit (Ctrl+X, Y, Enter)

# Restart reNgine services
docker-compose restart web celery

# Check logs for successful connection
docker-compose logs -f celery | grep -i ollama

reNgine will now use your local Ollama installation for all AI analysis.

Step 7: Test AI Analysis with Ollama

Verify that vulnerability analysis works with your local LLM.

# Test Ollama API directly
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2:13b",
  "prompt": "Analyze this SQL injection vulnerability and provide remediation steps.",
  "stream": false
}'

# Test via reNgine API
curl -X POST http://localhost:8000/api/vulnerabilities/1/analyze/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"use_ai": true}'

# Monitor GPU usage during analysis
watch -n 1 nvidia-smi

Performance Monitoring

# Check GPU utilization
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total \
  --format=csv -l 1

# Monitor Ollama logs
docker logs -f ollama

# Check reNgine Celery worker processing
docker exec -it rengine-celery celery -A reNgine inspect active

# Benchmark inference speed
time ollama run llama2:13b "Analyze this XSS vulnerability"

Performance Optimization

Optimize model performance for faster vulnerability analysis:

Ollama Configuration Options

# Create Modelfile for custom configuration
FROM llama2:13b

# Set GPU layers (higher = more GPU usage = faster)
PARAMETER num_gpu 99

# Temperature (0-1, lower = more focused)
PARAMETER temperature 0.3

# Context window size
PARAMETER num_ctx 4096

# Stop sequences
PARAMETER stop ""
PARAMETER stop "User:"

# Custom system prompt for security analysis
SYSTEM You are an expert security researcher analyzing vulnerabilities in web applications...
# Create custom model from Modelfile
ollama create rengine-security -f Modelfile

# Use the custom model
ollama run rengine-security

# Update reNgine to use custom model
# In .env: OLLAMA_MODEL=rengine-security

Model Selection Guide

Choosing the Right Model

Use Case Recommended Model Reasoning
Budget/Small GPU mistral:7b or llama2:7b Low VRAM requirement, fast
Balanced Performance llama2:13b Good quality/speed ratio
Code Analysis codellama:13b Trained on code
Best Quality mixtral:8x7b Excellent reasoning
High Volume mistral:7b Fast inference

Cost Comparison: Ollama vs OpenAI

Local LLMs with Ollama eliminate per-request API costs but require GPU infrastructure:

Monthly Cost Analysis (1000 scans/month)

Solution Setup Monthly Cost
OpenAI GPT-4 API only $2,000-3,000
OpenAI GPT-3.5 API only $100-200
Ollama (g4dn.xlarge) AWS EC2 + T4 GPU $380
Ollama (g5.xlarge) AWS EC2 + A10G GPU $800

Break-even analysis: Ollama becomes cost-effective at ~200 scans/month for GPT-4 comparison, or ~50 scans/month vs GPT-3.5, depending on scan complexity and token usage.

Troubleshooting

Common Issues

GPU Not Detected:

  • Run nvidia-smi - if error, reinstall drivers
  • Check Docker has GPU access: docker run --gpus all nvidia/cuda:12.3.0-base nvidia-smi
  • Verify NVIDIA Container Toolkit is installed
  • Restart Docker daemon after driver installation

Slow Inference:

  • Model may be running on CPU - check nvidia-smi during inference
  • Reduce model size (use 7B instead of 13B)
  • Increase GPU instance size
  • Set num_gpu parameter higher in Modelfile

Out of Memory Errors:

  • Model too large for available VRAM
  • Use smaller model or larger GPU instance
  • Reduce num_ctx (context window) in configuration
  • Monitor memory with nvidia-smi

Advanced: Multiple Models

Run different models for different types of analysis:

# Download multiple models
ollama pull mistral:7b         # Fast, for low/medium severity
ollama pull llama2:13b         # Balanced, for high severity
ollama pull codellama:13b      # Code-focused, for SAST findings

# Configure reNgine to use different models by severity
# In reNgine settings:
AI_MODEL_CRITICAL=llama2:13b
AI_MODEL_HIGH=llama2:13b
AI_MODEL_MEDIUM=mistral:7b
AI_MODEL_LOW=mistral:7b

Next Steps

Configure AI Analysis

Learn about OpenAI integration and AI-powered vulnerability prioritization.

View Tutorial →

Run Your First Scan

Execute comprehensive reconnaissance scans with reNgine Cloud.

View Tutorial →

Need Help?

Having trouble with GPU or Ollama setup? Our support team can assist.

Contact Support