Configure local GPU-accelerated LLMs using Ollama for offline AI analysis.
This tutorial guides you through setting up Ollama, a local LLM runtime, to run AI-powered vulnerability analysis entirely on your infrastructure. This approach eliminates external API dependencies, ensures data privacy, reduces costs for high-volume scanning, and enables offline operation. We'll cover GPU requirements, NVIDIA driver installation, Ollama setup, and reNgine integration.
Select a cloud instance with suitable GPU capabilities. Larger models require more VRAM.
| Instance Type | GPU | VRAM | Best For |
|---|---|---|---|
| g4dn.xlarge | NVIDIA T4 | 16GB | Small models (7B-13B) |
| g4dn.2xlarge | NVIDIA T4 | 16GB | Medium models (13B-20B) |
| g5.xlarge | NVIDIA A10G | 24GB | Large models (30B-70B) |
| p3.2xlarge | NVIDIA V100 | 16GB | High performance needs |
| Instance Type | GPU | VRAM | Best For |
|---|---|---|---|
| NC4as_T4_v3 | NVIDIA T4 | 16GB | Small to medium models |
| NC6s_v3 | NVIDIA V100 | 16GB | Production workloads |
| NC24ads_A100_v4 | NVIDIA A100 | 80GB | Largest models (70B+) |
Install NVIDIA GPU drivers and CUDA toolkit to enable GPU acceleration.
# SSH into your GPU instance
ssh user@your-rengine-server
# Update system packages
sudo apt update && sudo apt upgrade -y
# Check if GPU is detected
lspci | grep -i nvidia
# You should see output like:
# 00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
# Install NVIDIA driver
sudo apt install -y nvidia-driver-535
# Or use CUDA toolkit installer (includes drivers)
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sudo sh cuda_12.3.0_545.23.06_linux.run
# Follow installer prompts:
# - Accept license
# - Install NVIDIA driver: YES
# - Install CUDA toolkit: YES
# - Install samples: Optional
# Reboot to load drivers
sudo reboot
After reboot, verify GPU is accessible:
# Check NVIDIA driver installation
nvidia-smi
# Expected output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
# |===============================+======================+======================|
# | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
# | N/A 34C P0 27W / 70W | 0MiB / 15360MiB | 0% Default |
# +-------------------------------+----------------------+----------------------+
Ollama runs in Docker, so we need Docker with GPU support.
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Add current user to docker group
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Test GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
# You should see nvidia-smi output from inside the container
Install Ollama to run LLM models locally with GPU acceleration.
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Or using Docker
docker run -d --gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# Verify Ollama is running
curl http://localhost:11434/api/version
# Response:
# {"version":"0.1.17"}
Ollama is now running and ready to download models.
Choose and download an appropriate model for vulnerability analysis.
| Model | Size | VRAM Required | Quality |
|---|---|---|---|
| llama2:7b | 3.8GB | 8GB | Good for basic analysis |
| llama2:13b | 7.4GB | 16GB | Better reasoning |
| mixtral:8x7b | 26GB | 24GB+ | Excellent analysis |
| codellama:13b | 7.4GB | 16GB | Code-focused analysis |
| mistral:7b | 4.1GB | 8GB | Fast and capable |
# Download a model (this may take 5-15 minutes depending on model size)
ollama pull llama2:13b
# Or using Docker
docker exec -it ollama ollama pull llama2:13b
# Test the model
ollama run llama2:13b
# Interactive prompt will appear - test it:
>>> Analyze this vulnerability: SQL injection in login form parameter 'username'
# Model will respond with analysis
# Exit with: /bye
# List installed models
ollama list
# Expected output:
# NAME ID SIZE MODIFIED
# llama2:13b d5611f7b5f14 7.4 GB 2 minutes ago
Monitor GPU usage during model inference with watch -n 1 nvidia-smi
Update reNgine configuration to use local Ollama instead of OpenAI.
# SSH into reNgine server
cd /opt/rengine
# Edit environment configuration
nano .env
# Add/update these variables:
AI_ENABLED=true
AI_PROVIDER=ollama
OLLAMA_API_URL=http://localhost:11434
OLLAMA_MODEL=llama2:13b
OLLAMA_TIMEOUT=120 # Seconds to wait for response
OLLAMA_NUM_PREDICT=2048 # Max tokens to generate
# Disable OpenAI (if previously configured)
OPENAI_ENABLED=false
# Save and exit (Ctrl+X, Y, Enter)
# Restart reNgine services
docker-compose restart web celery
# Check logs for successful connection
docker-compose logs -f celery | grep -i ollama
reNgine will now use your local Ollama installation for all AI analysis.
Verify that vulnerability analysis works with your local LLM.
# Test Ollama API directly
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama2:13b",
"prompt": "Analyze this SQL injection vulnerability and provide remediation steps.",
"stream": false
}'
# Test via reNgine API
curl -X POST http://localhost:8000/api/vulnerabilities/1/analyze/ \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"use_ai": true}'
# Monitor GPU usage during analysis
watch -n 1 nvidia-smi
# Check GPU utilization
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv -l 1
# Monitor Ollama logs
docker logs -f ollama
# Check reNgine Celery worker processing
docker exec -it rengine-celery celery -A reNgine inspect active
# Benchmark inference speed
time ollama run llama2:13b "Analyze this XSS vulnerability"
Optimize model performance for faster vulnerability analysis:
# Create Modelfile for custom configuration
FROM llama2:13b
# Set GPU layers (higher = more GPU usage = faster)
PARAMETER num_gpu 99
# Temperature (0-1, lower = more focused)
PARAMETER temperature 0.3
# Context window size
PARAMETER num_ctx 4096
# Stop sequences
PARAMETER stop ""
PARAMETER stop "User:"
# Custom system prompt for security analysis
SYSTEM You are an expert security researcher analyzing vulnerabilities in web applications...
# Create custom model from Modelfile
ollama create rengine-security -f Modelfile
# Use the custom model
ollama run rengine-security
# Update reNgine to use custom model
# In .env: OLLAMA_MODEL=rengine-security
| Use Case | Recommended Model | Reasoning |
|---|---|---|
| Budget/Small GPU | mistral:7b or llama2:7b | Low VRAM requirement, fast |
| Balanced Performance | llama2:13b | Good quality/speed ratio |
| Code Analysis | codellama:13b | Trained on code |
| Best Quality | mixtral:8x7b | Excellent reasoning |
| High Volume | mistral:7b | Fast inference |
Local LLMs with Ollama eliminate per-request API costs but require GPU infrastructure:
| Solution | Setup | Monthly Cost |
|---|---|---|
| OpenAI GPT-4 | API only | $2,000-3,000 |
| OpenAI GPT-3.5 | API only | $100-200 |
| Ollama (g4dn.xlarge) | AWS EC2 + T4 GPU | $380 |
| Ollama (g5.xlarge) | AWS EC2 + A10G GPU | $800 |
Break-even analysis: Ollama becomes cost-effective at ~200 scans/month for GPT-4 comparison, or ~50 scans/month vs GPT-3.5, depending on scan complexity and token usage.
GPU Not Detected:
nvidia-smi - if error, reinstall driversdocker run --gpus all nvidia/cuda:12.3.0-base nvidia-smiSlow Inference:
nvidia-smi during inferencenum_gpu parameter higher in ModelfileOut of Memory Errors:
num_ctx (context window) in configurationnvidia-smiRun different models for different types of analysis:
# Download multiple models
ollama pull mistral:7b # Fast, for low/medium severity
ollama pull llama2:13b # Balanced, for high severity
ollama pull codellama:13b # Code-focused, for SAST findings
# Configure reNgine to use different models by severity
# In reNgine settings:
AI_MODEL_CRITICAL=llama2:13b
AI_MODEL_HIGH=llama2:13b
AI_MODEL_MEDIUM=mistral:7b
AI_MODEL_LOW=mistral:7b
Learn about OpenAI integration and AI-powered vulnerability prioritization.
View Tutorial →