GPU Setup for Local LLM (Ollama) | HailBytes | Cloud-Ready Cybersecurity Tools

Overview

This tutorial guides you through setting up Ollama, a local LLM runtime, to run AI-powered vulnerability analysis entirely on your infrastructure. This approach eliminates external API dependencies, ensures data privacy, reduces costs for high-volume scanning, and enables offline operation. We'll cover GPU requirements, NVIDIA driver installation, Ollama setup, and reNgine integration.

Prerequisites

reNgine Cloud instance (AWS/Azure) with GPU-enabled instance type
NVIDIA GPU with 8GB+ VRAM (16GB+ recommended)
Ubuntu 20.04/22.04 or similar Linux distribution
Root or sudo access to the server
At least 50GB free disk space for models

What You'll Learn

GPU requirements and instance selection
Installing NVIDIA drivers and CUDA toolkit
Setting up Ollama for local LLM inference
Downloading and running LLM models
Configuring reNgine to use Ollama
Model selection and performance tuning

Step 1: Choose GPU-Enabled Instance

Select a cloud instance with suitable GPU capabilities. Larger models require more VRAM.

Recommended AWS Instances

Instance Type	GPU	VRAM	Best For
g4dn.xlarge	NVIDIA T4	16GB	Small models (7B-13B)
g4dn.2xlarge	NVIDIA T4	16GB	Medium models (13B-20B)
g5.xlarge	NVIDIA A10G	24GB	Large models (30B-70B)
p3.2xlarge	NVIDIA V100	16GB	High performance needs

Azure GPU Instances

Instance Type	GPU	VRAM	Best For
NC4as_T4_v3	NVIDIA T4	16GB	Small to medium models
NC6s_v3	NVIDIA V100	16GB	Production workloads
NC24ads_A100_v4	NVIDIA A100	80GB	Largest models (70B+)

Step 2: Install NVIDIA Drivers

Install NVIDIA GPU drivers and CUDA toolkit to enable GPU acceleration.

# SSH into your GPU instance
ssh user@your-rengine-server

# Update system packages
sudo apt update && sudo apt upgrade -y

# Check if GPU is detected
lspci | grep -i nvidia

# You should see output like:
# 00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

# Install NVIDIA driver
sudo apt install -y nvidia-driver-535

# Or use CUDA toolkit installer (includes drivers)
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sudo sh cuda_12.3.0_545.23.06_linux.run

# Follow installer prompts:
# - Accept license
# - Install NVIDIA driver: YES
# - Install CUDA toolkit: YES
# - Install samples: Optional

# Reboot to load drivers
sudo reboot

After reboot, verify GPU is accessible:

# Check NVIDIA driver installation
nvidia-smi

# Expected output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 545.23.06    Driver Version: 545.23.06    CUDA Version: 12.3     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |===============================+======================+======================|
# |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
# | N/A   34C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+

Step 3: Install Docker with NVIDIA Container Toolkit

Ollama runs in Docker, so we need Docker with GPU support.

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add current user to docker group
sudo usermod -aG docker $USER

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

# You should see nvidia-smi output from inside the container

Step 4: Install Ollama

Install Ollama to run LLM models locally with GPU acceleration.

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Or using Docker
docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Verify Ollama is running
curl http://localhost:11434/api/version

# Response:
# {"version":"0.1.17"}

Ollama is now running and ready to download models.

Step 5: Download and Test LLM Models

Choose and download an appropriate model for vulnerability analysis.

Recommended Models for Security Analysis

Model	Size	VRAM Required	Quality
llama2:7b	3.8GB	8GB	Good for basic analysis
llama2:13b	7.4GB	16GB	Better reasoning
mixtral:8x7b	26GB	24GB+	Excellent analysis
codellama:13b	7.4GB	16GB	Code-focused analysis
mistral:7b	4.1GB	8GB	Fast and capable

# Download a model (this may take 5-15 minutes depending on model size)
ollama pull llama2:13b

# Or using Docker
docker exec -it ollama ollama pull llama2:13b

# Test the model
ollama run llama2:13b

# Interactive prompt will appear - test it:
>>> Analyze this vulnerability: SQL injection in login form parameter 'username'

# Model will respond with analysis

# Exit with: /bye

# List installed models
ollama list

# Expected output:
# NAME            ID              SIZE    MODIFIED
# llama2:13b      d5611f7b5f14    7.4 GB  2 minutes ago

Monitor GPU usage during model inference with watch -n 1 nvidia-smi

Step 6: Configure reNgine to Use Ollama

Update reNgine configuration to use local Ollama instead of OpenAI.

# SSH into reNgine server
cd /opt/rengine

# Edit environment configuration
nano .env

# Add/update these variables:
AI_ENABLED=true
AI_PROVIDER=ollama
OLLAMA_API_URL=http://localhost:11434
OLLAMA_MODEL=llama2:13b
OLLAMA_TIMEOUT=120  # Seconds to wait for response
OLLAMA_NUM_PREDICT=2048  # Max tokens to generate

# Disable OpenAI (if previously configured)
OPENAI_ENABLED=false

# Save and exit (Ctrl+X, Y, Enter)

# Restart reNgine services
docker-compose restart web celery

# Check logs for successful connection
docker-compose logs -f celery | grep -i ollama

reNgine will now use your local Ollama installation for all AI analysis.

Step 7: Test AI Analysis with Ollama

Verify that vulnerability analysis works with your local LLM.

# Test Ollama API directly
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2:13b",
  "prompt": "Analyze this SQL injection vulnerability and provide remediation steps.",
  "stream": false
}'

# Test via reNgine API
curl -X POST http://localhost:8000/api/vulnerabilities/1/analyze/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"use_ai": true}'

# Monitor GPU usage during analysis
watch -n 1 nvidia-smi

Performance Monitoring

# Check GPU utilization
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total \
  --format=csv -l 1

# Monitor Ollama logs
docker logs -f ollama

# Check reNgine Celery worker processing
docker exec -it rengine-celery celery -A reNgine inspect active

# Benchmark inference speed
time ollama run llama2:13b "Analyze this XSS vulnerability"

Performance Optimization

Optimize model performance for faster vulnerability analysis:

Ollama Configuration Options

# Create Modelfile for custom configuration
FROM llama2:13b

# Set GPU layers (higher = more GPU usage = faster)
PARAMETER num_gpu 99

# Temperature (0-1, lower = more focused)
PARAMETER temperature 0.3

# Context window size
PARAMETER num_ctx 4096

# Stop sequences
PARAMETER stop ""
PARAMETER stop "User:"

# Custom system prompt for security analysis
SYSTEM You are an expert security researcher analyzing vulnerabilities in web applications...

# Create custom model from Modelfile
ollama create rengine-security -f Modelfile

# Use the custom model
ollama run rengine-security

# Update reNgine to use custom model
# In .env: OLLAMA_MODEL=rengine-security

Model Selection Guide

Choosing the Right Model

Use Case	Recommended Model	Reasoning
Budget/Small GPU	mistral:7b or llama2:7b	Low VRAM requirement, fast
Balanced Performance	llama2:13b	Good quality/speed ratio
Code Analysis	codellama:13b	Trained on code
Best Quality	mixtral:8x7b	Excellent reasoning
High Volume	mistral:7b	Fast inference

Cost Comparison: Ollama vs OpenAI

Local LLMs with Ollama eliminate per-request API costs but require GPU infrastructure:

Monthly Cost Analysis (1000 scans/month)

Solution	Setup	Monthly Cost
OpenAI GPT-4	API only	$2,000-3,000
OpenAI GPT-3.5	API only	$100-200
Ollama (g4dn.xlarge)	AWS EC2 + T4 GPU	$380
Ollama (g5.xlarge)	AWS EC2 + A10G GPU	$800

Break-even analysis: Ollama becomes cost-effective at ~200 scans/month for GPT-4 comparison, or ~50 scans/month vs GPT-3.5, depending on scan complexity and token usage.

Troubleshooting

Common Issues

GPU Not Detected:

Run nvidia-smi - if error, reinstall drivers
Check Docker has GPU access: docker run --gpus all nvidia/cuda:12.3.0-base nvidia-smi
Verify NVIDIA Container Toolkit is installed
Restart Docker daemon after driver installation

Slow Inference:

Model may be running on CPU - check nvidia-smi during inference
Reduce model size (use 7B instead of 13B)
Increase GPU instance size
Set num_gpu parameter higher in Modelfile

Out of Memory Errors:

Model too large for available VRAM
Use smaller model or larger GPU instance
Reduce num_ctx (context window) in configuration
Monitor memory with nvidia-smi

Advanced: Multiple Models

Run different models for different types of analysis:

# Download multiple models
ollama pull mistral:7b         # Fast, for low/medium severity
ollama pull llama2:13b         # Balanced, for high severity
ollama pull codellama:13b      # Code-focused, for SAST findings

# Configure reNgine to use different models by severity
# In reNgine settings:
AI_MODEL_CRITICAL=llama2:13b
AI_MODEL_HIGH=llama2:13b
AI_MODEL_MEDIUM=mistral:7b
AI_MODEL_LOW=mistral:7b

Next Steps

Configure AI Analysis

Learn about OpenAI integration and AI-powered vulnerability prioritization.

View Tutorial →

Run Your First Scan

Execute comprehensive reconnaissance scans with reNgine Cloud.

View Tutorial →

Need Help?

Having trouble with GPU or Ollama setup? Our support team can assist.

Contact Support