HailBytes ASM Tutorial

GPU Setup for Local LLM (Ollama)

Configure local GPU-accelerated LLMs using Ollama for offline AI analysis on NVIDIA CUDA or AMD ROCm.

Overview

This tutorial guides you through setting up Ollama, a local LLM runtime, to run AI-powered vulnerability analysis entirely on your infrastructure. This approach eliminates external API dependencies, ensures data privacy (air-gapped operation — no scan data ever leaves your account), reduces costs for high-volume scanning, and enables offline operation. We'll cover GPU requirements, both NVIDIA (CUDA) and AMD (ROCm) driver installation, Ollama setup, and HailBytes ASM integration.

Prerequisites

  • HailBytes ASM instance (AWS/Azure) with GPU-enabled instance type
  • NVIDIA GPU with 8GB+ VRAM (16GB+ recommended), or AMD GPU with ROCm support (e.g., MI210, MI300X, Radeon RX 7900 XT/XTX)
  • Ubuntu 22.04 or 24.04 (matches the hardened HailBytes ASM Marketplace image)
  • Root or sudo access to the server
  • At least 50GB free disk space for models

What You'll Learn

  • GPU requirements and instance selection (NVIDIA and AMD)
  • Installing NVIDIA drivers + CUDA toolkit
  • Installing AMD ROCm for Radeon Instinct / Radeon GPUs
  • Setting up Ollama for local LLM inference on either vendor
  • Downloading and running LLM models
  • Configuring HailBytes ASM to use Ollama
  • Model selection and performance tuning

Step 1: Choose GPU-Enabled Instance

Select a cloud instance with suitable GPU capabilities. Larger models require more VRAM. HailBytes ASM supports both NVIDIA (CUDA) and AMD (ROCm) GPUs.

Recommended AWS Instances

Instance TypeGPUVRAMBest For
g4dn.xlargeNVIDIA T416GBSmall models (7B-13B)
g4dn.2xlargeNVIDIA T416GBMedium models (13B-20B)
g5.xlargeNVIDIA A10G24GBLarge models (30B-70B)
p3.2xlargeNVIDIA V10016GBHigh performance needs
g4ad.xlargeAMD Radeon Pro V5208GBAMD ROCm path, 7B models

Azure GPU Instances

Instance TypeGPUVRAMBest For
NC4as_T4_v3NVIDIA T416GBSmall to medium models
NC6s_v3NVIDIA V10016GBProduction workloads
NC24ads_A100_v4NVIDIA A10080GBLargest models (70B+)
ND_MI300X_v5AMD Instinct MI300X192GBAMD ROCm path, very large models

Step 2: Install GPU Drivers

Step 2a: Install NVIDIA Drivers (NVIDIA path)

If you're using an NVIDIA GPU, install NVIDIA drivers and the CUDA toolkit. (For AMD GPUs, skip to Step 2b.)

# SSH into your GPU instance
ssh user@your-asm-server

# Update system packages
sudo apt update && sudo apt upgrade -y

# Check if GPU is detected
lspci | grep -i nvidia

# You should see output like:
# 00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

# Install NVIDIA driver
sudo apt install -y nvidia-driver-535

# Or use CUDA toolkit installer (includes drivers)
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sudo sh cuda_12.3.0_545.23.06_linux.run

# Follow installer prompts:
# - Accept license
# - Install NVIDIA driver: YES
# - Install CUDA toolkit: YES
# - Install samples: Optional

# Reboot to load drivers
sudo reboot

After reboot, verify GPU is accessible:

# Check NVIDIA driver installation
nvidia-smi

# Expected output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 545.23.06    Driver Version: 545.23.06    CUDA Version: 12.3     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |===============================+======================+======================|
# |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
# | N/A   34C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+

Step 2b: Install AMD ROCm (AMD path)

If you're using an AMD GPU (Radeon Instinct MI210/MI300X, Radeon RX 7900 XT/XTX, or any ROCm-supported card), install AMD ROCm instead of CUDA.

# SSH into your AMD GPU instance
ssh user@your-asm-server

# Confirm the GPU is visible
lspci | grep -i amd | grep -i -E 'vga|3d|display'

# Install ROCm 6.x on Ubuntu 22.04 / 24.04
sudo apt update && sudo apt install -y wget gnupg

# Add the AMD ROCm apt repository
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | \
  sudo gpg --dearmor -o /etc/apt/keyrings/rocm.gpg
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.1 jammy main" | \
  sudo tee /etc/apt/sources.list.d/rocm.list

sudo apt update
sudo apt install -y rocm-hip-libraries rocm-smi rocminfo

# Add your user to render and video groups
sudo usermod -aG render,video $USER

# Reboot to load the kernel driver
sudo reboot

After reboot, verify ROCm sees the GPU:

# Check that ROCm sees your GPU
rocm-smi

# Expected output (excerpt):
# ============================ ROCm System Management ============================
# GPU  Temp   AvgPwr   SCLK    MCLK    Fan   Perf  PwrCap  VRAM%  GPU%
# 0    38.0c  92.0W    1700Mhz 1300Mhz 0%    auto  300.0W  0%     0%
# ================================================================================

# And confirm the ROCm runtime can enumerate it
rocminfo | grep -E 'Name:|Marketing Name:' | head

Step 3: Install Docker with GPU Container Toolkit

Ollama runs in Docker, so we need Docker with GPU support. Use the toolkit that matches your GPU vendor.

# Install Docker (both vendors)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add current user to docker group
sudo usermod -aG docker $USER

# --- NVIDIA path ---
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

# --- AMD ROCm path ---
# AMD GPUs are exposed through /dev/kfd and /dev/dri devices.
# No special container toolkit is required, just pass the devices through.
docker run --rm \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  rocm/rocm-terminal rocm-smi

Step 4: Install Ollama

Install Ollama to run LLM models locally with GPU acceleration. Ollama auto-detects NVIDIA CUDA and AMD ROCm at runtime.

# Install Ollama (works for both NVIDIA and AMD)
curl -fsSL https://ollama.ai/install.sh | sh

# --- NVIDIA: Run Ollama in Docker with CUDA ---
docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# --- AMD: Run Ollama in Docker with ROCm ---
docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:rocm

# Verify Ollama is running (either path)
curl http://localhost:11434/api/version

# Response:
# {"version":"0.1.17"}

Ollama is now running and ready to download models.

Step 5: Download and Test LLM Models

Choose and download an appropriate model for vulnerability analysis.

Recommended Models for Security Analysis

ModelSizeVRAM RequiredQuality
llama2:7b3.8GB8GBGood for basic analysis
llama2:13b7.4GB16GBBetter reasoning
mixtral:8x7b26GB24GB+Excellent analysis
codellama:13b7.4GB16GBCode-focused analysis
mistral:7b4.1GB8GBFast and capable
# Download a model (this may take 5-15 minutes depending on model size)
ollama pull llama2:13b

# Or using Docker
docker exec -it ollama ollama pull llama2:13b

# Test the model
ollama run llama2:13b

# Interactive prompt will appear - test it:
>>> Analyze this vulnerability: SQL injection in login form parameter 'username'

# Model will respond with analysis

# Exit with: /bye

# List installed models
ollama list

# Expected output:
# NAME            ID              SIZE    MODIFIED
# llama2:13b      d5611f7b5f14    7.4 GB  2 minutes ago

Monitor GPU usage during model inference with watch -n 1 nvidia-smi (NVIDIA) or watch -n 1 rocm-smi (AMD).

Step 6: Configure HailBytes ASM to Use Ollama

Update HailBytes ASM configuration to use local Ollama instead of OpenAI.

# SSH into HailBytes ASM server
cd /opt/hailbytes-asm

# Edit environment configuration
nano .env

# Add/update these variables:
AI_ENABLED=true
AI_PROVIDER=ollama
OLLAMA_API_URL=http://localhost:11434
OLLAMA_MODEL=llama2:13b
OLLAMA_TIMEOUT=120  # Seconds to wait for response
OLLAMA_NUM_PREDICT=2048  # Max tokens to generate

# Disable OpenAI (if previously configured)
OPENAI_ENABLED=false

# Save and exit (Ctrl+X, Y, Enter)

# Restart HailBytes ASM services
docker-compose restart web celery

# Check logs for successful connection
docker-compose logs -f celery | grep -i ollama

HailBytes ASM will now use your local Ollama installation for all AI analysis. The same configuration works whether Ollama is backed by NVIDIA CUDA or AMD ROCm.

Step 7: Test AI Analysis with Ollama

Verify that vulnerability analysis works with your local LLM.

# Test Ollama API directly
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2:13b",
  "prompt": "Analyze this SQL injection vulnerability and provide remediation steps.",
  "stream": false
}'

# Test via HailBytes ASM API
curl -X POST http://localhost:8082/api/vulnerabilities/1/analyze/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"use_ai": true}'

# Monitor GPU usage during analysis
# NVIDIA:
watch -n 1 nvidia-smi
# AMD:
watch -n 1 rocm-smi

Performance Monitoring

# Check GPU utilization (NVIDIA)
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total \
  --format=csv -l 1

# Check GPU utilization (AMD)
rocm-smi --showuse --showmemuse --showtemp -l 1

# Monitor Ollama logs
docker logs -f ollama

# Check HailBytes ASM Celery worker processing
docker exec -it hailbytes-asm-celery-1 celery -A hailbytes_asm inspect active

# Benchmark inference speed
time ollama run llama2:13b "Analyze this XSS vulnerability"

Performance Optimization

Optimize model performance for faster vulnerability analysis:

Ollama Configuration Options

# Create Modelfile for custom configuration
FROM llama2:13b

# Set GPU layers (higher = more GPU usage = faster)
PARAMETER num_gpu 99

# Temperature (0-1, lower = more focused)
PARAMETER temperature 0.3

# Context window size
PARAMETER num_ctx 4096

# Stop sequences
PARAMETER stop ""
PARAMETER stop "User:"

# Custom system prompt for security analysis
SYSTEM You are an expert security researcher analyzing vulnerabilities in web applications...
# Create custom model from Modelfile
ollama create hailbytes-asm-security -f Modelfile

# Use the custom model
ollama run hailbytes-asm-security

# Update HailBytes ASM to use custom model
# In .env: OLLAMA_MODEL=hailbytes-asm-security

Model Selection Guide

Choosing the Right Model

Use CaseRecommended ModelReasoning
Budget/Small GPUmistral:7b or llama2:7bLow VRAM requirement, fast
Balanced Performancellama2:13bGood quality/speed ratio
Code Analysiscodellama:13bTrained on code
Best Qualitymixtral:8x7bExcellent reasoning
High Volumemistral:7bFast inference

Cost Comparison: Ollama vs OpenAI

Local LLMs with Ollama eliminate per-request API costs but require GPU infrastructure:

Monthly Cost Analysis (1000 scans/month)

SolutionSetupMonthly Cost
OpenAI GPT-4API only$2,000-3,000
OpenAI GPT-3.5API only$100-200
Ollama (g4dn.xlarge)AWS EC2 + NVIDIA T4 GPU$380
Ollama (g5.xlarge)AWS EC2 + NVIDIA A10G GPU$800
Ollama (g4ad.xlarge)AWS EC2 + AMD V520 (ROCm)$280

Break-even analysis: Ollama becomes cost-effective at ~200 scans/month for GPT-4 comparison, or ~50 scans/month vs GPT-3.5, depending on scan complexity and token usage. AMD ROCm instances are often the cheapest path for 7B-class models.

Troubleshooting

Common Issues

GPU Not Detected:

  • NVIDIA: run nvidia-smi — if it errors, reinstall drivers
  • AMD: run rocm-smi and rocminfo — confirm the user is in the render and video groups
  • Check Docker has GPU access (NVIDIA): docker run --gpus all nvidia/cuda:12.3.0-base nvidia-smi
  • Check Docker has GPU access (AMD): docker run --device=/dev/kfd --device=/dev/dri --group-add video --group-add render rocm/rocm-terminal rocm-smi
  • Restart Docker daemon after driver installation

Slow Inference:

  • Model may be running on CPU - check nvidia-smi / rocm-smi during inference
  • Reduce model size (use 7B instead of 13B)
  • Increase GPU instance size
  • Set num_gpu parameter higher in Modelfile

Out of Memory Errors:

  • Model too large for available VRAM
  • Use smaller model or larger GPU instance
  • Reduce num_ctx (context window) in configuration
  • Monitor memory with nvidia-smi or rocm-smi

Advanced: Multiple Models

Run different models for different types of analysis:

# Download multiple models
ollama pull mistral:7b         # Fast, for low/medium severity
ollama pull llama2:13b         # Balanced, for high severity
ollama pull codellama:13b      # Code-focused, for SAST findings

# Configure HailBytes ASM to use different models by severity
# In HailBytes ASM settings:
AI_MODEL_CRITICAL=llama2:13b
AI_MODEL_HIGH=llama2:13b
AI_MODEL_MEDIUM=mistral:7b
AI_MODEL_LOW=mistral:7b

Next Steps

Configure AI Analysis

Learn about OpenAI integration and AI-powered vulnerability prioritization.

View Tutorial →

Run Your First Scan

Execute comprehensive reconnaissance scans with HailBytes ASM.

View Tutorial →

Need Help?

Having trouble with GPU or Ollama setup? Our support team can assist.

Contact Support

Get the Free HailBytes ASM Getting Started Guide

A 7-part email series covering everything from your first deployment to advanced configuration and real-world workflows. One email per day, no spam.