Benchmarking NextGenRedPVE: Running Local Offensive AI Agents on Proxmox

In modern offensive security operations, AI is no longer a futuristic concept—it’s an active member of the red team. However, relying on commercial cloud LLMs (like GPT-4 or Claude) presents two massive blockers for professional red teaming:
- Safety Refusal Loops: Cloud providers enforce strict, broad-brush alignment guardrails. Asking a cloud model to analyze an IDOR chain or generate a PoC HTTP exploit request often triggers a generic refusal: "As an AI, I cannot assist with hacking..."
- Data Privacy & OpSec: Feeding client network layouts, custom exploitation code, or proprietary vulnerability details into external third-party APIs is a compliance and confidentiality nightmare.
To solve this, we built NextGenRedPVE—our locally-hosted, heterogeneous Proxmox cluster dedicated to running uncensored, abliterated local LLMs.
To ensure our AI agents can reason through complex offensive scenarios efficiently without melting our hardware, we conducted extensive benchmark testing across multiple model architectures (Dense and Mixture-of-Experts) ranging from 2B to 35B parameters.
Here is how we did it, what we tested, and the results that shaped our production setup.
Why We Tested (The Core Objective)
We needed a local model that could act as a competent Lead Offensive Security Researcher. The ideal model had to satisfy three criteria:
- Offensive Competence: It must understand vulnerabilities (like Insecure Direct Object References, business logic bypasses, and access control flaws) and generate clean, structured Markdown or HTTP PoC scripts.
- No Refusals: It must be "abliterated" or uncensored, ignoring standard guardrails to evaluate security weaknesses objectively.
- Hardware Efficiency: It must deliver acceptable tokens-per-second throughput on our local cluster without exceeding host resource constraints.
What & How We Tested
We benchmarked several models across three distinct phases (v5, v6, and v7) on the NextGenRedPVE environment. The testbed server host was a dedicated machine running Proxmox VE, with high-compute tasks allocated to our primary node NextGenRedPVE and GPU-passthrough allocated to our VM workloads (like NextGenRedBrain).
The Test Suite
Each model was subjected to a test suite of 5 real-world offensive security prompts (P1 to P5), including:
- P1: Evaluating Broken Access Control (BAC) and analyzing IDOR chaining.
- P2–P5: Assessing business logic bypasses, authentication flaws, and API vulnerabilities.
For each model, we measured:
- Average Ingestion Speed (t/s): How fast the model processes the context and prompt.
- Average Generation Speed (t/s): How fast the model streams the response tokens.
- Resource Footprint: Peak CPU (%) and Peak Memory (GB) usage on the host.
- Alignment & Pass Rate: Whether the model successfully completed the security task without refusing or throwing system errors.
The Benchmark Data
Below are the consolidated metrics from our testing rounds, categorized by model type and size.
Dense Models (v5 & v7)
Dense models allocate resources to all parameters for every token. We tested models from 9B to 27B parameters, plus a tiny 2B edge model.
| Model | Size / Type | Avg Ingest (t/s) | Avg Gen (t/s) | Peak CPU | Peak Mem | Prompt Pass Rate |
|---|---|---|---|---|---|---|
| Gemma-4-E2B-Abliterated | 2B / Edge | 2384.09 | 87.39 | 1.2% | 12.5 GB | 5 / 5 (100%) |
| Yi-Coder-9B-Chat | 9B / Dense | 1112.79 | 35.71 | 1.2% | 13.2 GB | 5 / 5 (100%) |
| GLM-4-9B-Abliterated | 9B / Dense | 524.96 | 33.26 | 81.8% | 7.6 GB | 5 / 5 (100%) |
| Qwen3.6-27B-Abliterated | 27B / Dense | 61.90 | 3.65 | 25.3% | 8.4 GB | 5 / 5 (100%) |
| Gemma-4-12B-Uncensored | 12B / Dense | 591.97 | 23.73 | 2.4% | 18.7 GB | 4 / 5 (80%) |
| Gemma-4-12B-Abliterated | 12B / Dense | 574.59 | 23.12 | 22.6% | 37.5 GB | 2 / 5 (40%) |
Mixture-of-Experts (MoE) Models (v6 & v7)
MoE models route inputs to specific subnetworks ("experts"), allowing them to utilize a fraction of their total parameters per token. This gives them the reasoning capability of a larger model with the speed of a smaller one.
| Model | Size / Type | Avg Ingest (t/s) | Avg Gen (t/s) | Peak CPU | Peak Mem | Prompt Pass Rate |
|---|---|---|---|---|---|---|
| DeepSeek-Coder-V2-Lite-16B | 16B / MoE | 260.82 | 55.09 | 14.3% | 20.7 GB | 5 / 5 (100%) |
| Gemma-4-26B-Abliterated | 26B / MoE | 154.50 | 32.44 | 4.8% | 14.1 GB | 4 / 5 (80%) |
| Qwen3.6-35B-A3B-MoE | 35B / MoE | 11.65 | 22.16 | 22.3% | 9.4 GB | 4 / 5 (80%) |
| DeepSeek-V4-Flash-Abliterated | Flash / MoE | 0.00 | 0.00 | 0.0% | 0.0 GB | CRASHED (0%) |
Key Takeaways & Performance Analysis
1. The Power of Mixture-of-Experts (MoE)
DeepSeek-Coder-V2-Lite-16B emerged as an absolute powerhouse. It maintained an impressive generation speed of 55.09 t/s while only pulling 14.3% Peak CPU and 20.7 GB of RAM. Because of its MoE routing, it handles code generation and vulnerability analysis with the intelligence of a much larger model at a fraction of the hardware cost.
2. Edge Speed Demons
For edge deployment or lightweight tasks (like parsing quick telemetry logs or running agents on low-resource devices like ZimaBoards), Gemma-4-E2B-Abliterated was a revelation. It ingested text at an astronomical 2,384 t/s and generated at 87.39 t/s using almost no CPU (1.2%) and minimal memory. Crucially, it passed all 5 prompts, making it the perfect candidate for local agent orchestration on edge hardware.
3. Dense Bottlenecks
While Qwen3.6-27B-Abliterated was highly intelligent and passed all 5 prompts, its generation speed choked down to a painful 3.65 t/s. The compute overhead of a 27B dense model running without heavy GPU acceleration makes it impractical for real-time agent loops where speed is critical.
What We’re Running in Production
After analyzing the data, we settled on a hybrid architecture for NextGenRedPVE:
- Production Inference Host (
NextGenRedBrainVM): We chose Gemma-4-26B-Abliterated as our primary reasoning engine. Running locally on Ollama, it strikes the perfect balance of deep vulnerability reasoning and acceptable generation speeds (32.44 t/s) with a low CPU overhead (4.8%). When we need extreme precision, we fallback to Gemini via API (with anonymized data). - Production Agent Gateway (
NextGenRedHermesLXC): This container runs our Hermes Agent Gateway, coordinating tasks and queryingNextGenRedBrainfor fast, local intelligence. - Edge Scouts: For persistent network monitoring and telemetry parsing, we deploy Gemma-4-E2B-Abliterated directly to our Tailscale-connected ZimaBoards.
The Wrap-Up
Self-hosting your AI stack isn't just about privacy; it’s about control. By benchmarking and selecting models like Gemma-4 and DeepSeek-Coder, we've built a local, highly-responsive threat emulation pipeline that doesn't ask for permission to do its job.
Stay tuned for our next update, where we’ll dive into how we hook these local LLMs into our custom exploitation tools to automate safe, controlled defense validation.