Taming the Edge: Bypassing Axera AX8850 NPU Compiler Constraints

Edge Neural Processing Units (NPUs) represent a massive leap in deploying local, air-gapped LLMs for tactical offensive security operations. However, navigating the closed-source, highly rigid vendor compilers required to run models on bare silicon can feel like hacking a black box.
Recently, our team targeted the Radxa AI Core AX-M1 board, which features the Axera AX8850 (LAMBERT) architecture. While the chip has hardware parity with recent edge modules, its vendor software stack remains outdated. Attempting to compile dense and Mixture-of-Experts (MoE) transformer models onto its Contiguous Memory (CMM) blocks using the proprietary pulsar2 compiler suite led us directly into the weeds of reverse-engineering closed-source Python binaries.
Here is the technical breakdown of the compiler bugs we faced, the PyArmor runtime intercept we engineered, and how we bypassed silicon limits to achieve native inference.
The Target Environment
Our hardware testbed consists of:
- Host: Proxmox VE server
- Workload Container: A privileged LXC container passing through the PCIe NPU interface
- NPU Hardware: Radxa AI Core AX-M1 (AX8850 LAMBERT architecture)
- Compiler:
pulsar2(distributed inside theax_pulsar2_6.0_packagecompiler suite)
Roadblock 1: The Closed-Source Compiler Loop
The pulsar2 compiler is distributed as an obfuscated Python package compiled using PyArmor. This prevents direct analysis or patch modification of files on disk.
When attempting to build for the LAMBERT target, we hit two initial roadblocks:
- Startup Wrappers: Symlinking the compiler binary directly to
/usr/local/bin/pulsar2broke internal path resolution loops because of how the script resolved${BASH_SOURCE[0]}. - The Backend Class Bug: The compiler expects to map the
--chip LAMBERTargument to aLambertBackendclass inside Python'sbackend.lambert.backend_impl. However, the compiled package threw a module resolution exception because it lacked the class definition.
The Fix: In-Memory Python Interception
To resolve the path loop, we wrapped the compiler execution in a forwarding script:
#!/bin/bash
exec /opt/axera/compiler/ax_pulsar2_6.0_package/bin/pulsar2 "$@"
To fix the backend class error without breaking PyArmor's signature checks, we created a custom sitecustomize.py file and injected it into the compiler's bundled Python site-packages directory (/opt/axera/compiler/ax_pulsar2_6.0_package/python3/lib/python3.12/site-packages/sitecustomize.py).
By hooking Python’s built-in __import__ function, we intercepted the import sequence in memory, dynamically mapped the backend.lambert requests to the existing backend.ax8860 classes, and aliased the backend classes in-flight:
import builtins
import sys
import importlib
original_import = builtins.__import__
def custom_import(name, globals=None, locals=None, fromlist=(), level=0):
if name.startswith("backend.lambert"):
# Map Lambert target to AX8860 backend classes in memory
target_name = name.replace("backend.lambert", "backend.ax8860")
mod = original_import(target_name, globals, locals, fromlist, level)
sys.modules[name] = sys.modules[target_name]
try:
impl = importlib.import_module("backend.ax8860.backend_impl")
if hasattr(impl, "AX8860Backend"):
impl.LambertBackend = impl.AX8860Backend
sys.modules["backend.lambert.backend_impl"] = impl
except Exception:
pass
return mod
return original_import(name, globals, locals, fromlist, level)
builtins.__import__ = custom_import
This bypass tricked the compiler into resolving the backend class successfully, allowing the code generation pipeline to start.
Roadblock 2: SRAM Allocation & The TileFailException
Once the compiler began parsing the model, we ran into deep physical memory layout constraints on the chip's internal SRAM blocks.
The mRoPE Crash
When attempting to compile models like DeepSeek, the compiler immediately crashed.
- The Root Cause: These models utilize multi-core Rotary Position Embedding (mRoPE). The NPU's physical gating engine lacks the hardware logic to tile and compile multi-dimensional positional embeddings on silicon.
- The Fix: Stick to standard, single-dimensional RoPE architectures (such as Qwen 1.5/2.5 or Gemma). MoE models and dynamic embeddings are fundamentally incompatible with static CMM memory maps.
The TileFailException (Group Quantization)
When compiling standard float models (FP16/BF16) directly to INT4 using --weight_type s4, the compilation crashed with a TileFailException or AssertionError: invalid groupN.
- The Root Cause: The silicon's SRAM tiler requires a strictly structured, grouped quantization matrix (specifically
groupN=128). When fed raw float weights, the compiler falls back to a per-tensor or channel-wide quantization scheme (groupN=1024), which exceeds the physical SRAM tile boundaries.
The Fix: In-Flight Function Hooking
Using our sitecustomize.py injector, we targeted conv_common module functions to log internal shapes and dynamically hook the compiler's math routines. We intercepted get_group_info and overrode data type validations in get_ifm_dtype to force unsupported FP32 tensors to register as BF16 (represented as type 6 in the compiler engine):
# Part of our injected runtime hook
if "conv_common" in name:
# Hook the group allocation assertion
if hasattr(mod, "get_group_info"):
orig_get_group_info = mod.get_group_info
def wrapper_get_group_info(inputs_spec, attrs):
try:
return orig_get_group_info(inputs_spec, attrs)
except AssertionError as ae:
if "invalid groupN" in str(ae):
# Force override to let the tiler pass
return (0, 0)
raise ae
mod.get_group_info = wrapper_get_group_info
# Patch data type mapping to avoid FP32 crashes
if hasattr(mod, "get_ifm_dtype"):
orig_get_ifm_dtype = mod.get_ifm_dtype
def wrapper_get_ifm_dtype(dtype):
try:
return orig_get_ifm_dtype(dtype)
except KeyError:
if getattr(dtype, "name", "") == "FP32":
return 6 # Force override to BF16 (Type 6)
Bypassing Compilation: The AX650 Cross-Compatibility Spoof
While the memory hooking allowed us to compile small models, the "Golden Combo" for compiling larger models from scratch requires:
- Standard RoPE Architecture (e.g., Qwen 1.5).
- Pre-Quantized Weights (e.g.,
Qwen1.5-0.5B-Chat-GPTQ-Int4or AWQ) to ensure weight groupings are already set togroupN=128prior to compilation.
However, our biggest breakthrough was discovering a hardware cross-compatibility shortcut: The AX8850 silicon natively executes graph files (.axmodel) compiled for the older AX650 architecture.
Instead of wrestling with local compiler issues, we bypassed the entire compilation pipeline:
- Download a pre-compiled AX650 model from Hugging Face (such as
AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4). - Download the pre-compiled CMM slices directly to the target system.
- Launch the
axllmPCIe Inference gateway pointing to the model directory.
The AX8850 native memory allocator accepts the AX650 binary slices perfectly, serving the models natively on silicon without any magic bytes rejection.
# Verify CMM status
axcl-smi
# Spin up the gateway
axllm serve /opt/axera/models/gemma-4-E2B --host 0.0.0.0 --port 8000
Qwen 2.5-0.5B NPU Benchmarks
To verify the performance of pre-compiled AX650 binaries running natively on the AX8850 CMM blocks, we ran a standard inference test suite against Qwen 2.5-0.5B (AX650 Pre-Compiled). The model was served using the axllm gateway and queried via the completions API:
| Test Prompt | NPU Response Time (s) | Key Output Snippet |
|---|---|---|
| P1: Capital of France? | 0.62s | "The capital of France is Paris." |
| P2: Define Zero Trust? | 2.52s | "Zero-trust architecture is a security model that..." |
| P3: HTTP GET Python Script? | 10.37s | Code output importing http.client |
| P4: Mitigate XSS? | 7.51s | "To mitigate XSS, use a Content Security Policy (CSP)..." |
| P5: Summarize OSI model? | 9.63s | OSI Layer summaries (Physical, Data Link, Network...) |
While the model's responses are direct and the generation speed on the M1 NPU is fast, we did notice minor output repetition on longer tokens (e.g., repeating the CSP header sentence in P4), which is typical of smaller 0.5B parameter models. However, the raw execution speed and the fact that it loads into memory and streams tokens in under a second validates that AX650 spoofing is a viable deployment pipeline.
Multi-Model Co-Existence: Running Concurrent Models on a Single NPU
Deploying local AI at the edge often hits a hard resource wall. While most edge deployments dedicate a single NPU board to a single running model, the Radxa AI Core AX-M1 provides 7040 MiB of CMM (Contiguous Memory Allocation). This memory layout allows us to partition the physical NPU allocation and run multiple models concurrently on a single silicon board.
By launching multiple instances of the axllm inference gateway on different network ports, we successfully served two independent models on the same NPU without memory overlap or graph compilation conflicts:
- Qwen 2B (INT4 AX650 Compiled) serving completions on port
8000 - Gemma 2B (INT4 AX650 Compiled) serving completions on port
8002
Concurrent Serving Execution
To start both models concurrently, we spin up two background axllm daemons. The NPU runtime automatically registers the model graphs into separate CMM memory blocks:
# Boot the Qwen 2B model on Port 8000
sudo sh -c 'nohup axllm serve /opt/axera/models/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047 \
--host 0.0.0.0 --port 8000 > /opt/axera/models/axllm_qwen_8000.log 2>&1 &'
# Boot the Gemma 2B model on Port 8002
sudo sh -c 'nohup axllm serve /opt/axera/models/gemma-4-E2B \
--host 0.0.0.0 --port 8002 > /opt/axera/models/axllm_gemma_8002.log 2>&1 &'
Verifying the active processes inside the LXC container confirms that both compiler runtimes are active and handling context concurrently:
$ ps aux | grep axllm
root 29290 0.8 1.0 2034400 174232 ? Sl 21:58 0:07 axllm serve /opt/axera/models/Qwen3.5-2B... --port 8000
root 29915 0.0 0.7 651252 127736 ? Sl 22:14 0:00 axllm serve /opt/axera/models/gemma-4-E2B ... --port 8002
This multi-tenant NPU setup allows edge routers or localized offensive security hardware to route classification and analysis tasks to different specialized local models simultaneously, maximizing hardware utilization.
The Catch: Concurrency Driver Bottlenecks & Loops
While the dual-model allocation is physically supported by the NPU's SRAM and CMM, our testing exposed significant stability bugs when querying both models simultaneously:
- API Timing Collisions & Cutoffs: When querying both Qwen 3.5 and Gemma 4 concurrently (e.g. asking both models to evaluate
2/2+2*2=), theaxcldriver backend struggled to multiplex execution contexts. Qwen 3.5 successfully booted its<think>reasoning block but froze mid-token. - EOS Array Bugs & Loops: Gemma 4 started spewing repetitive garbage loop outputs (
"content": "factfactfactfact..."). This occurred because the vendor gateway failed to handle Hugging Face token configs whereeos_token_idis defined as an array ([1, 106]) instead of a single integer. We had to write an automated patching script to forceeos_token_id = 106on the disk-level configuration to restore proper end-of-sentence behavior. - Driver Hangs: High-concurrency query execution eventually exhausted the NPU core scheduling queue, necessitating a hard service restart (
pkill -f axllm) and temporary file cleanups (/tmp/axcl/*).
Model Support Quick Reference
Here is a quick compatibility matrix of the LLM architectures we tested against the AX8850 (AICore AX-M1) during our research:
| Model Name | Parameters | Source | Compatibility | Notes / Failure Cause |
|---|---|---|---|---|
| Qwen 2.5-0.5B | 0.5B | Pre-Compiled (AX650) | PASS | Runs natively. Fast token streaming. |
| Qwen 3.5-2B | 2.0B | Pre-Compiled (AX650) | PASS | Runs on port 8000. Susceptible to freeze under concurrent query load. |
| Qwen 1.5-0.5B | 0.5B | Local Compile (Pulsar2) | PASS | Compiled using our custom in-memory patcher hooks. |
| Gemma 4 E2B-it | 2.0B | Pre-Compiled (AX650) | PARTIAL | Runs on port 8002. Requires patching eos_token_id to single int 106 to prevent looping. |
| DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | Pre-Compiled (AX650) | FAIL | Crashes with 0x8030070c driver execution error (often triggered by 0-byte HF stub config imports). |
| DeepSeek-Coder-V2 / MoE | Various | Direct compile attempt | FAIL | Compiler crash. SRAM tiler cannot process mRoPE (multi-core Rotary Position Embedding) shapes. |
| Raw float16 / bf16 models | Various | Direct compile attempt | FAIL | Compiler TileFailException / invalid groupN crash. SRAM requires pre-quantized weights (groupN=128). |
Deployment & Tooling Suite
To automate and streamline these deployment steps, we've developed and packaged a unified Python toolkit for NPU guest provisioning, model compiling, and testing. It is available under our public repositories as NextGenRedPVE-edge.
The Tooling Toolkit
proxmox_create_vm.py- When to use: To spin up a clean Ubuntu template environment on a Proxmox cluster configured with proper storage drivers and QEMU guest agents.
- What it does: Connects to Proxmox via SSH and automates VM templates generation.
proxmox_configure_vm.py- When to use: Immediately after VM generation.
- What it does: Auto-resolves dynamic guest IPs from the guest agent, adds restricted developer users, installs build dependencies, compilers, and dependencies.
deploy_precompiled.py- When to use: The recommended path. Deploying a high-speed pre-compiled model from Hugging Face.
- What it does: Connects to the guest container, downloads pre-compiled models (e.g., Qwen3.5 or Gemma), provisions cache folders, and launches the
axllm servegateway.
deploy_golden_combo.py- When to use: Compiling custom standard RoPE models from raw HF quantized sources.
- What it does: Automates the local compilation loop targeting the AX650 fallback architecture.
benchmark_npu.py- When to use: After model boot to evaluate efficiency.
- What it does: Queries evaluation prompt sequences and metrics CPU, memory, and tokens/sec throughput during inference execution.
Next Steps
By leveraging the AX650 cross-compatibility spoof, we successfully booted a Gemma 2.6B model in under two minutes, running natively on the AX8850 edge NPU.
Now, we've taken the next leap: bypassing broken vendor C++ HTTP APIs, designing a custom FastAPI orchestrator, and piping bare-metal telemetry directly into OpenAI-formatted API outputs.
Check out the full story in Conquering the Silicon: Building a Custom 'Ollama' Orchestrator for the AX8850 NPU (Part II).