Qwen3-VL is a multimodal vision-language model released by Alibaba's Qwen team on September 24, 2025. 2B is a small parameter version in this series that has been distilled or lightweight designed. We run this model on the RK3576. To use hardware acceleration (NPU), we must convert the model format. We need to convert this model separately, ultimately generating two core files:
- .rknn file: Contains the model's Visual Encoder part, responsible for converting images into feature vectors.
- .rkllm file: Contains the model's Large Language Model (LLM) part, responsible for understanding image features and performing text inference and generation.
To perform the conversion of these two files, two Rockchip tools are required:
- RKNN-Toolkit2: Specifically used to convert the vision part to NPU-executable
.rknnformat. - RKLLM-Toolkit: Specifically used to quantize and convert the language model part, generating
.rkllmformat.
After completing the above, another Rockchip tool is needed to run the model on the board:
- RKLLM-Runtime: This is the inference engine (C++ library) running on the development board's Linux system. It is responsible for loading the two model files above and calling NPU drivers for high-performance inference.
1. Process Overview
2. Environment Preparation
- Host Environment: Ubuntu22.04 (x86)
- Development Board: LCSC-TaishanPi-3M-RK3576
- Data Cable: Connect PC and development board for ADB file transfer.
RKNN-LLM
Clone the RKNN-LLM repository:
Repository: https://github.com/airockchip/rknn-llm
This is the official open-source repository provided by Rockchip
git clone https://github.com/airockchip/rknn-llm.gitInstall miniforge3
To prevent Python environment issues caused by different environments on a single host, we use miniforge3 for management.
Install miniforge3:
# Download miniforge3 installation script
wget -c https://mirrors.bfsu.edu.cn/github-release/conda-forge/miniforge/LatestRelease/Miniforge3-Linux-x86_64.sh
# Run the installation script
bash Miniforge3-Linux-x86_64.sh
# 1. Press Enter to continue
# 2. Use the down arrow to scroll through the agreement
# 3. Enter yes at the end
# 4. When prompted "Proceed with initialization?", enter yes2
3
4
5
6
7
8
9
10
You can check https://mirrors.bfsu.edu.cn/github-release/conda-forge/miniforge/LatestRelease/ to find the current latest
.shfilename.
Initialize the conda environment variable:
source ~/miniforge3/bin/activateAfter success, a
(base)will appear at the beginning of the command line.
Create RKLLM-Toolkit Environment
Create and activate a Conda environment: TaishanPi3-RKLLM-Toolkit (Python 3.10 is recommended)
# Create environment
conda create -n TaishanPi3-RKLLM-Toolkit python=3.10
# When prompted "Proceed ([y]/n)?"
# Enter y2
3
4
5
Activate the Conda environment:
conda activate TaishanPi3-RKLLM-ToolkitInstall RKLLM-Toolkit:
In the
rknn-llm/rkllm-toolkit/packages/directory, there are several whl files to choose from:
rkllm_toolkit-1.2.3-cp39-cp39-linux_x86_64.whlrkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whlrkllm_toolkit-1.2.3-cp311-cp311-linux_x86_64.whlrkllm_toolkit-1.2.3-cp312-cp312-linux_x86_64.whlWe select different files based on the Python version. The Conda environment we created uses Python 3.10, so we select the file with
cp310-cp310.For Python 3.12, you can use the file with
cp312-cp312.
# Using Aliyun mirror https://mirrors.aliyun.com/pypi/simple
pip install rknn-llm/rkllm-toolkit/packages/rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple2
After installation, exit the TaishanPi3-RKLLM-Toolkit environment:
conda deactivateCreate RKNN-Toolkit2 Environment
Create and activate a Conda environment: TaishanPi3-RKNN-Toolkit2 (Python 3.10 is recommended)
# Create environment
conda create -n TaishanPi3-RKNN-Toolkit2 python=3.10
# When prompted "Proceed ([y]/n)?"
# Enter y2
3
4
5
Activate the Conda environment:
conda activate TaishanPi3-RKNN-Toolkit2Install RKNN-Toolkit2:
According to the official documentation, the version must be >= 2.3.2
# Using Aliyun mirror https://mirrors.aliyun.com/pypi/simple
pip install rknn-toolkit2 -i https://mirrors.aliyun.com/pypi/simple2
After installation, exit the TaishanPi3-RKNN-Toolkit2 environment:
conda deactivate3. Pulling the Model
We use the Qwen3-VL-4B-Instruct model and pull the model files from huggingface/modelscope for our subsequent operations:
- Enter the
TaishanPi3-RKLLM-Toolkitenvironment
conda activate TaishanPi3-RKLLM-Toolkit- Install
git-lfs
sudo apt update && sudo apt install git-lfs- Pull the model
git clone https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
# Or use the domestic ModelScope community model
git clone https://www.modelscope.cn/Qwen/Qwen3-VL-2B-Instruct.git2
3
4
4. Model Conversion
We continue operations in the TaishanPi3-RKLLM-Toolkit environment, exporting two model files:
- Export LLM model part (.rkllm)
- Export Vision part as ONNX (.onnx)
Navigate to the rknn-llm/examples/multimodal_model_demo directory to prevent path issues in the Python scripts:
cd rknn-llm/examples/multimodal_model_demoGenerate Dataset File
Modify the rknn-llm/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py script file as follows:
Because by default, the Python script uses the qwen2_vl architecture format for construction. We need the qwen3_vl architecture, so we need to modify the Python script to be compatible with both Qwen2 and Qwen3 API differences.
import torch
import os
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from PIL import Image
import json
import numpy as np
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer, AutoProcessor
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--path', type=str, default='Qwen/Qwen2-VL-2B-Instruct', help='model path', required=False)
args = parser.parse_args()
path = args.path
if "Qwen3" in path:
from transformers import Qwen3VLForConditionalGeneration as ModelClass
else:
from transformers import Qwen2VLForConditionalGeneration as ModelClass
model = ModelClass.from_pretrained(
path, torch_dtype="auto", device_map="cpu",
low_cpu_mem_usage=True,
trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(path)
datasets = json.load(open("data/datasets.json", 'r'))
for data in datasets:
image_name = data["image"].split(".")[0]
imgp = os.path.join(data["image_path"], data["image"])
image = Image.open(imgp)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": data["input"]},
],
}
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to(model.device)
inputs_embeds = model.get_input_embeddings()(inputs["input_ids"])
pixel_values = inputs["pixel_values"].type(model.dtype)
image_mask = inputs["input_ids"] == model.config.image_token_id
image_embeds = model.visual(pixel_values, grid_thw=inputs["image_grid_thw"])
if isinstance(image_embeds, tuple):
image_embeds = image_embeds[0]
image_embeds = image_embeds.to(inputs_embeds.device)
inputs_embeds[image_mask] = image_embeds
print("inputs_embeds", inputs_embeds.shape)
os.makedirs("data/inputs_embeds/", exist_ok=True)
np.save("data/inputs_embeds/{}".format(image_name), inputs_embeds.to(dtype=torch.float16).cpu().detach().numpy())
with open('data/inputs.json', 'w') as json_file:
json_file.write('[\n')
first = True
for data in tqdm(datasets):
input_embed = np.load(os.path.join("data/inputs_embeds", data["image"].split(".")[0]+'.npy'))
target = data["target"]
input_dict = {
"input_embed": input_embed.tolist(),
"target": target
}
if not first:
json_file.write(',\n')
else:
first = False
json.dump(input_dict, json_file)
json_file.write('\n]')
print("Done")2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
The differences are as follows:
diff --git a/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py b/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py
index 2229b9a..3ef824e 100644
--- a/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py
+++ b/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py
@@ -6,7 +6,7 @@ from PIL import Image
import json
import numpy as np
from tqdm import tqdm
-from transformers import AutoModel, AutoTokenizer, AutoProcessor, Qwen2VLForConditionalGeneration
+from transformers import AutoModel, AutoTokenizer, AutoProcessor
import argparse
argparse = argparse.ArgumentParser()
@@ -14,7 +14,13 @@ argparse.add_argument('--path', type=str, default='Qwen/Qwen2-VL-2B-Instruct', h
args = argparse.parse_args()
path = args.path
-model = Qwen2VLForConditionalGeneration.from_pretrained(
+
+if "Qwen3" in path:
+ from transformers import Qwen3VLForConditionalGeneration as ModelClass
+else:
+ from transformers import Qwen2VLForConditionalGeneration as ModelClass
+
+model = ModelClass.from_pretrained(
path, torch_dtype="auto", device_map="cpu",
low_cpu_mem_usage=True,
trust_remote_code=True).eval()
@@ -43,10 +49,13 @@ for data in datasets:
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to(model.device)
- inputs_embeds = model.model.embed_tokens(inputs["input_ids"])
- pixel_values = inputs["pixel_values"].type(model.visual.get_dtype())
+ inputs_embeds = model.get_input_embeddings()(inputs["input_ids"])
+ pixel_values = inputs["pixel_values"].type(model.dtype)
image_mask = inputs["input_ids"] == model.config.image_token_id
- image_embeds = model.visual(pixel_values, grid_thw=inputs["image_grid_thw"]).to(inputs_embeds.device)
+ image_embeds = model.visual(pixel_values, grid_thw=inputs["image_grid_thw"])
+ if isinstance(image_embeds, tuple):
+ image_embeds = image_embeds[0]
+ image_embeds = image_embeds.to(inputs_embeds.device)
inputs_embeds[image_mask] = image_embeds
print("inputs_embeds", inputs_embeds.shape)
os.makedirs("data/inputs_embeds/", exist_ok=True)2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Run the following command to generate the quantization calibration dataset file. This Python script reads information from data/datasets.json and, combined with the pulled model files, generates data/inputs.json:
python data/make_input_embeds_for_quantize.py \
--path /home/lipeng/workspace/Qwen3-VL-2B-Instruct2
--path: Use absolute path, this directory points to the model project directory we pulled.
Export LLM Model
python ./export/export_rkllm.py \
--path /home/lipeng/workspace/Qwen3-VL-2B-Instruct \
--target-platform rk3576 \
--num_npu_core 2 \
--quantized_dtype w8a8 \
--device cpu2
3
4
5
6
--path: Use absolute path, this directory points to the model project directory we pulled, containing files likeconfig.json,model.safetensors,tokenizer.json, etc.--target-platform: Used to specify the target board's CPU model.--num_npu_core: Number of NPU cores used for inference.--quantized_dtype: Quantization precision type- W8 (Weights 8-bit): Compresses model weight parameters from FP16 (16-bit floating point) to 8-bit integers. Volume is directly halved.
- A8 (Activations 8-bit): Intermediate activation values generated during computation are also represented as 8-bit integers.
- This is currently the most cost-effective solution for edge devices. Compared to FP16, W8A8 is much faster, uses less memory, and accuracy loss is usually within acceptable range.
--device: The hardware used by this PC during model conversion. Normally CPU is slower but the most secure with the best compatibility.
After export, an rkllm/ folder will be generated in the current directory, storing our exported model files.
Export ONNX
Exit the TaishanPi3-RKLLM-Toolkit environment:
conda deactivateEnter the TaishanPi3-RKNN-Toolkit2 environment:
conda activate TaishanPi3-RKNN-Toolkit2Install dependencies:
For specific instructions, refer to the README
# Install transformers 4.57.0
pip install transformers==4.57.0
# Install onnx 1.18.0
pip install onnx==1.18.0
# Install dependencies
sudo apt-get update && sudo apt-get install -y libgl1 libglib2.0-0 libsm6 libxext62
3
4
5
6
7
8
Export Vision part as ONNX (.onnx):
python export/export_vision.py \
--path=/home/lipeng/workspace/Qwen3-VL-2B-Instruct \
--model_name=qwen3-vl \
--height=448 \
--width=4482
3
4
5
A
qwen3-vl_vision.onnxfile will be generated in theonnxfolder under the current directory (rknn-llm/examples/multimodal_model_demo/onnx/).
Convert RKNN Model
We continue operations in the TaishanPi3-RKNN-Toolkit2 environment to convert the exported .onnx model to a .rknn format vision model:
python export/export_vision_rknn.py \
--path=./onnx/qwen3-vl_vision.onnx \
--model_name=qwen3-vl \
--target-platform=rk3576 \
--height=448 \
--width=4482
3
4
5
6
Note: If the
model_nameis aqwen3*****model, directly useqwen3-vl.
5. Demo Compilation (C++)
Overview
The official Rockchip open-source project uses C++ written demos. You can compile the sample code directly by running:
rknn-llm/examples/multimodal_model_demo/deploy/build-linux.shrknn-llm/examples/multimodal_model_demo/deploy/build-android.sh
These two scripts (replacing cross-compilation paths with actual paths) compile the sample code directly.
In the deploy directory, a install/demo_Linux_aarch64 or install/demo_Android_aarch64 folder will be generated, containing imgenc, llm, demo, and lib folders.
Exit Environment
conda deactivateWhen (base) appears at the beginning of the command line, it's done.
Install Cross-Compiler
We need to compile the Demo on the PC to generate files and run them on the LCSC-TaishanPi-3M-RK3576 board. So we directly use apt to install aarch64-linux-gnu:
sudo apt update && \
sudo apt install -y cmake make gcc-aarch64-linux-gnu g++-aarch64-linux-gnu2
Modify Build Script
Next, we need to modify the cross-compilation script so that it uses the cross-compiler we installed for compilation.
Modify the rknn-llm/examples/multimodal_model_demo/deploy/build-linux.sh script to:
set -e
rm -rf build
mkdir build && cd build
cmake .. -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
-DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=aarch64 \
make -j8
make install2
3
4
5
6
7
8
9
10
11
12
The specific differences are as follows:
diff --git a/examples/multimodal_model_demo/deploy/build-linux.sh b/examples/multimodal_model_demo/deploy/build-linux.sh
index c75d9c5..1c9b6b0 100755
--- a/examples/multimodal_model_demo/deploy/build-linux.sh
+++ b/examples/multimodal_model_demo/deploy/build-linux.sh
@@ -2,9 +2,8 @@ set -e
rm -rf build
mkdir build && cd build
-GCC_COMPILER=~/opts/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu
-cmake .. -DCMAKE_CXX_COMPILER=${GCC_COMPILER}/bin/aarch64-none-linux-gnu-g++ \
- -DCMAKE_C_COMPILER=${GCC_COMPILER}/bin/aarch64-none-linux-gnu-gcc \
+cmake .. -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
+ -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
-CMAKE_BUILD_TYPE=Release \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=aarch64 \2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Compile
Navigate to the specified directory:
cd rknn-llm/examples/multimodal_model_demo/deployRun the build script:
./build-linux.shThe final generated install/ directory structure is as follows:
`-- demo_Linux_aarch64
|-- demo # Final executable
|-- demo.jpg # Multimodal test image
|-- imgenc
`-- lib # Required dependency files
|-- librkllmrt.so
`-- librknnrt.so
2 directories, 5 files2
3
4
5
6
7
8
9
6. Board Demo Presentation
Next, we need to transfer some files to our board:
rknn-llm/examples/multimodal_model_demo/deploy/install/demo_Linux_aarch64rknn-llm/examples/multimodal_model_demo/rkllm/qwen3-vl-2b-instruct_w8a8_rk3576.rkllmrknn-llm/examples/multimodal_model_demo/rknn/qwen3-vl_vision_rk3576.rknn
Create a qwen3-vl-2b-instruct directory on the board to store the files we will transfer:
mkdir ~/qwen3-vl-2b-instructCopy install Folder
It is recommended to use the adb tool for transfer. The LCSC-TaishanPi-3M has ADB enabled by default. You can also use TF card, SSH, or USB drive.
Refer to: https://wiki.lckfb.com/zh-hans/tspi-3-rk3576/system-usage/debian12-usage/adb-usage.html
Push the entire install/demo_Linux_aarch64 directory to the board at /home/lckfb/qwen3-vl-4b-instruct/:
adb push rknn-llm/examples/multimodal_model_demo/deploy/install/demo_Linux_aarch64 /home/lckfb/qwen3-vl-2b-instruct/Transfer Models to Board
Push the qwen3-vl-2b-instruct_w8a8_rk3576.rkllm model to the board at /home/lckfb/qwen3-vl-2b-instruct/:
adb push rknn-llm/examples/multimodal_model_demo/rkllm/qwen3-vl-2b-instruct_w8a8_rk3576.rkllm /home/lckfb/qwen3-vl-2b-instruct/Push the qwen3-vl_vision_rk3576.rknn model to the board at /home/lckfb/qwen3-vl-2b-instruct/:
adb push rknn-llm/examples/multimodal_model_demo/rknn/qwen3-vl_vision_rk3576.rknn /home/lckfb/qwen3-vl-2b-instruct/Running on Board
We enter the LCSC-TaishanPi-3M development board terminal and navigate to the /home/lckfb/qwen3-vl-4b-instruct/demo_Linux_aarch64/ directory:
# Navigate to the directory
cd /home/lckfb/qwen3-vl-2b-instruct/demo_Linux_aarch64/2
Set the dynamic library path (located in the ./lib subdirectory):
# Set the dynamic library path (very important, otherwise errors will occur)
export LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH2
If you want to view performance, add a variable
export RKLLM_LOG_LEVEL=1
Grant executable permission to the demo:
sudo chmod +x demoRun the Demo:
Usage:
./demo [Image] [Vision Model] [Language Model] [Generation Length] [Context Length] [NPU Core Count] [Special Prompt Tokens...]Note: Because the model path is in the parent directory, we use
../
./demo demo.jpg \
../qwen3-vl_vision_rk3576.rknn \
../qwen3-vl-2b-instruct_w8a8_rk3576.rkllm \
256 2048 2 "<|vision_start|>" "<|vision_end|>" "<|image_pad|>"2
3
4
The three tokens "<|vision_start|>", "<|vision_end|>", and "<|image_pad|>" are actually special placeholder tokens in multimodal LLM input:
"<|vision_start|>": Visual Start Token- Indicates the starting position of image information in the LLM input sequence, telling the model "an image's content will be inserted next."
"<|vision_end|>": Visual End Token- Indicates where the image information ends, telling the model "image information input ends here."
"<|image_pad|>": Image Padding Token- When processing multiple images in batch inference, image patch/token lengths may vary. To align inputs, Pad tokens are often used to pad to a consistent length. This token is used for padding.
Essentially, these are special string tokens that tell the LLM "where the image content starts, where it ends, and what to pad with when not filled", used for multimodal inference input.
After successful execution, you can engage in Q&A.
The terminal will output the model's description or answer regarding the
demo.jpgimage.
Question 1
Let's analyze the demo.jpg that was previously transferred to it: