Qwen3-VL-2B Full Process Deployment | LC Development Board Tech Document Center

Qwen3-VL is a multimodal vision-language model released by Alibaba's Qwen team on September 24, 2025. 2B is a small parameter version in this series that has been distilled or lightweight designed. We run this model on the RK3576. To use hardware acceleration (NPU), we must convert the model format. We need to convert this model separately, ultimately generating two core files:

.rknn file: Contains the model's Visual Encoder part, responsible for converting images into feature vectors.
.rkllm file: Contains the model's Large Language Model (LLM) part, responsible for understanding image features and performing text inference and generation.

To perform the conversion of these two files, two Rockchip tools are required:

RKNN-Toolkit2: Specifically used to convert the vision part to NPU-executable .rknn format.
RKLLM-Toolkit: Specifically used to quantize and convert the language model part, generating .rkllm format.

After completing the above, another Rockchip tool is needed to run the model on the board:

RKLLM-Runtime: This is the inference engine (C++ library) running on the development board's Linux system. It is responsible for loading the two model files above and calling NPU drivers for high-performance inference.

1. Process Overview

2. Environment Preparation

Host Environment: Ubuntu22.04 (x86)
Development Board: LCSC-TaishanPi-3M-RK3576
Data Cable: Connect PC and development board for ADB file transfer.

RKNN-LLM

Clone the RKNN-LLM repository:

Repository: https://github.com/airockchip/rknn-llm
This is the official open-source repository provided by Rockchip

bash

git clone https://github.com/airockchip/rknn-llm.git

Install miniforge3

To prevent Python environment issues caused by different environments on a single host, we use miniforge3 for management.

Install miniforge3:

bash

# Download miniforge3 installation script
wget -c https://mirrors.bfsu.edu.cn/github-release/conda-forge/miniforge/LatestRelease/Miniforge3-Linux-x86_64.sh

# Run the installation script
bash Miniforge3-Linux-x86_64.sh

# 1. Press Enter to continue
# 2. Use the down arrow to scroll through the agreement
# 3. Enter yes at the end
# 4. When prompted "Proceed with initialization?", enter yes

You can check https://mirrors.bfsu.edu.cn/github-release/conda-forge/miniforge/LatestRelease/ to find the current latest .sh filename.

Initialize the conda environment variable:

bash

source ~/miniforge3/bin/activate

After success, a (base) will appear at the beginning of the command line.

Create RKLLM-Toolkit Environment

Create and activate a Conda environment: TaishanPi3-RKLLM-Toolkit (Python 3.10 is recommended)

bash

# Create environment
conda create -n TaishanPi3-RKLLM-Toolkit python=3.10

# When prompted "Proceed ([y]/n)?"
# Enter y

Activate the Conda environment:

bash

conda activate TaishanPi3-RKLLM-Toolkit

Install RKLLM-Toolkit:

In the rknn-llm/rkllm-toolkit/packages/ directory, there are several whl files to choose from:
rkllm_toolkit-1.2.3-cp39-cp39-linux_x86_64.whl
rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl
rkllm_toolkit-1.2.3-cp311-cp311-linux_x86_64.whl
rkllm_toolkit-1.2.3-cp312-cp312-linux_x86_64.whl
We select different files based on the Python version. The Conda environment we created uses Python 3.10, so we select the file with cp310-cp310.
For Python 3.12, you can use the file with cp312-cp312.

bash

# Using Aliyun mirror https://mirrors.aliyun.com/pypi/simple
pip install rknn-llm/rkllm-toolkit/packages/rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple

After installation, exit the TaishanPi3-RKLLM-Toolkit environment:

bash

conda deactivate

Create RKNN-Toolkit2 Environment

Create and activate a Conda environment: TaishanPi3-RKNN-Toolkit2 (Python 3.10 is recommended)

bash

# Create environment
conda create -n TaishanPi3-RKNN-Toolkit2 python=3.10

# When prompted "Proceed ([y]/n)?"
# Enter y

Activate the Conda environment:

bash

conda activate TaishanPi3-RKNN-Toolkit2

Install RKNN-Toolkit2:

According to the official documentation, the version must be >= 2.3.2

bash

# Using Aliyun mirror https://mirrors.aliyun.com/pypi/simple
pip install rknn-toolkit2 -i https://mirrors.aliyun.com/pypi/simple

After installation, exit the TaishanPi3-RKNN-Toolkit2 environment:

bash

conda deactivate

3. Pulling the Model

We use the Qwen3-VL-4B-Instruct model and pull the model files from huggingface/modelscope for our subsequent operations:

Enter the TaishanPi3-RKLLM-Toolkit environment

bash

conda activate TaishanPi3-RKLLM-Toolkit

Install git-lfs

bash

sudo apt update && sudo apt install git-lfs

Pull the model

bash

git clone https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct

# Or use the domestic ModelScope community model
git clone https://www.modelscope.cn/Qwen/Qwen3-VL-2B-Instruct.git

4. Model Conversion

We continue operations in the TaishanPi3-RKLLM-Toolkit environment, exporting two model files:

Export LLM model part (.rkllm)
Export Vision part as ONNX (.onnx)

Navigate to the rknn-llm/examples/multimodal_model_demo directory to prevent path issues in the Python scripts:

bash

cd rknn-llm/examples/multimodal_model_demo

Generate Dataset File

Modify the rknn-llm/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py script file as follows:

Because by default, the Python script uses the qwen2_vl architecture format for construction. We need the qwen3_vl architecture, so we need to modify the Python script to be compatible with both Qwen2 and Qwen3 API differences.

import torch
import os
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from PIL import Image
import json
import numpy as np
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer, AutoProcessor
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--path', type=str, default='Qwen/Qwen2-VL-2B-Instruct', help='model path', required=False)
args = parser.parse_args()

path = args.path

if "Qwen3" in path:
    from transformers import Qwen3VLForConditionalGeneration as ModelClass
else:
    from transformers import Qwen2VLForConditionalGeneration as ModelClass

model = ModelClass.from_pretrained(
    path, torch_dtype="auto", device_map="cpu",
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval()

processor = AutoProcessor.from_pretrained(path)

datasets = json.load(open("data/datasets.json", 'r'))
for data in datasets:
    image_name = data["image"].split(".")[0]
    imgp = os.path.join(data["image_path"], data["image"])
    image = Image.open(imgp)

    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                },
                {"type": "text", "text": data["input"]},
            ],
        }
    ]
    text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(
        text=[text_prompt], images=[image], padding=True, return_tensors="pt"
    )
    inputs = inputs.to(model.device)
    inputs_embeds = model.get_input_embeddings()(inputs["input_ids"])
    pixel_values = inputs["pixel_values"].type(model.dtype)
    image_mask = inputs["input_ids"] == model.config.image_token_id
    image_embeds = model.visual(pixel_values, grid_thw=inputs["image_grid_thw"])
    if isinstance(image_embeds, tuple):
        image_embeds = image_embeds[0]
    image_embeds = image_embeds.to(inputs_embeds.device)
    inputs_embeds[image_mask] = image_embeds
    print("inputs_embeds", inputs_embeds.shape)
    os.makedirs("data/inputs_embeds/", exist_ok=True)
    np.save("data/inputs_embeds/{}".format(image_name), inputs_embeds.to(dtype=torch.float16).cpu().detach().numpy())

with open('data/inputs.json', 'w') as json_file:
    json_file.write('[\n')
    first = True
    for data in tqdm(datasets):
        input_embed = np.load(os.path.join("data/inputs_embeds", data["image"].split(".")[0]+'.npy'))
        target = data["target"]
        input_dict = {
            "input_embed": input_embed.tolist(),
            "target": target
        }
        if not first:
            json_file.write(',\n')
        else:
            first = False
        json.dump(input_dict, json_file)
    json_file.write('\n]')

print("Done")

The differences are as follows:

diff

diff --git a/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py b/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py
index 2229b9a..3ef824e 100644
--- a/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py
+++ b/examples/multimodal_model_demo/data/make_input_embeds_for_quantize.py
@@ -6,7 +6,7 @@ from PIL import Image
 import json
 import numpy as np
 from tqdm import tqdm
-from transformers import AutoModel, AutoTokenizer, AutoProcessor, Qwen2VLForConditionalGeneration
+from transformers import AutoModel, AutoTokenizer, AutoProcessor
 import argparse

 argparse = argparse.ArgumentParser()
@@ -14,7 +14,13 @@ argparse.add_argument('--path', type=str, default='Qwen/Qwen2-VL-2B-Instruct', h
 args = argparse.parse_args()

 path = args.path
-model = Qwen2VLForConditionalGeneration.from_pretrained(
+
+if "Qwen3" in path:
+    from transformers import Qwen3VLForConditionalGeneration as ModelClass
+else:
+    from transformers import Qwen2VLForConditionalGeneration as ModelClass
+
+model = ModelClass.from_pretrained(
     path, torch_dtype="auto", device_map="cpu",
     low_cpu_mem_usage=True,
     trust_remote_code=True).eval()
@@ -43,10 +49,13 @@ for data in datasets:
         text=[text_prompt], images=[image], padding=True, return_tensors="pt"
     )
     inputs = inputs.to(model.device)
-    inputs_embeds = model.model.embed_tokens(inputs["input_ids"])
-    pixel_values = inputs["pixel_values"].type(model.visual.get_dtype())
+    inputs_embeds = model.get_input_embeddings()(inputs["input_ids"])
+    pixel_values = inputs["pixel_values"].type(model.dtype)
     image_mask = inputs["input_ids"] == model.config.image_token_id
-    image_embeds = model.visual(pixel_values, grid_thw=inputs["image_grid_thw"]).to(inputs_embeds.device)
+    image_embeds = model.visual(pixel_values, grid_thw=inputs["image_grid_thw"])
+    if isinstance(image_embeds, tuple):
+        image_embeds = image_embeds[0]
+    image_embeds = image_embeds.to(inputs_embeds.device)
     inputs_embeds[image_mask] = image_embeds
     print("inputs_embeds", inputs_embeds.shape)
     os.makedirs("data/inputs_embeds/", exist_ok=True)

Run the following command to generate the quantization calibration dataset file. This Python script reads information from data/datasets.json and, combined with the pulled model files, generates data/inputs.json:

bash

python data/make_input_embeds_for_quantize.py \
    --path /home/lipeng/workspace/Qwen3-VL-2B-Instruct

--path: Use absolute path, this directory points to the model project directory we pulled.

Export LLM Model

bash

python ./export/export_rkllm.py \
    --path /home/lipeng/workspace/Qwen3-VL-2B-Instruct \
    --target-platform rk3576 \
    --num_npu_core 2 \
    --quantized_dtype w8a8 \
    --device cpu

--path: Use absolute path, this directory points to the model project directory we pulled, containing files like config.json, model.safetensors, tokenizer.json, etc.
--target-platform: Used to specify the target board's CPU model.
--num_npu_core: Number of NPU cores used for inference.
--quantized_dtype: Quantization precision type
- W8 (Weights 8-bit): Compresses model weight parameters from FP16 (16-bit floating point) to 8-bit integers. Volume is directly halved.
- A8 (Activations 8-bit): Intermediate activation values generated during computation are also represented as 8-bit integers.
- This is currently the most cost-effective solution for edge devices. Compared to FP16, W8A8 is much faster, uses less memory, and accuracy loss is usually within acceptable range.
--device: The hardware used by this PC during model conversion. Normally CPU is slower but the most secure with the best compatibility.

After export, an rkllm/ folder will be generated in the current directory, storing our exported model files.

Export ONNX

Exit the TaishanPi3-RKLLM-Toolkit environment:

bash

conda deactivate

Enter the TaishanPi3-RKNN-Toolkit2 environment:

bash

conda activate TaishanPi3-RKNN-Toolkit2

Install dependencies:

For specific instructions, refer to the README

bash

# Install transformers 4.57.0
pip install transformers==4.57.0

# Install onnx 1.18.0
pip install onnx==1.18.0

# Install dependencies
sudo apt-get update && sudo apt-get install -y libgl1 libglib2.0-0 libsm6 libxext6

Export Vision part as ONNX (.onnx):

bash

python export/export_vision.py \
    --path=/home/lipeng/workspace/Qwen3-VL-2B-Instruct \
    --model_name=qwen3-vl \
    --height=448 \
    --width=448

A qwen3-vl_vision.onnx file will be generated in the onnx folder under the current directory (rknn-llm/examples/multimodal_model_demo/onnx/).

Convert RKNN Model

We continue operations in the TaishanPi3-RKNN-Toolkit2 environment to convert the exported .onnx model to a .rknn format vision model:

bash

python export/export_vision_rknn.py \
	--path=./onnx/qwen3-vl_vision.onnx \
	--model_name=qwen3-vl \
	--target-platform=rk3576 \
	--height=448 \
	--width=448

Note: If the model_name is a qwen3***** model, directly use qwen3-vl.

5. Demo Compilation (C++)

Overview

The official Rockchip open-source project uses C++ written demos. You can compile the sample code directly by running:

rknn-llm/examples/multimodal_model_demo/deploy/build-linux.sh
rknn-llm/examples/multimodal_model_demo/deploy/build-android.sh

These two scripts (replacing cross-compilation paths with actual paths) compile the sample code directly.

In the deploy directory, a install/demo_Linux_aarch64 or install/demo_Android_aarch64 folder will be generated, containing imgenc, llm, demo, and lib folders.

Exit Environment

bash

conda deactivate

When (base) appears at the beginning of the command line, it's done.

Install Cross-Compiler

We need to compile the Demo on the PC to generate files and run them on the LCSC-TaishanPi-3M-RK3576 board. So we directly use apt to install aarch64-linux-gnu:

bash

sudo apt update && \
sudo apt install -y cmake make gcc-aarch64-linux-gnu g++-aarch64-linux-gnu

Modify Build Script

Next, we need to modify the cross-compilation script so that it uses the cross-compiler we installed for compilation.

Modify the rknn-llm/examples/multimodal_model_demo/deploy/build-linux.sh script to:

bash

set -e
rm -rf build
mkdir build && cd build

cmake .. -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++  \
        -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_SYSTEM_NAME=Linux \
        -DCMAKE_SYSTEM_PROCESSOR=aarch64 \

make -j8
make install

The specific differences are as follows:

diff

diff --git a/examples/multimodal_model_demo/deploy/build-linux.sh b/examples/multimodal_model_demo/deploy/build-linux.sh
index c75d9c5..1c9b6b0 100755
--- a/examples/multimodal_model_demo/deploy/build-linux.sh
+++ b/examples/multimodal_model_demo/deploy/build-linux.sh
@@ -2,9 +2,8 @@ set -e
 rm -rf build
 mkdir build && cd build

-GCC_COMPILER=~/opts/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu
-cmake .. -DCMAKE_CXX_COMPILER=${GCC_COMPILER}/bin/aarch64-none-linux-gnu-g++  \
-        -DCMAKE_C_COMPILER=${GCC_COMPILER}/bin/aarch64-none-linux-gnu-gcc \
+cmake .. -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++  \
+        -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
         -CMAKE_BUILD_TYPE=Release \
         -DCMAKE_SYSTEM_NAME=Linux \
         -DCMAKE_SYSTEM_PROCESSOR=aarch64 \

Compile

Navigate to the specified directory:

bash

cd rknn-llm/examples/multimodal_model_demo/deploy

Run the build script:

bash

./build-linux.sh

The final generated install/ directory structure is as follows:

bash

`-- demo_Linux_aarch64
    |-- demo # Final executable
    |-- demo.jpg # Multimodal test image
    |-- imgenc
    `-- lib # Required dependency files
        |-- librkllmrt.so
        `-- librknnrt.so

2 directories, 5 files

6. Board Demo Presentation

Next, we need to transfer some files to our board:

rknn-llm/examples/multimodal_model_demo/deploy/install/demo_Linux_aarch64
rknn-llm/examples/multimodal_model_demo/rkllm/qwen3-vl-2b-instruct_w8a8_rk3576.rkllm
rknn-llm/examples/multimodal_model_demo/rknn/qwen3-vl_vision_rk3576.rknn

Create a qwen3-vl-2b-instruct directory on the board to store the files we will transfer:

bash

mkdir ~/qwen3-vl-2b-instruct

Copy install Folder

It is recommended to use the adb tool for transfer. The LCSC-TaishanPi-3M has ADB enabled by default. You can also use TF card, SSH, or USB drive.

Refer to: https://wiki.lckfb.com/zh-hans/tspi-3-rk3576/system-usage/debian12-usage/adb-usage.html

Push the entire install/demo_Linux_aarch64 directory to the board at /home/lckfb/qwen3-vl-4b-instruct/:

bash

adb push rknn-llm/examples/multimodal_model_demo/deploy/install/demo_Linux_aarch64 /home/lckfb/qwen3-vl-2b-instruct/

Transfer Models to Board

Push the qwen3-vl-2b-instruct_w8a8_rk3576.rkllm model to the board at /home/lckfb/qwen3-vl-2b-instruct/:

bash

adb push rknn-llm/examples/multimodal_model_demo/rkllm/qwen3-vl-2b-instruct_w8a8_rk3576.rkllm /home/lckfb/qwen3-vl-2b-instruct/

Push the qwen3-vl_vision_rk3576.rknn model to the board at /home/lckfb/qwen3-vl-2b-instruct/:

bash

adb push rknn-llm/examples/multimodal_model_demo/rknn/qwen3-vl_vision_rk3576.rknn /home/lckfb/qwen3-vl-2b-instruct/

Running on Board

We enter the LCSC-TaishanPi-3M development board terminal and navigate to the /home/lckfb/qwen3-vl-4b-instruct/demo_Linux_aarch64/ directory:

bash

# Navigate to the directory
cd /home/lckfb/qwen3-vl-2b-instruct/demo_Linux_aarch64/

Set the dynamic library path (located in the ./lib subdirectory):

bash

# Set the dynamic library path (very important, otherwise errors will occur)
export LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH

If you want to view performance, add a variable export RKLLM_LOG_LEVEL=1

Grant executable permission to the demo:

bash

sudo chmod +x demo

Run the Demo:

Usage: ./demo [Image] [Vision Model] [Language Model] [Generation Length] [Context Length] [NPU Core Count] [Special Prompt Tokens...]
Note: Because the model path is in the parent directory, we use ../

bash

./demo demo.jpg \
  ../qwen3-vl_vision_rk3576.rknn \
  ../qwen3-vl-2b-instruct_w8a8_rk3576.rkllm \
  256 2048 2 "<|vision_start|>" "<|vision_end|>" "<|image_pad|>"

"<|vision_start|>": Visual Start Token
- Indicates the starting position of image information in the LLM input sequence, telling the model "an image's content will be inserted next."
"<|vision_end|>": Visual End Token
- Indicates where the image information ends, telling the model "image information input ends here."
"<|image_pad|>": Image Padding Token
- When processing multiple images in batch inference, image patch/token lengths may vary. To align inputs, Pad tokens are often used to pad to a consistent length. This token is used for padding.

Essentially, these are special string tokens that tell the LLM "where the image content starts, where it ends, and what to pad with when not filled", used for multimodal inference input.

After successful execution, you can engage in Q&A.
The terminal will output the model's description or answer regarding the demo.jpg image.

Question 1

Let's analyze the demo.jpg that was previously transferred to it:

Pangniu Phone

1. Process Overview ​

2. Environment Preparation ​

RKNN-LLM ​

Install miniforge3 ​

Create RKLLM-Toolkit Environment ​

Create RKNN-Toolkit2 Environment ​

3. Pulling the Model ​

4. Model Conversion ​

Generate Dataset File ​

Export LLM Model ​

Export ONNX ​

Convert RKNN Model ​

5. Demo Compilation (C++) ​

Overview ​

Exit Environment ​

Install Cross-Compiler ​

Modify Build Script ​

Compile ​

6. Board Demo Presentation ​

Copy install Folder ​

Transfer Models to Board ​

Running on Board ​

Question 1 ​

Question 2 ​

Question 3 ​

1. Process Overview

2. Environment Preparation

RKNN-LLM

Install miniforge3

Create RKLLM-Toolkit Environment

Create RKNN-Toolkit2 Environment

3. Pulling the Model

4. Model Conversion

Generate Dataset File

Export LLM Model

Export ONNX

Convert RKNN Model

5. Demo Compilation (C++)

Overview

Exit Environment

Install Cross-Compiler

Modify Build Script

Compile

6. Board Demo Presentation

Copy install Folder

Transfer Models to Board

Running on Board

Question 1

Question 2

Question 3