Running the Whisper Model on MaixPy MaixCAM

2026-01-05

Update history

Date	Version	Author	Update content
2026-01-05	1.0.0	lxowalle	Added Whisper documentation

Whisper Model Overview

Whisper is a general-purpose speech recognition model open-sourced by OpenAI, designed for tasks such as multilingual speech recognition and speech translation.
Currently, the Whisper model ported to MaixCAM2 is the base version. It supports input WAV audio files with mono channel and 16 kHz sample rate, and can recognize Chinese and English.

Downloading the Model

Supported models:

Model	Platform	Memory Requirement	Description
whisper-base-maixcam2	MaixCAM2	1G	base

Refer to the Large Model User Guide to download the model.

Running the Model with MaixPy

Currently, only the base-size Whisper model is supported. It accepts mono, 16 kHz WAV audio files and supports Chinese and English recognition.
Below is a simple example demonstrating how to use Whisper for speech recognition:

from maix import nn

whisper = nn.Whisper(model="/root/models/whisper-base-maixcam2/whisper-base.mud")

wav_path = "/maixapp/share/audio/demo.wav"

res = whisper.transcribe(wav_path)

print('res:', res)

Notes:

First, import the nn module to create a Whisper model object:

from maix import nn

Select the model to load. Currently, only the base-size Whisper model is supported:

whisper = nn.Whisper(model="/root/models/whisper-base-maixcam2/whisper-base.mud")

Prepare a mono, 16 kHz WAV audio file and run inference. The recognition result will be returned directly:

wav_path = "/maixapp/share/audio/demo.wav"
res = whisper.forward(wav_path)
print('whisper:', res)

Output result:

whisper: 开始愉快的探索吧

By default, the model recognizes Chinese.
To recognize English, specify the language parameter when initializing the object:

whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base-maixcam2.mud", language="en")

SenseVoice Speech-Recognition Model

MeloTTS Speech-Synthesis Model