MaixCAM MaixPy speech synthesis
Update history
Date | Version | Author | Update content |
---|---|---|---|
2025-08-15 | 1.0.0 | lxowalle | Initial document |
Introduction
This document provides instructions on using the built-in TTS functionality to convert text into speech.
TTS Support List:
MaixCAM | MaixCAM Pro | MaixCAM2 | |
---|---|---|---|
MeloTTS | ❌ | ❌ | ✅ |
About TTS
TTS (Text-to-Speech) converts text into speech. You can write a piece of text and feed it to a TTS-supported model. After running the model, it will output an audio data containing the spoken version of the text.
In practice, TTS is commonly used for video dubbing, navigation guidance, public announcements, and more. Simply put, TTS is “technology that reads text aloud.”
MelloTTS
MeloTTS is a high-quality multilingual text-to-speech library jointly developed by MIT and MyShell.ai. Currently, it supports the mellotts-zh model, which can synthesize both Chinese and English speech. However, English synthesis is not yet optimal.
The default output audio is PCM data with a sample rate of 44100 Hz, single channel, and 16-bit depth.
Sample rate: The number of times sound is sampled per second.
Channels: The number of audio channels captured per sample. Single channel means mono audio, and dual channel means stereo (left and right channels). To reduce AI inference complexity, single-channel audio is generally used.
Bit depth: The data range captured per sample. A 16-bit depth usually represents each sample as a 16-bit signed integer. Higher bit depth captures finer audio details.
from maix import nn, audio
# Only MaixCAM2 supports this model.
sample_rate = 44100
p = audio.Player(sample_rate=sample_rate)
p.volume(80)
melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
pcm = melotts.infer('hello', output_pcm=True)
p.play(pcm)
注:
- Import the nn module first to create a MeloTTS model object:
from maix import nn
- Choose the model to load. currently, the melotts-zh model is supported:
speed
sets the playback speedlanguage
sets the language type
melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
- Start inference:
- The text to infer here is 'hello'
- Set
output_pcm=True
to return PCM data
pcm = melotts.infer('hello', output_pcm=True)
- Use the audio playback module to play the generated audio:
- Make sure the sample rate matches the model’s output
- Use
p.volume(80)
to control the output volume (range: 0–100) - Play the PCM generated by MeloTTS with
p.play(pcm)
p = audio.Player(sample_rate=sample_rate)
p.volume(80)
p.play(pcm)