MaixCAM MaixPy speech synthesis

Update history
Date Version Author Update content
2025-08-15 1.0.0 lxowalle Initial document

Introduction

This document provides instructions on using the built-in TTS functionality to convert text into speech.

TTS Support List:

MaixCAM MaixCAM Pro MaixCAM2
MeloTTS

About TTS

TTS (Text-to-Speech) converts text into speech. You can write a piece of text and feed it to a TTS-supported model. After running the model, it will output an audio data containing the spoken version of the text.
In practice, TTS is commonly used for video dubbing, navigation guidance, public announcements, and more. Simply put, TTS is “technology that reads text aloud.”

MelloTTS

MeloTTS is a high-quality multilingual text-to-speech library jointly developed by MIT and MyShell.ai. Currently, it supports the mellotts-zh model, which can synthesize both Chinese and English speech. However, English synthesis is not yet optimal.

The default output audio is PCM data with a sample rate of 44100 Hz, single channel, and 16-bit depth.

Sample rate: The number of times sound is sampled per second.

Channels: The number of audio channels captured per sample. Single channel means mono audio, and dual channel means stereo (left and right channels). To reduce AI inference complexity, single-channel audio is generally used.

Bit depth: The data range captured per sample. A 16-bit depth usually represents each sample as a 16-bit signed integer. Higher bit depth captures finer audio details.

from maix import nn, audio

# Only MaixCAM2 supports this model.
sample_rate = 44100
p = audio.Player(sample_rate=sample_rate)
p.volume(80)

melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')

pcm = melotts.infer('hello', output_pcm=True)
p.play(pcm)

注:

  1. Import the nn module first to create a MeloTTS model object:
from maix import nn
  1. Choose the model to load. currently, the melotts-zh model is supported:
    • speed sets the playback speed
    • language sets the language type
melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
  1. Start inference:
    • The text to infer here is 'hello'
    • Set output_pcm=True to return PCM data
pcm = melotts.infer('hello', output_pcm=True)
  1. Use the audio playback module to play the generated audio:
    • Make sure the sample rate matches the model’s output
    • Use p.volume(80) to control the output volume (range: 0–100)
    • Play the PCM generated by MeloTTS with p.play(pcm)
p = audio.Player(sample_rate=sample_rate)
p.volume(80)
p.play(pcm)