MaixCAM MaixPy Keyword recognition

Update history
Date Version Author Update content
2024-10-08 1.0.0 916BGAI Initial document

Introduction

MaixCAM has ported the Maix-Speech offline speech library, enabling continuous Chinese numeral recognition, keyword recognition, and large vocabulary speech recognition capabilities. It supports audio recognition in PCM and WAV formats, and can accept input recognition via the onboard microphone.

Maix-Speech

Maix-Speech is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.

Keyword recognition

from maix import app, nn

speech = nn.Speech("/root/models/am_3332_192_int8.mud")
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")

kw_tbl = ['xiao3 ai4 tong2 xue2',
          'ni3 hao3',
          'tian1 qi4 zen3 me yang4']
kw_gate = [0.1, 0.1, 0.1]

def callback(data:list[float], len: int):
    for i in range(len):
        print(f"\tkw{i}: {data[i]:.3f};", end=' ')
    print("\n")

speech.kws(kw_tbl, kw_gate, callback, True)

while not app.need_exit():
    frames = speech.run(1)
    if frames < 1:
        print("run out\n")
        speech.deinit()
        break

Usage

  1. Import the app and nn modules
from maix import app, nn
  1. Load the acoustic model
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
  • You can also load the am_7332 acoustic model; larger models provide higher accuracy but consume more resources.
  1. Choose the corresponding audio device
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
  • This uses the onboard microphone and supports both WAV and PCM audio as input devices.
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav")   # Using WAV audio input
speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm")   # Using PCM audio input
  • Note that WAV must be 16KHz sample rate with S16_LE storage format. You can use the arecord tool for conversion.
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
  • When recognizing PCM/WAV , if you want to reset the data source, such as for the next WAV file recognition, you can use the speech.devive method, which will automatically clear the cache:
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
  1. Set up the decoder
kw_tbl = ['xiao3 ai4 tong2 xue2',
          'ni3 hao3',
          'tian1 qi4 zen3 me yang4']
kw_gate = [0.1, 0.1, 0.1]

def callback(data:list[float], len: int):
    for i in range(len):
        print(f"\tkw{i}: {data[i]:.3f};", end=' ')
    print("\n")

speech.kws(kw_tbl, kw_gate, callback, True)
  • Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a kws decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation. For other decoder usages, please refer to the sections on Real-time voice recognition and continuous Chinese numeral recognition.

  • When setting up the kws decoder, you need to provide a keyword list separated by spaces in Pinyin, a keyword probability threshold list arranged in order, and specify whether to enable automatic near-sound processing. If set to True, different tones of the same Pinyin will be treated as similar words to accumulate probabilities. Finally, you need to set a callback function to handle the decoded data.

  • Users can also manually register near-sound words using the speech.similar method, with a maximum of 10 near-sound words registered for each Pinyin. (Note that using this interface to register near-sound words will override the near-sound table generated by enabling automatic near-sound processing.)

similar_char = ['zhen3', 'zheng3']
speech.similar('zen3', similar_char)
  • After registering the decoder, use the speech.deinit() method to clear the initialization.
  1. Recognition
while not app.need_exit():
    frames = speech.run(1)
    if frames < 1:
        print("run out\n")
        speech.deinit()
        break
  • Use the speech.run method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.

Recognition Results

If the above program runs successfully, speaking into the onboard microphone will yield keyword recognition results, such as:

kws log 2.048s, len 24
decoder_kws_init get 3 kws
  00, xiao3 ai4 tong2 xue2
  01, ni3 hao3
  02, tian1 qi4 zen3 me yang4
find shared memory(491520),  saved:491520
    kw0: 0.959; 	kw1: 0.000; 	kw2: 0.000;     # xiao3 ai4 tong2 xue2
    kw0: 0.000; 	kw1: 0.930; 	kw2: 0.000;     # ni3 hao3
    kw0: 0.000; 	kw1: 0.000; 	kw2: 0.961;     # tian1 qi4 zen3 me yang4