watch me try to do speech recognition on Linux: ffmpeg/avconv

Often, the source audio files will be in a compressed format such as MP3. The pocketsphinx system does not provide conversion facilities, it must receive either WAV or RAW audio format. So, you'll need to convert upstream. On Linux, this is often done with "ffmpeg".

However: note that on (at least my version of) Debian, the "ffmpeg" program seems to be named "avconv". All the commandline parameters are the same, it seems to just be a name change, I'm guessing for some "political" reason such as copyright issues. There's not even a manpage for "avconv"; see "man ffmpeg", and then apply the parameters to "avconv". Confusing!

(You could also use "audacity", "sox", or various other graphical or commandline programs under Linux to convert your audio files.)

pocketsphinx is very sensitive to the sample rate and format of the audio; data of the wrong format may be rejected, or worse yet, may be accepted but may yield very poor accuracy. Models are provided (or available) for 16kHz and 8kHz sample rates, the latter only useful for telephone audio. So basically, you'll want 16kHz sample rate, 16 bits depth (endianness can be controlled by a parameter), single channel (monophonic).

Here's how I convert my incoming .mp3 files to .wav for pocketsphinx:

avconv -i file0001.mp3 -ar 16000 -ac 1 input.wav

(Yes, the output file is named "input.wav", because that's what the pocketsphinx ctlfile.txt asks for.)

watch me try to do speech recognition on Linux

Tuesday, November 5, 2019

ffmpeg/avconv

No comments:

Post a Comment