watch me try to do speech recognition on Linux

run_pocketsphinx_continuous.sh

# This works, in the sense of executing and producing output.
# However, accuracy is so low that it's very hard to tell what the
# input speech even was. Not sure this will be any better than
# simply transcribing by hand. This is probably the fault of the
# models and configuration -- and does "8k" mean 8kHz? My audio
# is 16kHz... Anyway, the fun continues...

pocketsphinx_continuous \
    -hmm /usr/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k \
    -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic \
    -lm /usr/share/pocketsphinx/model/lm/en_US/hub4.5000.DMP \
-infile input.wav \
-adcin true \
-hyp out.txt

ffmpeg/avconv

Often, the source audio files will be in a compressed format such as MP3. The pocketsphinx system does not provide conversion facilities, it must receive either WAV or RAW audio format. So, you'll need to convert upstream. On Linux, this is often done with "ffmpeg".

However: note that on (at least my version of) Debian, the "ffmpeg" program seems to be named "avconv". All the commandline parameters are the same, it seems to just be a name change, I'm guessing for some "political" reason such as copyright issues. There's not even a manpage for "avconv"; see "man ffmpeg", and then apply the parameters to "avconv". Confusing!

(You could also use "audacity", "sox", or various other graphical or commandline programs under Linux to convert your audio files.)

pocketsphinx is very sensitive to the sample rate and format of the audio; data of the wrong format may be rejected, or worse yet, may be accepted but may yield very poor accuracy. Models are provided (or available) for 16kHz and 8kHz sample rates, the latter only useful for telephone audio. So basically, you'll want 16kHz sample rate, 16 bits depth (endianness can be controlled by a parameter), single channel (monophonic).

Here's how I convert my incoming .mp3 files to .wav for pocketsphinx:

avconv -i file0001.mp3 -ar 16000 -ac 1 input.wav

(Yes, the output file is named "input.wav", because that's what the pocketsphinx ctlfile.txt asks for.)

run_pocketsphinx_batch.sh

# run_pocketsphinx_batch.sh
# ctlfile.txt contains one line: "input"
# input.wav -> out.txt
# These are the exact parameters that happen to work for me.
# I've bypassed the -argfile, instead just including all
# the parameters here.
pocketsphinx_batch \
    -hmm /usr/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k \
    -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic \
    -lm /usr/share/pocketsphinx/model/lm/en_US/hub4.5000.DMP \
-cepdir . \
-ctl ctlfile.txt \
-cepext .wav \
-adcin true \
-hyp out.txt

pocketsphinx_batch

The two programs, _batch and _continuous, take mostly the same commandline parameters. However, I've read that _continuous is more suitable for, well, continuous speech; I guess _batch is for collections of short utterances, such as voice-control commands. No, thanks, I hate voice interfaces!

However, here is an example of a suggested commandline for _batch:

pocketsphinx_batch \
-argfile argfile.txt \
-cepdir <path>/data \
-ctl ctlfile.txt \
-cepext .wav \
-adcin true \
-hyp out.txt

Where ctlfile.txt contains the name(s) of the audio input file(s), located in "cepdir", one filename per line; the names must *not* include the file extension, such as .wav, this is instead specified by -cepext.
("cepdir" can simply be the current directory: "-cepdir ." works, in Linux.)

argfile.txt contains three parameters with their values, each being a filename with directory path. These specify the libraries and acoustic models and stuff, by which the speech translation is made. If your results end up having poor accuracy, these will be the files to replace with your own, hopefully better, versions. In Debian, I found that the default install of the pocketsphinx package put all the files where they were expected: I just used the filenames and paths as given in many of the Linux examples, and everything worked. My parameters are as follows:

    -hmm /usr/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k \
    -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic \
    -lm /usr/share/pocketsphinx/model/lm/en_US/hub4.5000.DMP

So a lot of this should work with _continuous, though I guess not -ctl... I'll be trying it soon, in any case, because I think _continuous is the one I need.

(...later) Yes, indeed, when I try to run _batch on my file, I get:

FATAL_ERROR: "acmod.c", line 532: Batch processing can not process more than 32767 frames at once, requested 259043

(My audio is about 30 minutes of conversation.)

pocketsphinx

So I've decided to install and try "pocketsphinx".
On Debian, as root, I did:

apt-get install pocketsphinx

This installs several components: none of which is called "pocketsphinx" per se!
Instead, see the following:

man pocketsphinx_batch
man pocketsphinx_continuous

But the man pages, which contain the same info as running the programs with no args will also provide, just give an alphabetical list of the numerous commandline options available, without providing any examples of simple ways to use the programs in a real situation. There's no option called, e.g., "-infile", and none called "-outfile". Just figuring out how to get data into and out of the program is a major pain, considering the feeble documentation! For this, I've had to wander the corridors of the Internet, picking up suggestions and example invocations from tech-help forums and FAQs in far-flung locations.

intro

Title says it all (sigh). So here we go...

(Spoiler alert: pocketsphinx, at least out-of-the-box, is a joke. Totally useless.)

The "free" server/cloud based systems such as google, are encumbered by licensing BS, and/or generally they want to do *anything but* provide the simple service that everyone wants, i.e., convert an arbitrary sound file into a plain text file: speech-to-text transcription. Google and their ilk want to force you to use their browser (Chrome), or the input can only come from your mobile-device microphone (if any!), not from a sound file, or the text output is embedded in some kind of graphical web interface that lets you do all sorts of colourful manipulations, yet somehow makes it remarkably hard to actually download a plain-text file of your results. I can't stand that kind of... stuff.

The little virtual keyboard that pops up on my Galaxy tablet when I want to type something, has a little "microphone" button which I had never tried before. I assume it links up to google cloud. I tried it, and it gave by far the best results I've seen so far! But again, I just can't bring myself to do the project at hand by playing my audio files back on one little gizmo's speaker, while holding my tablet up to the speaker so that I can type an email to myself... Which century is this again? Reminds me of an acoustic modem...

So, I will try to use one of the Linux standalone systems, and see if I can make it work acceptably. Given how little useful documentation seems to be out there pertaining to this simple question:
How can I convert speech to text on Linux for absolutely free without restrictions, limitations, trial periods, or subscriptions of any kind?
I hope that the record of my own bumbling efforts towards a solution may help others in this world. Good luck to us all!