Voice Recognition on the iPhone
One of my projects has the potential to get a real usability boost through voice recognition. Voice recognition is a utopia of mobile user interface, and like every utopia, it turns out to not work as well as you’d like in real life. I spent some time looking into Open Source Voice Recognition packages and if you’re looking to use Voice Recognition for what is called “Very Large Vocabularies”, you are almost guaranteed to be disappointed. If you are looking for Voice Commands, the default performance may be acceptable for what you are looking for.
Voice Recognition works by transforming each sample in an audio stream, guessing what pronunciation that sample might represent, and in turn, what word it might represent. This involves both an Acoustic Model, and a Language Model to guide probabilities, and a number of algorithms to determine the result based off of those probabilities. One key thing here is that bad results are usually not code bugs, but training deficiencies. The Acoustic Models are trained by hundreds of hours of voice, but the free models are not as robust as comercial offerings. If you want to do something about it, go to Vox Forge and contribute some audio. Restricting the language model goes a long way towards improving the results, as does training the Acoustic Model to your voice. For my project, restricting the language model was the only option since my Acoustic Models need to be speaker independent.
I investigated 2 packages, Pocket Sphinx and Julius. Pocket Sphinx is the evolution of a long line of Voice Recognition packages based in C and developed by people in CMU. Julius is a package developed most actively for the Japanese Language, but is still language agnostic.
I had fun cleaning up the Mac experience for both of these projects. Julius built fine but the CoreAudio driver was broken. Pocket Sphinx had a few compile errors under XCode and no clear way to build iPhone friendly libraries. So I submitted a new driver based off of the Audio Queue technology and submitted a patch that lets Pocket Sphinx build Mac and iPhone friendly binaries. It was great to submit patches to both of these projects. I still have an Audio Queue driver to write for Pocket Sphinx though!
So more on my project soon, but Pocket Sphinx is the package that I’m going to push forward with. This is largely due to the fact that the default Acoustic Models appeared to perform better than Julius. I’m hoping to use this for aligning text and audio, not Voice to Text or Voice commands. My challenge now is to see if this process helps even if it is only 60% accurate. But this spike solution is done, time to flesh out the rest of the code!
