Automatic Speech Recognition and Text To Speech conversions are promising areas of dynamic research. The idea of systems mimicking the vocal cord and recognition of the vocal cord has been long developed. But the area seems still incomplete and calls for a lot more perfection. There are lots of proprietary as well open source ventures in these areas like the NUANCE, AT & T Labs, Vox Forge, Sphinx etc. juat a few to name.
One principal knowledge source that we can draw to benefit machine speech recognition for long-term research is in the area of human speech perception, understanding, and cognition. This rich knowledge source has its basis in both psychological and physiological processes in humans. Physiological aspects of human speech perception of most interest include cortical processing in the auditory area as well as in the associated motor area of the brain. One important principle of auditory perception is its modular organization, and recent advances in functional neuro-imaging technologies provide a driving force motivating new studies towards developing integrated knowledge of the modularly organized auditory process in an end-to-end manner. Psychological aspects of human speech perception embody the essential psychoacoustic properties that underlie auditory masking and attention. Such key properties equip human listeners with the remarkable capability of coping with cocktail party effects that no current automatic speech recognition techniques can successfully handle. Intensive studies are needed in order for speech recognition and understanding applications to reach a new level, delivering performance comparable to humans.
Specific issues to be resolved in the study of how the human brain processes spoken language are the way human listeners adapt to non-native accents and the time course over which human listeners re-acquaint themselves to a language known to them. Humans have amazing capabilities to adapt to non-native accents. Current ASR systems are extremely poor in this aspect, and the improvement is expected only after we have sufficient understanding of human speech processing mechanisms.
One specific issue related to human speech perception, which is linked to human speech production, is the temporal span over which speech signals are represented and modeled. One prominent weakness in current HMMs is the handicap in representing long-span temporal dependency in the acoustic feature sequence of speech, which, nevertheless, is an essential property of speech dynamics in both perception and production. The main cause of this handicap is the conditional independence assumptions inherit in the HMM formalism. The HMM framework also assumes that speech can be described as a sequence of discrete units, usually phonemes. In this symbolic, invariant approach, the focus is on the linguistic/phonetic information, and the incoming speech signal is normalized during pre-processing in order to remove most of the paralinguistic information. However, human speech perception experiments have shown that the paralinguistic information plays a crucial role in human speech perception.