Advanced Multimedia Processing Lab -- Projects -- Audio Visual Speech Processing

            About AMP Lab        Projects        Downloads        Publications        People        Links

Project - Audio-Visual Speech Processing

Contents

 

Team Member

Fu Jie Huang

jhuangfu@cmu.edu

Top of this page

                                  

Goal

A human listener can use visual cues, such as lip and tongue movements, to enhance the level of speech understanding, especially in a noisy environment. The process of combining the audio modality and the visual modality is referred to as speechreading, or lipreading. Inspired by human speechreading, the goal of this project is to enable a computer to use speechreading for higher speech recognition accuracy. 

There are many applications in which it is desired to recognize speech under extremely adverse acoustic environments. Detecting a person's speech from a distance or through a glass window, understanding a person speaking among a very noisy crowd of people, and monitoring a speech over TV broadcast when the audio link is weak or corrupted, are some examples. In these applications, the performance of traditional speech recognition is very limited. In this project, we use a video camera to track the lip movements of the speaker to assist acoustic speech recognition. We have developed a robust lip-tracking technique, with which preliminary results have showed that speech recognition accuracy for noisy audio can be improved from less than 20% when only audio information is used, to close to 60% when lip tracking is used to assist speech recognition. Even with the visual modality only, i.e., without listening at all, lip-tracking can achieve a recognition accuracy close to 40%. 

Top of this page

                                         

System Description

We explore the problem of enhancing the speech recognition in noisy environments (both Gaussian white noise and cross-talk noise cases) by using the visual information such as lip movements.

We use a novel Hidden Markov Model (HMM) to model the audio-visual bi-modal signal jointly, which shows promising result for recognition. We also explore the fusion of the acoustic signal and the visual information with different combinations approaches, to find the optimum method.

To test the performance of this approach, we add gaussian noise to the acoustic signal with different SNR (signal-to-noise ratio). Here we plot the recognition ratio versus SNR.  

In the plot above, the blue curve shows the change of the recognition ratio vs. SNR when we use the audio-visual joint feature (cascade the visual parameter with the acoustic feature) as the input to the recognition system. The black curve with "o" shows the performance when we use acoustic signal as the input, just as most conventional speech recognition systems do. The flat black curve shows the performance of recognition using only the lip movement parameters.  

We can see that the recognition ratio of the joint HMM system drops much slower than the acoustic only system, which means the joint system is more robust to the gaussian noise corruption in the acoustic signal. We also apply this approach to cross-talk noise corrupted speech.

Top of this page

 

Download

As the first step of this research, we collected an audio-visual data corpus, which is available to the public. In this data set, we have:

First of all, for the video part, we only keep the mouth area since it's the Area of Interest in the lipreading research field. We used a video editing tool (such as Adobe Premiere) to crop out the mouth area.  The following figure shows how the whole face picture, size 720*480, was cut to the only picture of the mouth, size 216*264.  Note that the face in the original Quicktime video frame lie horizontally due to the shooting procedure.

The four offsets of the mouth picture were noted down for calculating the positions of the lip parameters later.  After that, we shrank the mouth picture from size 216*264 to 144*176, and rotated to the final picture, size 176*144, the standardized QCIF format.

And then we use a lip tracking program to extract the lip parameters from each QCIF sized file, based on deformable template and color information. The template is defined by the left and right corners of the mouth, the height of the upper lip, and the height of the lower lip. The following figure shows an example of the lip tracking result, with the template in black lines superimposed on the mouth area. 

Click the image to view the sample video sequence.For more details about the lip tracking and object tracking, please see our face tracking web page. The face tracking toolkit can be extended to track lip movements.  The lip parameters are stored in text files, along with the offsets of the mouth picture. 

Secondly, we also have the waveform files extracted from the Quicktime video files. These waveform files contains the speech signals corresponding to the lip parameters contained in the text files mentioned above.

 

Download our data now.

Top of this page

 

Publications

Top of this page

 

Contact

Any suggestions or comments are welcome. Please send them to Wende Zhang

Top of this page