Advanced Multimedia Processing Lab -- Projects -- Profile-Frontal Audio-Visual Speech Recognition

            About AMP Lab        Projects        Downloads        Publications        People        Links

Project - Profile-Frontal Audio-Visual Speech Recognition

Contents

 

Team Member

Kshitiz Kumar

kshitizk@ece.cmu.edu

Top of this page

                                  

Goal

We aimed to introduce Profile View lipreading and compare its recognition accuracy with Front View lipreading. We also plan for audio-visual speech recognition, where we can enhance audio only speech recognition in noisy environments with visual modality information. 

Lipreading is the process of combining the audio and visual modalities to obtain better recognition accuracy than with either of individual modalities. Front View lipreading has long been established  to assist speech recognition in noisy environments. With lipreading we can have simultaneous communication among different speakers and listeners. Thinking of a real life example in round table conference, we can only have 1 person speaking and rest listening at a time; with multiple speakers, communication easily get garbled up but with lipreading we can support many more communication channels, just that the speaker and listener should be in the line of sight of each other. Lipreading has also been shown to be a boon for hard-of-hearing people. Lipreading also gives us an opportunity to read speech when audio modality is corrupted or unavailable, for example - in hearing what players talk on a game field.

Front View lipreading is a mature enough research field and its suitable for humans too. Profile View lipreading may not be suitable for humans (perhaps because we are not adjusted to it) but its still suitable for machines. Machines just care about features so as long as we can extract useful lip features and train a model for lipreading, machines should do well irrespective of the view being Front or Profile.  Profile View lipreading is especially motivated from mobile phone lipreading applications. While talking on a mobile phone, we normally hold the mobile set towards our side face. Here, Front View mouth is out of sight from mobile phone camera making Front View lipreading infeasible, but we can think of a camera near bottom tip of mobile phone and track profile face to do Profile View lipreading. In general, Profile View lipreading can be used in other applications where Front View lipreading has been used, only that profile sight should be available.

Top of this page

                                         

System Description

We compare lipreading accuracy for Profile View and Front View. Lip features are extracted from both the view. In Profile View, we obtain lip height and lip protrusion features whereas in Front View, we obtain lip height and lip width features. Notice that lip protrusion is more specific to Profile View and lip width is specific to Front View. 

We train a 2 class Gaussian Mixture Model (GMM) for observation probability and a Hidden Markov Model (HMM) for state transition and prior probability, where states are  triphones. Later, Viterbi algorithm along with the trained model, decodes the incoming features into words. We performed a speaker dependent isolated word lipreading separately for Profile and Front View and then we fused features from both the views to do a joint Profile-Front View lipreading. While fusing features, we leave out the Front View lip height features, as it is already present in  Profile View features.

Our lipreading results are as below.

From the plot, we observe that Profile View lipreading has an average recognition accuracy of 55.5% whereas for Front View it is 67.3%. Not to our surprise, Profile View does better than Front View. It is expected because protrusion parameters in Profile View appear to be more informative than width parameter in Front View. We did a further analysis on recognition accuracy with individual lip features of height, protrusion and width. We observed that Profile View protrusion features are almost as good as height features but Front View width parameter are quite indiscriminative. With both Profile and Front View features, we further improved upon the recognition accuracy to 54.2%. Later, we plan to improve speech recognition in noisy environment with visual modality.

Top of this page

 

Download

As the first step of this research, we collected an profile-frontal audio-visual data corpus, which is available to the public. In this data set, we have:

Went Sent Bent Dent Tent Rent
Hold Cold Told Fold Sold Gold
Pat Pad Pan Path Pack Pass
Lane Lay Late Lake Lace Lame
Kit Bit Fit Hit Wit Sit
Must Bust Gust Rust Dust Just
Teak Team Teal Teach Tear Tease
Din Dill Dim Dig Dip Did
Bed Led Fed Red Wed Shed
Pin Sin Tin Fin Din Win
Dug Dung Duck Dud Dub Dun
Sum Sun Sung Sup Sub Sud
Seep Seen Seethe Seek Seem Seed
Same Name Game Tame Came Fame
Peel Reel Feel Eel Keel Heel
Hark Dark Mark Bark Park Lark
Heave Hear Heat Heal Heap Heath
Cup Cut Cud Cuff Cuss Cud
Thaw Law Raw Paw Jaw Saw
Pen Hen Men Then Den Ten
Puff Puck Pub Pus Pup Pun
Bean Beach Beat Beak Bead Beam
Heat Neat Feat Seat Meat Beat
Dip Sip Hip Tip Lip Rip
Kill Kin Kit Kick King Kid

 

Download our data now.

Top of this page

 

Contact

Any suggestions or comments are welcome. Please send them to Kshitiz Kumar

Top of this page