About AMP Lab Projects Downloads Publications People Links
We aimed to introduce Profile View lipreading and compare its recognition accuracy with Front View lipreading. We also plan for audio-visual speech recognition, where we can enhance audio only speech recognition in noisy environments with visual modality information.
Lipreading is the process of combining the audio and visual modalities to obtain better recognition accuracy than with either of individual modalities. Front View lipreading has long been established to assist speech recognition in noisy environments. With lipreading we can have simultaneous communication among different speakers and listeners. Thinking of a real life example in round table conference, we can only have 1 person speaking and rest listening at a time; with multiple speakers, communication easily get garbled up but with lipreading we can support many more communication channels, just that the speaker and listener should be in the line of sight of each other. Lipreading has also been shown to be a boon for hard-of-hearing people. Lipreading also gives us an opportunity to read speech when audio modality is corrupted or unavailable, for example - in hearing what players talk on a game field.
Front View lipreading is a mature enough research field and its suitable for humans too. Profile View lipreading may not be suitable for humans (perhaps because we are not adjusted to it) but its still suitable for machines. Machines just care about features so as long as we can extract useful lip features and train a model for lipreading, machines should do well irrespective of the view being Front or Profile. Profile View lipreading is especially motivated from mobile phone lipreading applications. While talking on a mobile phone, we normally hold the mobile set towards our side face. Here, Front View mouth is out of sight from mobile phone camera making Front View lipreading infeasible, but we can think of a camera near bottom tip of mobile phone and track profile face to do Profile View lipreading. In general, Profile View lipreading can be used in other applications where Front View lipreading has been used, only that profile sight should be available.
We compare lipreading accuracy for Profile View and Front View. Lip features are extracted from both the view. In Profile View, we obtain lip height and lip protrusion features whereas in Front View, we obtain lip height and lip width features. Notice that lip protrusion is more specific to Profile View and lip width is specific to Front View.
We train a 2 class Gaussian Mixture Model (GMM) for observation probability and a Hidden Markov Model (HMM) for state transition and prior probability, where states are triphones. Later, Viterbi algorithm along with the trained model, decodes the incoming features into words. We performed a speaker dependent isolated word lipreading separately for Profile and Front View and then we fused features from both the views to do a joint Profile-Front View lipreading. While fusing features, we leave out the Front View lip height features, as it is already present in Profile View features.
Our lipreading results are as below.
From the plot, we observe that Profile View lipreading has an average recognition accuracy of 55.5% whereas for Front View it is 67.3%. Not to our surprise, Profile View does better than Front View. It is expected because protrusion parameters in Profile View appear to be more informative than width parameter in Front View. We did a further analysis on recognition accuracy with individual lip features of height, protrusion and width. We observed that Profile View protrusion features are almost as good as height features but Front View width parameter are quite indiscriminative. With both Profile and Front View features, we further improved upon the recognition accuracy to 54.2%. Later, we plan to improve speech recognition in noisy environment with visual modality.
As the first step of this research, we collected an profile-frontal audio-visual data corpus, which is available to the public. In this data set, we have:
3 subjects (2 males and 1 females). This data set will be expanded to 10 subjects.
The vocabulary includes 150 words from Modified Rhyme Test (MRT). Each word is repeated 10 times.
The database had been collected in a soundproof IAC studio. For good illumination, we used extra lights and for background, we used a blue-screen. We used a SONY DCR TRV 900 digital camcorder for images and a fine quality Sennheiser microphone for audio in Front View data recording. For Profile View, we used SONY DCR TRV 25 digital camcorder for both images and audio. The images were recorded in VGA (640*480) resolution at 30fps. Audio was initially recorded at 32 kHz, 32 bit, stereo but later in data-preprocessing step, it was converted to 16 kHz, 16 bit, mono. The data recorded on miniDV tapes was transferred to PC in Windows Media Video (WMV) format by Windows Movie Maker.
Here are some sample WMV videos. Click the image to view the sample video.
Both the miniDV raw data and the WMV files are available upon request. Each WMV filesize is around 600MB.
Data Download Formats
Text files with the lip features. The text file contains lip features corresponding to each set of utterances by the subjects. Remember that each set of utterances consist of 150 words. In Profile View, we extracted a total of 4 features; 2 for lip height and 2 for lip protrusion. The first column in the text file is for frame number, next 2 for upper and lower lip height respectively and then next 2 for upper and lower lip protrusion respectively. In Front View feature files, the first column is for frame number, next 2 columns for upper and lower lip height and the last column is for lip width parameter.
The waveform files. These files contain audio in wav format at 16 KHz, 16 bit, mono.
Segmentation files. These text files contain video frame segmentation information corresponding to each word in the vocabulary. The segmentation is required for isolated word training and testing.
Vocabulary
Went | Sent | Bent | Dent | Tent | Rent |
Hold | Cold | Told | Fold | Sold | Gold |
Pat | Pad | Pan | Path | Pack | Pass |
Lane | Lay | Late | Lake | Lace | Lame |
Kit | Bit | Fit | Hit | Wit | Sit |
Must | Bust | Gust | Rust | Dust | Just |
Teak | Team | Teal | Teach | Tear | Tease |
Din | Dill | Dim | Dig | Dip | Did |
Bed | Led | Fed | Red | Wed | Shed |
Pin | Sin | Tin | Fin | Din | Win |
Dug | Dung | Duck | Dud | Dub | Dun |
Sum | Sun | Sung | Sup | Sub | Sud |
Seep | Seen | Seethe | Seek | Seem | Seed |
Same | Name | Game | Tame | Came | Fame |
Peel | Reel | Feel | Eel | Keel | Heel |
Hark | Dark | Mark | Bark | Park | Lark |
Heave | Hear | Heat | Heal | Heap | Heath |
Cup | Cut | Cud | Cuff | Cuss | Cud |
Thaw | Law | Raw | Paw | Jaw | Saw |
Pen | Hen | Men | Then | Den | Ten |
Puff | Puck | Pub | Pus | Pup | Pun |
Bean | Beach | Beat | Beak | Bead | Beam |
Heat | Neat | Feat | Seat | Meat | Beat |
Dip | Sip | Hip | Tip | Lip | Rip |
Kill | Kin | Kit | Kick | King | Kid |
Any suggestions or comments are welcome. Please send them to Kshitiz Kumar.