4.3 SpeakingFaces LipReading (SFLR)
4.3.2 Data Preparation
The raw data required some cleaning and adjusting. For example, it was necessary to align the visual and thermal recordings, even though there was only one position for the speaker and the cameras were attached; they still had slightly different viewing angles. Additionally, the thermal camera had an autofocus property, which would oc- casionally change the shift of the frame. It was necessary to align the corresponding visual and thermal images, matching them manually, as part of the data preprocess- ing stage. As the recordings were collected over an interval of nearly two months, it became apparent that for each day of the recording the thermal camera was set- ting up the autofocus differently, hence more than one aligning process was needed.
Unfortunately, the data collection and occasional re-shoots introduced minor com- plications in session identification and alignment. The issue was resolved by manual classification of each session. As a result, I distinguished 12 classes of aligning.
Figure 4-3: An example of the process of aligning SFLR dataset’s subject
The aligning was done by detecting the lip landmarks of a visual frame and match- ing them with the lips on a corresponding thermal image, such that while cropping the region of interest (ROI) on a visual image, the program would be able to use the same coordinates to crop the ROI on the thermal image as well (Fig. 4-3). After ensuring that the alignment was correct for several random frames of a particular ses- sion, the vertical and horizontal shift values were recorded and applied for all frames in that session.
The artifacts common for the SF dataset, such as "freezing" of the thermal cam- era’s stream, frame flickerings, and image blur were detected during data collection, corresponding sections were deleted and re-shot on-the-go, so there was no need to search for them.
Additionally, after preliminary experiments with the dataset, it was apparent that the audio volume was uneven from one video to another, as the volume of synthesized audio was noticeably sometimes different from the original. Therefore, all audios were normalized using ffmpeg-normalize program in accordance with the EBU R128 loudness normalization standard.
Chapter 5
Results and Analysis
The goal of this chapter is to present and discuss the results of my thesis work using the metrics described above: STOI, ESTOI, PESQ and WRR. Based upon the literature review, I identified the current state-of-the-art model for lip2speech systems, Lip2Wav, and was able to download and install the system locally, using the DGX computational environment of the Institute for Smart Systems and Artificial Intelligence. I was able to configure and conduct test-runs of that system, and then adapt it for the inclusion of the thermal data stream. I ran the system for training purposes using the SFLR dataset.
Table 5.1 presents the results of the trained model on visual image from Speak- ingFaces LipReading data, and compares them to the ones of the original Lip2Wav paper. As shown in the table, the current performance metrics for the visual stream are lower when I run the system on the SPLR dataset than they are when run on the Lip2Wav dataset, but comparable to the overall results in the field.
Table 5.2 shows the results of training of Lip2Wav model on SLFR dataset’s
Table 5.1: The results of training Lip2Wav on the original dataset and on SFLR visual image
Dataset STOI ESTOI PESQ
Lip2Wav [29] 0.282 0.183 1.671
SpeakingFaces LipReading 0.134 0.041 1.395
Table 5.2: The results of training Lip2Wav on different inputs from SFLR dataset Channels STOI ESTOI PESQ WRR
Visual 0.134 0.041 1.395 14.2%
Thermal 0.045 0.002 1.141 0.00%
Both 0.125 0.031 1.372 14.3%
different types of data: visual image only, thermal image only and both streams si- multaneously. The metrics of the thermal-only model are significantly worse than those of visual only model. This can be attributed to the fact that a thermal image contains less amount of information on facial features compared to the corresponding visual image. Taking into the account the small numbers in the metrics for thermal image training and the insignificant change in them from visual image only and com- bined input models, I conclude that the thermal image in its current resolution does not contribute any significant information for lipreading in this model.
Additionally, it should be pointed out that the word accuracy scores do not meet the state-of-the-art standards. As the metrics were derived by using speech recogni- tion model on synthesized audio, the intelligibility of the resulting audio is the reason for these decreased numbers. The suggestions for their improvements are enumerated in the conclusion.
The training was performed on a DGX-2 server. All of the preprocessing and train- ing procedures were conducted using a set of Python programs adapted from the au- thors’ original source code (available at https://github.com/Rudrabha/Lip2Wav).
The environment was set up in accordance with the directions of the authors of Lip2Wav [29].
As shown in the literature review section, lip2text models have higher performance levels than lip2speech models, as measured by the Word Recognition Rate. Taking this into account, if we compare directly the achieved results against the published results of other lip2speech models, we can see that the implemented system was comparable to the others, but did not improve on those results. In short, I was able to configure and adapt a state-of-the-art system, and replicate comparable results,
but not yet further improve them.
Chapter 6 Conclusion
The level of interest in lipreading systems has increased in recent years, due to rapid improvements in system performance and the potential utility of lipreading in appli- cations ranging from human-computer interaction to the use of speech2text systems for the hearing-impaired people. However, the challenge of lipreading has not yet been fully met: the results obtained from silent video rarely exceed a Word Recognition Rate (WRR) of 85%, thus leaving substantial room for improvement.
This thesis examines the conjecture that the recognition rate could be improved by augmenting visual image data with aligned thermal image data. The recent im- provements in the resolution of thermal cameras provides an increased level of facial feature granularity that could contribute additional information to the machine learn- ing process and thereby potentially improve lipreading accuracy.
Upon reviewing the recent literature, and assessing the current state-of-the-art, I chose to base my work on the Lip2Wav model, as described in the Methodology, and adapt the system to incorporate the thermal data.
There are few existing datasets that include aligned thermal data, as noted in the Literature Review. One of the largest such datasets is known as SpeakingFaces, with which I began my initial investigations by conducting data preprocessing and preliminary analytics. However, I determined that for the purpose of this study it was necessary to have extended data collected from individual speakers, beyond the approximately 20 minutes of utterances per speaker available in the SF dataset.
To these ends, the ISSAI team designed an extended version known as Speaking- Faces LipReading (SFLR), consisting of approximately two hours of recordings of a single speaker, collected under the conditions of the original SF dataset.
I obtained the code for the open-source Lip2Wav system, and configured the code for local execution, then adapted the system to take into account the thermal data as provided in the novel SFLR dataset. I conducted experiments on three variations of the data streams consisting of the visual image stream alone, the thermal image stream alone, and the two combined. As shown in the Results, I was able to replicate the system, generate comparable results for the PESQ measure on visual streams, but PESQ results were lower on the thermal and combined streams.
Upon reflection, the system can be further enhanced by enlarging and improv- ing the dataset. First, as the SFLR dataset’s transcripts consist of rather complex phrases, the collection of additional data could potentially increase the training re- sults. Second, the thermal and visual camera images were not matched pixel-by-pixel, i.e. there is still some minor shift in the view angle, which affects the precision of the alignment. Refinements of the recording setup could have a positive impact on the performance of the model. Additionally, the extended dataset could include vari- ations of head postures, as Kumar et al. [16] pointed out that multi-view data gives better results compared to single-view data.
Changes in the model may also cause a positive dynamics in the results. Apart from further fine-tuning of the Lip2Wav system, one can try to implement alternative fusion approaches for the combined model, such as first encoding each image sepa- rately, and then concatenating them [15], and more complex architectures [9, 13].
Another option is to try adapting other lipreading models to test the hypothesis, not necessarily lip2speech and speaker-specific one. Furthermore, it is recommended to use POLQA metrics for assessing synthesized audio intelligibility, as an improved successor of PESQ, once its implementation is publicly available.
As future work, the Lip2Wav system as implemented produces synthesized audio tracks; it is feasible that such output can be used as input in a similar lip2text system so as to facilitate the association of the ROI with specific audio outputs in the deep
learning process, following Kaldi or ESPNET-based model recipes. Possible findings include the detection of patterns unique to the movements on thermal images, gaining higher lipreading performance through adding thermal video on top of visual image, increasing the robustness of audio-visual speech recognition in adverse environments, and the investigation of results on how the inclusion of audio input affects each of these methods.
Appendix A
Tables
Table A.1: Literature Review: Datasets
Name Year Type Classes Speakers Resolution Duration
TIMIT 1989 Sentences 6300 360 - 30 hours
IBMViaVoice 2000 Sentences 10,500* 290 704 × 480, 30 fps 50 hours VIDTIMIT [39] 2002 Sentences 346 43 512 × 384, 25 fps 30 minutes AVICAR [3] 2004 Sentences 1317 86 720 × 480, 30 fps 33 hours
Tottori [32] 2006 Words 5 3 720 x 480, 30 fps 4 minutes
GRID [38] 2006 Phrases 1000 34 720 × 576, 25 fps 28 hours
OuluVS1 [24] 2009 Phrases 10 20 720 × 576, 25 fps 16 minutes
MIRACL-VC1 [22] 2014 Words 10 15 640 × 480, 15 fps 3 hours
Phrases
OuluVS2 [25] 2015 Phrases 10 52 1920 × 1080, 30 fps 2 hours Sentences
TCD-TIMIT [37] 2015 Sentences 6913 62 1920 × 1080, 30 fps 6 hours
LRW [19] 2016 Words 500 1000+ 256 × 256, 25 fps 111 hours
LRS [19] 2017 Sentences 17428* 1000+ 160 × 160, 25 fps 328 hours MV-LRS [19] 2017 Sentences 14960 1000+ 160 × 160, 25 fps 207 hours LRW-1000 [20] 2019 Syllables 1000 2000+ 1024 × 576, 25 fps 57 hours Lip2Wav [29] 2020 Sentences 5000 5 various, 25-30 fps 120 hours SpeakingFaces [1] 2020 Phrases 1800 142 768 × 512 (visual), 45 hours
464 × 348 (thermal), 28 fps
Table A.2: Literature Review: Papers
Year Reference Database Extractor Classifier Accuracy
2006 Saitoh and Tottori LDA Eigenimage RGB: 76.00%
Konishi [32] waveform Thr: 44.00%
+ DP matching Both: 80.00%
2016 Wand et al. [40] GRID Eigenlips SVM V: 70.60%
HOG SVM V: 71.30%
Feed-forward LSTM V: 79.60%
2016 Assael et al. [2] GRID CNN Bi-GRU V: 95.20%
2016 Chung and LRW CNN CNN V: 61.10%
Zisserman [6] OuluVS1 CNN CNN V: 91.40%
OuluVS2 CNN CNN V: 93.20%
2016 Petridis and OuluVS1 DBNF + DCT LSTM V: 81.80%
Pantic [26]
2017 Chung and OuluVS2 CNN LSTM+attention V: 88.90%
Zisserman [7] MV-LRS CNN LSTM+attention V: 37.20%
2017 Chung et al. [5] LRS CNN LSTM+attention V: 49.80%
A: 37.10%
AV: 58.00%
2017 Petridis et al. [28] OuluVS2 Autoencoder Bi-LSTM V: 96.90%
2017 Stafylakis and LRW 3D-CNN Bi-LSTM V: 83.00%
Tzimiropoulos [35] + ResNet A: 97.72%
2017 Le Cornu and GRID AAM RNN V: 33%
Milner [18] ESTOI: 0.434
PESQ: 1.686
2017 Ephrat and GRID CNN CNN STOI: 0.584
Peleg [11] PESQ: 1.190
2017 Ephrat et al. [10] GRID CNN CNN STOI: 0.7
ESTOI: 0.462 PESQ: 1.922
TCD-TIMIT CNN CNN STOI: 0.63
ESTOI: 0.447 PESQ: 1.612
2018 Chung and LRW CNN LSTM V: 66.00%
Zisserman [6] OuluVS1 CNN LSTM V: 94.10%
2018 Petridis et al. [27] LRW CNN ResNet + Bi-GRU V: 83.39%
A: 97.72%
Continued on next page
Table A.2 – continued from previous page
Year Reference Database Extractor Classifier Accuracy AV: 98.38%
2019 Kumar et al. [16] OuluVS2 VGG-16 Bi-GRU V: 97.00%
+ STCNN PESQ: 2.002
2019 Yang et al. [41] LRW CNN 3D-DenseNet V: 78.00%
LRW-1000 CNN 3D-DenseNet V: 34.76%
2020 Martinez et al. LRW CNN ResNet+MS-TCN V: 85.30%
[21] A: 98.46%
AV: 98.96%
LRW-1000 CNN ResNet+MS-TCN V: 41.10%
2020 Prajwal et al. [29] GRID 3D-CNN LSTM + attention V: 85.92%
(Tacotron 2) STOI: 0.731 ESTOI: 0.535 ESTOI: 1.772 TCD-TIMIT 3D-CNN LSTM + attention V: 68.74%
(Tacotron 2) STOI: 0.558 ESTOI: 0.365 PESQ: 1.350
LRW 3D-CNN LSTM + attention V: 65.80%
(Tacotron 2) STOI: 0.543 ESTOI: 0.344 PESQ: 1.197 Lip2Wav 3D-CNN LSTM + attention STOI: 0.416
(Tacotron 2) ESTOI: 0.284 ESTOI: 1.300
Bibliography
[1] M. Abdrakhmanova, A. Kuzdeuov, S. Jarju, Y. Khassanov, M. Lewis, and H.A.
Varol. Speakingfaces: A large-scale multimodal dataset of voice commands with visual and thermal video streams. Sensors, 21(10):3465, January 2021.
[2] Y.M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, year = 2016,. [3] AVICAR corpus. Available at: http://www.isle.illinois.edu/sst/AVICAR/.
[4] Schmidmer C. Berger J. Obermann M. Ullmann R. Pomy J. Beerends, J.G.
and M. Keyhl. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment.
[5] J.S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3444–3453. IEEE, July 2017.
[6] J.S. Chung and A. Zisserman. Lip reading in the wild. In Asian Conference on Computer Vision, pages 87–103. Springer, Cham, November 2016.
[7] J.S. Chung and A.P. Zisserman. Lip reading in profile. British Machine Vision Association and Society for Pattern Recognition, 2017.
[8] J.S. Chung and A.P Zisserman. Learning to lip read words by watching videos.
Computer Vision and Image Understanding Journal, 173:76–85, 2018.
[9] L. Ding, Y. Wang, R. Laganiere, D. Huang, and S. Fu. Convolutional neural networks for multispectral pedestrian detection.
[10] A. Ephrat, T. Halperin, and S. Peleg. Improved speech reconstruction from silent video. InProceedings of the IEEE International Conference on Computer Vision Workshops, pages 455–462. IEEE, 2017.
[11] A. Ephrat and S. Peleg. Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5095–5099. IEEE, 2017.
[12] A. Fernandez-Lopez and F.M. Sukno. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing Journal, 78:53–72, 2018.
[13] C. Hangil, S. Kim, P. Kihong, and K. Sohn. Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks. In2016 23rd International Conference on Pattern Recognition (ICPR), page 621, 2016.
[14] J. Jensen and C.H. Taal. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers.
[15] B. Khalid, A. M. Khan, M. U. Akram, and S. Batool. Person detection by fusion of visible and thermal images using convolutional neural network. In 2019 2nd International Conference on Communication, Computing and Digital systems (C-CODE), page 143, 2019.
[16] Y. Kumar, R. Jain, K.M. Salik, R.R. Shah, Y. Yin, and R. Zimmermann. Lipper:
Synthesizing thy speech using multi-view lipreading. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2588–2595. IEEE, 2019.
[17] L.F. Lamel, R.H. Kassel, and S. Seneff. Speech database development: Design and analysis of the acoustic-phonetic corpus. January 1989.
[18] T. Le Cornu and B. Milner. Generating intelligible audio speech from visual speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(9):1751–1761, June 2017.
[19] Lip Reading in the Wild and Lip Reading Sentences in the Wild Datasets. Avail- able at: https://www.bbc.co.uk/rd/projects/lip-reading-datasets.
[20] LRW-1000: Lip Reading database. Available at:
http://vipl.ict.ac.cn/en/view𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒.𝑝ℎ𝑝?𝑖𝑑= 13.
[21] B. Martinez, P. Ma, S. Petridis, and M. Pantic. Lipreading using temporal convolu- tional networks. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6319–6323. IEEE, May 2020.
[22] MIRACL-VC1 dataset. Available at: https://sites.google.com/site/achrafbenhamadou/- datasets/miracl-vc1.
[23] Harte N. and Gillen E. Tcd-timit: An audio-visual corpus of continuous speech.
IEEE Transactions on Multimedia, 17(5):603–615, February 2015.
[24] OuluVS database. Available at: https://www.oulu.fi/cmvs/node/41315.
[25] OuluVS2 database. Available at: http://www.ee.oulu.fi/research/imag/OuluVS2/.
[26] S. Petridis and M. Pantic. Deep complementary bottleneck features for visual speech recognition. In2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2304–2308. IEEE, March 2016.
[27] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic. End-to-end audiovisual speech recognition. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6548–6552. IEEE, April 2018.
[28] S. Petridis, Y. Wang, Z. Li, and M. Pantic. End-to-end multi-view lipreading.
arXiv preprint arXiv:1709.00443, year = 2017,.
[29] K.R. Prajwal, R. Mukhopadhyay, V.P. Namboodiri, and C.V. Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796–
13805. IEEE, 2020.
[30] K. Pujar, S. Chickerur, and M.S. Patil. Combining rgb and depth images for in- door scene classification using deep learning. In 2017 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pages 1–8. IEEE, December 2017.
[31] A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 749–752. IEEE, 2001.
[32] T. Saitoh and R. Konishi. Lip reading using video and thermal images. In 2006 SICE-ICASE International Joint Conference, pages 5011–5015. IEEE, October 2006.
[33] J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. Saurous, Y. Agiomvrgiannakis, and Y. Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In2018 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783, 2018.
[34] I. Shopovska, L. Jovanov, and W. Philips. Deep visible and thermal image fusion for enhanced pedestrian visibility.
[35] T. Stafylakis and G. Tzimiropoulos. Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105, year = 2017,.
[36] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE in- ternational conference on acoustics, speech and signal processing, pages 4214–4217.
IEEE, 2010.
[37] TCD-TIMIT corpus. Available at: https://sigmedia.tcd.ie/TCDTIMIT/.
[38] The GRID audiovisual sentence corpus . Available at:
http://spandh.dcs.shef.ac.uk/gridcorpus/.
[39] VidTIMIT Audio-Video Dataset. Available at:
http://conradsanderson.id.au/vidtimit/.