Naturalness of Synthesized Speech: A Subjective-Comparative Study Utilizing ACR Listening-Opinion Tests between Siri and Google Translate Using News Content

Authors

  • Phisit Pornpongtechavanich
  • Therdpong Daengsi

Abstract

Text-To-Speech synthesis (TTS) is one of the most important technologies based on human language processing at present. However, one of interesting open issues about synthesized speech is naturalness of speech. Therefore, this article presents the application of a speech quality assessment method to assess the naturalness of Thai synthesized speech from two popular TTS systems, Siri and Google Translate. For methodology, Thai synthesized speech, associate with two royal news and two general news (COVID-19), provided by Siri and Google Translate have been assessed by sixteen Thai male volunteers and sixteen Thai female volunteers. Then, it has been found the overall result, that the value of the naturalness - Mean Opinion Score (MOS) of Thai synthesized speech provided by Google Translate is 3.53 ± 0.67, which is higher than the value of 3.16 ± 0.77 provided by Siri. Furthermore, after the statistical analysis using t-test, it is been found that the p-value is 0.037. In conclusion, the speech synthesis engine in Google Translate provides better naturalness than the one in Siri significantly. Therefore, the methodology in this article can be applied to assess naturalness level of other applications/services/systems in order to improve synthesized speech quality.                                        Keywords :  TTS ; MOS ; naturalness ; synthesized speech

References

Capes, T., Coles, P., Conkie, A., Golipour, L., Hadjitarkhani, A., Hu, Q., Huddleston, N., Hunt, M., Li, J., Neeracher, M., Prahallad, K., Raitio, T., Rasipuram, R., Townsend, G., Williamson, B., Winarsky, D., Wu, Z., & Zhang, H. (2017). Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System. In Proc. of INTERSPEECH. (pp. 4011-4015). Stockholm, Sweden. Retrieved February 9, 2021, from: https://pdfs.semanticscholar.org/702e/aa99bcb366d08d7f450ed7e354f9f6920b23.pdf


Cardoso, W., Smith, G., & Fuentes C. G. (2015). Evaluating text-to-speech synthesizers. In Proc. of EUROCALL. (pp. 108-113). Padova, Italy. Retrieved February 9, 2021, from: https://files.eric.ed.gov/fulltext/ED564181.pdf

Csapo, T. G. (2020). Increasing the Naturalness of Synthesized Speech. Retrieved February 9, 2021, from: http://smartlab.tmit.bme.hu/csapo/downloads/Csapo-phonetician2012-paper.pdf

Daengsi, T. & Pornpongtechavanich, P. (2021). Quality of Experience: Comparison of Synthesized Speech Naturalness Between Apple’s Siri and Google Translate Referring to Thai Language. In Proc of ICCCI 2021). Coimbatore, INDIA.

Daengsi, T., Preechayasomboon, A., Sukparungsee, S., & Wutiwiwatchai, C. (2012). Thai Text Resource: A Recommended Thai Text Set for Voice Quality Measurements and Its Comparative Study. KKU Science Journal, 40(4), 1114-1127. Retrieved February 9, 2021, from: http://scijournal.kku.ac.th/files/Vol_40_No_4_P_1114-1127.pdf
Daengsi, T., Wutiwiwatchai, C., Preechayasomboon, A., & Sukparungsee, S. (2014). IP Telephony: Comparison of Subjective Assessment Methods for Voice Quality Evaluation. Walailak Journal of Science and Technology, 11(2), 87-92. Retrieved February 9, 2021, from: https://wjst.wu.ac.th/index.php/wjst/article/view/577/353

Daengsi, T., & Wuttidittachotti, P. (2019). QoE Modeling for Voice over IP: Simplified E-model Enhancement Utilizing the Subjective MOS Prediction Model – A Case of G.729 and Thai Users. Journal of Network and Systems Management, 27(4), 837–859. Retrieved February 9, 2021, from: https://link.springer.com/article/10.1007/s10922-018-09487-4

Daengsi, T. Yochanang, K., & Wuttidittachotti, P. (2013). A Study of Perceptual VoIP Quality Evaluation with Thai Users and Codec Selection Using Voice Quality - Bandwidth Tradeoff Analysis. In Proc. of 4th ICTC. (pp. 691-696). Jeju Island, Korea. Retrieved February 9, 2021, from: http://www2.it.kmutnb.ac.th/teacher/FileDL/Kiattisak1712255614465.pdf

Dall, R., Yamagishi, J., & King, S. (2014). Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation. In Proc. of Conference contribution. (pp. 1-5). Dublin, Ireland. Retrieved February 9, 2021, from: https://core.ac.uk/download/pdf/24060899.pdf

Dinh, T., Kain, A., Samlan, R., Cao B., & Wang, J. (2020). Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency. In Proc. of INTERSPEECH. (pp. 4781-4785) Shanghai, China. Retrieved February 9, 2021, from: https://isca-speech.org/archive/Interspeech_2020/pdfs/1196.pdf

Google Play. (2020). Google Translate. Retrieved February 9, 2021, from:
https://play.google.com/store/apps/details?id=com.google.android.apps.translate&hl=en

ITU-T Recommendation P.800. (1996). Methods for subjective determination of transmission quality. Retrieved February 9, 2021, from: http://www.itu.int/rec/T-REC-P.800-199608-I

ITU-T Recommendation P.800.2 (2016). Mean opinion score interpretation and reporting. Retrieved February 9, 2021, from: https://www.itu.int/rec/T-REC-P.800.2/en

lTU-T Recommendation P.805. (2007). Subjective evaluation of conversational quality. Retrieved February 9, 2021, from: https://www.itu.int/rec/T-REC-P.805/en

Kertkeidkachorn, N., Chanjaradwichai, S., Punyabukkana, P., & Suchato, A. (2014). CHULA TTS: A modularized text-to-speech framework. In Proc. of PACLIC. (pp. 414–421). Phuket, Thailand. Retrieved February 9, 2021, from: https://www.aclweb.org/anthology/Y14-1048.pdf

Martin, A. F., Malfaz, M., Castro-González, A., Castillo, C. J., & Salichs, A. M. (2020). Four-Features Evaluation of Text to Speech Systemsfor Three Social Robots,” Electronics, 9(2), 1-23. Retrieved February 9, 2021, from: https://www.mdpi.com/2079-9292/9/2/267/pdf

Martín, B. S. (2017). Translation Quality Assessment of Google Translate and Microsoft Bing Translator. Thesis, Universidad de Valladolid, spain. Retrieved February 9, 2021, from: http://uvadoc.uva.es/bitstream/handle/10324/22596/TFG_F_2017_7.pdf?sequence=1&isAllowed=y

Pornpongtechavanich, P. & Daengsi, T. (2019). Video Telephony - Quality of Experience: A Simple QoE Model to Assess Video Calls Using Subjective Approach. Multimedia Tools and Applications, 78(22), 31987-32006. Retrieved February 9, 2021, from: https://link.springer.com/article/10.1007/s11042-019-07928-z

ReadSpeaker. (2020). TTS Software Use Cases. Retrieved February 9, 2021, from: https://www.readspeaker.com/tts-software-use-cases/

Shirali-Shahreza, S., & Penn, G. (2018). MOS Naturalness and the quest for human-like speech. In Proc. of IEEE SLT Workshop. (pp. 346-352) Athens, Greece. Retrieved February 9, 2021, from: https://doi.org/10.1109/SLT.2018.8639599

Siri Team. (2017). Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis. Retrieved May 10, 2020, from: https://machinelearning.apple.com/research/siri-voices

Sriwongchai, S., Setthee, P., & Prasongsook, S. (2017). Study on Behavior of Participation in Solid Waste Management of Burapha University Sakaeo Campus’s Students and Personnel. Burapha Scuence Journal, 22(2), 288-299. Retrieved February 9, 2021, from: http://science.buu.ac.th/ojs246/index.php/sci/article/download/1505/1448

Sornlertlamvanich, V., Potipiti, T., Wutiwiwatchai C., & Mittrapiyanuruk P. (2020). The State of the Art in Thai Language Processing. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. (pp. 1-2). Bangkok, Thailand. Retrieved February 9, 2021, from: https://dl.acm.org/doi/pdf/10.3115/1075218.1075296

Wutiwiwatchai, C., Hansakunbuntheung, C., Rugchatjaroen, A., Saychum, S., Kasuriya S. & Chootrakool P. (2017). Thai Text-to-Speech Synthesis: A Review. Journal of Intelligent Informatics and Smart Technology, 2, 1-8. Retrieved February 9, 2021, from: https://jiist.aiat.or.th/assets/upload

Downloads

Published

2022-01-10