Naturalness of Synthesized Speech: A Subjective-Comparative Study Utilizing ACR Listening-Opinion Tests between Siri and Google Translate Using News Content
Abstract
Text-To-Speech synthesis (TTS) is one of the most important technologies based on human language processing at present. However, one of interesting open issues about synthesized speech is naturalness of speech. Therefore, this article presents the application of a speech quality assessment method to assess the naturalness of Thai synthesized speech from two popular TTS systems, Siri and Google Translate. For methodology, Thai synthesized speech, associate with two royal news and two general news (COVID-19), provided by Siri and Google Translate have been assessed by sixteen Thai male volunteers and sixteen Thai female volunteers. Then, it has been found the overall result, that the value of the naturalness - Mean Opinion Score (MOS) of Thai synthesized speech provided by Google Translate is 3.53 ± 0.67, which is higher than the value of 3.16 ± 0.77 provided by Siri. Furthermore, after the statistical analysis using t-test, it is been found that the p-value is 0.037. In conclusion, the speech synthesis engine in Google Translate provides better naturalness than the one in Siri significantly. Therefore, the methodology in this article can be applied to assess naturalness level of other applications/services/systems in order to improve synthesized speech quality. Keywords : TTS ; MOS ; naturalness ; synthesized speechReferences
Capes, T., Coles, P., Conkie, A., Golipour, L., Hadjitarkhani, A., Hu, Q., Huddleston, N., Hunt, M., Li, J., Neeracher, M., Prahallad, K., Raitio, T., Rasipuram, R., Townsend, G., Williamson, B., Winarsky, D., Wu, Z., & Zhang, H. (2017). Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System. In Proc. of INTERSPEECH. (pp. 4011-4015). Stockholm, Sweden. Retrieved February 9, 2021, from: https://pdfs.semanticscholar.org/702e/aa99bcb366d08d7f450ed7e354f9f6920b23.pdf
Cardoso, W., Smith, G., & Fuentes C. G. (2015). Evaluating text-to-speech synthesizers. In Proc. of EUROCALL. (pp. 108-113). Padova, Italy. Retrieved February 9, 2021, from: https://files.eric.ed.gov/fulltext/ED564181.pdf
Csapo, T. G. (2020). Increasing the Naturalness of Synthesized Speech. Retrieved February 9, 2021, from: http://smartlab.tmit.bme.hu/csapo/downloads/Csapo-phonetician2012-paper.pdf
Daengsi, T. & Pornpongtechavanich, P. (2021). Quality of Experience: Comparison of Synthesized Speech Naturalness Between Apple’s Siri and Google Translate Referring to Thai Language. In Proc of ICCCI 2021). Coimbatore, INDIA.
Daengsi, T., Preechayasomboon, A., Sukparungsee, S., & Wutiwiwatchai, C. (2012). Thai Text Resource: A Recommended Thai Text Set for Voice Quality Measurements and Its Comparative Study. KKU Science Journal, 40(4), 1114-1127. Retrieved February 9, 2021, from: http://scijournal.kku.ac.th/files/Vol_40_No_4_P_1114-1127.pdf
Daengsi, T., Wutiwiwatchai, C., Preechayasomboon, A., & Sukparungsee, S. (2014). IP Telephony: Comparison of Subjective Assessment Methods for Voice Quality Evaluation. Walailak Journal of Science and Technology, 11(2), 87-92. Retrieved February 9, 2021, from: https://wjst.wu.ac.th/index.php/wjst/article/view/577/353
Daengsi, T., & Wuttidittachotti, P. (2019). QoE Modeling for Voice over IP: Simplified E-model Enhancement Utilizing the Subjective MOS Prediction Model – A Case of G.729 and Thai Users. Journal of Network and Systems Management, 27(4), 837–859. Retrieved February 9, 2021, from: https://link.springer.com/article/10.1007/s10922-018-09487-4
Daengsi, T. Yochanang, K., & Wuttidittachotti, P. (2013). A Study of Perceptual VoIP Quality Evaluation with Thai Users and Codec Selection Using Voice Quality - Bandwidth Tradeoff Analysis. In Proc. of 4th ICTC. (pp. 691-696). Jeju Island, Korea. Retrieved February 9, 2021, from: http://www2.it.kmutnb.ac.th/teacher/FileDL/Kiattisak1712255614465.pdf
Dall, R., Yamagishi, J., & King, S. (2014). Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation. In Proc. of Conference contribution. (pp. 1-5). Dublin, Ireland. Retrieved February 9, 2021, from: https://core.ac.uk/download/pdf/24060899.pdf
Dinh, T., Kain, A., Samlan, R., Cao B., & Wang, J. (2020). Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency. In Proc. of INTERSPEECH. (pp. 4781-4785) Shanghai, China. Retrieved February 9, 2021, from: https://isca-speech.org/archive/Interspeech_2020/pdfs/1196.pdf
Google Play. (2020). Google Translate. Retrieved February 9, 2021, from:
https://play.google.com/store/apps/details?id=com.google.android.apps.translate&hl=en
ITU-T Recommendation P.800. (1996). Methods for subjective determination of transmission quality. Retrieved February 9, 2021, from: http://www.itu.int/rec/T-REC-P.800-199608-I
ITU-T Recommendation P.800.2 (2016). Mean opinion score interpretation and reporting. Retrieved February 9, 2021, from: https://www.itu.int/rec/T-REC-P.800.2/en
lTU-T Recommendation P.805. (2007). Subjective evaluation of conversational quality. Retrieved February 9, 2021, from: https://www.itu.int/rec/T-REC-P.805/en
Kertkeidkachorn, N., Chanjaradwichai, S., Punyabukkana, P., & Suchato, A. (2014). CHULA TTS: A modularized text-to-speech framework. In Proc. of PACLIC. (pp. 414–421). Phuket, Thailand. Retrieved February 9, 2021, from: https://www.aclweb.org/anthology/Y14-1048.pdf
Martin, A. F., Malfaz, M., Castro-González, A., Castillo, C. J., & Salichs, A. M. (2020). Four-Features Evaluation of Text to Speech Systemsfor Three Social Robots,” Electronics, 9(2), 1-23. Retrieved February 9, 2021, from: https://www.mdpi.com/2079-9292/9/2/267/pdf
Martín, B. S. (2017). Translation Quality Assessment of Google Translate and Microsoft Bing Translator. Thesis, Universidad de Valladolid, spain. Retrieved February 9, 2021, from: http://uvadoc.uva.es/bitstream/handle/10324/22596/TFG_F_2017_7.pdf?sequence=1&isAllowed=y
Pornpongtechavanich, P. & Daengsi, T. (2019). Video Telephony - Quality of Experience: A Simple QoE Model to Assess Video Calls Using Subjective Approach. Multimedia Tools and Applications, 78(22), 31987-32006. Retrieved February 9, 2021, from: https://link.springer.com/article/10.1007/s11042-019-07928-z
ReadSpeaker. (2020). TTS Software Use Cases. Retrieved February 9, 2021, from: https://www.readspeaker.com/tts-software-use-cases/
Shirali-Shahreza, S., & Penn, G. (2018). MOS Naturalness and the quest for human-like speech. In Proc. of IEEE SLT Workshop. (pp. 346-352) Athens, Greece. Retrieved February 9, 2021, from: https://doi.org/10.1109/SLT.2018.8639599
Siri Team. (2017). Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis. Retrieved May 10, 2020, from: https://machinelearning.apple.com/research/siri-voices
Sriwongchai, S., Setthee, P., & Prasongsook, S. (2017). Study on Behavior of Participation in Solid Waste Management of Burapha University Sakaeo Campus’s Students and Personnel. Burapha Scuence Journal, 22(2), 288-299. Retrieved February 9, 2021, from: http://science.buu.ac.th/ojs246/index.php/sci/article/download/1505/1448
Sornlertlamvanich, V., Potipiti, T., Wutiwiwatchai C., & Mittrapiyanuruk P. (2020). The State of the Art in Thai Language Processing. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. (pp. 1-2). Bangkok, Thailand. Retrieved February 9, 2021, from: https://dl.acm.org/doi/pdf/10.3115/1075218.1075296
Wutiwiwatchai, C., Hansakunbuntheung, C., Rugchatjaroen, A., Saychum, S., Kasuriya S. & Chootrakool P. (2017). Thai Text-to-Speech Synthesis: A Review. Journal of Intelligent Informatics and Smart Technology, 2, 1-8. Retrieved February 9, 2021, from: https://jiist.aiat.or.th/assets/upload
Cardoso, W., Smith, G., & Fuentes C. G. (2015). Evaluating text-to-speech synthesizers. In Proc. of EUROCALL. (pp. 108-113). Padova, Italy. Retrieved February 9, 2021, from: https://files.eric.ed.gov/fulltext/ED564181.pdf
Csapo, T. G. (2020). Increasing the Naturalness of Synthesized Speech. Retrieved February 9, 2021, from: http://smartlab.tmit.bme.hu/csapo/downloads/Csapo-phonetician2012-paper.pdf
Daengsi, T. & Pornpongtechavanich, P. (2021). Quality of Experience: Comparison of Synthesized Speech Naturalness Between Apple’s Siri and Google Translate Referring to Thai Language. In Proc of ICCCI 2021). Coimbatore, INDIA.
Daengsi, T., Preechayasomboon, A., Sukparungsee, S., & Wutiwiwatchai, C. (2012). Thai Text Resource: A Recommended Thai Text Set for Voice Quality Measurements and Its Comparative Study. KKU Science Journal, 40(4), 1114-1127. Retrieved February 9, 2021, from: http://scijournal.kku.ac.th/files/Vol_40_No_4_P_1114-1127.pdf
Daengsi, T., Wutiwiwatchai, C., Preechayasomboon, A., & Sukparungsee, S. (2014). IP Telephony: Comparison of Subjective Assessment Methods for Voice Quality Evaluation. Walailak Journal of Science and Technology, 11(2), 87-92. Retrieved February 9, 2021, from: https://wjst.wu.ac.th/index.php/wjst/article/view/577/353
Daengsi, T., & Wuttidittachotti, P. (2019). QoE Modeling for Voice over IP: Simplified E-model Enhancement Utilizing the Subjective MOS Prediction Model – A Case of G.729 and Thai Users. Journal of Network and Systems Management, 27(4), 837–859. Retrieved February 9, 2021, from: https://link.springer.com/article/10.1007/s10922-018-09487-4
Daengsi, T. Yochanang, K., & Wuttidittachotti, P. (2013). A Study of Perceptual VoIP Quality Evaluation with Thai Users and Codec Selection Using Voice Quality - Bandwidth Tradeoff Analysis. In Proc. of 4th ICTC. (pp. 691-696). Jeju Island, Korea. Retrieved February 9, 2021, from: http://www2.it.kmutnb.ac.th/teacher/FileDL/Kiattisak1712255614465.pdf
Dall, R., Yamagishi, J., & King, S. (2014). Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation. In Proc. of Conference contribution. (pp. 1-5). Dublin, Ireland. Retrieved February 9, 2021, from: https://core.ac.uk/download/pdf/24060899.pdf
Dinh, T., Kain, A., Samlan, R., Cao B., & Wang, J. (2020). Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency. In Proc. of INTERSPEECH. (pp. 4781-4785) Shanghai, China. Retrieved February 9, 2021, from: https://isca-speech.org/archive/Interspeech_2020/pdfs/1196.pdf
Google Play. (2020). Google Translate. Retrieved February 9, 2021, from:
https://play.google.com/store/apps/details?id=com.google.android.apps.translate&hl=en
ITU-T Recommendation P.800. (1996). Methods for subjective determination of transmission quality. Retrieved February 9, 2021, from: http://www.itu.int/rec/T-REC-P.800-199608-I
ITU-T Recommendation P.800.2 (2016). Mean opinion score interpretation and reporting. Retrieved February 9, 2021, from: https://www.itu.int/rec/T-REC-P.800.2/en
lTU-T Recommendation P.805. (2007). Subjective evaluation of conversational quality. Retrieved February 9, 2021, from: https://www.itu.int/rec/T-REC-P.805/en
Kertkeidkachorn, N., Chanjaradwichai, S., Punyabukkana, P., & Suchato, A. (2014). CHULA TTS: A modularized text-to-speech framework. In Proc. of PACLIC. (pp. 414–421). Phuket, Thailand. Retrieved February 9, 2021, from: https://www.aclweb.org/anthology/Y14-1048.pdf
Martin, A. F., Malfaz, M., Castro-González, A., Castillo, C. J., & Salichs, A. M. (2020). Four-Features Evaluation of Text to Speech Systemsfor Three Social Robots,” Electronics, 9(2), 1-23. Retrieved February 9, 2021, from: https://www.mdpi.com/2079-9292/9/2/267/pdf
Martín, B. S. (2017). Translation Quality Assessment of Google Translate and Microsoft Bing Translator. Thesis, Universidad de Valladolid, spain. Retrieved February 9, 2021, from: http://uvadoc.uva.es/bitstream/handle/10324/22596/TFG_F_2017_7.pdf?sequence=1&isAllowed=y
Pornpongtechavanich, P. & Daengsi, T. (2019). Video Telephony - Quality of Experience: A Simple QoE Model to Assess Video Calls Using Subjective Approach. Multimedia Tools and Applications, 78(22), 31987-32006. Retrieved February 9, 2021, from: https://link.springer.com/article/10.1007/s11042-019-07928-z
ReadSpeaker. (2020). TTS Software Use Cases. Retrieved February 9, 2021, from: https://www.readspeaker.com/tts-software-use-cases/
Shirali-Shahreza, S., & Penn, G. (2018). MOS Naturalness and the quest for human-like speech. In Proc. of IEEE SLT Workshop. (pp. 346-352) Athens, Greece. Retrieved February 9, 2021, from: https://doi.org/10.1109/SLT.2018.8639599
Siri Team. (2017). Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis. Retrieved May 10, 2020, from: https://machinelearning.apple.com/research/siri-voices
Sriwongchai, S., Setthee, P., & Prasongsook, S. (2017). Study on Behavior of Participation in Solid Waste Management of Burapha University Sakaeo Campus’s Students and Personnel. Burapha Scuence Journal, 22(2), 288-299. Retrieved February 9, 2021, from: http://science.buu.ac.th/ojs246/index.php/sci/article/download/1505/1448
Sornlertlamvanich, V., Potipiti, T., Wutiwiwatchai C., & Mittrapiyanuruk P. (2020). The State of the Art in Thai Language Processing. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. (pp. 1-2). Bangkok, Thailand. Retrieved February 9, 2021, from: https://dl.acm.org/doi/pdf/10.3115/1075218.1075296
Wutiwiwatchai, C., Hansakunbuntheung, C., Rugchatjaroen, A., Saychum, S., Kasuriya S. & Chootrakool P. (2017). Thai Text-to-Speech Synthesis: A Review. Journal of Intelligent Informatics and Smart Technology, 2, 1-8. Retrieved February 9, 2021, from: https://jiist.aiat.or.th/assets/upload
Downloads
Published
2022-01-10
Issue
Section
Research Article