e-ISSN 2231-8526
ISSN 0128-7680
Emmaryna Irie, Sarah Samson Juan and Suhaila Saee
Pertanika Journal of Science & Technology, Volume 31, Issue 3, April 2023
DOI: https://doi.org/10.47836/pjst.31.3.10
Keywords: Automatic speech recognition, G2P technique, grapheme-to-phoneme, pronunciation variants, under-resourced language
Published on: 7 April 2023
A pronunciation dictionary (PD) is one of the components in an Automatic Speech Recognition (ASR) system, a system that is used to convert speech to text. The dictionary consists of word-phoneme pairs that map sound units to phonetic units for modelling and predictions. Research has shown that words can be transcribed to phoneme sequences using grapheme-to-phoneme (G2P) models, which could expedite building PDs. The G2P models can be developed by training seed PD data using statistical approaches requiring large amounts of data. Consequently, building PD for under-resourced languages is a great challenge due to poor grapheme and phoneme systems in these languages. Moreover, some PDs must include pronunciation variants, including regional accents that native speakers practice. For example, recent work on a pronunciation dictionary for an ASR in Iban, an under-resourced language from Malaysia, was built through a bootstrapping G2P method. However, the current Iban pronunciation dictionary has yet to include pronunciation variants that the Ibans practice. Researchers have done recent studies on Iban pronunciation variants, but no computational methods for generating the variants are available yet. Thus, this paper reviews G2P algorithms and processes we would use to develop pronunciation variants automatically. Specifically, we discuss data-driven techniques such as CRF, JSM, and JMM. These methods were used to build PDs for Thai, Arabic, Tunisian, and Swiss-German languages. Moreover, this paper also highlights the importance of pronunciation variants and how they can affect ASR performance.
Al-Shareef, S., & Hain, T. (2012). Crf-based diacritisation of colloquial Arabic for automatic speech recognition. In Thirteenth Annual Conference of the International Speech Communication Association (pp. 1824-1827). ISCA Publishing.
Amdal, I., Korkmazskiy, F., & Surendran, A. C. (2000, October 16-20). Joint pronunciation modelling of non-native speakers using data-driven methods. In INTERSPEECH (pp. 622-625). Beijing, China.
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56(1), 85-100. https://doi.org/10.1016/j.specom.2013.07.008
Bisani, M., & Ney, H. (2002, September 16-20). Investigations on joint-multigram models for grapheme-to-phoneme conversion. In INTERSPEECH (pp. 1-4). Colorado, USA
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434-451. https://doi.org/10.1016/j.specom.2008.01.002
Brenzinger, M., Yamamoto, A., Aikawa, N., Koundiouba, D., Minasyan, A., Dwyer, A., Grinevald, C., Krauss, M., Miyaoka, O., Sakiyama, O., Smeets, R., & Zepeda, O. (2003, March 10-12). Language vitality and endangerment. In International Expert Meeting on the UNESCO Programme Safeguarding of Endangered Languages. Fontenoy, Paris.
Chen, S., Beeferman, D., & Rosenfeld, R. (1998, February 8-11). Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 275-280). Lansdowne, Virginia. http://repository.cmu.edu/cgi/viewcontent.cgi?article=2330&context=compsci
Cherifi, E. H., & Guerti, M. (2021). Arabic grapheme-to-phoneme conversion based on joint multi-gram model. International Journal of Speech Technology, 24(1), 173-182. https://doi.org/10.1007/s10772-020-09779-8
Chowdhury, S. A., Alam, F., Khan, N., & Noori, S. R. H. (2018). Bangla grapheme to phoneme conversion using conditional random fields. In 2017 20th International Conference of Computer and Information Technology (ICCIT) (pp. 1-6). IEEE Publishing. https://doi.org/10.1109/ICCITECHN.2017.8281780
Deligne, S., Yvon, F., & Bimbot, F. (1995, September 18-21). Variable-length sequence matching for phonetic transcription using joint multigrams. In Fourth European Conference on Speech Communication and Technology (pp. 2243-2246). Madrid, Spain.
Guazzi, M. D., Cipolla, C., Sganzerla, P., Agostoni, P. G., Fabbiocchi, F., & Pepi, M. (1983). Language vitality and endangerment. European Heart Journal, 4(Suppl. A), 181-187. https://doi.org/10.1093/eurheartj/4.suppl_a.181
Illina, I., Fohr, D., & Jouvet, D. (2011, August 28-31). Grapheme-to-phoneme conversion using Conditional Random Fields. In Twelfth Annual Conference of the International Speech Communication Association (pp. 2313-2316). Florence, Italy.
Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. arXiv Preprint. https://doi.org/10.18653/v1/2020.acl-main.560
Juan, S., & Flora, S. (2015). Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia (Doctoral dissertation). Université Grenoble Alpes, France. https://www.theses.fr/2015GREAM061
Juan, S. S., & Besacier, L. (2013, October 14-18). Fast bootstrapping of grapheme to phoneme system for under-resourced languages-application to the iban language. In Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing (pp. 1-8). Nagoya, Japan.
Juan, S. S., Besacier, L., Lecouteux, B., & Dyab, M. (2015, September 6-10). Using resources from a closely-related language to develop ASR for a very under-resourced language: A case study for iban. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1270-1274). Dresden, Germany.
Jurafsky, D., & Martin, J. (2000). Speech & Language Processing. Pearson Education India.
Karanasou, P. (2013). Phonemic variability and confusability in pronunciation modeling for automatic speech recognition (Doctoral dissertation). Université Paris Sud-Paris, France. http://hal.archives-ouvertes.fr/tel-00843589/
Lafferty, J., McCallum, A., & C.N. Pereira, F. (2001). Conditional rndom fileds: Probabbilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001) (pp. 282-289). ACM Publishing. https://doi.org/10.29122/mipi.v11i1.2792
Laurent, A., Meignier, S., & Deléglise, P. (2014). Improving recognition of proper nouns in ASR through generating and filtering phonetic transcriptions. Computer Speech & Language, 28(4), 979-996. https://doi.org/10.1016/j.csl.2014.02.006
Lukeš, D., Kopřivová, M., Komrsková, Z., & Poukarová, P. (2018, May 7-12). Pronunciation variants and ASR of colloquial speech: A case study on Czech. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 2704-2709). Miyazaki, Japan.
Masmoudi, A., Ellouze, M., Bougares, F., Esètve, Y., & Belguith, L. (2016). Conditional random fields for the tunisian dialect grapheme-to-phoneme conversion. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1457-1461). ISCA Publishing. https://doi.org/10.21437/Interspeech.2016-1320
McCallum, A. (2012). Efficiently Inducing Features of Conditional Random Fields. arXiv Preprint. http://arxiv.org/abs/1212.2504
Morris, J. J. (2010). A study on the use of conditional random fields for automatic speech recognition (Doctoral dissertation). The Ohio State University, USA. https://etd.ohiolink.edu/apexprod/rws_olink/r/1501/10?clear=10&p10_accession_num=osu1274212139
Omar, A. (1981). The Iban language of Sarawak; A grammatical description. Kuala Lumpur: Dewan Bahasa dan Pustaka.
Ramli, I., Jamil, N., Seman, N., & Ardi, N. (2015). An improved syllabification for a better Malay language text-to-speech synthesis (TTS). Procedia Computer Science, 76, 417-424. https://doi.org/10.1016/j.procs.2015.12.280
Rugchatjaroen, A., Saychum, S., Kongyoung, S., Chootrakool, P., Kasuriya, S., & Wutiwiwatchai, C. (2019). Efficient two-stage processing for joint sequence model-based Thai grapheme-to-phoneme conversion. Speech Communication, 106, 105-111. https://doi.org/10.1016/j.specom.2018.12.003
Saychum, S., Kongyoung, S., Rugchatjaroen, A., Chootrakool, P., Kasuriya, S., & Wutiwiwatchai, C. (2016, September 8-12). Efficient Thai grapheme-to-phoneme conversion using CRF-based joint sequence modeling. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1462-1466). ISCA Publishing. https://doi.org/10.21437/Interspeech.2016-621
Shin, C. (2021). Iban as a koine language in Sarawak. Wacana, 22(1), 102-124. https://doi.org/10.17510/wacana.v22i1.985
Singh, A. K. (2008). Natural language processing for less privileged languages: Where do we come from? Where are we going? In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages (pp. 7-12). Asian Federation of Natural Language Processing. http://www.aclweb.org/anthology/I08-3004
Stadtschnitzer, M., & Schmidt, C. (2018, May 7-12). Data-driven pronunciation modeling of swiss german dialectal speech for automatic speech recognition. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 3152-3156). Miyazaki, Japan.
Sutlive, V. H. (1994). A handy Reference Dictionary of Iban and English. Tun Jugah Foundation.
Tjalve, M., & Huckvale, M. (2005, September 4-8). Pronunciation variation modelling using accent features. In 9th European Conference on Speech Communication and Technology (pp. 1341-1344). Lisbon, Portugal.
Tsuboi, Y., Kashima, H., Mori, S., Oda, H., & Matsumoto, Y. (2008, August 18-22). Training conditional random fields using incomplete annotations. In Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference (pp. 897-904). Manchester, UK. https://doi.org/10.3115/1599081.1599194
Tan, T. P., Xiao, X., Tang, E. K., Chng, E. S., & Li, H. (2009). MASS: A Malay language LVCSR corpus resource. In 2009 Oriental COCOSDA International Conference on Speech Database and Assessments (pp. 25-30). IEEE Publishing. https://doi.org/10.1109/ICSDA.2009.5278382.
Wang, X., & Sim, K. C. (2013). Integrating conditional random fields and joint multi-gram model with syllabic features for grapheme-to-phone conversion. In INTERSPEECH (pp. 2321-2325). ISCA Publishing.
Yamazaki, M., Morita, H., Komiya, K., & Kotani, Y. (2014). Extracting the translation of anime titles from web corpora using CRF. In Knowledge-Based Software Engineering: 11th Joint Conference, JCKBSE 2014 (pp. 311-320). Springer International Publishing. https://doi.org/10.1007/978-3-319-11854-3_26
Yolchuyeva, S., Németh, G., & Gyires-Tóth, B. (2019). Grapheme-to-phoneme conversion with convolutional neural networks. Applied Sciences, 9(6), 1-17. https://doi.org/10.3390/app9061143
Young, S. R. (1994, April). Detecting misrecognitions and out-of-vocabulary words. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. II-21). IEEE Publishing. https://doi.org/10.1109/ICASSP.1994.389728
Yu, M., Nguyen, H. D., Sokolov, A., Lepird, J., Sathyendra, K. M., Choudhary, S., Mouchtaris, A., & Kunzmann, S. (2020). Multilingual grapheme-to-phoneme conversion with byte representation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8234-8238). IEEE Publishing. https://doi.org/10.1109/ICASSP40776.2020.9054696
Zoubir, A. M., & Iskander, D. R. (2007). Bootstrap methods and applications: A tutorial for the signal processing practitioner. IEEE Signal Processing Magazine, 24(4), 10-19. https://doi.org/10.1109/MSP.2007.4286560
Zweig, G., & Nguyen, P. (2009). Maximum mutual information multi-phone units in direct modeling. In Tenth Annual Conference of the International Speech Communication Association (pp. 1919-1922). ISCA Publishing.
ISSN 0128-7680
e-ISSN 2231-8526