Home / Regular Issue / JST Vol. 31 (3) Apr. 2023 / JST-3567-2022


A Review on Grapheme-to-Phoneme Modelling Techniques to Transcribe Pronunciation Variants for Under-Resourced Language

Emmaryna Irie, Sarah Samson Juan and Suhaila Saee

Pertanika Journal of Science & Technology, Volume 31, Issue 3, April 2023

DOI: https://doi.org/10.47836/pjst.31.3.10

Keywords: Automatic speech recognition, G2P technique, grapheme-to-phoneme, pronunciation variants, under-resourced language

Published on: 7 April 2023

A pronunciation dictionary (PD) is one of the components in an Automatic Speech Recognition (ASR) system, a system that is used to convert speech to text. The dictionary consists of word-phoneme pairs that map sound units to phonetic units for modelling and predictions. Research has shown that words can be transcribed to phoneme sequences using grapheme-to-phoneme (G2P) models, which could expedite building PDs. The G2P models can be developed by training seed PD data using statistical approaches requiring large amounts of data. Consequently, building PD for under-resourced languages is a great challenge due to poor grapheme and phoneme systems in these languages. Moreover, some PDs must include pronunciation variants, including regional accents that native speakers practice. For example, recent work on a pronunciation dictionary for an ASR in Iban, an under-resourced language from Malaysia, was built through a bootstrapping G2P method. However, the current Iban pronunciation dictionary has yet to include pronunciation variants that the Ibans practice. Researchers have done recent studies on Iban pronunciation variants, but no computational methods for generating the variants are available yet. Thus, this paper reviews G2P algorithms and processes we would use to develop pronunciation variants automatically. Specifically, we discuss data-driven techniques such as CRF, JSM, and JMM. These methods were used to build PDs for Thai, Arabic, Tunisian, and Swiss-German languages. Moreover, this paper also highlights the importance of pronunciation variants and how they can affect ASR performance.

  • Al-Shareef, S., & Hain, T. (2012). Crf-based diacritisation of colloquial Arabic for automatic speech recognition. In Thirteenth Annual Conference of the International Speech Communication Association (pp. 1824-1827). ISCA Publishing.

  • Amdal, I., Korkmazskiy, F., & Surendran, A. C. (2000, October 16-20). Joint pronunciation modelling of non-native speakers using data-driven methods. In INTERSPEECH (pp. 622-625). Beijing, China.

  • Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56(1), 85-100. https://doi.org/10.1016/j.specom.2013.07.008

  • Bisani, M., & Ney, H. (2002, September 16-20). Investigations on joint-multigram models for grapheme-to-phoneme conversion. In INTERSPEECH (pp. 1-4). Colorado, USA

  • Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434-451. https://doi.org/10.1016/j.specom.2008.01.002

  • Brenzinger, M., Yamamoto, A., Aikawa, N., Koundiouba, D., Minasyan, A., Dwyer, A., Grinevald, C., Krauss, M., Miyaoka, O., Sakiyama, O., Smeets, R., & Zepeda, O. (2003, March 10-12). Language vitality and endangerment. In International Expert Meeting on the UNESCO Programme Safeguarding of Endangered Languages. Fontenoy, Paris.

  • Chen, S., Beeferman, D., & Rosenfeld, R. (1998, February 8-11). Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 275-280). Lansdowne, Virginia. http://repository.cmu.edu/cgi/viewcontent.cgi?article=2330&context=compsci

  • Cherifi, E. H., & Guerti, M. (2021). Arabic grapheme-to-phoneme conversion based on joint multi-gram model. International Journal of Speech Technology, 24(1), 173-182. https://doi.org/10.1007/s10772-020-09779-8

  • Chowdhury, S. A., Alam, F., Khan, N., & Noori, S. R. H. (2018). Bangla grapheme to phoneme conversion using conditional random fields. In 2017 20th International Conference of Computer and Information Technology (ICCIT) (pp. 1-6). IEEE Publishing. https://doi.org/10.1109/ICCITECHN.2017.8281780

  • Deligne, S., Yvon, F., & Bimbot, F. (1995, September 18-21). Variable-length sequence matching for phonetic transcription using joint multigrams. In Fourth European Conference on Speech Communication and Technology (pp. 2243-2246). Madrid, Spain.

  • Guazzi, M. D., Cipolla, C., Sganzerla, P., Agostoni, P. G., Fabbiocchi, F., & Pepi, M. (1983). Language vitality and endangerment. European Heart Journal, 4(Suppl. A), 181-187. https://doi.org/10.1093/eurheartj/4.suppl_a.181

  • Illina, I., Fohr, D., & Jouvet, D. (2011, August 28-31). Grapheme-to-phoneme conversion using Conditional Random Fields. In Twelfth Annual Conference of the International Speech Communication Association (pp. 2313-2316). Florence, Italy.

  • Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. arXiv Preprint. https://doi.org/10.18653/v1/2020.acl-main.560

  • Juan, S., & Flora, S. (2015). Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia (Doctoral dissertation). Université Grenoble Alpes, France. https://www.theses.fr/2015GREAM061

  • Juan, S. S., & Besacier, L. (2013, October 14-18). Fast bootstrapping of grapheme to phoneme system for under-resourced languages-application to the iban language. In Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing (pp. 1-8). Nagoya, Japan.

  • Juan, S. S., Besacier, L., Lecouteux, B., & Dyab, M. (2015, September 6-10). Using resources from a closely-related language to develop ASR for a very under-resourced language: A case study for iban. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1270-1274). Dresden, Germany.

  • Jurafsky, D., & Martin, J. (2000). Speech & Language Processing. Pearson Education India.

  • Karanasou, P. (2013). Phonemic variability and confusability in pronunciation modeling for automatic speech recognition (Doctoral dissertation). Université Paris Sud-Paris, France. http://hal.archives-ouvertes.fr/tel-00843589/

  • Lafferty, J., McCallum, A., & C.N. Pereira, F. (2001). Conditional rndom fileds: Probabbilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001) (pp. 282-289). ACM Publishing. https://doi.org/10.29122/mipi.v11i1.2792

  • Laurent, A., Meignier, S., & Deléglise, P. (2014). Improving recognition of proper nouns in ASR through generating and filtering phonetic transcriptions. Computer Speech & Language, 28(4), 979-996. https://doi.org/10.1016/j.csl.2014.02.006

  • Lukeš, D., Kopřivová, M., Komrsková, Z., & Poukarová, P. (2018, May 7-12). Pronunciation variants and ASR of colloquial speech: A case study on Czech. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 2704-2709). Miyazaki, Japan.

  • Masmoudi, A., Ellouze, M., Bougares, F., Esètve, Y., & Belguith, L. (2016). Conditional random fields for the tunisian dialect grapheme-to-phoneme conversion. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1457-1461). ISCA Publishing. https://doi.org/10.21437/Interspeech.2016-1320

  • McCallum, A. (2012). Efficiently Inducing Features of Conditional Random Fields. arXiv Preprint. http://arxiv.org/abs/1212.2504

  • Morris, J. J. (2010). A study on the use of conditional random fields for automatic speech recognition (Doctoral dissertation). The Ohio State University, USA. https://etd.ohiolink.edu/apexprod/rws_olink/r/1501/10?clear=10&p10_accession_num=osu1274212139

  • Omar, A. (1981). The Iban language of Sarawak; A grammatical description. Kuala Lumpur: Dewan Bahasa dan Pustaka.

  • Ramli, I., Jamil, N., Seman, N., & Ardi, N. (2015). An improved syllabification for a better Malay language text-to-speech synthesis (TTS). Procedia Computer Science, 76, 417-424. https://doi.org/10.1016/j.procs.2015.12.280

  • Rugchatjaroen, A., Saychum, S., Kongyoung, S., Chootrakool, P., Kasuriya, S., & Wutiwiwatchai, C. (2019). Efficient two-stage processing for joint sequence model-based Thai grapheme-to-phoneme conversion. Speech Communication, 106, 105-111. https://doi.org/10.1016/j.specom.2018.12.003

  • Saychum, S., Kongyoung, S., Rugchatjaroen, A., Chootrakool, P., Kasuriya, S., & Wutiwiwatchai, C. (2016, September 8-12). Efficient Thai grapheme-to-phoneme conversion using CRF-based joint sequence modeling. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1462-1466). ISCA Publishing. https://doi.org/10.21437/Interspeech.2016-621

  • Shin, C. (2021). Iban as a koine language in Sarawak. Wacana, 22(1), 102-124. https://doi.org/10.17510/wacana.v22i1.985

  • Singh, A. K. (2008). Natural language processing for less privileged languages: Where do we come from? Where are we going? In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages (pp. 7-12). Asian Federation of Natural Language Processing. http://www.aclweb.org/anthology/I08-3004

  • Stadtschnitzer, M., & Schmidt, C. (2018, May 7-12). Data-driven pronunciation modeling of swiss german dialectal speech for automatic speech recognition. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 3152-3156). Miyazaki, Japan.

  • Sutlive, V. H. (1994). A handy Reference Dictionary of Iban and English. Tun Jugah Foundation.

  • Tjalve, M., & Huckvale, M. (2005, September 4-8). Pronunciation variation modelling using accent features. In 9th European Conference on Speech Communication and Technology (pp. 1341-1344). Lisbon, Portugal.

  • Tsuboi, Y., Kashima, H., Mori, S., Oda, H., & Matsumoto, Y. (2008, August 18-22). Training conditional random fields using incomplete annotations. In Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference (pp. 897-904). Manchester, UK. https://doi.org/10.3115/1599081.1599194

  • Tan, T. P., Xiao, X., Tang, E. K., Chng, E. S., & Li, H. (2009). MASS: A Malay language LVCSR corpus resource. In 2009 Oriental COCOSDA International Conference on Speech Database and Assessments (pp. 25-30). IEEE Publishing. https://doi.org/10.1109/ICSDA.2009.5278382.

  • Wang, X., & Sim, K. C. (2013). Integrating conditional random fields and joint multi-gram model with syllabic features for grapheme-to-phone conversion. In INTERSPEECH (pp. 2321-2325). ISCA Publishing.

  • Yamazaki, M., Morita, H., Komiya, K., & Kotani, Y. (2014). Extracting the translation of anime titles from web corpora using CRF. In Knowledge-Based Software Engineering: 11th Joint Conference, JCKBSE 2014 (pp. 311-320). Springer International Publishing. https://doi.org/10.1007/978-3-319-11854-3_26

  • Yolchuyeva, S., Németh, G., & Gyires-Tóth, B. (2019). Grapheme-to-phoneme conversion with convolutional neural networks. Applied Sciences, 9(6), 1-17. https://doi.org/10.3390/app9061143

  • Young, S. R. (1994, April). Detecting misrecognitions and out-of-vocabulary words. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. II-21). IEEE Publishing. https://doi.org/10.1109/ICASSP.1994.389728

  • Yu, M., Nguyen, H. D., Sokolov, A., Lepird, J., Sathyendra, K. M., Choudhary, S., Mouchtaris, A., & Kunzmann, S. (2020). Multilingual grapheme-to-phoneme conversion with byte representation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8234-8238). IEEE Publishing. https://doi.org/10.1109/ICASSP40776.2020.9054696

  • Zoubir, A. M., & Iskander, D. R. (2007). Bootstrap methods and applications: A tutorial for the signal processing practitioner. IEEE Signal Processing Magazine, 24(4), 10-19. https://doi.org/10.1109/MSP.2007.4286560

  • Zweig, G., & Nguyen, P. (2009). Maximum mutual information multi-phone units in direct modeling. In Tenth Annual Conference of the International Speech Communication Association (pp. 1919-1922). ISCA Publishing.