Transcript Outline
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Tim Schlippe, Wolf Quaschningk, Tanja Schultz SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Outline 1. Motivation and Goals 2. Experimental Setup 1. Grapheme-to-phoneme converters 2. Data 3. Experiments and Results 1. 2. 3. 4. Single grapheme-to-phoneme converters’ performance Phoneme-level combination scheme Adding web-driven grapheme-to-phoneme converters Automatic speech recognition experiments 4. Conclusion and Future Work 2 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Motivation About 7.100 languages exist in the world (www.ethnologue.com) only few languages have speech processing systems Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR) Manual production of pronunciations slow and costly 19.2–30s / word for Afrikaans (Davel and Barnard, 2004) Automatic grapheme-to-phoneme (G2P) conversion But: Consistency pronunciations first at ~3.7k wordpronunciation pairs for training (30k phoneme tokens) Methods to reduce manual effort 3 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Goals Common approaches use their single favorite G2P conversion tool Idea: Use synergy effects of multiple G2P converters Close in performance but at the same time produce an output that differs in their errors Provides complementary information Achieve pronunciations with higher quality through combination of G2P converter outputs Reduce manual effort in semi-automatic methods Impact on ASR performance 4 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Grapheme-to-phoneme converters G2P converters Knowledge-based Manual Rulebased Local classification Handcrafted rules CART1based „t2p“ Graphone-based „Sequitur“ WFST2-based „Phonetisaurus“ SMT3-based „Moses“ (Lenzo, 1998) (Bisani & Ney, 2008) (Novak 2011) (Koehn, 2005) c a r s K AX 9r S 5 15-May-2014 Data-driven Probabilistic (According to (Bisani and Ney, 2008)) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios different grade of G2P relationship Data Languages: English, German, French, Spanish Dictionaries: English: German, Spanish: French: CMU dictionary GlobalPhone Quaero Project Data sets (randomly chosen): different amounts of small training data sizes to simulate low resources Training: 200, 500, 1k, 5k, 10k word-pronunciation pairs Development / test set: 10k word-pronunciation pairs (disjunctive) 6 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Analysis of Single G2P Converter Outputs Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER)) Lower PERs with increasing amount of training data 7 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Analysis of Single G2P Converter Outputs Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER)) Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Moses it is very close for de 8 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Analysis of Single G2P Converter Outputs Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER)) For 200 en and fr W-P pairs, Rules outperforms Moses 9 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination scheme Based on ROVER (Fiscus, 1997) (Recognizer Output Voting Error Reduction) (traditionally at word level) Voting Module by frequency of occurence, since G2P confidence scores not reliable 10 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination scheme Example (trained with 200 W-P pairs): Reference: cars K AA 9r ZH Converter Output PER Sequitur G2P k EH 9r ZH 25% Phonetisaurus K AA ZH 25% CART K AE ZH 50% Moses K AA 9r S 25% 1:1 G2P (Rules) K AX 9r S 50% 11 15-May-2014 PLC output PER K AA 9r ZH 0% Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination Relative PER change compared to best single converter output de In 10 of 16 cases combination equal or better 12 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination Relative PER change compared to best single converter output de Most improvement for de and en ASR experiments 13 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination Relative PER change compared to best single converter output de es (most regular G2P relationship) never improvements 14 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Wiktionary 39 Wiktionary editions with more than 1k IPA prons. (June 2012) Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List of Wiktionaries T. Schlippe, S. Ochs, T. Schultz: Web-based tools and methods for rapid pronunciation dictionary creation, Speech Communication, vol. 56, pp. 101 – 118, January 2014 15 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Wiktionary Additional G2P converters based on wordpronunciation pairs in Wiktionary 4.6k W-P pairs Internal consistency (PER %) 1.5k W-P pairs 3.8k W-P pairs 3.3k W-P pairs 16 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Data Filtered web-derived pronunciations Fully automatic methods from (Schlippe, 2012a, 2012b, 2014) ~15% with each filtering method Language Best method English (en) M2NAlign 33.18% 26.13% +21.25% French (fr) Eps 14.96% 13.97% +6.62% German (de) G2PLen 16.74% 14.17% +15.35% Spanish (es) M2NAlign 10.25% 10.90% -6.34% 17 15-May-2014 unfiltWDP filtWDP Rel. change Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination Relative PER change compared to best single converter output PLC-unfiltWDP already better than w/oWDP 18 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination Relative PER change compared to best single converter output 23.1% rel. PER reduction Filtering web-derived pronunciations helps 19 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios ASR experiments Replace dictionaries in de & en recognizers with pronunciations generated with G2P converters Train and decode the systems Word Error Rate (WER) • As in PER evaluation: Sequitur and Phonetisaurus very good in most cases • However: Rules results in lowest WERs for most scenarios 20 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios ASR experiments In only 1 case PLC-w/oWDP better or equal best single converter 21 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios ASR experiments Filtering web-derived word-pronunciation pairs hels. 22 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios ASR experiments Confusion Network Combination (CNC) outperforms PLC 23 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios ASR experiments In 9 cases Adding system with PLC in helps in CNC 24 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Conclusion and Future Work In most cases, PLC comes close validated reference pronunciations more than the single converters Web-derived word-pronunciation pairs can further improve quality (Filtering the web data helpful) Weighting single G2P converters’ outputs gave no improvement according to performance on dev set according to converters‘ confidences Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editing effort 25 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Conclusion and Future Work Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems Including systems with pronunciation dictionaries that have been built with PLC to CNC can lead to improvements Future work: Embedding PLC and web-derived pronunciations into the semiautomatic pronunciation dictionary creation Further languages and further G2P converters 26 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios благодари́м за внима́ние! 27 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios ASR experiments In 6 cases System with PLC better or equal best single converter 28 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Data Filtered web-derived pronunciations Threshold for each filtering method dependent on mean µ and standard deviation σ of measure in focus 1st-stage filtering (Len / Eps / M2NAlign) wordpronunciation pairs prefiltering filtered wordpronunciation pairs (Black et al., 1998) (Martirosian and Davel, 2007) (Schlippe, 2012a, 2012b) 29 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Data Filtered web-derived pronunciations Threshold for each filtering method dependent on mean µ and standard deviation σ of measure in focus 1st-stage filtering (Len / Eps / M2NAlign) wordpronunciation pairs prefiltering filtered word- Train „reliable“ g2p model pronunciation pairs „reliable“ g2p model Apply g2p model to words Edit distance < threshold remaining wordpronunciation pairs 30 15-May-2014 2nd-stage filtering (G2P) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phoneme-level combination scheme Example (trained with 200 W-P pairs): Reference: cars K AA 9r ZH Sequitur (25% PER) K EH 9r Phonetisaurus (25% PER) K AA ZH CART (50% PER) K AE ZH K AA 9r S K AX 9r S Moses (25% PER) 1:1 G2P (50% PER) 31 15-May-2014 ZH Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios 32 15-May-2014 Phoneme-level combination scheme Alignment Module K EH 9r ZH K AA @ ZH K AE @ ZH K AA 9r S K AX 9r S Voting Module by frequency of occurence, since G2P confidence scores not reliable 33 15-May-2014 K AA 9r ZH (1) (0.4) (0.6) (0.6) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Data Filtered WDPs German: 1. G2PLen Remove a pronunciation if the ratio of grapheme and phoneme tokens is shorter than µLen – σLen or longer than µLen + σLen 2.1. Train G2P models with remaining more “reliable” W-P pairs. 2.2. Apply the G2P models to convert a grapheme string into a most likely phoneme string. 2.3 Remove a pronunciation if the edit distance between the synthesized phoneme string and the pronunciation in question is shorter than µG2P – σG2P or longer than µG2P + σG2P PER reduction: 16.74 14.17 34 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Data Filtered WDPs English, Spanish: M2NAlign 1. Perform an m-n G2P alignment (Black et al., 1998) 2. Remove a pronunciation if the alignment score is shorter than µG2P – σG2P or longer than µG2P + σG2P. English PER reduction: 33.18 26.13 Spanish PER reduction: 10.25 10.90 35 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Data Filtered WDPs French: 1. Eps (according to (Martirosian and Davel, 2007)) Perform an 1-1 G2P alignment (Black et al., 1998) Alignment process involves the insertion of graphemic and phonemic nulls (epsilons) into the lexical entries of words. 2. Remove a pronunciation if the proportion of graphemic and phonemic nulls is shorter than µG2P – σG2P or longer than µG2P + σG2P. PER reduction: 14.96 13.97 36 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References 37 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment