References

[1]   Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering, Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne, 2023. To appear at NeurIPS 2023. https://arxiv.org/abs/2309.17133.

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA’s retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever’s PRRecall@5 by approximately 8%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve 61% VQAscore in the OK-VQA dataset.

[2]   Uniform training and marginal decoding for multi-reference question-answer generation. Svitlana Vakulenko, Bill Byrne, and Adrią de Gispert. In 26th European Conference on Artificial Intelligence (ECAI 2023), October 2023. https://assets.amazon.science/d0/56/b686559b4c16a4075b2c3a4e4804/uniform-training-and-marginal-decoding-for-multi-reference-question-answer-generation.pdf.

Question generation is an important task that helps to improve question answering performance and augment search interfaces with possible suggested questions. While multiple approaches have been proposed for this task, none addresses the goal of generating a diverse set of questions given the same input context. The main reason for this is the lack of multi-reference datasets for training such models. We propose to bridge this gap by seeding a baseline question generation model with named entities as candidate answers. This allows us to automatically synthesize an unlimited number of question-answer pairs. We then propose an approach designed to leverage such multi-reference annotations, and demonstrate its advantages over the standard training and decoding strategies used in question generation. An experimental evaluation on synthetic, as well as manually annotated data shows that our approach can be used in creating a single generative model that produces a diverse set of question-answer pairs per input sentence.

[3]   Grounding description-driven dialogue state trackers with knowledge-seeking turns. Alexandru Coca, Bo-Hsiang Tseng, Jinghong Chen, Weizhe Lin, Weixuan Zhang, Tisha Anders, and Bill Byrne. In Proc. Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 444–456. Association for Computational Linguistics, September 2023. Best Long Paper Award. https://aclanthology.org/2023.sigdial-1.42.

Schema-guided dialogue state trackers can generalise to new domains without further training, yet they are sensitive to the writing style of the schemata. Augmenting the training set with human or synthetic schema paraphrases improves the model robustness to these variations but can be either costly or difficult to control. We propose to circumvent these issues by grounding the state tracking model in knowledge-seeking turns collected from the dialogue corpus as well as the schema. Including these turns in prompts during finetuning and inference leads to marked improvements in model robustness, as demonstrated by large average joint goal accuracy and schema sensitivity improvements on SGD and SGD-X.

[4]   An inner table retriever for robust table question answering. Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adria de Gispert, and Gonzalo Iglesias. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9909–9926, Toronto, Canada, July 2023. Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.551.

Recent years have witnessed the thriving of pretrained Transformer-based language models for understanding semi-structured tables, with several applications, such as Table Question Answering (TableQA).These models are typically trained on joint tables and surrounding natural language text, by linearizing table content into sequences comprising special tokens and cell information. This yields very long sequences which increase system inefficiency, and moreover, simply truncating long sequences results in information loss for downstream tasks. We propose Inner Table Retriever (ITR), a general-purpose approach for handling long tables in TableQA that extracts sub-tables to preserve the most relevant information for a question.We show that ITR can be easily integrated into existing systems to improve their accuracy with up to 1.3-4.8% and achieve state-of-the-art results in two benchmarks, i.e., 63.4% in WikiTableQuestions and 92.1% in WikiSQL. Additionally, we show that ITR makes TableQA systems more robust to reduced model capacity and to different ordering of columns and rows. We make our code available at: https://github.com/amazon-science/robust-tableqa.

[5]   LI-RAGE: Late interaction retrieval augmented generation with explicit signals for open-domain table question answering. Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adria de Gispert, and Gonzalo Iglesias. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1557–1566, Toronto, Canada, July 2023. Association for Computational Linguistics. https://aclanthology.org/2023.acl-short.133.

Recent open-domain TableQA models are typically implemented as retriever-reader pipelines. The retriever component is usually a variant of the Dense Passage Retriever, which computes the similarities between questions and tables based on a single representation of each.These fixed vectors can be insufficient to capture fine-grained features of potentially very big tables with heterogeneous row/column information. We address this limitation by 1) applying late interaction models which enforce a finer-grained interaction between question and table embeddings at retrieval time. In addition, we 2) incorporate a joint training scheme of the retriever and reader with explicit table-level signals, and 3) embed a binary relevance token as a prefix to the answer generated by the reader, so we can determine at inference time whether the table used to answer the question is reliable and filter accordingly. The combined strategies set a new state-to-the-art performance on two public open-domain TableQA datasets.

[6]   xPQA: Cross-lingual product question answering in 12 languages. Xiaoyu Shen, Akari Asai, Bill Byrne, and Adria De Gispert. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 103–115, Toronto, Canada, July 2023. Association for Computational Linguistics. https://aclanthology.org/2023.acl-industry.12.

Product Question Answering (PQA) systems are key in e-commerce applications as they provide responses to customers’ questions as they shop for products. While existing work on PQA focuses mainly on English, in practice there is need to support multiple customer languages while leveraging product information available in English. To study this practical industrial task, we present xPQA, a large-scale annotated cross-lingual PQA dataset in 12 languages, and report results in (1) candidate ranking, to select the best English candidate containing the information to answer a non-English question; and (2) answer generation, to generate a natural-sounding non-English answer based on the selected English candidate.We evaluate various approaches involving machine translation at runtime or offline, leveraging multilingual pre-trained LMs, and including or excluding xPQA training data. We find that in-domain data is essential as cross-lingual rankers trained on other domains perform poorly on the PQA task, and that translation-based approaches are most effective for candidate ranking while multilingual finetuning works best for answer generation. Still, there remains a significant performance gap between the English and the cross-lingual test sets.

[7]   More robust schema-guided dialogue state tracking via tree-based paraphrase ranking. Alexandru Coca, Bo-Hsiang Tseng, Weizhe Lin, and Bill Byrne. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1443–1454, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. https://aclanthology.org/2023.findings-eacl.106.

The schema-guided paradigm overcomes scalability issues inherent in building task-oriented dialogue (TOD) agents with static ontologies. Rather than operating on dialogue context alone, agents have access to hierarchical schemas containing task-relevant natural language descriptions. Fine-tuned language models excel at schema-guided dialogue state tracking (DST) but are sensitive to the writing style of the schemas. We explore methods for improving the robustness of DST models. We propose a framework for generating synthetic schemas which uses tree-based ranking to jointly optimise lexical diversity and semantic faithfulness. The robust generalisation of strong baselines is improved when augmenting their training data with prompts generated by our framework, as demonstrated by marked improvements in average Joint Goal Accuracy (JGA) and schema sensitivity (SS) on the SGD-X benchmark.

[8]   FVQA 2.0: Introducing adversarial samples into fact-based visual question answering. Weizhe Lin, Zhilin Wang, and Bill Byrne. In Findings of the Association for Computational Linguistics: EACL 2023, pages 149–157, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. https://aclanthology.org/2023.findings-eacl.11.

The widely used Fact-based Visual Question Answering (FVQA) dataset contains visually-grounded questions that require information retrieval using common sense knowledge graphs to answer. It has been observed that the original dataset is highly imbalanced and concentrated on a small portion of its associated knowledge graph. We introduce FVQA 2.0 which contains adversarial variants of test questions to address this imbalance. We show that systems trained with the original FVQA train sets can be vulnerable to adversarial samples and we demonstrate an augmentation scheme to reduce this vulnerability without human annotations.

[9]   Neural ranking with weak supervision for open-domain question answering : A survey. Xiaoyu Shen, Svitlana Vakulenko, Marco del Tredici, Gianni Barlacchi, Bill Byrne, and Adria de Gispert. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1736–1750, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. https://aclanthology.org/2023.findings-eacl.129.

Neural ranking (NR) has become a key component for open-domain question-answering in order to access external knowledge. However, training a good NR model requires substantial amounts of relevance annotations, which is very costly to scale. To address this, a growing body of research works have been proposed to reduce the annotation cost by training the NR model with weak supervision (WS) instead. These works differ in what resources they require and employ a diverse set of WS signals to train the model. Understanding such differences is crucial for choosing the right WS technique. To facilitate this understanding, we provide a structured overview of standard WS signals used for training a NR model. Based on their required resources, we divide them into three main categories: (1) only documents are needed; (2) documents and questions are needed; and (3) documents and question-answer pairs are needed. For every WS signal, we review its general idea and choices. Promising directions are outlined for future research.

[10]   FocusQA: Open-domain question answering with a context in focus. Gianni Barlacchi, Ivano Lauriola, Marco Del Tredici, Xiaoyu Shen, Thuy Vu, Bill Byrne, Adrią de Gispert, and Alessandro Moschitti. In Findings of EMNLP 2022, 2022. https://www.amazon.science/publications/focusqa-open-domain-question-answering-with-a-context-in-focus.

We introduce question answering with a context in focus, a task that simulates a free interaction with a QA system. The user reads on a screen some information about a topic and they can follow-up with questions that can be either related or not to the topic; and the answer can be found in the document containing the screen content or from other pages. We call such information context. To study the task, we construct FocusQA, a dataset for answer sentence selection (AS2) with 12,165 unique (question, context) pairs and a total of 109,940 answers. To build the dataset, we developed a novel methodology that takes existing questions and pairs them with relevant contexts. To show the benefits of this approach, we present a comparative analysis with a set of questions written by humans after reading the context, showing that our approach greatly helps in eliciting more realistic (question, context) pairs. Finally, we show that the task poses several challenges for incorporating contextual information. In this respect, we introduce strong baselines for answer sentence selection that outperform the precision of state-of-the-art models for AS2 up to 21.3

[11]   The teacher-student chatroom corpus version 2: more lessons, new annotation, automatic detection of sequence shifts. Andrew Caines, Helen Yannakoudakis, Helen Allen, Pascual Prez-Paredes, Bill Byrne, and Paula Buttery. In Proceedings of Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL), 2022.

The first version of the Teacher-Student Chatroom Corpus (TSCC) was released in 2020 and contained 102 chatroom dialogues between 2 teachers and 8 learners of English, amounting to 13.5K conversational turns and 133K word tokens. In this second version of the corpus, we release an additional 158 chatoom dialogues, amounting to an extra 27.9K conversational turns and 230K word tokens. In total there are now 260 chatroom lessons, 41.4K conversational turns and 363K word tokens, involving 2 teachers and 13 students with seven different first languages. The content of the lessons was, as before, guided by the teacher, and the proficiency level of the learners is judged to range from B1 to C2 on the CEFR scale. Annotation of the dialogues continued with conversational analysis of sequence types, pedagogical focus, and correction of grammatical errors. In addition, we have annotated fifty of the dialogues using the Self-Evaluation of Teacher Talk framework which is intended for self-reflection on interactional aspects of language teaching. Finally, we conducted machine learning experiments to automatically detect shifts in discourse sequences from turn to turn, using modern transfer learning methods with large pre- trained language models. The TSCC v2 is freely available for research use.

[12]   Retrieval augmented visual question answering with outside knowledge. Weizhe Lin and Bill Byrne. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. https://arxiv.org/abs/2210.03809.

Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation, introducing a potential limit on the overall system performance. Instead, we propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion. Our experiments show that our scheme outperforms recent OK-VQA systems with strong DPR for retrieval. We also introduce new diagnostic metrics to analyze how retrieval and generation interact. The strong retrieval ability of our model significantly reduces the number of retrieved documents needed in training, yielding significant benefits in answer quality and computation required for training.

[13]   Combining unstructured content and knowledge graphs into recommendation datasets. Weizhe Lin, Linjun Shou, Ming Gong, Jian Pei, Zhilin Wang, Bill Byrne, and Daxin Jiang. In Proceedings of the RecSys 2022: Fourth Knowledge-aware and Conversational Recommender Systems Workshop (KaRS), 2022. https://ceur-ws.org/Vol-3294/short5.pdf.

Popular book and movie recommendation datasets can be associated with Knowledge Graphs (KG) that enable the development of KG-based recommender systems. However, most of these approaches are based on Collaborative Filtering, leaving Content- based Filtering approaches unexploited. This is partially due to the lack of content-based information (e.g. summary texts of movies and books) in datasets. To facilitate the research in achieving both KG-aware and content-aware recommender systems, we contribute to public domain resources through the creation of a large-scale Movie-KG dataset and an extension of the already public Amazon-Book dataset through incorporation of text descriptions crawled from external sources. Both datasets provide descriptive texts that enable recommendations based on unstructured content. We provide benchmark results as well as showing the value of the content-based information in making recommendations.

[14]   Transformer-empowered content-aware collaborative filtering. Weizhe Lin, Linjun Shou, Ming Gong, Jian Pei, Zhilin Wang, Bill Byrne, and Daxin Jiang. In Proceedings of the RecSys 2022: Fourth Knowledge-aware and Conversational Recommender Systems Workshop (KaRS), 2022. https://ceur-ws.org/Vol-3294/long3.pdf.

Knowledge graph (KG) based Collaborative Filtering (CF) is an effective approach to personalize recommender systems for relatively static domains such as movies and books, by leveraging structured information from KG to enrich both item and user representations. This paper investigates the complementary power of unstructured content information (e.g. rich summary texts of items) in KG-based CF recommender systems. We introduce Content-aware KG-enhanced Meta-preference Networks that enhances the CF recommendation based on both structured information from KG as well as unstructured content features based on Transformer-empowered content-based filtering (CBF). Within this modeling framework, we demonstrate a powerful KG-based CF model and a CBF model (a variant of the well-known NRMS system) and employ a novel training scheme, Cross-System Contrastive Learning, to address the inconsistency of the two very different systems in fusing information. We present experimental results showing that enhancing collaborative filtering with Transformer-based features derived from content-based filtering offers new improvements relative to strong baseline systems, improving the ability of KG-based CF systems to exploit item content information.

[15]   The devil is in the details: On the pitfalls of vocabulary selection in neural machine translation. Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne, and Felix Hieber. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1861–1874, Seattle, United States, July 2022. Association for Computational Linguistics. https://aclanthology.org/2022.naacl-main.136.

Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.

[16]   uFACT: Unfaithful alien-corpora training for semantically consistent data-to-text generation. Tisha Anders, Alexandru Coca, and Bill Byrne. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2836–2841, Dublin, Ireland, May 2022. Association for Computational Linguistics. https://aclanthology.org/2022.findings-acl.223.

We propose uFACT (Un-Faithful Alien Corpora Training), a training corpus construction method for data-to-text (d2t) generation models. We show that d2t models trained on uFACT datasets generate utterances which represent the semantic content of the data sources more accurately compared to models trained on the target corpus alone. Our approach is to augment the training set of a given target corpus with alien corpora which have different semantic representations. We show that while it is important to have faithful data from the target corpus, the faithfulness of additional corpora only plays a minor role. Consequently, uFACT datasets can be constructed with large quantities of unfaithful data. We show how uFACT can be leveraged to obtain state-of-the-art results on the WebNLG benchmark using METEOR as our performance metric. Furthermore, we investigate the sensitivity of the generation faithfulness to the training corpus structure using the PARENT metric, and provide a baseline for this metric on the WebNLG (Gardent et al., 2017) benchmark to facilitate comparisons with future work.

[17]   From rewriting to remembering: Common ground for conversational QA models. Marco Del Tredici, Xiaoyu Shen, Gianni Barlacchi, Bill Byrne, and Adrią de de Gispert. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 70–76, Dublin, Ireland, May 2022. Association for Computational Linguistics. https://aclanthology.org/2022.nlp4convai-1.7.

In conversational QA, models have to leverage information in previous turns to answer upcoming questions. Current approaches, such as Question Rewriting, struggle to extract relevant information as the conversation unwinds. We introduce the Common Ground (CG), an approach to accumulate conversational information as it emerges and select the relevant information at every turn. We show that CG offers a more efficient and human-like way to exploit conversational information compared to existing approaches, leading to improvements on Open Domain Conversational QA.

[18]   First the worst: Finding better gender translations during beam search. Danielle Saunders, Rosie Sallis, and Bill Byrne. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3814–3823, Dublin, Ireland, May 2022. Association for Computational Linguistics. https://aclanthology.org/2022.findings-acl.301.

Generating machine translations via beam search seeks the most likely output under a model. However, beam search has been shown to amplify demographic biases exhibited by a model. We aim to address this, focusing on gender bias resulting from systematic errors in grammatical gender translation. Almost all prior work on this problem adjusts the training data or the model itself. By contrast, our approach changes only the inference procedure. We constrain beam search to improve gender diversity in n-best lists, and rerank n-best lists using gender features obtained from the source sentence. Combining these strongly improves WinoMT gender translation accuracy for three language pairs without additional bilingual data or retraining. We also demonstrate our approach’s utility for consistently gendering named entities, and its flexibility to handle new gendered language beyond the binary.

[19]   Product answer generation from heterogeneous sources: A new benchmark and best practices. Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, Bill Byrne, and Adrią de Gispert. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 99–110, Dublin, Ireland, May 2022. Association for Computational Linguistics. https://aclanthology.org/2022.ecnlp-1.13.

It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.

[20]   semiPQA: A study on product question answering over semi-structured data. Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, and Adrią Gispert. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 111–120, Dublin, Ireland, May 2022. Association for Computational Linguistics. https://aclanthology.org/2022.ecnlp-1.14.

Product question answering (PQA) aims to automatically address customer questions to improve their online shopping experience. Current research mainly focuses on finding answers from either unstructured text, like product descriptions and user reviews, or structured knowledge bases with pre-defined schemas. Apart from the above two sources, a lot of product information is represented in a semi-structured way, e.g., key-value pairs, lists, tables, json and xml files, etc. These semi-structured data can be a valuable answer source since they are better organized than free text, while being easier to construct than structured knowledge bases. However, little attention has been paid to them. To fill in this blank, here we study how to effectively incorporate semi-structured answer sources for PQA and focus on presenting answers in a natural, fluent sentence. To this end, we present semiPQA: a dataset to benchmark PQA over semi-structured data. It contains 11,243 written questions about json-formatted data covering 320 unique attribute types. Each data point is paired with manually-annotated text that describes its contents, so that we can train a neural answer presenter to present the data in a natural way. We provide baseline results and a deep analysis on the successes and challenges of leveraging semi-structured data for PQA. In general, state-of-the-art neural models can perform remarkably well when dealing with seen attribute types. For unseen attribute types, however, a noticeable drop is observed for both answer presentation and attribute ranking.

[21]   Improving the quality trade-off for neural machine translation multi-domain adaptation. Eva Hasler, Tobias Domhan, Jonay Trenous, Ke Tran, Bill Byrne, and Felix Hieber. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page (9 pages). Association for Computational Linguistics, 2021. https://aclanthology.org/2021.emnlp-main.666.

Building neural machine translation systems to perform well on a specific target domain is a well-studied problem. Optimizing system performance for multiple, diverse target domains however remains a challenge. We study this problem in an adaptation setting where the goal is to preserve the existing system quality while incorporating data for domains that were not the focus of the original translation system. We find that we can improve over the performance trade-off offered by Elastic Weight Consolidation with a relatively simple data mixing strategy. At comparable performance on the new domains, catastrophic forgetting is mitigated significantly on strong WMT baselines. Combining both approaches improves the Pareto frontier on this task.

[22]   Knowledge-aware graph-enhanced GPT-2 for dialogue state tracking. Weizhe Lin, Bo-Hsiang Tseng, and Bill Byrne. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page (11 pages), 2021. https://aclanthology.org/2021.emnlp-main.620.

Dialogue State Tracking is central to multi-domain task-oriented dialogue systems, responsible for extracting information from user utterances. We present a novel hybrid architecture that augments GPT-2 with representations derived from Graph Attention Networks in such a way to allow causal, sequential prediction of slot values. The model architecture captures inter-slot relationships and dependencies across domains that otherwise can be lost in sequential prediction. We report improvements in state tracking performance in MultiWOZ 2.0 against a strong GPT-2 baseline and investigate a simplified sparse training scenario in which DST models are trained only on session-level annotations but evaluated at the turn level. We further report detailed analyses to demonstrate the effectiveness of graph models in DST by showing that the proposed graph modules capture inter-slot dependencies and improve the predictions of values that are common to multiple domains.

[23]   Transferable dialogue systems and user simulators. Bo-Hsiang Tseng, Yinpei Dai, Florian Kreyssig, and Bill Byrne. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page (14 pages), 2021. https://aclanthology.org/2021.acl-long.13.

One of the difficulties in training dialogue systems is the lack of training data. We explore the possibility of creating dialogue data through the interaction between a dialogue system and a user simulator. Our goal is to develop a modelling framework that can incorporate new dialogue scenarios through self-play between the two agents. In this framework, we first pre-train the two agents on a collection of source domain dialogues, which equips the agents to converse with each other via natural language. With further fine-tuning on a small amount of target domain data, the agents continue to interact with the aim of improving their behaviors using reinforcement learning with structured reward functions. In experiments on the MultiWOZ dataset, two practical transfer learning problems are investigated: 1) domain adaptation and 2) single-to-multiple domain transfer. We demonstrate that the proposed framework is highly effective in bootstrapping the performance of the two agents in transfer learning. We also show that our method leads to improvements in dialogue system performance on complete datasets.

[24]   GCDF1: A goal- and context- driven F-score for evaluating user models. Alexandru Coca, Bo-Hsiang Tseng, and Bill Byrne. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 7–14. Association for Computational Linguistics, November 2021. https://aclanthology.org/2021.eancs-1.2.

The evaluation of dialogue systems in interaction with simulated users has been proposed to improve turn-level, corpus-based metrics which can only evaluate test cases encountered in a corpus and cannot measure system’s ability to sustain multi-turn interactions. Recently, little emphasis was put on automatically assessing the quality of the user model itself, so unless correlations with human studies are measured, the reliability of user model based evaluation is unknown. We propose GCDF1, a simple but effective measure of the quality of semantic-level conversations between a goal-driven user agent and a system agent. In contrast with previous approaches we measure the F-score at dialogue level and consider user and system behaviours to improve recall and precision estimation. We facilitate scores interpretation by providing a rich hierarchical structure with information about conversational patterns present in the test data and tools to efficiently query the conversations generated. We apply our framework to assess the performance and weaknesses of a Convlab2 user model.

[25]   The practical ethics of bias reduction in machine translation: why domain adaptation is better than data debiasing. M. Tomalin, B. Byrne, S. Concannon, D. Saunders, and S. Ullman. Ethics and Information Technology, March 2021. Published online 6 March 2021 (15 pages). https://link.springer.com/article/10.1007/s10676-021-09583-1.

This article probes the practical ethical implications of AI system design by reconsidering the important topic of bias in the datasets used to train autonomous intelligent systems. The discussion draws on recent work concerning behaviour-guiding technologies, and it adopts a cautious form of technological utopianism by assuming it is potentially beneficial for society at large if AI systems are designed to be comparatively free from the biases that characterise human behaviour. However, the argument presented here critiques the common well-intentioned requirement that, in order to achieve this, all such datasets must be debiased prior to training. By focusing specifically on gender-bias in Neural Machine Translation (NMT) systems, three automated strategies for the removal of bias are considered - downsampling, upsampling, and counterfactual augmentation - and it is shown that systems trained on datasets debiased using these approaches all achieve general translation performance that is much worse than a baseline system. In addition, most of them also achieve worse performance in relation to metrics that quantify the degree of gender bias in the system outputs. By contrast, it is shown that the technique of domain adaptation can be effectively deployed to debias existing NMT systems after they have been fully trained. This enables them to produce translations that are quantitatively far less biased when analysed using gender-based metrics, but which also achieve state-of-the-art general performance. It is hoped that the discussion presented here will reinvigorate ongoing debates about how and why bias can be most effectively reduced in state-of-the-art AI systems.

[26]   The teacher-student chatroom corpus. A. Caines, H. Yannakoudakis, H. Edmondson, H. Allen, Pascual Pérez-Paredes, W. Byrne, and P. Buttery. In Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning, page (11 pages), 2020. https://www.aclweb.org/anthology/2020.nlp4call-1.2.

The Teacher-Student Chatroom Corpus (TSCC) is a collection of written conver- sations captured during one-to-one lessons between teachers and learners of English. The lessons took place in an online chatroom and therefore involve more interactive, immediate and informal language than might be found in asynchronous exchanges such as email correspondence. The fact that the lessons were one-to-one means that the teacher was able to focus exclusively on the linguistic abilities and errors of the student, and to offer personalised exercises, scaffolding and correction. The TSCC contains more than one hundred lessons between two teachers and eight students, amounting to 13.5K conversational turns and 133K words: it is freely available for research use. We describe the corpus design, data collection procedure and annotations added to the text. We perform some preliminary descriptive analyses of the data and consider possible uses of the TSCC.

[27]   Semi-supervised pre-training and back-translation fine-tuning for translation of formal and informal text. T. Domhan, J. Trenous, and W. Byrne. In Amazon Machine Learning Conference Workshop on Semi-supervised Learning, page (4 pages), 2020

[28]   Reducing gender bias in neural machine translation as a domain adaptation problem. D. Saunders and B. Byrne. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, page (9 pages), 2020. https://www.aclweb.org/anthology/2020.acl-main.690.

Training data for NLP tasks often exhibits gender bias in that fewer sentences refer to women than to men. In Neural Machine Translation (NMT) gender bias has been shown to reduce translation quality, particularly when the target language has grammatical gender. The recent WinoMT challenge set allows us to measure this effect directly (Stanovsky et al, 2019) Ideally we would reduce system bias by simply debiasing all data prior to training, but achieving this effectively is itself a challenge. Rather than attempt to create a ‘balanced’ dataset, we use transfer learning on a small set of trusted, gender-balanced examples. This approach gives strong and consistent improvements in gender debiasing with much less computational cost than training from scratch. A known pitfall of transfer learning on new domains is ‘catastrophic forgetting’, which we address at adaptation and inference time. During adaptation we show that Elastic Weight Consolidation allows a performance trade-off between general translation quality and bias reduction. At inference time we propose a lattice-rescoring scheme which outperforms all systems evaluated in Stanovsky et al, 2019 on WinoMT with no degradation of general test set BLEU. We demonstrate our approach translating from English into three languages with varied linguistic properties and data availability.

[29]   Addressing exposure bias with document minimum risk training: Cambridge at the WMT20 biomedical translation task. D. Saunders and W. Byrne. In Proceedings of the Fifth Conference on Machine Translation, page (8 pages), 2020. https://www.aclweb.org/anthology/2020.wmt-1.94.

The 2020 WMT Biomedical translation task evaluated Medline abstract translations. This is a small-domain translation task, meaning limited relevant training data with very distinct style and vocabulary. Models trained on such data are susceptible to exposure bias effects, particularly when training sentence pairs are imperfect translations of each other. This can result in poor behaviour during inference if the model learns to neglect the source sentence. The UNICAM entry addresses this problem during fine-tuning using a robust variant on Minimum Risk Training. We contrast this approach with data-filtering to remove ‘problem’ training examples. Under MRT fine-tuning we obtain good results for both directions of English-German and English-Spanish biomedical translation. In particular we achieve the best English-to-Spanish translation result and second-best Spanish-to-English result, despite using only single models with no ensembling.

[30]   Inference-only sub-character decomposition improves translation of unseen logographic characters. D. Saunders, W. Feely, and W. Byrne. In Proceedings of the 7th Workshop on Asian Translation, page (8 pages), 2020. https://www.aclweb.org/anthology/2020.wat-1.21.

Neural Machine Translation (NMT) on logographic source languages struggles when translating ‘unseen’ characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.

[31]   Neural machine translation doesn’t translate gender coreference right unless you make it. D. Saunders, R. Sallis”, and B. Byrne. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, page (9 pages), 2020. https://www.aclweb.org/anthology/2020.gebnlp-1.4.

Neural Machine Translation (NMT) has been shown to struggle with grammatical gender that is dependent on the gender of human referents, which can cause gender bias effects. Many existing approaches to this problem seek to control gender inflection in the target language by explicitly or implicitly adding a gender feature to the source sentence, usually at the sentence level. In this paper we propose schemes for incorporating explicit word-level gender inflection tags into NMT. We explore the potential of this gender-inflection controlled translation when the gender feature can be determined from a human reference, or when a test sentence can be automatically gender-tagged, assessing on English-to-Spanish and English-to-German translation. We find that simple existing approaches can over-generalize a gender-feature to multiple entities in a sentence, and suggest effective alternatives in the form of tagged coreference adaptation data. We also propose an extension to assess translations of gender-neutral entities from English given a corresponding linguistic convention, such as a non-binary inflection, in the target language.

[32]   Using context in neural machine translation training objectives. D. Saunders, F. Stahlberg, and B. Byrne. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, page (5 pages), 2020. https://www.aclweb.org/anthology/2020.acl-main.693.

We present Neural Machine Translation (NMT) training using document-level metrics with batch-level documents. Previous sequence-objective approaches to NMT training focus exclusively on sentence-level metrics like sentence BLEU which do not correspond to the desired evaluation metric, typically document BLEU. Meanwhile research into document-level NMT training focuses on data or model architecture rather than training procedure. We find that each of these lines of research has a clear space in it for the other, and propose merging them with a scheme that allows a document-level evaluation metric to be used in the NMT training objective. We first sample pseudo-documents from sentence samples. We then approximate the expected document BLEU gradient with Monte Carlo sampling for use as a cost function in Minimum Risk Training (MRT). This two-level sampling procedure gives NMT performance gains over sequence MRT and maximum-likelihood training. We demonstrate that training is more robust for document-level metrics than with sequence metrics. We further demonstrate improvements on NMT with TER and Grammatical Error Correction (GEC) using GLEU, both metrics used at the document level for evaluations.

[33]   Domain adaptive inference for neural machine translation. D. Saunders, F. Stahlberg, A. de Gispert, and W. Byrne. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, page (8 pages), 2019. https://www.aclweb.org/anthology/P19-1022.

We investigate adaptive ensemble weighting for Neural Machine Translation, addressing the case of improving performance on a new and potentially unknown domain without sacrificing performance on the original domain. We adapt sequentially across two Spanish-English and three English-German tasks, comparing unregularized fine-tuning, L2 and Elastic Weight Consolidation. We then report a novel scheme for adaptive NMT ensemble decoding by extending Bayesian Interpolation with source information, and report strong improvements across test domains without access to the domain label.

[34]   UCAM biomedical translation at WMT19: Transfer learning multi-domain ensembles. D. Saunders, F. Stahlberg, A. de Gispert, and W. Byrne. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), page (5 pages), 2019. https://www.aclweb.org/anthology/W19-5421.

The 2019 WMT Biomedical translation task involved translating Medline abstracts. We approached this using transfer learning to obtain a series of strong neural models on distinct domains, and combining them into multi-domain ensembles. We further experimented with an adaptive language-model ensemble weighting scheme. Our submission achieved the best submitted results on both directions of English-Spanish.

[35]   Neural grammatical error correction with finite state transducers. F. Stahlberg, C. Bryant, and W. Byrne. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), page (5 pages), 2019. https://www.aclweb.org/anthology/N19-1406.

Grammatical error correction (GEC) is one of the areas in natural language processing in which purely neural models have not yet superseded more traditional symbolic models. Hybrid systems combining phrase-based statistical machine translation (SMT) and neural sequence models are currently among the most effective approaches to GEC. However, both SMT and neural sequence-to-sequence models require large amounts of annotated data. Language model based GEC (LM-GEC) is a promising alternative which does not rely on annotated training data. We show how to improve LM-GEC by applying modelling techniques based on finite state transducers. We report further gains by rescoring with neural language models. We show that our methods developed for LM-GEC can also be used with SMT systems if annotated training data is available. Our best system outperforms the best published result on the CoNLL-2014 test set, and achieves far better relative improvements over the SMT baselines than previous hybrid systems.

[36]   The CUED’s grammatical error correction systems for BEA-2019. F. Stahlberg and W. Byrne. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, page (8 pages). Association for Computational Linguistics, 2019. https://www.aclweb.org/anthology/W19-4417.

We describe two entries from the Cambridge University Engineering Department to the BEA 2019 Shared Task on grammatical error correction. Our submission to the low-resource track is based on prior work on using finite state transducers together with strong neural language models. Our system for the restricted track is a purely neural system consisting of neural language models and neural machine translation models trained with back-translation and a combination of checkpoint averaging and fine-tuning – without the help of any additional tools like spell checkers. The latter system has been used inside a separate system combination entry in cooperation with the Cambridge University Computer Lab.

[37]   On NMT search errors and model errors: Cat got your tongue? F. Stahlberg and W. Byrne. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), page (7 pages), 2019. https://www.aclweb.org/anthology/D19-1331.

We report on search errors and model errors in neural machine translation (NMT). We present an exact inference procedure for neural sequence models based on a combination of beam search and depth-first search. We use our exact search to find the global best model scores under a Transformer base model for the entire WMT15 English-German test set. Surprisingly, beam search fails to find these global best model scores in most cases, even with a very large beam size of 100. For more than 50% of the sentences, the model in fact assigns its global best score to the empty translation, revealing a massive failure of neural models in properly accounting for adequacy. We show by constraining search with a minimum translation length that at the root of the problem of empty translations lies an inherent bias towards shorter translations. We conclude that vanilla NMT in its current form requires just the right amount of beam search errors, which, from a modelling perspective, is a highly unsatisfactory conclusion indeed, as the model often prefers an empty translation.

[38]   CUED@WMT19:EWC&LMs. F. Stahlberg, D. Saunders, A. de Gispert, and W. Byrne. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), page (9 pages). Association for Computational Linguistics, 2019. https://www.aclweb.org/anthology/W19-5340.

Two techniques provide the fabric of the Cambridge University Engineering Department’s (CUED) entry to the WMT19 evaluation campaign: elastic weight consolidation (EWC) and different forms of language modelling (LMs). We report substantial gains by fine-tuning very strong baselines on former WMT test sets using a combination of checkpoint averaging and EWC. A sentence-level Transformer LM and a document-level LM based on a modified Transformer architecture yield further gains. As in previous years, we also extract n-gram probabilities from SMT lattices which can be seen as a source-conditioned n-gram LM.

[39]   Semi-supervised bootstrapping of dialogue state trackers for task-oriented modelling. B.-H. Tseng, M. Rei, P. Budzianowski, R. Turner, B. Byrne, and A. Korhonen. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), page (8 pages), 2019. https://www.aclweb.org/anthology/D19-1125.

Dialogue systems benefit greatly from optimizing on detailed annotations, such as transcribed utterances, internal dialogue state representations and dialogue act labels. However, collecting these annotations is expensive and time-consuming, holding back development in the area of dialogue modelling. In this paper, we investigate semi-supervised learning methods that are able to reduce the amount of required intermediate labelling. We find that by leveraging un-annotated data instead, the amount of turn-level annotations of dialogue state can be significantly reduced when building a neural dialogue system. Our analysis on the MultiWOZ corpus, covering a range of domains and topics, finds that annotations can be reduced by up to 30% while maintaining equivalent system performance. We also describe and evaluate the first end-to-end dialogue model created for the MultiWOZ corpus.

[40]   Neural and FST-based approaches to grammatical error correction. Z. Yuan, F. Stahlberg, M. Rei, W. Byrne, and H. Yannakoudakis. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, page (11 pages). Association for Computational Linguistics, 2019. https://www.aclweb.org/anthology/W19-4424.

In this paper, we describe our submission to the BEA 2019 shared task on grammatical error correction. We present a system pipeline that utilises both error detection and correction models. The input text is first corrected by two complementary neural machine translation systems: one using convolutional networks and multi-task learning, and another using a neural Transformer-based system. Training is performed on publicly available data, along with artificial examples generated through back-translation. The n-best lists of these two machine translation systems are then combined and scored using a finite state transducer (FST). Finally, an unsupervised re-ranking system is applied to the n-best output of the FST. The re-ranker uses a number of error detection features to re-rank the FST n-best list and identify the final 1-best correction hypothesis. Our system achieves 66.75% F 0.5 on error correction (ranking 4th), and 82.52% F 0.5 on token-level error detection (ranking 2nd) in the restricted track of the shared task.

[41]   Neural machine translation decoding with terminology constraints. E. Hasler, A. de Gispert, G. Iglesias, and W. Byrne. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2018), page (7 pages), 2018. https://www.aclweb.org/anthology/N18-2081.

Despite the impressive quality improvements yielded by neural machine translation (NMT) systems, controlling their translation output to adhere to user-provided terminology constraints remains an open problem. We describe our approach to constrained neural decoding based on finite-state machines and multi-stack decoding which supports target-side constraints as well as constraints with corresponding aligned input text spans. We demonstrate the performance of our framework on multiple translation tasks and motivate the need for constrained decoding with attentions as a means of reducing misplacement and duplication when translating user constraints.

[42]   Accelerating NMT batched beam decoding with LMBR posteriors for deployment. G. Iglesias, W. Tambellini, A. de Gispert, and W. Byrne. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2018), page (8 pages), 2018. https://arxiv.org/abs/1804.11324.

We describe a batched beam decoding algorithm for NMT with LMBR n-gram posteriors, showing that LMBR techniques still yield gains on top of the best recently reported results with Transformers. We also discuss acceleration strategies for deployment, and the effect of the beam size and batching on memory and speed.

[43]   Multi-representation ensembles and delayed SGD updates improve syntax-based NMT. D. Saunders, F. Stahlberg, A. de Gispert, and W. Byrne. In 56th Annual Meeting of the Association for Computational Linguistics, page (7 pages), 2018. https://arxiv.org/abs/1805.00456.

We explore strategies for incorporating target syntax into Neural Machine Translation. We specifically focus on syntax in ensembles containing multiple sentence representations. We formulate beam search over such ensembles using WFSTs, and describe a delayed SGD update training procedure that is especially effective for long representations like linearized syntax. Our approach gives state-of-the-art performance on a difficult Japanese-English task.

[44]   An operation sequence model for explainable neural machine translation. F. Stahlberg, D. Saunders, and W. Byrne. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, page (11 pages), Brussels, Belgium, November 2018. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-5420.

We propose to achieve explainable neural machine translation (NMT) by changing the output representation to explain itself. We present a novel approach to NMT which generates the target sentence by monotonically walking through the source sentence. Word reordering is modeled by operations which allow setting markers in the target sentence and move a target-side write head between those markers. In contrast to many modern neural models, our system emits explicit word alignment information which is often crucial to practical machine translation as it improves explainability. Our technique can outperform a plain text system in terms of BLEU score under the recent Transformer architecture on Japanese-English and Portuguese-English, and is within 0.5 BLEU difference on Spanish-English.

[45]   The University of Cambridge’s machine translation systems for WMT18. F. Stahlberg, A. de Gispert, and W. Byrne. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, page (10 pages), Belgium, Brussels, October 2018. Association for Computational Linguistics. http://www.aclweb.org/anthology/W18-6427.

”The University of Cambridge submission to the WMT18 news translation task focuses on the combination of diverse models of translation. We compare recurrent, convolutional, and self-attention-based neural models on German-English, English-German, and Chinese-English. Our final system combines all neural models together with a phrase-based SMT system in an MBR-based scheme. We report small but consistent gains on top of strong Transformer ensembles.”

[46]   Why not be versatile? Applications of the SGNMT decoder for machine translation. F. Stahlberg, D. Saunders, G. Iglesias, and W. Byrne. In Proceedings of the Association of Machine Translation in the Americas, page (9 pages), March 2018. http://arxiv.org/abs/1803.07204.

SGNMT is a decoding platform for machine translation which allows paring various modern neural models of translation with different kinds of constraints and symbolic models. In this paper, we describe three use cases in which SGNMT is currently playing an active role: (1) teaching as SGNMT is being used for course work and student theses in the MPhil in Machine Learning, Speech and Language Technology at the University of Cambridge, (2) research as most of the research work of the Cambridge MT group is based on SGNMT, and (3) technology transfer as we show how SGNMT is helping to transfer research findings from the laboratory to the industry, eg. into a product of SDL plc.

[47]   A comparison of neural models for word ordering. E. Hasler, F. Stahlberg, M. Tomalin, A. de Gispert, and W. Byrne. In Proceedings of the 10th International Conference on Natural Language Generation, page (5 pages), 2017. https://www.aclweb.org/anthology/W17-3531.

We compare several language models for the word-ordering task and propose a new bag-to-sequence neural model based on attention-based sequence-to-sequence models. We evaluate the model on a large German WMT data set where it significantly outperforms existing models. We also describe a novel search strategy for LM-based word ordering and report results on the English Penn Treebank. Our best model setup outperforms prior work both in terms of speed and quality.

[48]   Unfolding and shrinking neural machine translation ensembles. F. Stahlberg and W. Byrne. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, page (9 pages), 2017. https://arxiv.org/pdf/1704.03279.

Ensembling is a well-known technique in neural machine translation (NMT). Instead of a single neural net, multiple neural nets with the same topology are trained separately, and the decoder generates predictions by averaging over the individual models. Ensembling often improves the quality of the generated translations drastically. However, it is not suitable for production systems because it is cumbersome and slow. This work aims to reduce the runtime to be on par with a single system without compromising the translation quality. First, we show that the ensemble can be unfolded into a single large neural network which imitates the output of the ensemble system. We show that unfolding can already improve the runtime in practice since more work can be done on the GPU. We proceed by describing a set of techniques to shrink the unfolded network by reducing the dimensionality of layers. On Japanese-English we report that the resulting network has the size and decoding speed of a single NMT network but performs on the level of a 3-ensemble system.

[49]   Neural machine translation by minimising the Bayes-risk with respect to syntactic translation lattices. F. Stahlberg, A. de Gispert, E. Hasler, and W. Byrne. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, page (7 pages), 2017. https://arxiv.org/abs/1612.03791.

We present a novel scheme to combine neural machine translation (NMT) with traditional statistical machine translation (SMT). Our approach borrows ideas from linearised lattice minimum Bayes-risk decoding for SMT. The NMT score is com- bined with the Bayes-risk of the trans- lation according the SMT lattice. This makes our approach much more flexible than n-best list or lattice rescoring as the neural decoder is not restricted to the SMT search space. We show an efficient and simple way to integrate risk estimation into the NMT decoder which is suitable for word-level as well as subword-unit-level NMT. We test our method on English- German and Japanese-English and report significant gains over lattice rescoring on several data sets for both single and en- sembled NMT. The MBR decoder pro- duces entirely new hypotheses far beyond simply rescoring the SMT search space or fixing UNKs in the NMT output.

[50]   SGNMT – A flexible NMT decoding platform for quick prototyping of new models and search strategies. Felix Stahlberg, Eva Hasler, Danielle Saunders, and Bill Byrne. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, page (6 pages), 2017. https://www.aclweb.org/anthology/D17-2005.

”This paper introduces SGNMT, our experimental platform for machine translation research. SGNMT provides a generic interface to neural and symbolic scoring modules (predictors) with left-to-right semantic such as translation models like NMT, language models, translation lattices, n-best lists or other kinds of scores and constraints. Predictors can be combined with other predictors to form complex decoding tasks. SGNMT implements a number of search strategies for traversing the space spanned by the predictors which are appropriate for different predictor constellations. Adding new predictors or decoding strategies is particularly easy, making it a very efficient tool for prototyping new research ideas. SGNMT is actively being used by students in the MPhil program in Machine Learning, Speech and Language Technology at the University of Cambridge for course work and theses, as well as for most of the research work in our group.”

[51]   Break it down for me: A study in automated lyric annotation. L. Sterckx, J. Naradowsky, T. Demeester, W. Byrne, and C. Develder. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, page (7 pages), 2017. https://www.aclweb.org/anthology/D17-1220.

Comprehending lyrics, as found in songs and poems, can pose a challenge to human and machine readers alike. This motivates the need for systems that can understand the ambiguity and jargon found in such creative texts, and provide commentary to aid readers in reaching the correct interpretation. We introduce the task of automated lyric annotation (ALA). Like text simplification, a goal of ALA is to rephrase the original text in a more easily understandable manner. However, in ALA the system must often include additional information to clarify niche terminology and abstract concepts. To stimulate research on this task, we release a large collection of crowdsourced annotations for song lyrics. We analyze the performance of translation and retrieval models on this task, measuring performance with both automated and human evaluation. We find that each model captures a unique type of information important to the task.

[52]   Source sentence simplification for statistical machine translation. E. Hasler, A. de Gispert, F. Stahlberg, A. Waite, and W. Byrne. Computer Speech & Language, 45:221–235 (15 pages), September 2017. http://dx.doi.org/10.1016/j.csl.2016.12.001.

Long sentences with complex syntax and long-distance dependencies pose difficulties for machine translation systems. Short sentences, on the other hand, are usually easier to translate. We study the potential of addressing this mismatch using text simplification: given a simplified version of the full input sentence, can we use it in addition to the full input to improve translation? We show that the spaces of original and simplified translations can be effectively combined using translation lattices and compare two decoding approaches to process both inputs at different levels of integration. We demonstrate on source-annotated portions of WMT test sets and on top of strong baseline systems combining hierarchical and neural translation for two language pairs that source simplification can help to improve translation quality.

[53]   Speed-constrained tuning for statistical machine translation using Bayesian optimization. D. Beck, A. de Gispert, A. Waite, and W. Byrne. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2016), page (8 pages), 2016. https://www.aclweb.org/anthology/N16-1100.

We address the problem of automatically finding the parameters of a statistical machine translation system that maximize BLEU scores while ensuring that decoding speed exceeds a minimum value. We propose the use of Bayesian Optimization to efficiently tune the speed-related decoding parameters by easily incorporating speed as a noisy constraint function. The obtained parameter values are guaranteed to satisfy the speed constraint with an associated confidence margin. Across three language pairs and two speed constraint values, we report overall optimization time reduction compared to grid and random search. We also show that Bayesian Optimization can decouple speed and BLEU measurements, resulting in a further reduction of overall optimization time as speed is measured over a small subset of sentences.

[54]   The edit distance transducer in action: The University of Cambridge English-German System at WMT16. F. Stahlberg, E. Hasler, and W. Byrne. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 377–384 (8 pages), 2016. http://aclweb.org/anthology/W16-2324.

This paper presents the University of Cambridge submission to WMT16. Motivated by the complementary nature of syntactical machine translation and neural machine translation (NMT), we exploit the synergies of Hiero and NMT in different combination schemes. Starting out with a simple neural lattice rescoring approach, we show that the Hiero lattices are often too narrow for NMT ensembles. Therefore, instead of a hard restriction of the NMT search space to the lattice, we propose to loosely couple NMT and Hiero by composition with a modified version of the edit distance transducer. The loose combination outperforms lattice rescoring, especially when using multiple NMT systems in an ensemble.

[55]   Syntactically guided neural machine translation. F. Stahlberg, E. Hasler, Aurelien Waite, and W. Byrne. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 299–305 (7 pages), 2016. http://aclweb.org/anthology/P16-2049.

We investigate the use of hierarchical phrase-based SMT lattices in end-to-end neural machine translation (NMT). Weight pushing transforms the Hiero scores for complete translation hypotheses, with the full translation grammar score and full n-gram language model score, into posteriors compatible with NMT predictive probabilities. With a slightly modified NMT beam-search decoder we find gains over both Hiero and NMT decoding alone, with practical advantages in extending NMT to very large input and output vocabularies.

[56]   Transducer disambiguation with sparse topological features. G. Iglesias, A. de Gispert, and W. Byrne. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, page (8 pages), September 2015. http://www.aclweb.org/anthology/D/D15/D15-1273.pdf.

We describe a simple and efficient algorithm to disambiguate non-functional weighted finite state transducers (WFST), i.e. to generate a new WFST that contains a unique, best-scoring path for each hypothesis in the input labels along with the best output labels. The algorithm uses topological features with the use of a novel tropical sparse tuple vector semiring. We empirically prove that our algorithm is more efficient than previous work in a PoS-tagging disambiguation task. Also, we use our method to rescore very large translation lattices with a bilingual neural network language model, obtaining gains in line with the literature.

[57]   Fast and accurate preordering for smt using neural networks. A. de Gispert, G. Iglesias, and W. Byrne. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2015), page (8 pages), June 2015. http://www.aclweb.org/anthology/N/N15/N15-1105.pdf.

We propose the use of neural networks to model source-side preordering for faster and better statistical machine translation. The neural network trains a logistic regression model to predict whether two sibling nodes of the source-side parse tree should be swapped in order to obtain a more monotonic parallel corpus, based on samples extracted from the word-aligned parallel corpus. For multiple language pairs and domains, we show that this yields the best reordering performance against other state-of-the-art techniques, resulting in improved translation quality and very fast decoding.

[58]   The geometry of statistical machine translation. A. Waite and W. Byrne. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2015), page (8 pages), June 2015. http://www.aclweb.org/anthology/N/N15/N15-1041.pdf.

Most modern statistical machine translation systems are based on linear statistical models. One extremely effective method for estimating the model parameters is minimum error rate training (MERT), which is an efficient form of line optimisation adapted to the highly non-linear objective functions used in machine translation. We will show that MERT can be represented using convex geometry, which is the mathematics of polytopes and their faces. Using this geometric representation of MERT we investigate whether the optimisation of linear models is tractable in general. It has been believed that the number of feasible solutions of a linear model is exponential with respect to the number of sentences used for parameter estimation, however we show that the exponential complexity is instead due to the feature dimension. This result has important ramifications because it suggests that the current trend in building statistical machine translation systems by introducing very large number of sparse features is inherently not robust.

[59]   Hierarchical statistical semantic realization for minimal recursion semantics. M. Horvat, A. Copestake, and W. Byrne. In Proceedings of the International Conference on Computational Semantics (IWCS 2015), page (11 pages), April 2015. http://anthology.aclweb.org/W/W15/W15-0116.pdf.

We introduce a robust statistical approach to realization from Minimal Recursion Semantics rep- resentations. The approach treats realization as a translation problem, transforming the Dependency MRS graph representation to a surface string. Translation is based on a Synchronous Context-Free Grammar that is automatically extracted from a large corpus of parsed sentences. We have evaluated the new approach on the Wikiwoods corpus, where it shows promising results.

[60]   Pushdown automata in statistical machine translation. C. Allauzen, W. Byrne, A. de Gispert, G. Iglesias, and M. Riley. Computational Linguistics, pages 687–723 (38 pages), 2014. http://www.aclweb.org/anthology/J/J14/J14-3008.pdf.

This paper describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata (FSA) representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger SCFGs and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of- the-art performance for large-scale SMT.

[61]   Investigating automatic and human filled pause insertion for synthetic speech. R. Dall, M. Tomalin, M. Wester, W. Byrne, and S. King. In Proceedings of INTERSPEECH, page (4 pages), September 2014.

Filled Pauses are pervasive in conversational speech and have been shown to serve a range of psychological and structural purposes. Despite this, they are seldom modelled overtly by state-of-the-art speech synthesis systems. This paper seeks to motivate the incorporation of filled pauses into speech synthesis systems by exploring their use in conversational speech, and by comparing the performance of several automatic systems that insert filled pauses into fluent texts. Two initial experiments are described which seek to determine whether people’s predictions about appropriate insertion points for filled pauses are consistent with actual practice and/or with each other. The experiments also investigate whether there are ’right’ and ’wrong’ places to insert filled pauses in a given sentence. The results summarised in this paper show good consistency between people’s predictions of usage and their actual practice, as well as a perceptual preference for the ’right’ placement. The third experiment contrasts the performance of several automatic systems that insert filled pauses into fluent sentences. The best performance (as determined by precision, recall and F-measure) was produced by interpolating a Recurrent Neural Network and a 4gram Language Model. The research presented in this paper offers new insights into the way in which filled pauses are used and perceived by humans, and how automatic systems can be used to predict the locations of filled pauses in fluent input text.

[62]   Effective incorporation of source syntax into hierarchical phrase-based translation. T. Xiao, A. de Gispert, J. Zhu, and W. Byrne. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2064–2074 (11 pages), Dublin, Ireland, August 2014. Dublin City University and Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1195.

In this paper we explicitly consider source language syntactic information in both rule extraction and decoding for hierarchical phrase-based translation. We obtain tree-to-string rules by the GHKM method and use them to complement Hiero-style rules. All these rules are then employed to decode new sentences with source language parse trees. We experiment with our approach in a state-of-the-art Chinese-English system and demonstrate +1.2 and +0.8 BLEU improvements on the NIST newswire and web evaluation data of MT08 and MT12.

[63]   Word ordering with phrase-based grammars. A. de Gispert, M. Tomalin, and W. Byrne. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 259–268 (12 pages), Gothenburg, Sweden, April 2014. Association for Computational Linguistics. http://www.aclweb.org/anthology/E14-1028.

We describe an approach to word ordering using modelling techniques from statistical machine translation. The system incorporates a phrase-based model of string generation that aims to take unordered bags of words and produce fluent, grammatical sentences. We describe the generation grammars and introduce parsing procedures that address the computational complexity of generation under permutation of phrases. Against the best previous results reported on this task, obtained using syntax driven models, we report huge quality improvements, with BLEU score gains of 20+ which we confirm with human fluency judgements. Our system incorporates dependency language models, large n-gram language models, and minimum Bayes risk decoding.

[64]   A graph-based approach to string regeneration. M. Horvat and W. Byrne. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 85–95 (11 pages), Gothenburg, Sweden, April 2014. Association for Computational Linguistics. http://www.aclweb.org/anthology/E14-3010.

The string regeneration problem is the problem of generating a fluent sentence from a bag of words. We explore the N-gram language model approach to string regeneration. The approach computes the highest probability permutation of the input bag of words under an N-gram language model. We describe a graph-based approach for finding the optimal permutation. The evaluation of the approach on a number of datasets yielded promising results, which were confirmed by conducting a manual evaluation study.

[65]   Source-side preordering for translation using logistic regression and depth-first branch-and-bound search. L. Jehl, A. de Gispert, M. Hopkins, and W. Byrne. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 239–248 (12 pages), Gothenburg, Sweden, April 2014. Association for Computational Linguistics. http://www.aclweb.org/anthology/E14-1026.

We present a simple preordering approach for machine translation based on a feature-rich logistic regression model to predict whether two children of the same node in the source-side parse tree should be swapped or not. Given the pair-wise children regression scores we conduct an efficient depth-first branch-and-bound search through the space of possible children permutations, avoiding using a cascade of classifiers or limiting the list of possible ordering outcomes. We report experiments in translating English to Japanese and Korean, demonstrating superior performance as (a) the number of crossing links drops by more than 10% absolute with respect to other state-of-the-art preordering approaches, (b) BLEU scores improve on 2.2 points over the baseline with lexicalised reordering model, and (c) decoding can be carried out 80 times faster.

[66]   The University of Cambridge Russian-English system at WMT13. J. Pino, A. Waite, T. Xiao, A. de Gispert, F. Flego, and W. Byrne. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 200–205 (6 pages), Sofia, Bulgaria, August 2013. Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2225.

This paper describes the University of Cambridge submission to the Eighth Workshop on Statistical Machine Translation. We report results for the Russian-English translation task. We use multiple segmentations for the Russian input language. We employ the Hadoop framework to extract rules. The decoder is HiFST, a hierarchical phrase-based decoder implemented using weighted finite-state transducers. Lattices are rescored with a higher order language model and minimum Bayes-risk objective.

[67]   Fast, low-artifact speech synthesis considering global variance. M. Shannon and W. Byrne. In Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, page (5 pages), June 2013. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6639196.

Speech parameter generation considering global variance (GV generation) is widely acknowledged to dramatically improve the quality of synthetic speech generated by HMM-based systems. However it is slower and has higher latency than the standard speech parameter generation algorithm. In addition it is known to produce artifacts, though existing approaches to prevent artifacts are effective. In this paper we present a simple new mathematical analysis of speech parameter generation considering global variance based on Lagrange multipliers. This analysis sheds light on one source of artifacts and suggests a way to reduce their occurrence. It also suggests an approximation to exact GV generation that allows fast, low latency synthesis. In a subjective evaluation the naturalness of our fast approximate algorithm is as good as conventional GV generation.

[68]   N-gram posterior probability confidence measures for statistical machine translation: an empirical study. A. de Gispert, G. Blackwood, G. Iglesias, and W. Byrne. Machine Translation, pages 1–30 (31 pages), 2012. Published online 1 September 2012. http://dx.doi.org/10.1007/s10590-012-9132-2.

We report an empirical study of n -gram posterior probability confidence measures for statistical machine translation (SMT). We first describe an efficient and practical algorithm for rapidly computing n -gram posterior probabilities from large translation word lattices. These probabilities are shown to be a good predictor of whether or not the n -gram is found in human reference translations, motivating their use as a confidence measure for SMT. Comprehensive n -gram precision and word coverage measurements are presented for a variety of different language pairs, domains and conditions. We analyze the effect on reference precision of using single or multiple references, and compare the precision of posteriors computed from k -best lists to those computed over the full evidence space of the lattice. We also demonstrate improved confidence by combining multiple lattices in a multi-source translation framework.

[69]   Simple and efficient model filtering in statistical machine translation. J. Pino, A. Waite, and W. Byrne. The Prague Bulletin of Mathematical Linguistics, (98):5–24 (20 pages), 2012. Published online 6 September 2012. https://ufal.mff.cuni.cz/pbml/98/art-pino-waite-byrne.pdf.

Data availability and distributed computing techniques have allowed statistical machine translation (SMT) researchers to build larger models. However, decoders need to be able to retrieve information efficiently from these models to be able to translate an input sentence or a set of input sentences. We introduce an easy to implement and general purpose solution to tackle this problem: we store SMT models as a set of key-value pairs in an HFile. We apply this strategy to two specific tasks: test set hierarchical phrase-based rule filtering and n-gram count filtering for language model lattice rescoring. We compare our approach to alternative strategies and show that its trade offs in terms of speed, memory and simplicity are competitive.

[70]   Autoregressive models for statistical parametric speech synthesis. M. Shannon, H. Zen, and W. Byrne. IEEE Transactions on Audio, Speech and Language Processing, 21(3):587–597 (11 pages), 2012. http://dx.doi.org/10.1109/TASL.2012.2227740.

We propose using the autoregressive hidden Markov model (HMM) for speech synthesis. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard approach to statistical parametric speech synthesis. It supports easy and efficient parameter estimation using expectation maximization, in contrast to the trajectory HMM. At the same time its similarities to the standard approach allow use of established high quality synthesis algorithms such as speech parameter generation considering global variance. The autoregressive HMM also supports a speech parameter generation algorithm not available for the standard approach or the trajectory HMM and which has particular advantages in the domain of real-time, low latency synthesis. We show how to do efficient parameter estimation and synthesis with the autoregressive HMM and look at some of the similarities and differences between the standard approach, the trajectory HMM and the autoregressive HMM. We compare the three approaches in subjective and objective evaluations. We also systematically investigate which choices of parameters such as autoregressive order and number of states are optimal for the autoregressive HMM.

[71]   Impacts of machine translation and speech synthesis on speech-to-speech translation. K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. Speech Communication, 54(7):857–866 (10 pages), September 2012. http://www.sciencedirect.com/science/article/pii/S0167639312000283.

This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, various features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech.

[72]   Lattice-based minimum error rate training using weighted finite-state transducers with tropical polynomial weights. A. Waite, G. Blackwood, and W. Byrne. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), page (11 pages), Donostia-San Sebastian, Spain, July 2012. http://aclweb.org/anthology-new/W/W12/W12-6219.pdf.

Minimum Error Rate Training (MERT) is a method for training the parameters of a log-linear model. One advantage of this method of training is that it can use the large number of hypotheses encoded in a translation lattice as training data. We demonstrate that the MERT line optimisation can be modelled as computing the shortest distance in a weighted finite-state transducer using a tropical polynomial semiring.

[73]   Preprocessing Arabic for Arabic-English statistical machine translation. A. de Gispert, W. Byrne, J. Xu, R. Zbib, J. Makhoul, A. Chalabi, H. Nader, N. Habash, and F. Sadat. In J. Olive, C. Christianson, and J. McCary, editors, Handbook of natural language processing and machine translation. DARPA Global Autonomous Language Exploitation, pages 135 – 145 (11 pages). Springer, 2011

[74]   Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. J. Dines, H. Liang, L. Saheer, M. Gibson, W. Byrne, K. Oura, K. Tokuda, J. Yamagishi, S. King, M. Wester, T. Hirsimäki, R. Karhila, and M. Kurimo. Computer Speech and Language, page (18 pages), 2011. Special Issue on Speech Translation. Published online 17 September 2011. http://www.sciencedirect.com/science/article/pii/S0885230811000441.

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

[75]   Unsupervised intra-lingual and cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction. M. Gibson and W. Byrne. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):895 – 904 (10 pages), 2011. http://dx.doi.org/10.1109/TASL.2010.2066968.

Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper first presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Second, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Third, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation.

[76]   An analysis of machine translation and speech synthesis in speech-to-speech translation system. K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. In Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, pages 5108 – 5111 (4 pages), 2011. http://dx.doi.org/10.1109/ICASSP.2011.5946361.

This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Recently, many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. The quality of synthesized speech is important, since users will not understand what the system said if the quality of synthesized speech is bad. Therefore, in this paper, we focus on the machine translation and speech synthesis components, and report a subjective evaluation to analyze the impact of each component. The results of these analyses show that the machine translation component affects the performance of speech-to-speech translation greatly, and that fluent sentences lead to higher naturalness and lower word error rate of synthesized speech.

[77]   The effect of using normalized models in statistical speech synthesis. M. Shannon, H. Zen, and W. Byrne. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, page (4 pages), 2011.

The standard approach to HMM-based speech synthesis is inconsistent in the enforcement of the deterministic constraints between static and dynamic features. The trajectory HMM and autoregressive HMM have been proposed as normalized models which rectify this inconsistency. This paper investigates the practical effects of using these normalized models, and examines the strengths and weaknesses of the different models as probabilistic models of speech. The most striking difference observed is that the standard approach greatly underestimates predictive variance. We argue that the normalized models have better predictive distributions than the standard approach, but that all the models we consider are still far from satisfactory probabilistic models of speech. We also present evidence that better intra-frame correlation modelling goes some way towards improving existing normalized models.

[78]   Hierarchical phrase-based translation representations. G. Iglesias, C. Allauzen, W. Byrne, A. de Gispert, and M. Riley. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1373–1383 (11 pages), Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1127.

This paper compares several translation representations for a synchronous context-free grammar parse including CFGs/hypergraphs, finite-state automata (FSA), and pushdown automata (PDA). The representation choice is shown to determine the form and complexity of target LM intersection and shortest-path algorithms that follow. Intersection, shortest path, FSA expansion and RTN replacement algorithms are presented for PDAs. Chinese-toEnglish translation experiments using HiFST and HiPDT, FSA and PDA-based decoders, are presented using admissible (or exact) search, possible for HiFST with compact SCFG rulesets and HiPDT with compact LMs. For large rulesets with large LMs, we introduce a two-pass search strategy which we then analyze in terms of search errors and translation performance.

[79]   Efficient path counting transducers for minimum Bayes-risk decoding of statistical machine translation lattices. G. Blackwood, A. de Gispert, and W. Byrne. In Proceedings of the Annual Meeting of the Association for Computational Linguistics – Short Papers, pages 27–32 (6 pages), 2010. http://www.aclweb.org/anthology/P/P10/P10-2006.pdf.

This paper presents an efficient implementation of linearised lattice minimum Bayes-risk decoding using weighted finite state transducers. We introduce transducers to efficiently count lattice paths containing n-grams and use these to gather the required statistics. We show that these procedures can be implemented exactly through simple transformations of word sequences to sequences of n-grams. This yields a novel implementation of lattice minimum Bayes-risk decoding which is fast and exact even for very large lattices.

[80]   Fluency constraints for minimum Bayes-risk decoding of statistical machine translation lattices. G. Blackwood, A. de Gispert, and W. Byrne. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 71–79 (9 pages), 2010. http://www.aclweb.org/anthology/C/C10/C10-1009.pdf.

A novel and robust approach to incorporating natural language generation into statistical machine tr anslation is developed within a minimum Bayes-risk decoding framework. Segmentation of translation l attices is guided by confidence measures over the maximum likelihood translation hypothesis in order to focus on regions with potential translation errors. Modeling techniques intended to improve flue ncy in low confidence regions are introduced so as to improve overall translation fluency.

[81]   Hierarchical phrase-based translation grammars extracted from alignment posterior probabilities. A. de Gispert, J. Pino, and W. Byrne. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 545–554 (10 pages), Cambridge, MA, 2010. http://www.aclweb.org/anthology/D/D10/D10-1053.pdf.

We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignment model. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteriors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-to- target and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

[82]   Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction. M. Gibson, T. Hirsimaki, R. Karhila, M. Kurimo, and W. Byrne. In Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, pages 4642 – 4645 (4 pages), 2010.

This paper demonstrates how unsupervised cross-lingual adaptation of HMM-based speech synthesis models may be performed without explicit knowledge of the adaptation data language. A two-pass decision tree construction technique is deployed for this purpose. Using parallel translated datasets, cross-lingual and intralingual adaptation are compared in a controlled manner. Listener evaluations reveal that the proposed method delivers performance approaching that of unsupervised intralingual adaptation.

[83]   Personalising speech-to-speech translation in the EMIME project. M. Kurimo, W. Byrne, J. Dines, P. Garner, M. Gibson, Y. Guan, T. Hirsimäki, R. Karhila, S. King, H. Liang, K. Oura, L. Saheer, M. Shannon, S. Shiota, J. Tian, K. Tokuda, M. Wester, Y.-J. Wu, and J. Yamagishi. In Proceedings of the Annual Meeting of the Association for Computational Linguistics – Demonstration Systems, pages 48–53 (6 pages), 2010

[84]   Overview and results of Morpho Challenge 2009. M. Kurimo, S. Virpioja, V. T. Turunen, G. W. Blackwood, and W. Byrne. In C. Peters et al., editor, Multilingual Information Access Evaluation, 10th Workshop of the Cross-Language Evaluation Forum - CLEF 2009, volume 1 of Revised Selected Papers, Lecture Notes in Computer Science, LNCS 6241, pages 579–598 (20 pages). Springer, 2010.

The goal of Morpho Challenge 2009 was to evaluate unsupervised algorithms that provide morpheme analyses for words in different languages and in various practical applications. Morpheme analysis is particularly useful in speech recognition, information retrieval and machine translation for morphologically rich languages where the amount of different word forms is very large. The evaluations consisted of: 1. a comparison to grammatical morphemes, 2. using morphemes instead of words in information retrieval tasks, and 3. combining morpheme and word based systems in statistical machine translation tasks. The evaluation languages were: Finnish, Turkish, German, English and Arabic. This paper describes the tasks, evaluation methods, and obtained results. The Morpho Challenge was part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.

[85]   Overview and results of Morpho Challenge 2009. M. Kurimo, S. Virpioja, V.T. Turunen, G.W. Blackwood, and W. Byrne. In Multilingual Information Access Evaluation, 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, volume 1 of Lecture Notes in Computer Science, pages 578–597 (20 pages). Springer, 2010. http://eprints.pascal-network.org/archive/00006052.

In the Morpho Challenge 2009 unsupervised algorithms that provide morpheme analyses for words in different languages were evaluated in various practical applications. Morpheme analysis is particularly useful in speech recognition, information retrieval and machine translation for morphologically rich languages where the amount of different word forms is very large. The evaluations consisted of: 1. a comparison to grammatical morphemes, 2. using morphemes instead of words in information retrieval tasks, and 3. combining morpheme and word based systems in statistical machine translation tasks. The evaluation languages in 2009 were: Finnish, Turkish, German, English and Arabic. This overview paper describes the tasks, evaluation methods, and obtained results. The Morpho Challenge is part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.

[86]   The CUED HiFST system for the WMT10 translation shared task. J. Pino, G. Iglesias, A. Gispert, G. Blackwood, J. Brunning, and W. Byrne. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 155–160 (6 pages), 2010. http://www.statmt.org/wmt10/pdf/WMT23.pdf.

This paper describes the Cambridge University Engineering Department submission to the Fifth Workshop on Statistical Machine Translation. We report results for the French-English and Spanish-English shared translation tasks in both directions. The CUED system is based on HiFST, a hierarchical phrase-based decoder implemented using weighted finite-state transducers. In the French-English task, we investigate the use of context-dependent alignment models. We also show that lattice minimum Bayes-risk decoding is an effective framework for multi-source translation, leading to large gains in BLEU score.

[87]   Autoregressive clustering for HMM speech synthesis. M. Shannon and W. Byrne. In Proceedings of INTERSPEECH, page (4 pages), 2010.

The autoregressive HMM has been shown to provide efficient parameter estimation and high-quality synthesis, but in previous experiments decision trees derived from a non-autoregressive system were used. In this paper we investigate the use of autoregressive clustering for autoregressive HMM-based speech synthesis. We describe decision tree clustering for the autoregressive HMM and highlight differences to the standard clustering procedure. Subjective listening evaluation results suggest that autoregressive clustering improves the naturalness of the resulting speech. We find that the standard minimum description length (MDL) criterion for selecting model complexity is inappropriate for the autoregressive HMM. Investigating the effect of model complexity on naturalness, we find that a large degree of overfitting is tolerated without a substantial decrease in naturalness.

[88]   Hierarchical phrase-based translation with weighted finite state transducers and shallow-N grammars. A. de Gispert, G. Iglesias, G. Blackwood, E. R. Banga, , and W. Byrne. Computational Linguistics, 36(3):505—533 (29 pages), September 2010. http://www.aclweb.org/anthology/J/J10/J10-3008.pdf.

In this paper we describe HiFST, a lattice-based decoder for hierarchical phrase-based translation and alignment. The decoder is implemented with standard Weighted Finite-State Transducer (WFST) operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance. The direct generation of translation lattices in the target language can improve subsequent rescoring procedures, yielding further gains when applying long-span language models and Minimum Bayes Risk decoding. We also give insight as to how to control the size of the search space defined by hierarchical rules. We show that shallow-N grammars, low-level rule catenation and other search constraints can help to match the power of the translation system to specific language pairs.

[89]   Context-dependent alignment models for statistical machine translation. J. Brunning, A. de Gispert, and W. Byrne. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 110–118 (9 pages), 2009. http://www.aclweb.org/anthology/N/N09/N09-1013.pdf.

We introduce alignment models for Machine Translation that take into account the context of a source word when determining its translation. Since the use of these contexts alone causes data sparsity problems, we develop a decision tree algorithm for clustering the contexts based on optimisation of the EM auxiliary function. We show that our context-dependent models lead to an improvement in alignment quality, and an increase in translation quality when the alignments are used to build a machine translation system.

[90]   Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions. A. de Gispert, S. Virpioja, M. Kurimo, and W. Byrne. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 73–76 (4 pages), 2009. http://www.aclweb.org/anthology/N/N09/N09-2019.pdf.

We describe a simple strategy to achieve translation performance improvements by combining output from identical statistical machine translation systems trained on alternative morphological decompositions of the source language. Combination is done by means of Minimum Bayes Risk decoding over a shared Nbest list. When translating into English from two highly inflected languages such as Arbic and Finnish we obtain significant improvements over simply selecting the best morphological decomposition.

[91]   The HiFST system for the europarl spanish-to-english task. G. Iglesias, A. de Gispert, E. Banga, and W. Byrne. In Proceedings of SEPLN, pages 207–214 (8 pages), 2009.

In this paper we present results for the Europarl Spanish-to-English translation task. We use HiFST, a novel hierarchical phrase-based translation system implemented with finite-state technology that creates target lattices rather than k-best lists

[92]   Hierarchical phrase-based translation with weighted finite state transducers. G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 433–441 (9 pages), 2009. http://www.aclweb.org/anthology/N/N09/N09-1049.pdf.

This paper describes a lattice-based decoder for hierarchical phrase-based translation. The decoder is implemented with standard WFST operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, direct generation of translation lattices in the target language, better parameter optimization, and improved translation performance when rescoring with long-span language models and MBR decoding. We report translation experiments for the Arabic-to-English and Chinese-to-English NIST translation tasks and contrast the WFST-based hierarchical decoder with hierarchical translation under cube pruning.

[93]   Rule filtering by pattern for efficient hierarchical translation. G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pages 380–388 (9 pages), 2009. http://www.aclweb.org/anthology/E/E09/E09-1044.pdf.

We describe refinements to hierarchical translation search procedures intended to reduce both search errors and memory usage through modifications to hypothesis expansion in cube pruning and reductions in the size of the rule sets used in translation. Rules are put into syntactic classes based on the number of non-terminals and the pattern, and various filtering strategies are then applied to assess the impact on translation speed and quality. Results are reported on the 2008 NIST Arabic-to-English evaluation task.

[94]   Autoregressive HMMs for speech synthesis. M. Shannon and W. Byrne. In Proceedings of INTERSPEECH, page (4 pages), 2009.

We propose the autoregressive HMM for speech synthesis. We show that the autoregressive HMM supports efficient EM parameter estimation and that we can use established effective synthesis techniques such as synthesis considering global variance with minimal modification. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard HMM synthesis framework, and supports easy and efficient parameter estimation, in contrast to the trajectory HMM. We find that the autoregressive HMM gives performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation.

[95]   Large-scale statistical machine translation with weighted finite state transducers. G. Blackwood, A. de Gispert, J. Brunning, and W. Byrne. In Proceedings of FSMNLP 2008: Finite-State Methods and Natural Language Processing, page (12 pages), Ispra, Lago Maggiore, Italy, September 2008.

The Cambridge University Engineering Department phrase-based statistical machine translation system follows a generative model of translation and is implemented by the composition of component models realised as Weighted Finite State Transducers. Our flexible architecture requires no special purpose decoder and readily handles the large-scale natural language processing demands of state-of-the-art machine translation systems. In this paper we describe the CUED participation in the NIST 2008 Arabic-English machine translation evaluation task.

[96]   Phrasal segmentation models for statistical machine translation. G. Blackwood, A. de Gispert, and W. Byrne. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 19–22 (4 pages), Manchester, UK, August 2008. http://www.aclweb.org/anthology/C/C08/C08-2005.pdf.

Phrasal segmentation models define a mapping from the words of a sentence to sequences of translatable phrases. We discuss the estimation of these models from large quantities of monolingual training text and describe their realization as weighted finite state transducers for incorporation into phrase-based statistical machine translation systems. Results are reported on the NIST Arabic-English translation tasks showing significant complementary gains in BLEU score with large 5-gram and 6-gram language models.

[97]   European language translation with weighted finite state transducers: The CUED MT system for the 2008 ACL workshop on statistical machine translation. G. Blackwood, A. de Gispert, J. Brunning, and W. Byrne. In Proceedings of the ACL 2008 Third Workshop on Statistical Machine Translation, pages 131–134 (4 pages), June 2008. http://www.aclweb.org/anthology/W/W08/W08-0316.pdf.

We describe the Cambridge University Engineering Department phrase-based statistical machine translation system for Spanish-English and French-English translation in the ACL 2008 Third Workshop on Statistical Machine Translation Shared Task. The CUED system follows a generative model of translation and is implemented by composition of component models realised as Weighted Finite State Transducers, without the use of a special-purpose decoder. Details of system tuning for both Europarl and News translation tasks are provided.

[98]   HMM word and phrase alignment for statistical machine translation. Y. Deng and W. Byrne. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):494–507 (14 pages), March 2008. http://dx.doi.org/10.1109/TASL.2008.916056.

Efficient estimation and alignment procedures for word and phrase alignment HMMs are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model-4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model-4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.

[99]   Discriminative language model adaptation for mandarin broadcast speech transcription and translation. X. A. Liu, W. J. Byrne, M. J. F. Gales, A. de Gispert, M. Tomalin, P. C. Woodland, and K. Yu. In Proc. IEEE Automatic Speech Recognition and Understanding (ASRU), pages 153– 158 (6 pages), Kyoto, Japan, 2007

[100]   Consensus network decoding for statistical machine translation system combination. K.-C. Sim, W. Byrne, M. Gales, H. Sahbi, and P.C. Woodland. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages), 2007. https://ieeexplore.ieee.org/abstract/document/4218048.

This paper presents a simple and robust consensus decoding approach for combining multiple Machine Translation (MT) system outputs. A consensus network is constructed from an N -best list by aligning the hypotheses against an alignment reference, where the alignment is based on minimising the translation edit rate (TER). The Minimum Bayes Risk (MBR) decoding technique is investigated for the selection of an appropriate alignment reference. Several alternative decoding strategies proposed to retain coherent phrases in the original translations. Experimental results are presented primarily based on three-way combination of Chinese-English translation outputs, and also presents results for six-way system combination. It is shown that worthwhile improvements in translation performance can be obtained using the methods discussed.

[101]   Gini support vector machines for segmental minimum Bayes risk decoding of continuous speech. V. Venkataramani, S. Chakrabartty, and W. Byrne. Computer Speech and Language, 21:423–442 (20 pages), 2007.

We describe the use of Support Vector Machines (SVMs) for continuous speech recognition by incorporating them in Segmental Minimum Bayes Risk decoding. Lattice cutting is used to convert the Automatic Speech Recognition search space into sequences of smaller recognition problems. SVMs are then trained as discriminative models over each of these problems and used in a rescoring framework. We pose the estimation of a posterior distribution over hypothesis in these regions of acoustic confusion as a logistic regression problem. We also show that GiniSVMs can be used as an approximation technique to estimate the parameters of the logistic regression problem. On a small vocabulary recognition task we show that the use of GiniSVMs can improve the performance of a well trained Hidden Markov Model system trained under the Maximum Mutual Information criterion. We also find that it is possible to derive reliable confidence scores over the GiniSVM hypotheses and that these can be used to good effect in hypothesis combination. We discuss the problems that we expect to encounter in extending this approach to Large Vocabulary Continuous Speech Recognition and describe initial investigation of constrained estimation techniques to derive feature spaces for SVMs.

[102]   Segmentation and alignment of parallel text for statistical machine translation. Y. Deng, S. Kumar, and W. Byrne. Journal of Natural Language Engineering, 13(3):235–260 (26 pages), 2006.

We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.

[103]   Statistical phrase-based speech translation. L. Mathias and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages), 2006.

A generative statistical model of speech-to-text translation is developed as an extension of existing models of phrase-based text translation. Speech is translated by mapping ASR word lattices to lattices of phrase sequences which are then translated using operations developed for text translation. Performance is reported on Chinese to English translation of Mandarin Broadcast News.

[104]   Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition. W. Byrne. Proceedings of the Institute of Electronics, Information, and Communication Engineers, Japan – Special Section on Statistical Modeling for Speech Processing, E89-D(3):900–907 (8 pages), March 2006. Invited paper.

[105]   A weighted finite state transducer translation template model for statistical machine translation. S. Kumar, Y. Deng, and W. Byrne. Journal of Natural Language Engineering, 12(1):35–75 (41 pages), March 2006.

We present a Weighted Finite State Transducer Translation Template Model for statistical machine translation. The approach we describe allows us to implement each constituent distribution of the model as a weighted finite state transducer or acceptor. We show that bitext word alignment and translation under the model can be performed with standard FSM operations involving these transducers. One of the benefits of using this framework is that it avoids the need to develop specialized search procedures, even for the generation of lattices or N-Best lists of bitext word alignments and translation hypotheses. We report and analyze bitext word alignment and translation performance of the model on French-English and Chinese-English tasks.

[106]   A dialectal Chinese speech recognition framework. J. Li, F. Zheng, W. Byrne, and D. Jurafsky. Journal of Computer Science and Technology (Science Press, Beijing, China), (1):106–115 (10 pages), January 2006.

A framework for dialectal Chinese speech recognition is proposed and studied, where a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and the dialect-related knowledge are adopted to translate a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. There are two kinds of knowledge sources: one is human experts and another is a small dialectal Chinese corpus. This knowledge includes four levels : a phonetics level, lexicon level, language level, and the acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language with the goal of deriving an acceptable WDC speech recognizer from an existing PTH speech recognizer. Based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua, we proposed to use the knowledge of the context-independent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), the context-independent WDC-IF mappings, and the syllable-dependent WDC-IF mappings obtained from either experts or data, and then to combine these with the surface-form based maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multi-pronunciation lexicon introduced by the IF mappings which might entail confusion in the lexicon and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on an accumulated uni-gram probability (AUP) was proposed. Compared with the original PTH speech recognizer, the resulted WDC speech recognizer achieved over 10% absolute Character Error Rate (CER) reduction when recognizing WDC with only 0.62% CER increase when recognizing PTH. The proposed framework and methods are intended to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and even other languages.

[107]   HMM word and phrase alignment for statistical machine translation. Y. Deng and W. Byrne. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 169–176 (8 pages), 2005. http://www.aclweb.org/anthology/H/H05/H05-1022.pdf.

HMM-based models are developed for the alignment of words and phrases in bitext. The models are formulated so that alignment and parameter estimation can be performed efficiently. We find that Chinese-English word alignment performance is comparable to that of IBM Model-4 even over large training bitexts. Phrase pairs extracted from word alignments generated under the model can also be used for phrase-based translation, and in Chinese to English and Arabic to English translation, performance is comparable to systems based on Model-4 alignments. Direct phrase pair induction under the model is described and shown to improve translation performance.

[108]   Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition. V. Doumpiotis and W. Byrne. Speech Communication, (2):142–160 (19 pages), 2005. http://dx.doi.org/10.1016/j.specom.2005.07.002.

Lattice segmentation techniques developed for Minimum Bayes Risk decoding in large vocabulary speech recognition tasks are used to compute the statistics for discriminative training algorithms that estimate HMM parameters so as to reduce the overall risk over the training data. New estimation procedures are developed and evaluated for small vocabulary and large vocabulary recognition tasks, and additive performance improvements are shown relative to maximum mutual information estimation. These relative gains are explained through a detailed analysis of individual word recognition errors.

[109]   Local phrase reordering models for statistical machine translation. S. Kumar and W. Byrne. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 161–168 (8 pages), 2005. http://www.aclweb.org/anthology/H/H05/H05-1021.pdf.

We describe stochastic models of local phrase movement that can be incorporated into a Statistical Machine Translation (SMT) system. These models provide properly formulated, non-deficient, probability distributions over reordered phrase sequences. They are implemented by Weighted Finite State Transducers. We describe EM-style parameter re-estimation procedures based on phrase alignment under the complete translation model incorporating reordering. Our experiments show that the reordering model yields substantial improvements in translation performance on Arabic-to-English and Chinese-to-English MT tasks. We also show that the procedure scales as the bitext size is increased.

[110]   Automatic transcription of Czech, Russian, and Slovak spontaneous speech in the MALACH project. J. Psutka, P. Ircing, J.V. Psutka, J. Hajic, W. Byrne, and J. Mirovski. In Proceedings of EUROSPEECH, page (4 pages), 2005.

This paper describes the 3.5-years effort put into building LVCSR systems for recognition of spontaneous speech of Czech, Russian, and Slovak witnesses of the Holocaust in the MALACH project. For processing of colloquial, highly emotional and heavily accented speech of elderly people containing many non-speech events we have developed techniques that very effectively handle both non-speech events and colloquial and accented variants of uttered words. Manual transcripts as one of the main sources for language modeling were automatically ćnormalizedÓ using standardized lexicon, which brought about 2 to 3% reduction of the word error rate (WER). The subsequent interpolation of such LMs with models built from an additional collection (consisting of topically selected sentences from general text corpora) resulted into an additional improvement of performance of up to 3% .

[111]   Acoustic training from heterogeneous data sources: Experiments in Mandarin conversational telephone speech transcription. S. Tsakalidis and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages), 2005. Presentation [PDF].

In this paper we investigate the use of heterogeneous data sources for acoustic training. We describe an acoustic normalization procedure for enlarging an ASR acoustic training set with out-of-domain acoustic data. A larger in-domain training set is created by effectively transforming the out-of-domain data before incorporation in training. Baseline experimental results in Mandarin conversational telephone speech transcription show that a simple attempt to add out-of-domain data degrades performance. Preliminary experiments assess the effectiveness of the proposed cross-corpus acoustic normalization.

[112]   Lattice segmentation and support vector machines for large vocabulary continuous speech recognition. V. Venkataramani and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages), 2005. Presentation [PDF].

Lattice segmentation procedures are used to spot possible recognition errors in first-pass recognition hypotheses produced by a large vocabulary continuous speech recognition system. This approach is analyzed in terms of its ability to reliably identify, and provide good alternatives for, incorrectly hypothesized words. A procedure is described to train and apply Support Vector Machines to strengthen the first pass system where it was found to be weak, resulting in small but statistically significant recognition improvements on a large test set of conversational speech.

[113]   Convergence theorems for generalized alternating minimization procedures. A. Gunawardana and W. Byrne. Journal of Machine Learning Research, (6):2049–2073 (25 pages), December 2005. http://www.jmlr.org/papers/volume6/gunawardana05a/gunawardana05a.pdf.

The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-Step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-Steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-Step approximation prevents convergence to local maxima in likelihood.

[114]   Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. V. Doumpiotis, S. Tsakalidis, and W. Byrne. IEEE Transactions on Speech and Audio Processing, 13(3):367–376 (10 pages), May 2005. http://dx.doi.org/10.1109/TSA.2005.845806.

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum Mutual Information (MMI) criterion. In this paper we introduce discriminative training procedures that employ linear transforms for feature normalization and for speaker adaptive training. We integrate these discriminative linear transforms into MMI estimation of HMM parameters for improvement of large vocabulary conversational speech recognition systems.

[115]   Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition. W. Byrne. In Proceedings of the ATR Workshop ”Beyond HMMs”, page (6 pages), Kyoto, Japan, 2004.

Minimum risk estimation and decoding strategies based on lattice segmentation techniques can be used to refine large vocabulary continuous speech recognition systems through the estimation of the parameters of the underlying hidden Mark models and through the identification of smaller recognition tasks which provides the opportunity to incorporate novel modeling and decoding procedures in LVCSR. These techniques are discussed in the context of going beyond HMMs.

[116]   Pinched lattice minimum Bayes risk discriminative training for large vocabulary continuous speech recognition. V. Doumpiotis and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, page (4 pages), 2004.

Iterative estimation procedures that minimize empirical risk based on general loss functions such as the Levenshtein distance have been derived as extensions of the Extended Baum Welch algorithm. While reducing expected loss on training data is a desirable training criterion, these algorithms can be difficult to apply. They are unlike MMI estimation in that they require an explicit listing of the hypotheses to be considered and in complex problems such lists tend to be prohibitively large. To overcome this difficulty, modeling techniques originally developed to improve search efficiency in Minimum Bayes Risk decoding can be used to transform these estimation algorithms so that exact update, risk minimization procedures can be used for complex recognition problems. Experimental results in two large vocabulary speech recognition tasks show improvements over conventionally trained MMIE models.

[117]   Minimum Bayes-risk decoding for statistical machine translation. S. Kumar and W. Byrne. In Proceedings of HLT-NAACL, pages 169–176 (8 pages), 2004. http://www.aclweb.org/anthology/N/N04/N04-1022.pdf.

We present Minimum Bayes-Risk (MBR) decoding for statistical machine translation. This statistical approach aims to minimize expected loss of translation errors under loss functions that measure translation performance. We describe a hierarchy of loss functions that incorporate different levels of linguistic information from word strings, word-to-word alignments from an MT system, and syntactic structure from parse-trees of source and target language sentences. We report the performance of the MBR decoders on a Chinese-to-English translation task. Our results show that MBR decoding can be used to tune statistical MT performance for specific loss functions.

[118]   Slavic languages in the MALACH project. J. Psutka, J. Hajic, and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2004. Invited Paper in Special Session on Multilingual Speech Processing (4 pages).

The development of acoustic training material for Slavic languages within the MALACH project is described. Initial experience with the variety of speakers and the difficulties encountered in transcribing Czech, Slovak, and Russian language oral history are described along with ASR recognition results intended investigate the effectiveness of different transcription conventions that address language specific phenomena within the task domain.

[119]   Issues in annotation of the Czech spontaneous speech corpus in the MALACH project. J. Psutka, P. Ircing, J. Hjic, V. Radova, J.V. Psutka, W. Byrne, and S. Gustman. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), page (4 pages), 2004.

The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.

[120]   Task-specific minimum Bayes-risk decoding using learned edit distance. I. Shafran and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, page (4 pages), 2004.

This paper extends the minimum Bayes-risk framework to incorporate a loss function specific to the task and the ASR system. The errors are modeled as a noisy channel and the parameters are learned from the data. The resulting loss function is used in the risk criterion for decoding. Experiments on a large vocabulary conversational speech recognition system demonstrate significant gains of about 1over MAP hypothesis and about 0.6approach is general enough to be applicable to other sequence recognition problems such as in Optical Character Recognition (OCR) and in analysis of biological sequences.

[121]   Automatic recognition of spontaneous speech for access to multilingual oral history archives. W. Byrne, D. Doermann, M. Franz, S. Gustman, J. Hajič, D. Oard, M. Picheny, J. Psutka, B. Ramabhadran, D. Soergel, T. Ward, and W.-J. Zhu. IEEE Transactions on Speech and Audio Processing, Special Issue on Spontaneous Speech Processing, pages 420–435 (16 pages), July 2004. http://dx.doi.org/10.1109/TSA.2004.828702.

The MALACH project has the goal of developing the technologies needed to facilitate access to large collections of spontaneous speech. Its aim is to dramatically improve the state of the art in key Automatic Speech Recognition (ASR), Natural Language Processing (NLP) technologies for use in large-scale retrieval systems. The project leverages a unique collection of oral history interviews with survivors of the Holocaust that has been assembled and extensively annotated by the Survivors of the Shoah Visual History Foundation. This paper describes the collection, 116,000 hours of interviews in 32 languages, and the way in which system requirements have been discerned through user studies. It discusses ASR methods for very difficult speech (heavily accented, emotional, and elderly spontaneous speech), including transcription to create training data and methods for language modeling and speaker adaptation. Results are presented for for English and Czech. NLP results are presented for named entity tagging, topic segmentation, and supervised topic classification, and the architecture of an integrated search system that uses these results is described.

[122]   Segmental minimum Bayes-risk decoding for automatic speech recognition. V. Goel, S. Kumar, and W. Byrne. IEEE Transactions on Speech and Audio Processing, 12:234–249 (16 pages), May 2004. http://dx.doi.org/10.1109/TSA.2004.825678.

Minimum Bayes-Risk (MBR) speech recognizers have been shown to yield improvements over the search over word lattices. We present a Segmental Minimum Bayes-Risk decoding (SMBR) framework that simplifies the implementation of MBR recognizers through the segmentation of the N-best lists or lattices over which the recognition is to be performed. This paper presents lattice cutting procedures that underly SMBR decoding. Two of these procedures are based on a risk minimization criterion while a third one is guided by word-level confidence scores. In conjunction with SMBR decoding, these lattice segmentation procedures give consistent improvements in recognition word error rate (WER) on the Switchboard corpus. We also discuss an application of risk-based lattice cutting to multiplesystem SMBR decoding and show that it is related to other system combination techniques such as ROVER. This strategy combines lattices produced from multiple ASR systems and is found to give WER improvements in a Switchboard evaluation system.
Correction Available : In our recently published paper, we presented a risk-based lattice cutting procedure to segment ASR word lattices into smaller sub-lattices as a means to to improve the efficiency of Minimum Bayes-Risk (MBR) rescoring. In the experiments reported, some of the hypotheses in the original lattices were inadvertently discarded during segmentation, and this affected MBR performance adversely. This note gives the corrected results as well as experiments demonstrating that the segmentation process does not discard any paths from the original lattice.

[123]   The Johns Hopkins University 2003 Chinese-English Machine Translation System. W. Byrne, S. Khudanpur, W. Kim, S. Kumar, P. Pecina, P.Virga, P. Xu, and D. Yarowsky. In Machine Translation Summit IX, page (4 pages). The Association for Machine Translation in the Americas, 2003.

We describe a Chinese to English Machine Translation system developed at the Johns Hopkins University for the NIST 2003 MT evaluations. The system is based on a Weighted Finite State Transducer implementation of the alignment template translation model for statistical machine translation. The baseline MT system was trained using 100,000 sentence pairs selected from a static bitext training collection. Information retrieval techniques were then used to create specific training collections for each document to be translated. This document-specific training set included bitext and name entities that were then added to the baseline system by augmenting the library of alignment templates. We report translation performance of baseline and IR-based systems on two NIST MT evaluation test sets.

[124]   Discriminative training for segmental minimum Bayes-risk decoding. V. Doumpiotis, S. Tsakalidis, and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages). IEEE, 2003.

A modeling approach is presented that incorporates discriminative training procedures within segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decis ion problems involving small sets of confusable words. Acoustic models specialized to discriminate between the competing words in these classes are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models.

[125]   Lattice segmentation and minimum Bayes risk discriminative training. V. Doumpiotis, S. Tsakalidis, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 2003.

Modeling approaches are presented that incorporate discriminative training procedures in segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decision problems involving small sets of confusable words. We discuss two approaches to incorporating these segmented lattices in discriminative training. We investigate the use of acoustic models specialized to discriminate between the competing words in these classes which are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models.

[126]   Minimum Bayes-risk automatic speech recognition. V. Goel and W. Byrne. In W. Chou and B.-H. Juang, editors, Pattern Recognition in Speech and Language Processing, pages 51–77 (27 pages). CRC Press, 2003

[127]   Issues in recognition of Spanish-accented spontaneous English. A. Ikeno, B. Pellom, D. Cer, A. Thornton, J. M. Brenier, D. Jurafsky, W. Ward, and W. Byrne. In Proceedings of the ISCA and IEEE workshop on Spontaneous Speech Processing and Recognition, page (4 pages), Tokyo Institute of Technology, Tokyo, Japan, 2003. ISCA and IEEE.

We describe a recognition experiment and two analytic experiments on a database of strongly Hispanic-accented English. We show the crucial importance of training on the Hispanic-accented data for acoustic model performance, and describe the tendency of Spanish-accented speakers to use longer, and presumably less-reduced, schwa vowels than native-English speakers.

[128]   A generative probabilistic OCR model for NLP applications. O. Kolak, W. Byrne, and P. Resnik. In Proceedings of HLT-NAACL, pages 55–62 (8 pages), 2003. http://www.aclweb.org/anthology/N/N03/N03-1018.pdf.

In this paper we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make them more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model’s ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.

[129]   A weighted finite state transducer implementation of the alignment template model for statistical machine translation. S. Kumar and W. Byrne. In Proceedings of HLT-NAACL, pages 63 – 70 (8 pages), 2003. http://www.aclweb.org/anthology/N/N03/N03-1019.pdf.

We present a derivation of the alignment template model for statistical machine translation and an implementation of the model using weighted finite state transducers. The approach we describe allows us to implement each constituent distribution of the model as a weighted finite state transducer or acceptor. We show that bitext word alignment and translation under the model can be performed with standard FSM operations involving these transducers. One of the benefits of using this framework is that it obviates the need to develop specialized search procedures, even for the generation of lattices or N-Best lists of bitext word alignments and translation hypotheses. We evaluate the implementation of the model on the Frenchto- English Hansards task and report alignment and translation performance.

[130]   Desperately seeking Cebuano. D. Oard, D. Doermann, B. Dorr, D. He, P. Resnik, W. Byrne, S. Khudanpur, D. Yarowsky, A. Leuski, P. Koehn, and K. Knight. In Proceedings of HLT-NAACL, page (3 pages), 2003.

This paper describes an effort to rapidly develop language resources and component technology to support searching Cebuano news stories using English queries. Results from the first 60 hours of the exercise are presented.

[131]   Building LVCSR systems for transcription of spontaneously produced Russian witnesses in the MALACH project: initial steps and first results. J. Psutka, I. Iljuchin, P. Ircing, J.V. Psutka, V. Trejbal, W. Byrne, J. Hajic, and S. Gustman. In Proceedings of the Text, Speech, and Dialog Workshop, pages 214–219 (6 pages), 2003.

The MALACH project uses the world’s largest digital archive of video oral histories collected by the Survivors of the Shoah Visual History Foundation (VHF) and attempts to access such archives by advancing the state-of-the-art in Automatic Speech Recognition and Information Retrieval. This paper discusses the intial steps and first results in building large vocabulary continuous speech recognition (LVCSR) systems for the transcription of Russian witnesses. As the third language processed in the MALACH project (following English and Czech), Russian has posed new ASR challenges, especially in phonetic modeling. Although most of the Russian testimonies were provided by native Russian survivors, the speakers come from many different regions and countries resulting in a diverse collection of accented spontaneous Russian speech.

[132]   Towards automatic transcription of spontaneous Czech speech in the MALACH project. J. Psutka, P. Ircing, J. V. Psutka, V. Radova, W. Byrne, J. Hajic, and S. Gustman. In Proceedings of the Text, Speech, and Dialog Workshop, pages 327–332 (6 pages), 2003.

Our paper discusses the progress achieved during a one-year effort in building the Czech LVCSR system for the automatic transcription of spontaneously produced testimonies in the MALACH project. The difficulty of this task stems from the highly inflectional nature of the Czech language and is further multiplied by the presence of many colloquial words in spontaneous Czech speech as well as by the need to handle emotional speech filled with disfluencies, heavy accents, age-related coarticulation and language switching. In this paper we concentrate mainly on the acoustic modeling issues - the proper choice of front-end paramterization, the handling of non-speech events in acoustic modeling, and unsupervised acoustic adaptation via MLLR. A method for selecting suitable language modeling data is also briefly discussed.

[133]   Large vocabulary ASR for spontaneous Czech in the MALACH project. J. Psutka, P. Ircing, J.V. Psutka, V. Radovic, W. Byrne, J. Hajic, Jiri Mirovsky, and Samuel Gustman. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 2003.

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate. recognition and retrieval techniques to improve cataloging efficiency and eventually to provide direct access into the archive itself.

[134]   Support vector machines for segmental minimum Bayes risk decoding of continuous speech. V. Venkataramani, S. Chakrabartty, and W. Byrne. In IEEE Automatic Speech Recognition and Understanding Workshop, page (6 pages), 2003.

Segmental Minimum Bayes Risk (SMBR) Decoding involves the refinement of the search space into manageable confusion sets i.e., smaller sets of confusable words. We describe the application of Support Vector Machines (SVMs) as discriminative models for the refined search space. We show that SVMs, which in their basic formulation are binary classifiers of fixed dimensional observations, can be used for continuous speech recognition. We also study the use of GiniSVMs, which is a variant of the basic SVM. On a small vocabulary task, we show this two pass scheme outperforms MMI trained HMMs. Using system combination we also obtain further improvements over discriminatively trained HMMs.

[135]   Supporting access to large digital oral history archives. S. Gustman, D. Soergel, D. Oard, W. Byrne, M. Picheny, B. Ramabhadran, and D. Greenberg. In Proceedings of the Joint Conference on Digital Libraries, page (10 pages), 2002.

This paper describes our experience with the creation, indexing, and provision of access to a very large archive of videotaped oral histories - 116,000 hours of digitized interviews in 32 languages from 52,000 survivors, liberators, rescuers, and witnesses of the Nazi Holocaust. It goes on to identify a set of critical research issues that must be addressed if we are to provide full and detailed access to collections of this size: issues in user requirement studies, automatic speech recognition, automatic classification, segmentation, summarization, retrieval, and user interfaces. The paper ends by inviting others to discuss use of these materials in their own research.

[136]   Minimum Bayes-risk alignment of bilingual texts. S. Kumar and W. Byrne. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 140–147 (8 pages), Philadelphia, PA, USA, 2002. http://www.aclweb.org/anthology/W/W02/W02-1019.pdf.

We present Minimum Bayes-Risk word alignment for machine translation. This statistical, model-based approach attempts to minimize the expected risk of alignment errors under loss functions that measure alignment quality. We describe various loss functions, including some that incorporate linguistic analysis as can be obtained from parse trees, and show that these approaches can improve alignments of the English-French Hansards.

[137]   Risk based lattice cutting for segmental minimum Bayes-risk decoding. S. Kumar and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, page (4 pages), Denver, Colorado, USA, 2002.

Minimum Bayes-Risk (MBR) speech recognizers have been shown to give improvements over the conventional maximum a-posteriori probability (MAP) decoders through N-best list rescoring and A-star search over word lattices. Segmental MBR (SMBR) decoders simplify the implementation of MBR recognizers by segmenting the N-best lists or lattices over which the recognition is performed. We present a lattice cutting procedure that attempts to minimize the total Bayes-Risk of all word strings in the segmented lattice. We provide experimental results on the Switchboard conversational speech corpus showing that this segmentation procedure, in conjunction with SMBR decoding, gives modest but significant improvements over MAP decoders as well as MBR decoders on unsegmented lattices.

[138]   Cross-language access to recorded speech in the MALACH project. D. Oard, D. Demner-Fushman, J. Hajic, B Ramabhadran, S Gustman, W Byrne, D. Soergel, B. Dorr, P. Resnik, and M. Picheney. In Proceedings of the Text, Speech, and Dialog Workshop, page (8 pages), 2002.

The MALACH project seeks to help users find information in a vast multilingual collection of untranscribed oral history interviews. This paper introduces the goals of the project and focuses on supporting access by users who are unfamiliar with the interview language. It begins with a review of the state of the art in cross-language speech retrieval: approaches that will be investigated in the project are then described. Czech was selected as the first non-English language to be supported; results of an initial experimental with Czech/English cross-language retrieval are reported.

[139]   Automatic transcription of Czech language oral history in the MALACH project: Resources and initial experiments. J. Psutka, P. Ircing, J. Psutka, V. Radova, W. Byrne, J. Hajic, S. Gustman, and B. Ramabhadran. In Proceedings of the Text, Speech, and Dialog Workshop, page (8 pages), 2002.

In this paper we describe the initial stages of the ASR component of the MALACH project. This project will attempt to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation by advancing the state of the art in automated speech recognition. In order to train the ASR system, it is necessary to manually transcribe a large amount of speech data, identify the appropriate vocabulary, and obtain relevant text for language modeling. We give a detailed description of the speech annotation process; show the specific properties of the spontaneous speech contained in the archives; and present baseline speech recognition results.

[140]   Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. S. Tsakalidis, V. Doumpiotis, and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, page (5 pages), Denver, Colorado, USA, 2002.

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum Mutual Information (MMI) criterion. In this paper we introduce discriminative training procedures that employ linear transforms for feature normalization and for speaker adaptive training. We integrate these discriminative linear transforms into MMI estimation of HMM parameters for improvement of large vocabulary conversational speech recognition systems.

[141]   Lexicon adaptation for LVCSR: speaker idiosyncracies, non-native speakers, and pronunciation choice. W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, and W. Byrne. In ISCA ITR Workshop on Pronunciation Modeling and Lexicon Adaptation, page (4 pages), 2002.

We report on our preliminary experiments on building dynamic lexicons for native-speaker conversational speech and for foreign-accented conversational speech. Our goal is to build a lexicon with a set of pronunciations for each word, in which the probability distribution over pronunciation is dynamically computed. The set of pronunciations are derived from hand-written rules (for foreign accent) or clustering (for phonetically-transcribed Switchboard data). The dynamic pronunciation-probability will take into account specific characteristics of the speaker as well as factors such as language-model probability, disfluencies, sentence position, and phonetic context.

[142]   Mandarin pronunciation modeling based on the CASS corpus. F. Zheng, Z. Song, P. Fung, and W. Byrne. Journal of Computer Science and Technology (Science Press, Beijing, China), 17(3), May 2002. (16 pages).

The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, we propose the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the context-dependent pronunciation weighting, based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) was reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription is used to refine the acoustic model using the proposed iterative force-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.

[143]   Automatic generation of pronunciation lexicons for Mandarin casual speech. W. Byrne, V. Venkataramani, T. Kamm, T.F. Zheng, Z. Song, P. Fung, Y. Lui, and U. Ruhi. In IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 569–572 (4 pages), Salt Lake City, Utah, 2001. IEEE.

Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed in English are applied to this corpus to train pronunciaton models when are then applied in Mandarin Broadcast News transcription.

[144]   Confidence based lattice segmentation and minimum Bayes-risk decoding. V. Goel, S. Kumar, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), volume 4, pages 2569–2572 (4 pages), Aalborg, Denmark, 2001.

Minimum Bayes Risk (MBR) speech recognizers have been shown to yield improvements over the conventional maximum a-posteriori probability (MAP) decoders in the context of Nbest list rescoring andsearch over recognition lattices. Segmental MBR (SMBR) procedures have been developed to simplify implementation of MBR recognizers, by segmenting the N-best list or lattice, to reduce the size of the search space over which MBR recognition is carried out. In this paper we describe lattice cutting as a method to segment recognition word lattices into regions of low confidence and high confidence. We present two SMBR decoding procedures that can be applied on low confidence segment sets. Results obtained on the Switchboard conversational telephone speech corpus show modest but significant improvements relative to MAP decoders.

[145]   Convergence of DLLR rapid speaker adaptation algorithms. A. Gunawardana and W. Byrne. In ISCA ITR-Workshop on Adaptation Methods for Automatic Speech Recognition, page (4 pages), 2001.

Discounted Likelihood Linear Regression (DLLR) is a speaker adaptation technique for cases where there is insufficient data for MLLR adaptation. Here, we provide an alternative derivation of DLLR by using a censored EM formulation which postulates additional adaptation data which is hidden. This derivation shows that DLLR, if allowed to converge, provides maximum likelihood solutions. Thus the robustness of DLLR to small amounts of data is obtained by slowing down the convergence of the algorithm and by allowing termination of the algorithm before overtraining occurs. We then show that discounting the observed adaptation data by postulating additional hidden data can also be extended to MAP estimation of MLLR-type adaptation transformations.

[146]   Discriminative speaker adaptation with conditional maximum likelihood linear regression. A. Gunawardana and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 2001.

We present a simplified derivation of the extended Baum-Welch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden Markov models (HMMs). We use the extended Baum-Welch procedure for discriminative estimation of MLLR-type speaker adaptation transformations. The resulting adaptation procedure, termed Conditional Maximum Likelihood Linear Regression (CMLLR), is used successfully for supervised and unsupervised adaptation tasks on the Switchboard corpus, yielding an improvement over MLLR. The interaction of unsupervised CMLLR with segmental minimum Bayes risk lattice voting procedures is also explored, showing that the two procedures are complimentary.

[147]   On large vocabulary continuous speech recognition of highly inflectional language - Czech. P. Ircing, P. Krebc, J. Hajic, S. Khudanpur, F. Jelinek, J. Psutka, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 2001

[148]   MLLR adaptation techniques for pronunciation modeling. V. Venkataramani and W. Byrne. In IEEE Workshop on Automatic Speech Recognition and Understanding, page (4 pages), Madonna di Campiglio, Italy, 2001.

Multiple regression class MLLR transforms are investigated for use with pronunciation models that predict variation in the observed pronunciations given the phonetic context. Regression classes can be constructed so that MLLR transforms can be estimated and used to model specific acoustic changes associated with pronunciation variation. The effectiveness of this modeling approach is evaluated on the phonetically transcribed portion of the SWITCHBOARD conversational speech corpus.

[149]   Modeling pronunciaiton variation using context-dependent weighting and B/S refined acoustic modeling. F. Zheng, Z. Song, P. Fung, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 2001.

Pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. By studying the initial/final (IF) characteristics of Chinese language and developing the Bayesian equation, we propose the concepts of generalized initial/final (GIF) and generalized syllable (GS), the GIF modeling method and the IF-GIF modeling method, as well as the context-dependent pronunciation weighting method. By using these approaches, the IF-GIF modeling reduces the Chinese syllable error rate (SER) by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language modeling, such as syllable or word N-gram, is not used.

[150]   Discounted likelihood linear regression for rapid speaker adaptation. A. Gunawardana and W. Byrne. Computer Speech and Language, 15(1):15–38 (24 pages), Jan 2001. http://dx.doi.org/10.1006/csla.2000.0151.

The widely used maximum likelihood linear regression speaker adaptation procedure suffers from overtraining when used for rapid adaptation tasks in which the amount of adaptation data is severely limited. This is a well known difficulty associated with the estimation maximization algorithm. We use an information geometric analysis of the estimation maximization algorithm as an alternating minimization of a Kullback-Leibler-type divergence to see the cause of this difficulty, and propose a more robust discounted likelihood estimation procedure. This gives rise to a discounted likelihood linear regression procedure, which is a variant of maximum likelihood linear regression suited for small adaptation sets. Our procedure is evaluated on an unsupervised rapid adaptation task defined on the Switchboard conversational telephone speech corpus, where our proposed procedure improves word error rate by 1.6% (absolute) with as little as five seconds of adaptation data, which is a situation in which maximum likelihood linear regression overtrains in the first iteration of adaptation. We compare several realizations of discounted likelihood linear regression with maximum likelihood linear regression and other simple maximum likelihood linear regression variants, and discuss issues that arise in implementing our discounted likelihood procedures.

[151]   Towards language independent acoustic modeling. W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and W. Wang. In IEEE Conference on Acoustics, Speech and Signal Processing, pages 1029–1032 (4 pages), Istanbul, Turkey, 2000. IEEE.

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge-based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.

[152]   Morpheme based language models for speech recognition of czech. W. Byrne, Jan Hajic, Pavel Krbec, Pavel Ircing, and Josef Psutka. In TDS ’00: Proceedings of the Third International Workshop on Text, Speech and Dialogue, pages 211–216 (6 pages), London, UK, 2000. Springer-Verlag

[153]   Minimum Bayes-Risk automatic speech recognition. V. Goel and W. Byrne. Computer Speech and Language, 14(2):115–135 (21 pages), 2000. http://dx.doi.org/10.1006/csla.2000.0138.

In this paper we address the problem of efficient implementation of the minimum Bayes-risk classifiers for automatic speech recognition. Simplifying assumptions that allow computationally feasible approximations to these classifiers are proposed. Under these assumptions an approximate implementation as an A-star search algorithm over recognition lattice is constructed. This algorithm improves up on the previously proposed N-best list rescoring implementation of these classifiers. The minimum Bayes-risk classifiers are shown to outperform the most commonly used maximum a-posteriori probability (MAP) classifier on three speech recognition tasks: reduction of word error rate, reduction of content word error rate, and identification of Named Entities in speech. The A-star implementation is also contrasted with the N-best list rescoring implementation and is found to obtain modest but significant improvements in accuracy with little computational overhead.

[154]   Segmental minimum Bayes-risk ASR voting strategies. V. Goel, S. Kumar, and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, volume 3, pages 139–142 (4 pages), Beijing, China, 2000.

ROVER and its successor voting procedures have been shown to be quite effective in reducing the recognition word error rate (WER). The success of these methods has been attributed to their minimum Bayes-risk (MBR) nature: they produce the hypothesis with the least expected word error. In this paper we develop a general procedure within the MBR framework, called segmental MBR recognition, that encompasses current voting techniques and allows further extensions that yield lower expected WER. It also allows incorporation of loss functions other than the WER. We present a derivation of voting procedure of N-best ROVER as an instance of segmental MBR recognition. We then present an extension, called e-ROVER, that alleviates some of the restrictions of N-best ROVER by better approximating the WER. e-ROVER is compared with N-best ROVER on multi-lingual acoustic modeling task and is shown to yield modest yet significant and easily obtained improvements.

[155]   Robust estimation for rapid adaptation using discounted likelihood techniques. A. Gunawardana and W. Byrne. In International Conference on Acoustics, Speech, and Signal Processing, page (4 pages). IEEE, 2000.

The discounted likelihood procedure, which is a robust extension of the usual EM procedure, is presented, and two approximations which lead to two different variants of the usual MLLR adaptation scheme are introduced. These schemes are shown to robustly estimate speaker adaptation transforms with very little data. The evaluation is carried out on the Switchboard corpus.

[156]   CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. A. LI, F. ZHENG, W. Byrne, P. Fung, T. Kamm, Y. LIU, Z. SONG, U. Ruhi, V. Venkataramani, and X. CHEN. In Proc. of the International Conference on Spoken Language Processing, page (4 pages), 2000.

A collection of Chinese spoken language has been collected and phonetically annotated to capture spontaneous speech and language effects. The Chinese Annotated Spontaneous Speech (CASS) corpus contains phonetically transcribed spontaneous speech. This corpus was created to begin to collect samples of most of the phonetic variations in Mandarin spontaneous speech due to pronunciation effects, including allophonic changes, phoneme reduction, phoneme deletion and insertion, as well as duration changes. It is intended for use in pronunciation modeling for improved automatic speech recognition and will be used at the 2000 Johns Hopkins University Language Engineering Workshop by the project on Pronunciation Modeling of Mandarin Casual Speech.

[157]   On the incremental addition of regression classes for speaker adaptation. J. McDonough and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages). IEEE, 2000

[158]   Minimum risk acoustic clustering for multilingual acoustic model compination. D. Vergyri, S. Tsakalidis, and W. Byrne. In International Conference on Spoken Language Processing, page (4 pages), 2000.

In this paper we describe procedures for combining multiple acoustic models, obtained using training corpora from different languages, in order to improve ASR performance in languages for which large amounts of training data are not available. We treat these models as multiple sources of information whose scores are combined in a log-linear model to compute the hypothesis likelihood. The model combination can either be performed in a static way, with constant combination weights, or in a dynamic way, with parameters that can vary for different segments of a hypothesis. The aim is to optimize the parameters so as to achieve minimum word error rate. In order to achieve robust parameter estimation in the dynamic combination case, the parameters are defined to be piecewise constant on different phonetic classes that form a partition of the space of hypothesis segments. The partition is defined, using phonological knowledge, on segments that correspond to hypothesized phones. We examine different ways to define such a partition, including an automatic approach that gives a binary tree structured partition which tries to achieve the minimum WER with the minimum number of classes.

[159]   Comments on ’Efficient training algorithms for HMM’s using incremental estimation’. W. Byrne and A. Gunawardana. IEEE Transactions on Speech and Audio Processing, 8(6):751–754 (4 pages), Nov 2000. http://dx.doi.org/10.1109/89.876315.

“Efficient Training Algorithms for HMM’s using Incremental Estimation” investigates EM procedures that increase training speed. The authors’ claim that these are GEM procedures is incorrect. We discuss why this is so, provide an example of non-monotonic convergence to a local maximum in likelihood, and outline conditions that guarantee such convergence.

[160]   Towards language independent acoustic modeling. W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and W. Wang. In IEEE Workshop on Automatic Speech Recognition and Understanding, page (4 pages), Keystone, Colorado, 1999.

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.

[161]   Convergence of EM variants. W. Byrne and A. Gunawardana. In IEEE Information Theory Workshop on Detection, Estimation, Classification, and Imaging, page 64 (1 page), 1999

[162]   Discounted likelihood linear regression for rapid adaptation. W. Byrne and A. Gunawardana. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 1999.

Rapid adaptation schemes that employ the EM algorithm may suffer from overtraining problems when used with small amounts of adaptation data. An algorithm to alleviate this problem is derived within the information geometric framework of Csiszįr and Tusnįdy, and is used to improve MLLR adaptation on NAB and Switchboard adaptation tasks. It is shown how this algorithm approximately optimizes a discounted likelihood criterion.

[163]   Large vocabulary speech recognition for read and broadcast Czech. W. Byrne, J. Hajic, P. Ircing, F. Jelinek, S. Khudanpur, J. McDonough, N. Peterek, and J. Psutka. In Proceedings of the Text, Speech, and Dialog Workshop, page (6 pages), 1999.

We describe read speech and broadcast news corpora collected as part of a multi-year international collaboration for the development of large vocabulary speech recognition systems in the Czech language. Initial investigations into language modeling for Czech automatic speech recognition are described and preliminary recognition results on the read speech corpus are presented.

[164]   Rapid speech recognizer adaptation to new speakers. V. Digalakis, S. Berkowitz, E. Bochieri, C. Boulis, W. Byrne, H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, J. McDonough, and A. Sankar. In IEEE Conference on Acoustics, Speech and Signal Processing, page (4 pages). IEEE, 1999.

This paper summarizes the work of the “Rapid Speech Recognizer Adaptation” team in the workshop held at Johns Hopkins University in the summer of 1998. The project addressed the modeling of dependencies between units of speech with the goal of making more effective use of small amounts of data for speaker adaptation. A variety of methods were investigated and their effectiveness in a rapid adaptation task defined on the SWITCHBOARD conversational speech corpus is reported.

[165]   Task dependent loss functions in speech recognition: A-star search over recognition lattices. V. Goel and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 1999.

A recognition strategy that can be matched to specific system performance criteria has recently been found to yield improvem ents over the usual maximum a posteriori probability strategy. Some examples of different system performance criteria are word error rate (WER), F-measure for Named Entity extraction tasks, and word-specific errors for keyword spotting tasks. In the match ed-to-the-task strategy the hypothesis is chosen to minimize the expected loss or the Bayes Risk under a loss function defined by th e performance measure of interest. Due to the prohibitively expensive implementation of this strategy, only an approximate implemen tation as an N-best list rescoring scheme has been used so far. Our goal is to improve the performance of such risk-based dec oders by developing search strategies that can incorporate more acoustic evidence. In this paper we present search algorithms to implement the risk-based recognition strategy over word lattices that contain acoustic and language model scores. These algorithms are extensions of the N-best list rescoring approximation and are formulated as A-star algorithms. We first present a single stack A-star search and show how to obtain an under-estimate and an over-estimate of the cost needed for the search. For loss functions that do not depend on time segmentation of hypotheses, a prefix-tree based simpl ification of the single stack algorithm is then derived. For yet a further subset of loss functions, including the usual Levenshtei n distance based loss for WER reduction tasks, we describe a search organization that facilitates further efficiencies in computatio n and storage. Finally we present a path equivalence criterion for merging of prefix tree nodes during search to allow for a larger search space. We find that restricted loss functions yield the most efficient search procedures. However the general single stack search can be applied quite broadly even in principle to loss functions that measure semantic agreement between sentences. Preliminary experiments were performed for WER reduction task on the Switchboard corpus, dev-test set of the 1997 JHU-LVCSR workshop. We obtain an error rate reduction of 0.8-0.9% absolute over a baseline of 38.5% WER. The search speed is comparable to the N-best list rescoring procedure which is much more restrictive in the amount of hypotheses considered for search and produces slightly inferior results (0.5-0.6% absolute improvement). At the conference we will present the framework of task dependent recognition strategy, its implementation as A-star search, and the speed and accuracy comparison of the search with N-best list rescoring procedure.

[166]   Task dependent loss functions in speech recognition: Application to named entity extraction. V. Goel and W. Byrne. In ESCA-ETR Workshop on accessing information in spoken audio, page (4 pages), 1999

[167]   Single-pass adapted training with all-pass transforms. J. McDonough and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), page (4 pages), 1999.

In recent work, the all-pass transform (APT) was proposed as the basis of a speaker adaptation scheme intended for use with a large vocabulary speech recognition system. It was shown that APT-based adaptation reduces to a linear transformation of cepstral means, much like the better known maximum likelihood linear regression (MLLR), but is specified by far fewer free parameters. Due to its linearity, APT-based adaptation can be used in conjunction with speaker-adapted training (SAT), an algorithm for performing maximum likelihood estimation of the parameters of an HMM when speaker adaptation is to be employed during both training and test. In this work, we propose a refinement of SAT called single-pass adapted trainingB (SPAT) which achieves the same improvement in system performance as SAT but requires much less computation for HMM training. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report a word error rate reduction of 5.3% absolute using a single, global APT.

[168]   Speaker adaptation with all-pass transforms. J. McDonough and W. Byrne. In International Conference on Acoustics, Speech, and Signal Processing, page (4 pages). IEEE, 1999.

In recent work, a class of transforms were proposed which achieve a remapping of the frequency axis much like conventional vocal tract length normalization. These mappings, known collectively as all-pass transforms (APT), were shown to produce substantial improvements in the performance of a large vocabulary speech recognition system when used to normalize incoming speech prior to recognition. In this application, the most advantageous characteristic of the APT was its cepstral-domain linearity; this linearity makes speaker normalization simple to implement, and provides for the robust estimation of the parameters characterizing individual speakers. In the current work, we exploit the APT to develop a speaker adaptation scheme in which the cepstral means of a speech recognition model are transformed to better match the speech of a given speaker. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report reductions in word error rate of 3.7% absolute.

[169]   Stochastic pronunciation modeling from hand-labelled phonetic corpora. M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters, and G. Zavaliagkos. Speech Communication, pages 109–116 (8 pages), November 1999. http://dx.doi.org/10.1016/S0167-6393(99)00037-0.

In the early ’90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speech phonetically transcribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontaneous telephone speech ASR task. We describe several approaches taken there. These include (1) one analogous to the AT&T approach, (2) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words (‘multiwords’) in the corpus (with corpus-derived probabilities) into the ASR lexicon, and (1+2) a hybrid approach in which a decision-tree model was used to automatically phonetically transcribe a much larger speech corpus than ICSI and then the multiword approach was used to construct an ASR recognition pronunciation lexicon.

[170]   Stochastic pronunciation modeling from hand-labeled phonetic corpora. W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. In Proceedings of the Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, page (8 pages), 1998

[171]   Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. W. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. In IEEE International Conference on Acoustics, Speech and Signal Processing, page (4 pages). IEEE, 1998.

Accurately modelling pronunciation variability in conversational speech is an important component of an automatic speech recognition system. We describe some of the projects undertaken in this direction during and after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July- August, 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternate pronunciations in an automatic speech recognition system. We demonstrate that the improvement in recognition performance from pronunciation modelling persists as the system is enhanced with better acoustic and language models.

[172]   LVCSR rescoring with modified loss functions: a decision theoretic perspective. V. Goel, W. Byrne, and S. Khudanpur. In International Conference on Acoustics, Speech, and Signal Processing, page (4 pages). IEEE, 1998. https://ieeexplore.ieee.org/abstract/document/674458.

In this work, the problem of speech decoding is viewed in a Decision Theoretic framework. A modified speech decoding procedure to minimize the expected word error rate is formulated in this framework, and its implementation in N-best list rescoring is presented. Preliminary experiments on the Switch-board show a small but statistically significant error rate improvements.

[173]   Speaker normalization with all-pass transforms. J. McDonough, W. Byrne, and X. Luo. In International Conference on Spoken Language Processing, page (4 pages), 1998. https://www.isca-speech.org/archive_v0/archive_papers/icslp_1998/i98_0869.pdf.

Speaker normalization is a process in which the short-time features of speech from a given speaker are transformed so as to better match some speaker independent model. Vocal tract length normalization (VTLN) is a popular speaker normalization scheme wherein the frequency axis of the short-time spectrum associated with a speaker’s speech is rescaled or warped prior to the extraction of cepstral features. In this work, we develop a novel speaker normalization scheme by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We propose a class of such maps, designated all-pass transforms for reasons given hereafter, and in a set of speech recognition experiments conducted on the Switchboard Corpus demonstrate their capacity to achieve word error rate reductions of 3.7% absolute.

[174]   Pronunciation modelling for conversational speech recognition: A status report from WS97. W. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. In IEEE Automatic Speech Recognition and Understanding Workshop, page (8 pages), 1997.

Accurately modelling pronunciation variability in conversational speech is an important component for automatic speech recognition. We describe some of the projects undertaken in this direction at WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July-August, 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternate pronunciations in a recognition experiment as well as in the acoustic training of an automatic speech recognition system. Our results show a reduction of word error rate in both cases band 2.2% with acoustic retraining.

[175]   Is automatic speech recognition ready for non-native speech? a data collection effort and initial experiments in modeling conversational Hispanic english. W. Byrne, S. Khudanpur, E. Knodt, and J. Bernstein. In ESCA-ITR Workshop on speech technology in language learning, page (4 pages), 1997.

We describe the protocol used for collecting a corpus of conversational English speech from non-native speakers at several levels of proficiency, and report the results of preliminary automatic speech recognition (ASR) experiments on this corpus using HTK-based ASR systems. The speech corpus contains both read and conversational speech recorded simultaneously on wide-band and telephone channels, and has detailed time aligned transcriptions. The immediate goal of the ASR experiments is to assess the difficulty of the ASR problem in language learning exercises and thus to gauge how current ASR technology may be used in conversational computer assisted language learning (CALL) systems. The long-term goal of this research, of which the data collection and experiments are a first step, is to incorporate ASR into computer-based conversational language instruction systems.

[176]   Neurocontrol in sequence recognition. W. Byrne and S. Shamma. In O. Omidvar and D. Elliott, editors, Progress in Neural Networks: Neural Networks for Control, pages 31–56 (26 pages). Academic Press, 1997.

An artificial neural network intended for sequence modeling and recognition is described. The network is based on a lateral inhibitory network with controlled, oscillatory behavior so that it naturally models sequence generation. Dynamic programming algorithms can be used to transform the network into a sequence recognizer. Markov decision theory is used to develop novel and more “neural” recognition control strategies as alternatives to dynamic programming.

[177]   Information geometry and maximum likelihood criteria. W. Byrne. In Conference on Information Sciences and Systems, page (6 pages), Princeton, NJ, 1996.

This paper presents a brief comparison of two information geometries as they are used to describe the EM algorithm used in maximum likelihood estimation from incomplete data. The Alternating Minimization framework based on the I-Geometry developed by Csiszar is presented first, followed by the em-algorithm of Amari. Following a comparison of these algorithms, a discussion of a variation in likelihood criterion is presented. The EM algorithm is usually formulated so as to improve the marginal likelihood criterion. Closely related algorithms also exist which are intended to maximize different likelihood criteria. The 1-Best criterion, for example, leads to the Viterbi training algorithm used in Hidden Markov Modeling. This criterion has an information geometric description that results from a minor modification of the marginal likelihood formulation.

[178]   Modeling systematic variations in pronunciation via a language-dependent hiddn speaking mode. M. Ostendorf, W. Byrne, M. Bacchiani, M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin, A. Waibel, B. Wheatley, and T. Zeppenfeld. In Proceedings of the International Conference on Spoken Language Processing, page (4 pages), 1996. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=09306a9bf5518a561e3727c31ad1560ba2e1cdfd

[179]   Spontaneous speech recognition for the credit card corpus using the HTK toolkit. S. Young, P. Woodland, and W. Byrne. IEEE Transactions on Speech and Audio Processing, pages 615–621 (6 pages), 1994. http://dx.doi.org/10.1109/89.326619.

This paper describes the speech recognition system which was provided as a baseline for the Summer Workshop on Robust Speech Processing held at the Rutgers CAIP Center in July/August 1993.

[180]   Generalization and maximum likelihood from small data sets. W. Byrne. In IEEE-SP Workshop on Neural Networks in Signal Processing, page (7 pages), 1993. https://ieeexplore.ieee.org/abstract/document/471869.

An often encountered learning problem is maximum likelihood training of exponential models. When the state is only partially specified by the training data, iterative training algorithms are used to produce a sequence of models that assign increasing likelihood to the training data. Although the performance as measured on the training set continues to improve as the algorithms progress, performance on related data sets may eventually begin to deteriorate. The cause of this behavior can be seen when the training problem is stated in the Alternating Minimization framework. A modified maximum likelihood training criterion is suggested to counter this behavior. It leads to a simple modification of the learning algorithms which relates generalization to learning speed. Training Boltzmann Machines and Hidden Markov Models is discussed under this modified criterion.

[181]   Noise robustness in the auditory representation of speech signals. K. Wang, S. Shamma, and W. Byrne. In International Conference on Acoustics, Speech, and Signal Processing, page (4 pages). IEEE, 1993. https://ieeexplore.ieee.org/abstract/document/319306.

A common sequence of operations in the early stages of most biological sensory systems is a wavelet transform followed by a compressive nonlinearity. The contribution of these operations to the formation of robust and perceptually significant representations in the auditory system is explored. It is demonstrated that the neural representation of a complex signal such as speech is derived from a highly reduced version of its wavelet transform, specifically, from the distribution of its locally averaged zero-crossing rates along the temporal and scale axes. It is shown analytically that such encoding of the wavelet transform results in mutual suppressive interactions across its different scale representations. Suppression in turn endows the representation with enhanced spectral peaks and superior robustness in noisy environments. Examples using natural speech vowels are presented to illustrate the results.

[182]   Alternating Minimization and Boltzmann Machine learning. W. Byrne. IEEE Transactions on Neural Networks, 3(4):612–620 (9 pages), 1992.

Training a Boltzmann machine with hidden units is appropriately treated in information geometry using the information divergence and the technique of alternating minimization. The resulting algorithm is shown to be closely related to gradient descent Boltzmann machine learning rules, and the close relationship of both to the EM algorithm is described. An iterative proportional fitting procedure is described and incorporated into the alternating minimization algorithm.

[183]   The auditory processing and recognition of speech. W. Byrne, J. Robinson, and S. Shamma. In Proceedings of the Speech and Natural Language Workshop, pages 325–331 (7 pages), October 1989. https://dl.acm.org/doi/abs/10.3115/1075434.1075490.

We are carrying out investigations into the application of biophysical and computational models to speech processing. Described here are studies of the robustness of a speech representation using a biophysical model of the cochlea; experimental results on the representation of speech and complex sounds in the mammalian auditory cortex; and descriptions of computational sequential processing networks capable of recognizing sequences of phonemes.

[184]   Adaptive filter processing in remote heart monitors. W. Byrne, R. Zapp, P. Flynn, and M. Siegel. IEEE Transactions on Biomedical Engineering, pages 717–722 (6 pages), 1986. http://dx.doi.org/10.1109/TBME.1986.325763.

This commmunication describes some current applications of adaptive filtering to the processing of microwave Doppler signals for heart rate monitoring. The problem has been approached in the past using signal processing techniques such as peak detection or autocorrelation. These methods either require large amounts of data or tend to be unreliable. This communication utilizes some recent techniques used in speech processing and applies them to heartbeat detection, thus allowing on-line processing of sampled microwave heart signals. The presentation includes a model for the signal, brief discussions of the algorithms evaluated, and qualitative analysis of performance compared to EKG measurements.

[185]   Adaptive filtering in microwave remote heart monitors. W. Byrne, R. Zapp, P. Flynn, and M. Siegel. In IEEE Engineering in Medicine and Biology Society, Seventh Annual Conference, page (4 pages), 1985