New Perspectives on Cohesion and Coherence: Implications for Translation
Kerstin Kunz, University of Heidelberg, Germany
Ekaterina Lapshinova-Koltunski, Saarland University, Germany
Katrin Menzel, Saarland University, Germany
The panel will investigate textual relations of cohesion and coherence in translation and multilingual text production with a strong focus on innovative methods of empirical analysis, as well as technology and computation. Given the amount of multilingual computation that is taking place, this topic is important for both human and machine translation, and further multilingual studies. Cohesion refers to the text-internal relationship of linguistic elements that are overtly linked via lexical and grammatical devices across sentence boundaries to be understood as a text. The recognition of coherence in a text is more subjective as it involves text- and reader-based features and refers to the logical flow of interrelated ideas in a text, thus establishing a mental textual world. There is a connection between these two concepts in that relations of cohesion can be regarded as explicit indicators of meaning relations in a text and, hence, contribute to its overall coherence.
The aim of this panel is to bring together scholars analyzing cohesion and coherence from different research perspectives that cover translation-relevant topics: language contrast, translationese and machine translation. What these approaches share is that they investigate instantiations of discourse phenomena in a multilingual context. And moreover, language comparison is based on empirical data. The challenges here can be identified with respect to the following methodological questions:
1. How to arrive at a cost-effective operationalization of the annotation process when dealing with a broader range of discourse phenomena?
2. Which statistical techniques are needed and are adequate for the analysis? And which methods can be combined for data interpretation?
3. Which applications of the knowledge acquired are possible in multilingual computation, especially in machine translation?
The contributions of different research groups involved in our panel reflect these questions. On the one hand, some contributions will concentrate on procedures to analyse cohesion and coherence from a corpus-linguistic perspective (M. Rysová, K. Rysová). On the other hand, our panel will include papers with a particular focus on textual cohesion in parallel corpora that include both originals and translated texts (K. Kerremans, K. Kunz/ E. Lapshinova-Koltunski/ S. Degaetano-Ortlieb, A. Kutuzov/M. Kunilovskayath). And finally, the papers in the panel will also include discussion of the nature of cohesion and coherence with implications for human and machine translation (E. Lapshinova-Koltunski, C. Scarton/ L. Specia, K. S. Smith/L. Specia).
Targeting the questions raised above and addressing them together from different research angles, the present panel will contribute to moving empirical translation studies ahead.
For informal enquiries: [eDOTlapshinovaATmxDOTuni-saarlandDOTde]
Kerstin Kunz (University of Heidelberg) holds an interim professorship at the Institute of Translation and Interpreting. She finished her PhD on Nominal Coreference in English and German in 2009. Since then, she has been involved in empirical research projects dealing with properties of translations and English-German contrasts on the level of lexicogrammar and discourse. Together with Erich Steiner, she currently has GECCo project (http://www.gecco.uni-saarland.de/GECCo/Home.html) at the Department of Applied Linguistics, Translation and Interpreting (Saarland University), in which different types of cohesive relations in English and German are explored, contrasting languages, originals and translations as well as written and spoken registers.
Ekaterina Lapshinova-Koltunski (Saarland University) is a post-doctoral researcher at the Department of Applied Linguistics, Translation and Interpreting. She finished her PhD on semi-automatic extraction and classification of language data at Institute for Natural Language Processing (Stuttgart) in 2010. Since then, she has been working in corpus-based projects related to language variation, language contrasts and translation, one of which is GECCo (http://www.gecco.uni-saarland.de/GECCo/Home.html). In 2012 she received a start-up research grant from the Saarland University to build resources for the analysis on variation in translation caused by different dimensions (register, translation method) resulting in translation varieties (including both human and machine translation).
Katrin Menzel (Saarland University) studied Conference Interpreting and Translation Studies at Saarland University (Saarbrücken, Germany). She has been working as a teaching and research staff member at the Department of Applied Linguistics, Translation and Interpreting at Saarland University since 2011. Katrin is involved in the research project "GECCo" on cohesion in English and German and works on the case study of ellipses as cohesive ties for her PhD thesis.
SESSION PLAN
Each paper is allocated with a 20 minutes time slot + 10 minutes discussion.
Discussion time is used at the end of each paper.
Introduction (20 Minutes)
PART 1 : TEXTUAL COHESION AND CONTRASTIVE ASPECTS
PAPER 1:
Title: Terminological variation in multilingual parallel corpora: a semi-automatic method involving co-referential analysis
Speaker: Koen Kerremans, Vrije Universiteit Brussel (Belgium)
PAPER 2:
Title: Cohesive chains in an English-German parallel corpus: Methodologies and challenges
Speaker: Kerstin Kunz, University of Heidelberg (Germany); Ekaterina Lapshinova-Koltunski, Saarland University (Germany) and Stefania Degaetano-Ortlieb, Saarland University (Germany)
PART 2: ASPECTS OF COHESION AND COHERENCE IN HUMAN VS MACHINE TRANSLATION
PAPER 3:
Title: Cohesion and Translation Variation: Corpus-based Analysis of Translation Varieties
Speaker: Ekaterina Lapshinova-Koltunski, Saarland University (Germany)
PAPER 4:
Title: Exploring Discourse in Machine Translation Quality Estimation
Speaker: Carolina Scarton, University of Sheffield (UK) and Lucia Specia, University of Sheffield (UK)
PAPER 5:
Title: Examining Lexical Coherence in a Multilingual Setting
Speaker: Karin Sim Smith, University of Sheffield (UK) and Lucia Specia (University of Sheffield)
PAPER TITLES, ABSTRACTS AND BIONOTES
PART 1 : TEXTUAL COHESION AND CONTRASTIVE ASPECTS
PAPER 1
Title: Terminological variation in multilingual parallel corpora: a semi-automatic method involving co-referential analysis
Speaker: Koen Kerremans, Vrije Universiteit Brussel (Belgium)
Abstract: The work presented in this article is part of a research study that focused on how terms and equivalents recorded in multilingual terminological databases can be extended with terminological variants and their translations retrieved from English source texts and their corresponding French and Dutch target texts (Kerremans 2014). For this purpose, a novel type of translation resource is proposed, resulting from a method for identifying terminological variants and their translations in texts. In many terminology approaches, terminological variants within and across languages are identified on the basis of semantic and/or linguistic criteria (Carreño Cruz 2008; Fernández-Silva et al. 2008). Contrary to such approaches, three perspectives of analysis were combined in Kerremans (2014) in order to build up the translation resource comprised of terminological variants and their translations. The first perspective is the semantic perspective, which means that units of specialised knowledge – or units of understanding (Temmerman 2000) – form the starting point for the analysis of term variation in the English source texts. The second perspective of analysis is the textual perspective, which implies that terminological variants pointing to a particular unit of understanding in a text are identified on the basis of their 'co-referential ties'. In the third perspective of analysis, which is the contrastive perspective, the French and Dutch translations of the English terms are extracted from the target texts. This approach is motivated by the fact that translators need to acquire a profound insight into the unit of understanding expressed in a source text before they can decide which equivalent to choose in the target language. In the framework of text linguistics, it has been shown how this can be achieved through the analysis of texts. A translator analyses the unit of understanding based on how it is expressed in the source texts (i.e. the semantic perspective), how its meaning is developed through the use of cohesive ties (i.e. the textual perspective) and how it can be rendered into the target language (i.e. the contrastive perspective). In this article, we shall only focus on how co-referential analysis was applied to the analysis of terminological variants in the source texts, resulting in lexical chains. These are "cohesive ties sharing the same referent, lexically rather than grammatically expressed" (Rogers 2007: 17). The terminological variants in these chains – which in this study were limited to only single word nouns or nominal expressions – become part of a general cluster of variants that were encountered in a collection of source texts. Several semi-automated modules were created in order to reduce the manual effort in the analysis of co-referential chains while ensuring consistency and completeness in the data. We will explain how the semi-automatic modules work and how these contribute to the development of the envisaged translation resource (cf. supra). We will also discuss what results can be derived from a co-referential analysis of terms and how these results can be used to quantitatively and qualitatively compare term variation between source and target texts.
Bionote: Koen Kerremans obtained his Master's degree in Germanic Philology at Universiteit Antwerpen in 2001, his Master's degree in Language Sciences - with a major in computational linguistics - at Universiteit Gent in 2002 and his PhD degree in Applied Linguistics at Vrije Universiteit Brussel in 2014. His research interests pertain to applied linguistics, language technologies, ontologies, specialised communication, terminology (variation) and translation studies. He is currently appointed as doctor-assistant at the department of Applied Linguistics (Faculty of Arts and Philosophy) of Vrije Universiteit Brussel (VUB) where he teaches courses on applied linguistics, terminology and culture-specific communication.
PAPER 2:
Title: Cohesive chains in an English-German parallel corpus: Methodologies and challenges
Speaker: Kerstin Kunz, University of Heidelberg (Germany); Ekaterina Lapshinova-Koltunski, Saarland University (Germany) and Stefania Degaetano-Ortlieb, Saarland University (Germany)
Abstract: The current paper discusses methodological challenges in analyzing cohesive relations with corpus-based procedures. It is based on research aiming at the comparison of English and German cohesion in written and spoken language and in originals and translations. For this objective, methodologies are developed that enable a fine-grained and precise analysis of different cohesive aspects in a representative corpus and that yield results for data interpretation within the duration of the project. Thus, methodologies have to be elaborate and cost effective at the same time.
We use an English-German comparable and parallel corpus which is pre-annotated on various grammatical levels and which has been enriched semi-automatically with information on cohesive devices of reference, conjunction, substitution and ellipsis. Our discussion will revolve around methodological challenges related to the current analysis of (1) co-reference and (2) lexical cohesion. The analysis of both types includes (a) identifying cohesive devices that function as explicit linguistic triggers (b) setting up a relation to the linguistic items with which they tie up (antecedents) and (c) integrating these ties into (longer) cohesive chains.
The methodological steps involved are the following:
1) Designing an annotation scheme. Main challenges revolve around the conceptual distinction of relations between instantiated co-reference and sense relations (lexical cohesion), the definition of categories that fit for a bilingual analysis, the inter-relatedness of chains, the depth of the ontological hierarchy and the distance between chain elements.
2) Designing semi-automatic annotation procedures. The challenge is to combine automatic pre-annotation and manual revision in a cost effective way. Our annotation of co-reference is based on the automatic extraction of reference devices, their manual revision and the manual annotation of chain relations (outputs of automatic co-reference tools were to error-prone for pre-annotation of coreference chains). For the annotation of lexical cohesion, we intend to proceed in a similar way. Sense relations and chains are pre-annotated using existing resources, e.g. WordNet, and revised by human annotators to obtain most precise results.
3) Extracting and analysing information. The challenge here is to extract data relevant for our research objective, i.e. information on chain length, distance between elements in chains in combination with morpho-syntactic preferences of chain elements, as well as on alignment of translational equivalents of cohesive relations. Moreover, appropriate statistical evaluation techniques have to be applied for interpretations in terms of language contrast and properties of translation. After demonstrating these methodologies on the basis of initial results, the presentation will end with a discussion of open questions. While our main aim is to design methodologies for a contrastive comparison of English and German on the level of text/ discourse, we hope to lay the ground for new paths in NLP and in machine translation, in particular. Furthermore, available alignments provide an insight into shifts in cohesion between source and target texts and the translation strategies applied.
Bionote: Kerstin Kunz holds an interim professorship at the Institute of Translation and Interpreting at Heidelberg University where she teaches in several BA and MA programs. She finished her PhD on English-German Nominal Coreference in 2009. She has been involved in various empirical research projects on properties of translations and English German contrasts on the level of lexicogrammar and discourse. Together with Erich Steiner, she currently has a corpus-based project at Saarland University. The GECCo project explores different types of cohesive relations in English and German, contrasting languages, originals and translations as well as written and spoken registers.
PART 3: Aspects of Cohesion and Coherence in Human vs. Machine Translation
PAPER 3:
Title: Cohesion and Translation Variation: Corpus-based Analysis of Translation Varieties
Speaker: Ekaterina Lapshinova-Koltunski, Saarland University (Germany)
Abstract: In this study, we analyse cohesion in 'translation varieties' - translation types or classes which differ in the translation methods or knowledge involved, e.g. human vs. machine translation (MT) or professional vs. novice. We expect variation in the distribution of different cohesive devices which occur in translations. Variation in translation can be caused by different factors, e.g. by systemic contrasts between source and target languages or different register settings, as well as ambiguities in both source and target languages. Thus, conjunction 'while' in the original sentence in (1a) is ambiguous between the readings 'during' and 'although'. The ambiguity is solved in (1b), but not in (1c), as the German 'während' is also ambiguous: (1a) My father preferred to stay in a bathrobe and be waited on for a change while he lead the stacks of newspapers [...] (1b) Mein Vater ist lieber im Bademantel geblieben und hat sich zur Abwechslung mal bedienen lassen und dabei die Zeitungsstapel durchgelesen [...] (1c) Mein Vater saß die ganze Zeit im Bademantel da und ließ sich zur Abwechslung bedienen, während er die Zeitungen laß [...]
English translations from German are less distinct and less register-dependent if compared to German translations from English. The variation in English-to-German translations strongly depends on register and devices of cohesion involved reflecting either shining-through or normalisation phenomena. Therefore, for our analysis, we chose a corpus of English-to-German translation varieties containing five subcorpora: translations 1) by professionals, 2) by students, 3) with a rule-based MT system, 4) with a statistical MT system trained with big data, 5) with a statistical MT system trained with small data.
Our first observations show that translation varieties differ in the distribution of cohesive devices. For example, novice translations contain more personal reference than the other translation, e.g. professional translators or a rule-based MT. Moreover, registers also differ in their preferences for cohesive devices, e.g. popular-science and instructions use the conjunctions während and dabei equally in German original texts. But tourism and political essays make more use of während than dabei. In professional translations, we observe the same tendency. In student translations, however, während is overused in most cases. The same tendency is observed for MT, where dabei sometimes does not occur at all.
So, we want to prove how cohesive devices reflect translation methods, the evidence of 'experience' (professional vs. novice or big data vs. small data), as well as registers involved in translation varieties under analysis. For this, we extract evidence for cohesive devices from the corpus and analyse the extracted methods with statistical techniques, applying unsupervised analysis to where the differences lie, and supervised techniques to find the features contributing to these differences. This knowledge is useful for both human translation and MT, e.g. in evaluation and MT improvement.
Bionote: Ekaterina Lapshinova-Koltunski (Saarland University) is a post-doctoral researcher at the Department of Applied Linguistics, Translation and Interpreting. She finished her PhD on semi-automatic extraction and classification of language data at Institute for Natural Language Processing (Stuttgart) in 2010. Since then, she has been working in corpus-based projects related to language variation, language contrasts and translation, one of which is GECCo (http://www.gecco.uni-saarland.de/GECCo/Home.html). In 2012 she received a start-up research grant from the Saarland University to build resources for the analysis on variation in translation caused by different dimensions (register, translation method) resulting in translation varieties (including both human and machine translation).
PAPER 4:
Title: Exploring Discourse in Machine Translation Quality Estimation
Speaker: Carolina Scarton, University of Sheffield (UK) and Lucia Specia, University of Sheffield (UK)
Abstract: Discourse covers linguistic phenomena that can go beyond sentence boundaries and are related to text cohesion and coherence. Suitable elementary discourse units (EDUs) are defined depending on the level of analysis (paragraphs, sentences or clauses). Cohesion can be defined as a phenomenon where EDUs are connected using linguistics markers (e.g.: connectives). Coherence is related to the topic of the text and to the logical relationships among EDUs (e.g.: causality). A few recent efforts have been made towards including discourse information into machine translation (MT) systems and MT evaluation metrics. In our work, we address quality estimation (QE) of MT. This challenging task focuses on evaluating translation quality without relying on human references. Features extracted from examples of source and translation texts, as well as the MT system, are used to train machine learning algorithms in order to predict the quality of new, unseen translations.
The motivation for using discourse information for QE is threefold: (i) on the source side: identifying discourse structures (such as, connectives) or patterns of structures which are more complex to be translated, and therefore will most likely lead to low quality translations; (ii) on the target side: identifying broken or incomplete discourse structures, which are more likely to be found in low quality translations; (iii) comparing discourse structures on both source and target sides to identify not only possible errors, but also language peculiarities which are not appropriately handled by the MT system.
Since discourse phenomena can happen at document-level, we moved from the traditional sentence-level QE to document-level QE. Document-level QE is useful, for example, for evaluation in gisting scenarios, where the quality of the document as a whole is important so that the end-user can make sense of it. We have explored lexical cohesion for QE at document-level for English-Portuguese, Spanish-English and English-Spanish translations in two ways: (i) considering repetitions of words, lemmas and nouns, in both source and target texts; (ii) considering Latent Semantic Analysis (LSA) cohesion. LSA is a method that can capture cohesive relations in a text, going beyond simple repetition counts. In our scenario, for each sentence, there is a word vector that represents it, considering all the words that appear in the document. Sentences are then compared based on their words vectors and sentences showing high similarity with most others are considered cohesive. Since LSA is language independent, it was applied on target and source texts. LSA cohesion features improved the results over a strong baseline.
Our next step is to move to the Rhetorical Structure Theory (RST) to capture coherence phenomena. On the source side, RST trees will be extracted and we will correlate the occurrence (or not) of the discourse structures (e.g.: Nucleous, Satellite or relations type, such as Attribution) with the quality labels. The same will be applied on the target side, where incorrect discourse units are expected to correlate better with low quality translations.
Bionote: Carolina Scarton is a PhD candidate and Marie Curie Early Stage Researcher (EXPERT project) at The University of Sheffield, working in the Natural Language Processing group on the Department of Computer Science, under supervision of Dr. Lucia Specia. Her research focuses on the use of discursive information for quality estimation of machine translations. She received a master's degree from University of São Paulo, Brazil, in 2013, where she worked at the Interinstitutional Center for Computational Linguistics (NILC).
PAPER 5:
Title: Examining Lexical Coherence in a Multilingual Setting
Speaker: Karin Sim Smith, University of Sheffield (UK) and Lucia Specia (University of Sheffield)
Abstract: Discourse has long been recognised as a crucial part of translation, but when it comes to Statistical Machine Translation (SMT), discourse information has been mostly neglected to date, as the decoders in SMT tend to work on a sentence by sentence basis. Our research concerns a study of lexical coherence, an issue that has not yet been exploited in the context of SMT. We explore an entity-based discourse framework, applying it for the first time in a multilingual context, aiming to: (i) examine whether human- authored texts offer different patterns of entities compared to (potentially incorrect) machine translated texts, and a version of the latter fixed by humans, and (ii) understand how this discourse phenomenon is realised across languages.
Entity distribution patterns are derived from entity grids or entity graphs. Entity grids are constructed by identifying the discourse entities in the documents under consideration, and constructing a 2D grids whereby each column corresponds to the entity, i.e. noun, being tracked, and each row represents a particular sentence in the document. Alternatively these can be projected on a bipartite graph where the sentences and entities form nodes, and the connections are the edges.
For the monolingual experiments, we use a corpus comprising three versions of the same documents: the human translation, the raw machine translation output and the post-edited version of the machine translation output, establishing whether any differences in lexical coherence may be due to the nature of the texts, as well as to potential errors in the machine translated version. We observed some trends in our monolingual comparative experiments on versions of translations, indicating that some patterns of differences between human translated and machine translated texts can be expected. We also applied the entity-based grid framework in a multilingual context, to parallel texts in English, French, and German. The goals are to understand differences in lexical coherence across languages, and in the future to establish whether this can be used as a means of ensuring that the same level of lexical coherence is transferred from the source to the machine translated documents.
We observed distinct patterns in our comparative multilingual approach: we discovered that the probabilities for different types of entity transitions varied, indicating a different coherence structure in the different languages. In this instance we are comparing the same texts, on a document by document basis, so the same genre and style, yet there is a clear and consistent difference in the probabilities. This would appear to indicate, amongst other things, that the manner in which lexical coherence is achieved varies from language to language. Besides establishing the worth of these features independently, we will also do so in the context of MT evaluation, and our ultimate goal is to then integrate them in an SMT model, in the hope that they will manage to exert influence in the decoding process and improve overall text coherence.
Bionote: Karin Sim Smith is currently in her 2nd year PhD at the Computer Science Department of Sheffield University, where she is part of the Modist project (Modelling Discourse in Machine Translation), which aims to improve discourse in Machine Translation. Specifically, she is researching ways to improve the coherence of SMT output, hoping to learn the coherence patterns that can be transferred from source to target text.
WRAP-UP SESSION (20 Minutes)