Enhancing Machine Translation via Frame-Semantic Data
I’ve just finished my final assignment for the semester, a paper for LING 190. Click the title for the full text of the paper, read the abstract below, and see the cut at the bottom of this entry for a layperson’s explanation of the technical bits.
Enhancing Machine Translation via Frame-Semantic Data
Abstract
Frame semantics is an approach to examining meaning in natural language by considering clusters of related concepts. For instance, in the “commercial transaction” frame, there is a buyer, a seller, goods, and money; different predicates in this frame will place these agents in different syntactic roles, so, in English, the buyer will be the subject of buy while the seller will be the subject of sell.
Frame semantics presents a powerful aide to machine translation. Frame-semantic knowledge of an input phrase facilitates more precise word-sense disambiguation and allows greater flexibility in deciding which of multiple valid word orderings to emit in the target language. I have demonstrated this by creating a rudimentary system for translating from Spanish to English which can optionally take advantage of frame-annotated input, and then testing this system on a small corpus of phrases in the commercial transaction frame.
Glossary, in order of appearance in paper:
- predicate
- verb
- word-sense disambiguation
- For a word that has multiple meanings, figuring out which meaning a particular occurrence of that word is referring to.
- corpus, corpora
- A bunch of arbitrary text gathered from real-world sources. “Corpora” is the plural.
- parsing
- Taking natural language input and arranging it into phrases, subphrases, clauses, etc.
- lemma
- The root form of a word. For verbs, the lemma would be the infinitive form.
- tokenization
- Splitting a stream of natural language into a series of tokens, which are basically the same thing as words and punctuation. So, splitting “Hello, world!” into the following series of tokens: H e l l o , w o r l d !
- tagging
- Annotating each token with its part of speech, so in the sentence “he ran”, ‘ran’ would be marked as (amongst other things) a past-tense third-person singular verb.
- lexicon
- Dictionary.
- gendered/neuter pronouns
- “He” and “she” are gendered pronouns in English. “It” is a neuter pronoun in English. syntactic distribution of frame roles: This is referring to the way that particular frame roles are assigned to particular grammatical components in a particular frame-predicate (a particular predicate in a particular frame.) For instance, in English, “BUY” has the buyer as the subject, the seller in an optional “from” clause, the goods as the direct object, and money in an optional “for” clause.
- area for further research
- I’m too lazy to look into it / give me more grant money.
- syntactically motivated
- It’s a result of syntax.
- collocations
- Two words are considered collocates when they appear near each other more often than you’d expect given their respective frequencies.
- anaphora resolution
- When you have a pronoun, figuring out which noun it refers to.