full transcript

From the Ted Talk by Ioannis Papachimonas: How computers translate human language

Unscramble the Blue Letters

How is it that so many intergalactic species in moevis and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a plabtroe dvciee that can instantly translate between any languages. So is a universal translator possible in real life? We already have many pgamrors that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic ltiuiignsc elements in the input language. For a seimnegly simple sentence like, "The children eat the muffins," the program first prseas its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the picaerdte consisting of a verb "eat," and a direct object "the mfnufis." It then needs to recognize English morphology, or how the laggnaue can be broken down into its smallest meaningful utins, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the sincemats, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows wdros to be angarerd in any order, while in others, doing so could make the miuffn eat the child. molorpoghy can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are tlhclecnaiy correct, the program might miss their finer points, such as whether the chiredln "mangiano" the muffins, or "divorano" them. Another method is statistical machine tairosnaltn, which analyzes a database of books, articles, and documents that have already been tlaartnesd by humans. By finding matches between source and translated text that are unlikely to ouccr by chance, the program can identify corresponding phrases and prettans, and use them for frtuue tnitornlaass. However, the quality of this type of translation dneepds on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most fmuoas fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a smlal creutare that translates the brian waves and nerve snigals of sentient secpeis through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of lgunaaegs in the wrlod, as well as the increasing interaction between the ploepe who speak them, will only ctinonue to spur greater advances in automatic translation. Perhaps by the time we encounter ilcetrgtaainc life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that drictoaniy, after all.

Open Cloze

How is it that so many intergalactic species in ______ and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a ________ ______ that can instantly translate between any languages. So is a universal translator possible in real life? We already have many ________ that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic __________ elements in the input language. For a _________ simple sentence like, "The children eat the muffins," the program first ______ its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the _________ consisting of a verb "eat," and a direct object "the _______." It then needs to recognize English morphology, or how the ________ can be broken down into its smallest meaningful _____, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the _________, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows _____ to be ________ in any order, while in others, doing so could make the ______ eat the child. __________ can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are ___________ correct, the program might miss their finer points, such as whether the ________ "mangiano" the muffins, or "divorano" them. Another method is statistical machine ___________, which analyzes a database of books, articles, and documents that have already been __________ by humans. By finding matches between source and translated text that are unlikely to _____ by chance, the program can identify corresponding phrases and ________, and use them for ______ ____________. However, the quality of this type of translation _______ on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most ______ fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a _____ ________ that translates the _____ waves and nerve _______ of sentient _______ through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of _________ in the _____, as well as the increasing interaction between the ______ who speak them, will only ________ to spur greater advances in automatic translation. Perhaps by the time we encounter _____________ life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that __________, after all.

Solution

  1. language
  2. continue
  3. translation
  4. world
  5. small
  6. languages
  7. predicate
  8. patterns
  9. seemingly
  10. linguistic
  11. brain
  12. dictionary
  13. signals
  14. people
  15. future
  16. parses
  17. famous
  18. technically
  19. morphology
  20. units
  21. words
  22. portable
  23. children
  24. muffin
  25. arranged
  26. device
  27. depends
  28. translated
  29. species
  30. movies
  31. semantics
  32. programs
  33. muffins
  34. creature
  35. occur
  36. intergalactic
  37. translations

Original Text

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

Frequently Occurring Word Combinations

Important Words

  1. absent
  2. advances
  3. alien
  4. analyzes
  5. ancient
  6. answer
  7. arranged
  8. articles
  9. automatic
  10. availability
  11. babel
  12. basic
  13. biological
  14. bit
  15. book
  16. books
  17. brain
  18. broken
  19. chance
  20. child
  21. children
  22. circles
  23. claim
  24. communicate
  25. compiling
  26. complicated
  27. computer
  28. computers
  29. concept
  30. consistent
  31. consisting
  32. continue
  33. correct
  34. creators
  35. creature
  36. crew
  37. database
  38. definite
  39. depends
  40. device
  41. dictionary
  42. difficulty
  43. direct
  44. distinguishes
  45. documents
  46. dual
  47. easy
  48. eat
  49. eating
  50. element
  51. elements
  52. encounter
  53. english
  54. entire
  55. exceptions
  56. fact
  57. famous
  58. fashioned
  59. fictional
  60. finally
  61. find
  62. finding
  63. finer
  64. fish
  65. form
  66. forms
  67. future
  68. general
  69. give
  70. gizmo
  71. grammatical
  72. greater
  73. guide
  74. happen
  75. humans
  76. identify
  77. identifying
  78. includes
  79. increasing
  80. initial
  81. input
  82. instantly
  83. instinctively
  84. interaction
  85. intergalactic
  86. introduced
  87. irregularities
  88. lack
  89. language
  90. languages
  91. learning
  92. leave
  93. led
  94. lexical
  95. life
  96. linguistic
  97. machine
  98. matches
  99. matter
  100. meaning
  101. meaningful
  102. method
  103. modern
  104. morphology
  105. movies
  106. muffin
  107. muffins
  108. nerve
  109. number
  110. object
  111. occur
  112. order
  113. parses
  114. parts
  115. patterns
  116. people
  117. perfect
  118. phrases
  119. plural
  120. points
  121. portable
  122. pose
  123. predicate
  124. problem
  125. product
  126. program
  127. programs
  128. properly
  129. quality
  130. real
  131. reality
  132. recognize
  133. refer
  134. researchers
  135. rest
  136. results
  137. rules
  138. run
  139. samples
  140. sanskrit
  141. seemingly
  142. semantics
  143. sentence
  144. sentient
  145. set
  146. shades
  147. sheer
  148. short
  149. signals
  150. simple
  151. size
  152. slovene
  153. small
  154. smallest
  155. source
  156. speak
  157. species
  158. spend
  159. spur
  160. star
  161. starship
  162. start
  163. statistical
  164. structure
  165. styles
  166. subject
  167. suffix
  168. syntax
  169. target
  170. task
  171. technically
  172. telepathy
  173. text
  174. time
  175. tiny
  176. translate
  177. translated
  178. translates
  179. translating
  180. translation
  181. translations
  182. translator
  183. translators
  184. trek
  185. tricky
  186. tv
  187. type
  188. understand
  189. understanding
  190. unique
  191. units
  192. universal
  193. verb
  194. vocabulary
  195. watch
  196. waves
  197. wondering
  198. word
  199. words
  200. world
  201. worlds
  202. writing
  203. years