A quick preview

Yes, you read that correctly...we will be writing an English to French translator. The pre-machine learning world might have approached this with a series of parsers and rules on how to translate words and phrases to others, but our approach will be far more elegant, generalizable, and quick. We will just use examples, many examples, to train our translator.

The game here will be finding a dataset with enough English sentences translated into French (actually, it will work in any language). Translated essays and news articles will not be as helpful as we won't necessarily be able to place specific sentences from one language to another side by side. So, we will need to be more creative. Fortunately, organizations such as the United Nations often need to do exactly this—they need to do line by line translations to meet the needs of their diverse constituencies. How convenient for us!

The Workshop on Statistical Machine Translation had a conference in 2010 that released a nice packaged training set, which can be used. The full details are available at http://www.statmt.org/wmt10/.

We will be using specific files just for French, as follows:

The following is an excerpt of what the source data looks like on the English side:

  • Food, where European inflation slipped up
  • The skyward zoom in food prices is the dominant force behind the speed up in eurozone inflation
  • November price hikes were higher than expected in the 13 eurozone countries, with October's 2.6 percent yr/yr inflation rate followed by 3.1 percent in November, the EU's Luxembourg-based statistical office reported
  • Official forecasts predicted just three percent, Bloomberg said
  • As opposed to the US, UK, and Canadian central banks, the European Central Bank (ECB) did not cut interest rates, arguing that a rate drop combined with rising raw material prices and declining unemployment would trigger an inflationary spiral
  • The ECB wants to hold inflation to under two percent, or somewhere in that vicinity
  • According to one analyst, ECB has been caught in a Catch-22, and it needs to talk down inflation, to keep from having to take action to push it down later in the game

And, here is the French equivalent:

  • L'inflation, en Europe, a dérapé sur l'alimentation
  • L'inflation accélérée, mesurée dans la zone euro, est due principalement à l'augmentation rapide des prix de l'alimentation
  • En novembre, l'augmentation des prix, dans les 13 pays de la zone euro, a été plus importante par rapport aux prévisions, après un taux d'inflation de 2,6 pour cent en octobre, une inflation annuelle de 3,1 pour cent a été enregistrée, a indiqué le bureau des statistiques de la Communauté Européenne situé à Luxembourg
  • Les prévisions officielles n'ont indiqué que 3 pour cent, a communiqué Bloomberg
  • Contrairement aux banques centrales américaine, britannique et canadienne, la Banque centrale européenne (BCE) n'a pas baissé le taux d'intérêt directeur en disant que la diminution des intérêts, avec la croissance des prix des matières premières et la baisse du taux de chômage, conduirait à la génération d'une spirale inflationniste
  • La BCE souhaiterait maintenir le taux d'inflation au-dessous mais proche de deux pour cent
  • Selon un analyste, c'est le Catch 22 pour la BCE-: "il faut dissuader" l'inflation afin de ne plus avoir à intervenir sur ce sujet ultérieurement

It is usually good to do a quick sanity check, when possible, to ensure that the files do actually line up. We can see the Catch 22 phrase on line 7 of both files, which gives us comfort.

Of course, 7 lines are far from sufficient for a statistical approach. We will achieve an elegant, generalizable solution only with mounds of data. And the mounds of data we will get for our training set will consist of 20 gigabytes of text, translated line by line very much like the preceding excerpts.

Just as we did with images, we'll use subsets for training, validation, and testing. We will also define a loss function and attempt to minimize that loss. Let's start with the data though.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.