Achieve translations between 2 languages from a un-aligned set of image captioning datasets in the two languages. Hence
broadly in an unsupervised fashion concerning translation.
How they solve it?
The authors propose a 2-agent game that as a side product leads to a model that can translate between the given languages without using a parallel corpus. To achieve this, the agents are equipped with three modules each: an image encoder, a native
speaker module, a foreign language encoder.
Native Speaker Module (NSM): This is an image captioning model, that is tasked with describing an image as well as
Possible. It is a GRU that is fed the image as its first input, via, producing a hidden state that is then used to produce text via a fully connected layer and ST-Gumbel softmax sampling. Further unfolding produces the full description.
Foreign Langage Encoder (FLE): This is also an RNN, that can generate features of dimension D from a given text in
The foreign language to the agent. This text is the description of the image as done by the other agent in its language.
Image Encoder (IE): This is a CNN that encodes the image into features of the same dimension D that FLE does. The aim
is that these encodings and the ones that an agent gets with FLE are as close as possible.
The overall process is then as below. (_1 and _2 represent the agents)
Image –> NSM_1 –> FLE_2 –> Feature <– IE_2 <– (Set of images, one of them being the original image )
From this translation can be done as follows:
Text in Language 1 –> FLE_1 –> NSM_2 –> Text in Language 2
The different loss functions used are as follows:
A cross-entropy loss on NSM, this is same as in an image captioning model assessing the quality of the descriptions generated by the NSM
* Cross Entropy loss on the inverse of the mean square distance between target image embedding and the embedding of the message passed by the other agent. See Eq 1.