Intro - Hugging Face

Hugging Face is an extremely popular python library which provides state of the art models for various NLP tasks like text classification, machine translation etc. Its enables us to quickly experiment with various NLP architecture using its modules, thereby helping us to focus more on research instead of focusing on the nitty-gritty stuff.

One other big plus point is that it supports both Pytorch and Tensorflow frameworks. We can easily switch between the two. And we can also convert it into the ONNX frameword if need for inference.

Hugging Face has released various translation models, which you can explore in this link. We would be using the MarianMT model which has already been trained on parallel texts involving english and the dravidian languages. MarianMT models main ideas are based out of the MarianNMT project which mainly used C++. All models the MarinMT models at hugging face are transformer encoder-decoders with 6 layers in each component.

Intro - Translation

Machine Translation can be thought of a seq2seq generation task which contains encoder and decoder blocks. To train the model, the encoder receives the sentences in the source language and the decoder is made to predict the sentences in the target languages. You can check out this initial paper from Google for more information how it is done.

Here in this article we would be using translation models trained on Transformer architecture and you can see how easy it is to create a translation pipleline using the hugging face.

Code

!pip install transformers

Requirement already satisfied: transformers in /usr/local/lib/python3.6/dist-packages (3.4.0)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers) (4.41.1)
Requirement already satisfied: protobuf in /usr/local/lib/python3.6/dist-packages (from transformers) (3.12.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers) (1.18.5)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from transformers) (20.4)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: sentencepiece!=0.1.92 in /usr/local/lib/python3.6/dist-packages (from transformers) (0.1.91)
Requirement already satisfied: tokenizers==0.9.2 in /usr/local/lib/python3.6/dist-packages (from transformers) (0.9.2)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers) (0.0.43)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.6/dist-packages (from protobuf->transformers) (1.15.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf->transformers) (50.3.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (0.16.0)

from transformers import MarianMTModel, MarianTokenizer # imports the MarianMT model architecture and the tokenizer

model_name = 'Helsinki-NLP/opus-mt-en-dra' # This model has been trained on the parallel texts of english and the dravidian languages.

tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

print(tokenizer.supported_language_codes)

['>>tel<<', '>>kan<<', '>>mal<<', '>>tam<<']

Once you run the above code block you can see that the required tokenizer and model is getting downloaded from the hugging face model repository. The print statement prints out the languages supported by the translation engine. Since we are translating from English to the Dravidian languages we can see the 4 language codes of the dravidian languages.

All the 4 language codes which you see on the output cell are based out of the "ISO 639-2" which is a three letter language classification system. There is also a two letter language classification system which is commonly used called ISO 639-1. You can learn more the different language codes from this wikipedia link, which has a nice list of all the language codes in various standards.

Now let's prepare some texts for the translation engine to translate.

text_to_be_translated = ['>>tam<< How are you doing?',
                         '>>kan<< How are you doing?',
                         '>>tel<< How are you doing?',
                         '>>mal<< How are you doing?']

You can see that I am creating a list of same sentence for the model to translate but I am prepending the language codes of the Dravidian languages in the brackets. This addition of language codes at the beginning of the text is necessary because the translation model which has been trained to predict on mulitple target languages with the source language as English.

batch_text = tokenizer.prepare_seq2seq_batch(text_to_be_translated)
print(batch_text)

{'input_ids': tensor([[ 14, 129,  43,  24, 713,  15,   0],
        [ 12, 129,  43,  24, 713,  15,   0],
        [ 11, 129,  43,  24, 713,  15,   0],
        [ 13, 129,  43,  24, 713,  15,   0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]])}

In the print statement above you can observe that only the token for the language codes are different after tokenization while the other tokens are same in the input ids for all the sentences.

translated = model.generate(**batch_text)
print(translated)

tensor([[62951,  1796,  1381,  4547,  1629,    15,     0],
        [62951,   383, 13504,  9075,    15,     0, 62951],
        [62951,   934,   230,  6063,    15,     0, 62951],
        [62951,  6302, 11736,    15,     0, 62951, 62951]])

This step is used to make the model generate the intermediate representations for the input vectors. The ids which you see in the tensor all have relevant mappings to tokens in the target language. The tokenization technique used here is based on Sentence piece tokenization which tokenizes word to subword and creates a maping dictionary. You can learn more on Sentencepiece tokenization technique in this paper.

Now let's explore what do some of the ids in the intermediate representation tensor have as the associated word component for the sentence translated to Tamil.

print("Word for id 62951:", tokenizer.decode(token_ids=[62951]))
print("Word for id 1796:", tokenizer.decode(token_ids=[1796]))
print("Word for id 1381:", tokenizer.decode(token_ids=[1381]))
print("Word for id 4547:", tokenizer.decode(token_ids=[4547]))
print("Word for id 1629:", tokenizer.decode(token_ids=[1629]))
print("Word for id 15:", tokenizer.decode(token_ids=[15]))
print("Word for id 0:", tokenizer.decode(token_ids=[0]))

Word for id 62951: <pad>
Word for id 1796: நீ
Word for id 1381: எப்படி
Word for id 4547: இருக்கிற
Word for id 1629: ாய்
Word for id 15: ?
Word for id 0:

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text)

['நீ எப்படி இருக்கிறாய்?', 'ನೀವು ಹೇಗಿದ್ದೀರಿ?', 'ఎలా మీరు చేస్తున్న?', 'സുഖമാണോ?']

Please enter the English text to translate:
hello, how are you doing?
Please enter one of the following languages: 1) Tamil, 2) Telugu, 3) Kannada and 4) Malayalam:
Tamil
The translated text is: ஹலோ, நீ எப்படி இருக்கிறாய்?

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text)

['நீ எப்படி இருக்கிறாய்?', 'ನೀವು ಹೇಗಿದ್ದೀರಿ?', 'ఎలా మీరు చేస్తున్న?', 'സുഖമാണോ?']

So you have now created a setup of English to Dravidian languages translation in less than 10 steps using the hugging face package. You can also implement this translation activity using the pipeline feature of hugging face which abstracts the entire process. So let's take a look at how that works.

from transformers import pipeline, MarianTokenizer, MarianMTModel

model_name = 'Helsinki-NLP/opus-mt-en-dra' # This model has been trained on the parallel texts of english and the dravidian languages.

tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

translation_engine = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

text_to_translate = input(prompt='Please enter the English text to translate:\n')

lang_select = input(prompt='Please enter one of the following languages: 1) Tamil, 2) Telugu, 3) Kannada and 4) Malayalam:\n')
if lang_select == "Tamil":
  text_to_translate = ">>tam<<" + text_to_translate 
elif lang_select == "Kannada":
  text_to_translate = ">>kan<<" + text_to_translate 
elif lang_select == "Telugu":
  text_to_translate = ">>tel<<" + text_to_translate 
elif lang_select == "Malayalam":
  text_to_translate = ">>mal<<" + text_to_translate 

translated_text = translation_engine(text_to_translate)
print("The translated text is: {}".format(translated_text[0]["generated_text"]))

Please enter the English text to translate:
hello, how are you doing?
Please enter one of the following languages: 1) Tamil, 2) Telugu, 3) Kannada and 4) Malayalam:
Tamil
The translated text is: ஹலோ, நீ எப்படி இருக்கிறாய்?

As you can see this abstracts the majority of the technical know-hows and creates easy to use pipeline which would enable us to make products faster.