Mono and Multi-Lingual Embeddings


Some of the examples below use a multilingual embedding model paraphrase-multilingual-mpnet-base-v2. This means that the representations you are going to use are mutlilinguals. However you might need a broader coverage of languages. In that case, you can check SBERT to find a model you can use.


If you are doing topic modeling in English, you SHOULD use an English sentence-bert model, for example paraphrase-distilroberta-base-v2. In that case, it’s really easy to update the code to support monolingual English topic modeling. If you need other models you can check SBERT for other models.

qt = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")


In general, our package should be able to support all the models described in the sentence transformer package and in HuggingFace. You need to take a look at HuggingFace models and find which is the one for your language. For example, for Italian, you can use UmBERTo. How to use this in the model, you ask? well, just use the name of the model you want instead of the english/multilingual one:

qt = TopicModelDataPreparation("Musixmatch/umberto-commoncrawl-cased-v1")