Contextualized Topic Models¶
Contextualized Topic Models (CTM) are a family of topic models that use pre-trained representations of language (e.g., BERT) to support topic modeling. See the papers for details:
Bianchi, F., Terragni, S., & Hovy, D. (2021). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. ACL. https://aclanthology.org/2021.acl-short.96/
Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. EACL. https://www.aclweb.org/anthology/2021.eacl-main.143/

Topic Modeling with Contextualized Embeddings¶
Our new topic modeling family supports many different languages (i.e., the one supported by HuggingFace models) and comes in two versions: CombinedTM combines contextual embeddings with the good old bag of words to make more coherent topics; ZeroShotTM is the perfect topic model for task in which you might have missing words in the test data and also, if trained with muliglingual embeddings, inherits the property of being a multilingual topic model!
The big advantage is that you can use different embeddings for CTMs. Thus, when a new embedding method comes out you can use it in the code and improve your results. We are not limited by the BoW anymore.
We also have kitty! a new submodule that can be used to quickly create an human in the loop classifier to quickly classify your documents and create named clusters.
Tutorials¶
You can look at our medium blog post or start from one of our Colab Tutorials:
Name |
Link |
---|---|
Combined TM on Wikipedia Data (Preproc+Saving+Viz) (stable v2.2.0) |
|
Zero-Shot Cross-lingual Topic Modeling (Preproc+Viz) (stable v2.2.0) |
|
Kitty: Human in the loop Classifier (High-level usage) (stable v2.2.0) |
|
SuperCTM and β-CTM (High-level usage) (stable v2.2.0) |
Overview¶
TL;DR¶
In CTMs we have two models. CombinedTM and ZeroShotTM, which have different use cases.
CTMs work better when the size of the bag of words has been restricted to a number of terms that does not go over 2000 elements. This is because we have a neural model that reconstructs the input bag of word, Moreover, in CombinedTM we project the contextualized embedding to the vocab space, the bigger the vocab the more parameters you get, with the training being more difficult and prone to bad fitting. This is NOT a strict limit, however, consider preprocessing your dataset. We have a preprocessing pipeline that can help you in dealing with this.
Check the contextual model you are using, the multilingual model one used on English data might not give results that are as good as the pure English trained one.
Preprocessing is key. If you give a contextual model like BERT preprocessed text, it might be difficult to get out a good representation. What we usually do is use the preprocessed text for the bag of word creating and use the NOT preprocessed text for BERT embeddings. Our preprocessing class can take care of this for you.
Features¶
An important aspect to take into account is which network you want to use: the one that combines contextualized embeddings and the BoW (CombinedTM) or the one that just uses contextualized embeddings (ZeroShotTM)
But remember that you can do zero-shot cross-lingual topic modeling only with the ZeroShotTM
model. See cross-lingual-topic-modeling
We also have Kitty
: a utility you can use to do a simpler human in the loop classification of your
documents. This can be very useful to do document filtering. It also works in cross-lingual setting and
thus you might be able to filter documents in a language you don’t know!
References¶
If you find this useful you can cite the following papers :)
ZeroShotTM
@inproceedings{bianchi-etal-2021-cross,
title = "Cross-lingual Contextualized Topic Models with Zero-shot Learning",
author = "Bianchi, Federico and Terragni, Silvia and Hovy, Dirk and
Nozza, Debora and Fersini, Elisabetta",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.eacl-main.143",
pages = "1676--1683",
}
CombinedTM
@inproceedings{bianchi-etal-2021-pre,
title = "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence",
author = "Bianchi, Federico and
Terragni, Silvia and
Hovy, Dirk",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-short.96",
doi = "10.18653/v1/2021.acl-short.96",
pages = "759--766",
}
Development Team¶
Federico Bianchi <f.bianchi@unibocconi.it> Bocconi University
Silvia Terragni <s.terragni4@campus.unimib.it> University of Milan-Bicocca
Dirk Hovy <dirk.hovy@unibocconi.it> Bocconi University
Software Details¶
Free software: MIT license
Documentation: https://contextualized-topic-models.readthedocs.io.
Super big shout-out to Stephen Carrow for creating the awesome https://github.com/estebandito22/PyTorchAVITM package from which we constructed the foundations of this package. We are happy to redistribute this software again under the MIT License.
Credits¶
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template. To ease the use of the library we have also included the rbo package, all the rights reserved to the author of that package.
Note¶
Remember that this is a research tool :)