CombinedTM: Coherent Topic Models

Combined TM combines the BoW with SBERT, a process that seems to increase the coherence of the predicted topics (https://aclanthology.org/2021.acl-short.96/).

Usage

Here is how you can use the CombinedTM. This is a standard topic model that also uses contextualized embeddings. The good thing about CombinedTM is that it makes your topic much more coherent (see the paper https://arxiv.org/abs/2004.03974). n_components=50 specifies the number of topics.

from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file

qt = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")

training_dataset = qt.fit(text_for_contextual=list_of_unpreprocessed_documents, text_for_bow=list_of_preprocessed_documents)

ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50 topics

ctm.fit(training_dataset) # run the model

ctm.get_topics()

Once the model is trained, it is very easy to get the topics!

ctm.get_topics()

Creating the Test Set

The transform method will take care of most things for you, for example the generation of a corresponding BoW by considering only the words that the model has seen in training.

If you use CombinedTM you need to include the test text for the BOW:

testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual, text_for_bow=testing_text_for_bow)

# n_sample how many times to sample the distribution (see the doc)
ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document

Warning

Note that the way we use the transform method here is different from what we do for ZeroShotTM! This is very important!

Tutorial

You can find a tutorial here: Open In Colab it will show you how you can use CombinedTM.