Skip to content

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

License

Notifications You must be signed in to change notification settings

AmirStudy/BERTopic

Repository files navigation

PyPI - PythonBuilddocsPyPI - PyPiPyPI - LicenseDOI

BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found here and here.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install bertopic[flair] pip install bertopic[gensim] pip install bertopic[spacy] pip install bertopic[use]

To install all backends:

pip install bertopic[all]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with one of the examples below:

NameLink
Topic Modeling with BERTopicOpen In Colab
(Custom) Embedding Models in BERTopicOpen In Colab
Advanced Customization in BERTopicOpen In Colab
(semi-)Supervised Topic Modeling with BERTopicOpen In Colab
Dynamic Topic Modeling with Trump's TweetsOpen In Colab

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

frombertopicimportBERTopicfromsklearn.datasetsimportfetch_20newsgroupsdocs=fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] topic_model=BERTopic() topics, _=topic_model.fit_transform(docs)

After generating topics, we can access the frequent topics that were generated:

>>>topic_model.get_topic_info() TopicCountName-14630-1_can_your_will_any4969349_windows_drive_dos_file3246632_jesus_bible_christian_faith24412_space_launch_orbit_lunar2238122_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>>topic_model.get_topic(49) [('windows', 0.006152228076250982), ('drive', 0.004982897610645755), ('dos', 0.004845038866360651), ('file', 0.004140142872194834), ('disk', 0.004131678774810884), ('mac', 0.003624848635985097), ('memory', 0.0034840976976789903), ('software', 0.0034415334250699077), ('email', 0.0034239554442333257), ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Embedding Models

BERTopic supports many embedding models that can be used to embed the documents and words:

  • Sentence-Transformers
  • Flair
  • Spacy
  • Gensim
  • USE

Click here for a full overview of all supported embedding models.

Sentence-Transformers

You can select any model from sentence-transformers here and pass it to BERTopic:

topic_model=BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:

fromsentence_transformersimportSentenceTransformersentence_model=SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu") topic_model=BERTopic(embedding_model=sentence_model)

Flair

Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

fromflair.embeddingsimportTransformerDocumentEmbeddingsroberta=TransformerDocumentEmbeddings('roberta-base') topic_model=BERTopic(embedding_model=roberta)

You can select any 🤗 transformers model here.

Custom Embeddings
You can also use previously generated embeddings by passing it to fit_transform():

topic_model=BERTopic() topics, _=topic_model.fit_transform(docs, embeddings)

Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented across different times. Here, we will be using all of Donald Trump's tweet so see how he talked over certain topics over time:

importreimportpandasaspdtrump=pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6') trump.text=trump.apply(lambdarow: re.sub(r"http\S+", "", row.text).lower(), 1) trump.text=trump.apply(lambdarow: " ".join(filter(lambdax:x[0]!="@", row.text.split())), 1) trump.text=trump.apply(lambdarow: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1) trump=trump.loc[(trump.isRetweet=="f") & (trump.text!=""), :] timestamps=trump.date.to_list() tweets=trump.text.to_list()

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

topic_model=BERTopic(verbose=True) topics, _=topic_model.fit_transform(tweets)

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics:

topics_over_time=topic_model.topics_over_time(tweets, topics, timestamps)

Finally, we can visualize the topics by simply calling visualize_topics_over_time():

topic_model.visualize_topics_over_time(topics_over_time, top_n=6)

Overview

For quick access to common function, here is an overview of BERTopic's main methods:

MethodCode
Fit the modelBERTopic().fit(docs)
Fit the model and predict documentsBERTopic().fit_transform(docs)
Predict new documentsBERTopic().transform([new_doc])
Access single topicBERTopic().get_topic(topic=12)
Access all topicsBERTopic().get_topics()
Get topic freqBERTopic().get_topic_freq()
Get all topic informationBERTopic().get_topic_info()
Get topics per classBERTopic().topics_per_class(docs, topics, classes)
Dynamic Topic ModelingBERTopic().topics_over_time(docs, topics, timestamps)
Visualize TopicsBERTopic().visualize_topics()
Visualize Topic Probability DistributionBERTopic().visualize_distribution(probs[0])
Visualize Topics over TimeBERTopic().visualize_topics_over_time(topics_over_time)
Visualize Topics per ClassBERTopic().visualize_topics_per_class(topics_per_class)
Update topic representationBERTopic().update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topicsBERTopic().reduce_topics(docs, topics, nr_topics=30)
Find topicsBERTopic().find_topics("vehicle")
Save modelBERTopic().save("my_model")
Load modelBERTopic.load("my_model")
Get parametersBERTopic().get_params()

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic, author = {Maarten Grootendorst}, title = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.}, year = 2020, publisher = {Zenodo}, version = {v0.7.0}, doi = {10.5281/zenodo.4381785}, url = {https://doi.org/10.5281/zenodo.4381785} }

About

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python99.7%
  • Makefile0.3%