diff --git a/getting_started/representation/llm.html b/getting_started/representation/llm.html
index 27d638c5..43ff09f3 100755
--- a/getting_started/representation/llm.html
+++ b/getting_started/representation/llm.html
@@ -1150,13 +1150,6 @@
     Truncating Documents
   </a>
   
-</li>
-        
-          <li class="md-nav__item">
-  <a href="#document-truncation" class="md-nav__link">
-    Document Truncation
-  </a>
-  
 </li>
         
       </ul>
@@ -2707,13 +2700,6 @@
     Truncating Documents
   </a>
   
-</li>
-        
-          <li class="md-nav__item">
-  <a href="#document-truncation" class="md-nav__link">
-    Document Truncation
-  </a>
-  
 </li>
         
       </ul>
@@ -2852,32 +2838,6 @@ <h3 id="selecting-documents"><strong>Selecting Documents</strong><a class="heade
 <p>To increase the number of documents passed to <code>[DOCUMENTS]</code>, we can use the <code>nr_docs</code> parameter which is accessible in all LLMs on this page. Using this value allows you to select the top <em>n</em> most representative documents instead. If you have a long enough context length, then you could even give the LLM dozens of documents.</p>
 <p>However, some of these documents might be very similar to one another and might be near duplicates. They will not provide much additional information about the content of the topic. Instead, we can use the <code>diversity</code> parameter in each LLM to only select documents that are sufficiently diverse. It takes values between 0 and 1 but a value of 0.1 already does wonders!</p>
 <h3 id="truncating-documents"><strong>Truncating Documents</strong><a class="headerlink" href="#truncating-documents" title="Permanent link">&para;</a></h3>
-<p>If you increase the number of documents passed to <code>[DOCUMENTS]</code>, then the token limit can quickly be filled. Instead, we could only pass a part of the document by truncating it to a specific length. For that, we use two parameters that are accessible in all LLMs on this page, namely <code>document_length</code> and <code>tokenizer</code>. </p>
-<p>Let's start with the <code>tokenizer</code>. It is used to split a document up into tokens/segments. Each segment can then be used to calculate the complete document length. 
-The methods for tokenization are as follows:</p>
-<ul>
-<li>If tokenizer is <code>char</code>, then the document is split up into characters.</li>
-<li>If tokenizer is <code>whitespace</code>, the the document is split up into words separated by whitespaces. </li>
-<li>If tokenizer is <code>vectorizer</code>, then the internal CountVectorizer is used to tokenize the document.</li>
-<li>If tokenizer is a <code>callable</code>, then that callable is used to tokenize the document.</li>
-</ul>
-<p>After having tokenized the document according one of the strategies above, the <code>document_length</code> is used to truncate the document to its specified value. </p>
-<p>For example, if the tokenizer is <code>whitespace</code> then a document is split up into individual words and the length of the document is counted by the total number of words. In contrast, if the tokenizer is <code>callable</code> we can use any callable that has an <code>.encode</code> and <code>.decode</code> function. If we were to use <a href="https://github.com/openai/tiktoken">tiktoken</a>, then the document would be split up into tokens and the length of the document is counted by the total number tokens.</p>
-<p>To give an example, using tiktoken would work as follows:</p>
-<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">openai</span>
-<span class="kn">import</span> <span class="nn">tiktoken</span>
-<span class="kn">from</span> <span class="nn">bertopic.representation</span> <span class="kn">import</span> <span class="n">OpenAI</span>
-<span class="kn">from</span> <span class="nn">bertopic</span> <span class="kn">import</span> <span class="n">BERTopic</span>
-
-<span class="c1"># Create tokenizer</span>
-<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">tiktoken</span><span class="o">.</span><span class="n">get_encoding</span><span class="p">(</span><span class="s2">&quot;cl100k_base&quot;</span><span class="p">)</span>
-
-<span class="c1"># Create your representation model</span>
-<span class="n">client</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="s2">&quot;sk-...&quot;</span><span class="p">)</span>
-<span class="n">representation_model</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">document_length</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
-</code></pre></div>
-<p>In this example, each document will be at most 50 tokens, anything more will get truncated.</p>
-<h3 id="document-truncation"><strong>Document Truncation</strong><a class="headerlink" href="#document-truncation" title="Permanent link">&para;</a></h3>
 <p>We can truncate the input documents in <code>[DOCUMENTS]</code> in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:</p>
 <ul>
 <li><code>doc_length</code><ul>
diff --git a/search/search_index.json b/search/search_index.json
index 9ec97085..d095653f 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"BERTopic","text":"<p>BERTopic is a topic modeling technique that leverages \ud83e\udd17 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.</p> <p>BERTopic supports all kinds of topic modeling techniques:  </p> Guided Supervised Semi-supervised Manual Multi-topic distributions Hierarchical Class-based Dynamic Online/Incremental Multimodal Multi-aspect Text Generation/LLM Zero-shot (new!) Merge Models (new!) Seed Words (new!) <p>Corresponding medium posts can be found here, here and here. For a more detailed overview, you can read the paper or see a brief overview. </p>"},{"location":"index.html#installation","title":"Installation","text":"<p>Installation, with sentence-transformers, can be done using pypi:</p> <pre><code>pip install bertopic\n</code></pre> <p>You may want to install more depending on the transformers and language backends that you will be using.  The possible installations are: </p> <pre><code># Choose an embedding backend\npip install bertopic[flair, gensim, spacy, use]\n\n# Topic modeling with images\npip install bertopic[vision]\n</code></pre>"},{"location":"index.html#quick-start","title":"Quick Start","text":"<p>We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>After generating topics and their probabilities, we can access the frequent topics that were generated:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()\n\nTopic   Count   Name\n-1      4630    -1_can_your_will_any\n0       693     49_windows_drive_dos_file\n1       466     32_jesus_bible_christian_faith\n2       441     2_space_launch_orbit_lunar\n3       381     22_key_encryption_keys_encrypted\n</code></pre> <p>-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most  frequent topic that was generated, topic 0:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(0)\n\n[('windows', 0.006152228076250982),\n ('drive', 0.004982897610645755),\n ('dos', 0.004845038866360651),\n ('file', 0.004140142872194834),\n ('disk', 0.004131678774810884),\n ('mac', 0.003624848635985097),\n ('memory', 0.0034840976976789903),\n ('software', 0.0034415334250699077),\n ('email', 0.0034239554442333257),\n ('pc', 0.003047105930670237)]\n</code></pre> <p>Using <code>.get_document_info</code>, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:</p> <pre><code>&gt;&gt;&gt; topic_model.get_document_info(docs)\n\nDocument                               Topic    Name                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...\nFinally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...\nThink! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...\n1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...\n</code></pre> <p>Multilingual</p> <p>Use <code>BERTopic(language=\"multilingual\")</code> to select a model that supports 50+ languages. </p>"},{"location":"index.html#fine-tune-topic-representations","title":"Fine-tune Topic Representations","text":"<p>In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is <code>KeyBERTInspired</code>, which for many users increases the coherence and reduces stopwords from the resulting topic representations:</p> <pre><code>from bertopic.representation import KeyBERTInspired\n\n# Fine-tune your topic representations\nrepresentation_model = KeyBERTInspired()\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\n\n# Fine-tune topic representations with GPT\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, model=\"gpt-3.5-turbo\", chat=True)\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Multi-aspect Topic Modeling</p> <p>Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic. </p>"},{"location":"index.html#modularity","title":"Modularity","text":"<p>By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:</p> <p>You can swap out any of these models or even remove them entirely. The following steps are completely modular:</p> <ol> <li>Embedding documents</li> <li>Reducing dimensionality of embeddings</li> <li>Clustering reduced embeddings into topics</li> <li>Tokenization of topics</li> <li>Weight tokens</li> <li>Represent topics with one or multiple representations</li> </ol> <p>To find more about the underlying algorithm and assumptions here. </p>"},{"location":"index.html#overview","title":"Overview","text":"<p>BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview  of all methods and a short description of its purpose. </p>"},{"location":"index.html#common","title":"Common","text":"<p>Below, you will find an overview of common functions in BERTopic. </p> Method Code Fit the model <code>.fit(docs)</code> Fit the model and predict documents <code>.fit_transform(docs)</code> Predict new documents <code>.transform([new_doc])</code> Access single topic <code>.get_topic(topic=12)</code> Access all topics <code>.get_topics()</code> Get topic freq <code>.get_topic_freq()</code> Get all topic information <code>.get_topic_info()</code> Get all document information <code>.get_document_info(docs)</code> Get representative docs per topic <code>.get_representative_docs()</code> Update topic representation <code>.update_topics(docs, n_gram_range=(1, 3))</code> Generate topic labels <code>.generate_topic_labels()</code> Set topic labels <code>.set_topic_labels(my_custom_labels)</code> Merge topics <code>.merge_topics(docs, topics_to_merge)</code> Reduce nr of topics <code>.reduce_topics(docs, nr_topics=30)</code> Reduce outliers <code>.reduce_outliers(docs, topics)</code> Find topics <code>.find_topics(\"vehicle\")</code> Save model <code>.save(\"my_model\", serialization=\"safetensors\")</code> Load model <code>BERTopic.load(\"my_model\")</code> Get parameters <code>.get_params()</code>"},{"location":"index.html#attributes","title":"Attributes","text":"<p>After having trained your BERTopic model, several are saved within your model. These attributes, in part,  refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in <code>_</code> and are  public attributes that can be used to access model information. </p> Attribute Description <code>.topics_</code> The topics that are generated for each document after training or updating the topic model. <code>.probabilities_</code> The probabilities that are generated for each document if HDBSCAN is used. <code>.topic_sizes_</code> The size of each topic <code>.topic_mapper_</code> A class for tracking topics and their mappings anytime they are merged/reduced. <code>.topic_representations_</code> The top n terms per topic and their respective c-TF-IDF values. <code>.c_tf_idf_</code> The topic-term matrix as calculated through c-TF-IDF. <code>.topic_aspects_</code> The different aspects, or representations, of each topic. <code>.topic_labels_</code> The default labels for each topic. <code>.custom_labels_</code> Custom labels for each topic as generated through <code>.set_topic_labels</code>. <code>.topic_embeddings_</code> The embeddings for each topic if <code>embedding_model</code> was used. <code>.representative_docs_</code> The representative documents for each topic if HDBSCAN is used."},{"location":"index.html#variations","title":"Variations","text":"<p>There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.</p> Method Code Topic Distribution Approximation <code>.approximate_distribution(docs)</code> Online Topic Modeling <code>.partial_fit(doc)</code> Semi-supervised Topic Modeling <code>.fit(docs, y=y)</code> Supervised Topic Modeling <code>.fit(docs, y=y)</code> Manual Topic Modeling <code>.fit(docs, y=y)</code> Multimodal Topic Modeling <code>.fit(docs, images=images)</code> Topic Modeling per Class <code>.topics_per_class(docs, classes)</code> Dynamic Topic Modeling <code>.topics_over_time(docs, timestamps)</code> Hierarchical Topic Modeling <code>.hierarchical_topics(docs)</code> Guided Topic Modeling <code>BERTopic(seed_topic_list=seed_topic_list)</code> Zero-shot Topic Modeling <code>BERTopic(zeroshot_topic_list=zeroshot_topic_list)</code> Merge Multiple Models <code>BERTopic.merge_models([topic_model_1, topic_model_2])</code>"},{"location":"index.html#visualizations","title":"Visualizations","text":"<p>Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation.  Visualizing different aspects of the topic model helps in understanding the model and makes it easier  to tweak the model to your liking. </p> Method Code Visualize Topics <code>.visualize_topics()</code> Visualize Documents <code>.visualize_documents()</code> Visualize Document Hierarchy <code>.visualize_hierarchical_documents()</code> Visualize Topic Hierarchy <code>.visualize_hierarchy()</code> Visualize Topic Tree <code>.get_topic_tree(hierarchical_topics)</code> Visualize Topic Terms <code>.visualize_barchart()</code> Visualize Topic Similarity <code>.visualize_heatmap()</code> Visualize Term Score Decline <code>.visualize_term_rank()</code> Visualize Topic Probability Distribution <code>.visualize_distribution(probs[0])</code> Visualize Topics over Time <code>.visualize_topics_over_time(topics_over_time)</code> Visualize Topics per Class <code>.visualize_topics_per_class(topics_per_class)</code>"},{"location":"index.html#citation","title":"Citation","text":"<p>To cite the BERTopic paper, please use the following bibtex reference:</p> <pre><code>@article{grootendorst2022bertopic,\n  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},\n  author={Grootendorst, Maarten},\n  journal={arXiv preprint arXiv:2203.05794},\n  year={2022}\n}\n</code></pre>"},{"location":"changelog.html","title":"Changelog","text":""},{"location":"changelog.html#version-0160","title":"Version 0.16.0","text":"<p>Release date: 26 November, 2023</p> Highlights: <ul> <li>Merge pre-trained BERTopic models with <code>.merge_models</code><ul> <li>Combine models with different representations together!</li> <li>Use this for incremental/online topic modeling to detect new incoming topics</li> <li>First step towards federated learning with BERTopic</li> </ul> </li> <li>Zero-shot Topic Modeling<ul> <li>Use a predefined list of topics to assign documents</li> <li>If needed, allows for further exploration of undefined topics</li> </ul> </li> <li>Seed (domain-specific) words with <code>ClassTfidfTransformer</code><ul> <li>Make sure selected words are more likely to end up in the representation without influencing the clustering process</li> </ul> </li> <li>Added params to truncate documents to length when using LLMs</li> <li>Added LlamaCPP as a representation model</li> <li>LangChain: Support for LCEL Runnables by @joshuasundance-swca in #1586</li> <li>Added <code>topics</code> parameter to <code>.topics_over_time</code> to select a subset of documents and topics</li> <li>Documentation:<ul> <li>Best practices Guide</li> <li>Llama 2 Tutorial</li> <li>Zephyr Tutorial</li> <li>Improved embeddings guidance (MTEB)</li> <li>Improved logging throughout the package</li> </ul> </li> <li>Added support for Cohere's Embed v3: <pre><code>cohere_model = CohereBackend(\n    client,\n    embedding_model=\"embed-english-v3.0\",\n    embed_kwargs={\"input_type\": \"clustering\"}\n)\n</code></pre></li> </ul> Fixes: <ul> <li>Fixed n-gram Keywords need delimiting in OpenAI() #1546</li> <li>Fixed OpenAI v1.0 issues #1629</li> <li>Improved documentation/logging to adress #1589, #1591</li> <li>Fixed engine support for Azure OpenAI embeddings #1577</li> <li>Fixed OpenAI Representation: KeyError: 'content' #1570</li> <li>Fixed Loading topic model with multiple topic aspects changes their format #1487</li> <li>Fix expired link in algorithm.md by @burugaria7 in #1396</li> <li>Fix guided topic modeling in cuML's UMAP by @stevetracvc in #1326</li> <li>OpenAI: Allow retrying on Service Unavailable errors by @agamble in #1407</li> <li>Fixed parameter naming for HDBSCAN in best practices by @rnckp in #1408</li> <li>Fixed typo in tips_and_tricks.md by @aronnoordhoek in #1446</li> <li>Fix typos in documentation by @bobchien in #1481</li> <li>Fix IndexError when all outliers are removed by reduce_outliers by @Aratako in #1466</li> <li>Fix TypeError on reduce_outliers \"probabilities\" by @ananaphasia in #1501</li> <li>Add new line to fix markdown bullet point formatting by @saeedesmaili in #1519</li> <li>Update typo in topicrepresentation.md by @oliviercaron in #1537</li> <li>Fix typo in FAQ by @sandijou in #1542</li> <li>Fixed typos in best practices documentation by @poomkusa in #1557</li> <li>Correct TopicMapper doc example by @chrisji in #1637</li> <li>Fix typing in hierarchical_topics by @dschwalm in #1364</li> <li>Fixed typing issue with treshold parameter in reduce_outliers by @dschwalm in #1380</li> <li>Fix several typos by @mertyyanik in #1307 (#1307)</li> <li>Fix inconsistent naming by @rolanderdei in #1073</li> </ul> Merge Pre-trained BERTopic Models <p>The new <code>.merge_models</code> feature allows for any number of fitted BERTopic models to be merged. Doing so allows for a number of use cases:</p> <ul> <li>Incremental topic modeling -- Continuously merge models together to detect whether new topics have appeared</li> <li>Federated Learning - Train BERTopic models on different clients and combine them on a central server</li> <li>Minimal compute - We can essentially batch the training process into multiple instances to reduce compute</li> <li>Different datasets - When you have different datasets that you want to train seperately on, for example with different languages, you can train each model separately and join them after training</li> </ul> <p>To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.</p> <p>First, we train three separate models on different parts of the data:</p> <pre><code>from umap import UMAP\nfrom bertopic import BERTopic\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\n\n# Extract abstracts to train on and corresponding titles\nabstracts_1 = dataset[\"abstract\"][:5_000]\nabstracts_2 = dataset[\"abstract\"][5_000:10_000]\nabstracts_3 = dataset[\"abstract\"][10_000:15_000]\n\n# Create topic models\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\ntopic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)\ntopic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)\ntopic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)\n</code></pre> <p>Then, we can combine all three models into one with <code>.merge_models</code>:</p> <pre><code># Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n</code></pre> Zero-shot Topic Modeling <p>Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.  This allows for extensive flexibility as there are three scenario's to explore.</p> <ul> <li>No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run. </li> <li>Only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.</li> <li>Both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.</li> </ul> <p></p> <p>In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,  there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers. We know the data and believe there to be at least the following topics: clustering, topic modeling, and large language models.  However, we are not sure whether other topics exist and want to explore those.</p> <p>Using this feature is straightforward:</p> <pre><code>from datasets import load_dataset\n\nfrom bertopic import BERTopic\nfrom bertopic.representation import KeyBERTInspired\n\n# We select a subsample of 5000 abstracts from ArXiv\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\ndocs = dataset[\"abstract\"][:5_000]\n\n# We define a number of topics that we know are in the documents\nzeroshot_topic_list = [\"Clustering\", \"Topic Modeling\", \"Large Language Models\"]\n\n# We fit our model using the zero-shot topics\n# and we define a minimum similarity. For each document,\n# if the similarity does not exceed that value, it will be used\n# for clustering instead.\ntopic_model = BERTopic(\n    embedding_model=\"thenlper/gte-small\", \n    min_topic_size=15,\n    zeroshot_topic_list=zeroshot_topic_list,\n    zeroshot_min_similarity=.85,\n    representation_model=KeyBERTInspired()\n)\ntopics, _ = topic_model.fit_transform(docs)\n</code></pre> <p>When we run <code>topic_model.get_topic_info()</code> you will see something like this:</p> <p></p> Seed (Domain-specific) Words <p>When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the \"TNM\" classification is a method for identifying the stage of most cancers. The word \"TNM\" is an abbreviation and might not be correctly captured in generic embedding models. </p> <p>To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of <code>seed_words</code> in the <code>bertopic.vectorizer.ClassTfidfTransformer</code>. To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like \"agent\" and \"robot\" should be important in such a topic were it to be found. Using the <code>ClassTfidfTransformer</code>, we can define those <code>seed_words</code> and also choose by how much their values are multiplied. </p> <p>The full example is then as follows:</p> <pre><code>from umap import UMAP\nfrom datasets import load_dataset\nfrom bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\n# Let's take a subset of ArXiv abstracts as the training data\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\nabstracts = dataset[\"abstract\"][:5_000]\n\n# For illustration purposes, we make sure the output is fixed when running this code multiple times\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\n\n# We can choose any number of seed words for which we want their representation\n# to be strengthen. We increase the importance of these words as we want them to be more\n# likely to end up in the topic representations.\nctfidf_model = ClassTfidfTransformer(\n    seed_words=[\"agent\", \"robot\", \"behavior\", \"policies\", \"environment\"], \n    seed_multiplier=2\n)\n\n# We run the topic model with the seeded words\ntopic_model = BERTopic(\n    umap_model=umap_model,\n    min_topic_size=15,\n    ctfidf_model=ctfidf_model,\n).fit(abstracts)\n</code></pre> Truncate Documents in LLMs <p>When using LLMs with BERTopic, we can truncate the input documents in <code>[DOCUMENTS]</code> in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:</p> <ul> <li><code>doc_length</code> - The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.</li> <li><code>tokenizer</code> - The tokenizer used to calculate to split the document into segments used to count the length of a document. <ul> <li>Options include <code>'char'</code>, <code>'whitespace'</code>, <code>'vectorizer'</code>, and a callable</li> </ul> </li> </ul> <p>This means that the definition of <code>doc_length</code> changes depending on what constitutes a token in the <code>tokenizer</code> parameter. If a token is a character, then <code>doc_length</code> refers to max length in characters. If a token is a word, then <code>doc_length</code> refers to the max length in words.</p> <p>Let's illustrate this with an example. In the code below, we will use <code>tiktoken</code> to count the number of tokens in each document and limit them to 100 tokens. All documents that have more than 100 tokens will be truncated.</p> <p>We use <code>bertopic.representation.OpenAI</code> to represent our topics with nicely written labels. We specify that documents that we put in the prompt cannot exceed 100 tokens each. Since we will put 4 documents in the prompt, they will total roughly 400 tokens:</p> <pre><code>import openai\nimport tiktoken\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Tokenizer\ntokenizer= tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n\n# Create your representation model\nopenai.api_key = MY_API_KEY\nrepresentation_model = OpenAI(\n    model=\"gpt-3.5-turbo\", \n    delay_in_seconds=2, \n    chat=True,\n    nr_docs=4,\n    doc_length=100,\n    tokenizer=tokenizer\n)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"changelog.html#version-0150","title":"Version 0.15.0","text":"<p>Release date: 29 May, 2023</p> Highlights: <ul> <li>Multimodal Topic Modeling<ul> <li>Train your topic modeling on text, images, or images and text!</li> <li>Use the <code>bertopic.backend.MultiModalBackend</code> to embed images, text, both or even caption images!</li> </ul> </li> <li>Multi-Aspect Topic Modeling<ul> <li>Create multiple topic representations simultaneously </li> </ul> </li> <li>Improved Serialization options<ul> <li>Push your model to the HuggingFace Hub with <code>.push_to_hf_hub</code></li> <li>Safer, smaller and more flexible serialization options with <code>safetensors</code></li> <li>Thanks to a great collaboration with HuggingFace and the authors of BERTransfer!</li> </ul> </li> <li>Added new embedding models<ul> <li>OpenAI: <code>bertopic.backend.OpenAIBackend</code></li> <li>Cohere: <code>bertopic.backend.CohereBackend</code></li> </ul> </li> <li>Added example of summarizing topics with OpenAI's GPT-models</li> <li>Added <code>nr_docs</code> and <code>diversity</code> parameters to OpenAI and Cohere representation models</li> <li>Use <code>custom_labels=\"Aspect1\"</code> to use the aspect labels for visualizations instead</li> <li>Added cuML support for probability calculation in <code>.transform</code></li> <li>Updated topic embeddings<ul> <li>Centroids by default and c-TF-IDF weighted embeddings for <code>partial_fit</code> and <code>.update_topics</code></li> </ul> </li> <li>Added <code>exponential_backoff</code> parameter to <code>OpenAI</code> model</li> </ul> Fixes: <ul> <li>Fixed custom prompt not working in <code>TextGeneration</code> </li> <li>Fixed #1142</li> <li>Add additional logic to handle cupy arrays by @metasyn in #1179</li> <li>Fix hierarchy viz and handle any form of distance matrix by @elashrry in #1173</li> <li>Updated languages list by @sam9111 in #1099</li> <li>Added level_scale argument to visualize_hierarchical_documents by @zilch42 in #1106</li> <li>Fix inconsistent naming by @rolanderdei in #1073</li> </ul> Multimodal Topic Modeling <p>With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them. </p> <p> </p> <p>In this example, we are going to use images from <code>flickr</code> that each have a caption accociated to it: </p> <pre><code># NOTE: This requires the `datasets` package which you can \n# install with `pip install datasets`\nfrom datasets import load_dataset\n\nds = load_dataset(\"maderix/flickr_bw_rgb\")\nimages = ds[\"train\"][\"image\"]\ndocs = ds[\"train\"][\"caption\"]\n</code></pre> <p>The <code>docs</code> variable contains the captions for each image in <code>images</code>. We can now use these variables to run our multimodal example:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import VisualRepresentation\n\n# Additional ways of representing a topic\nvisual_model = VisualRepresentation()\n\n# Make sure to add the `visual_model` to a dictionary\nrepresentation_model = {\n   \"Visual_Aspect\":  visual_model,\n}\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\n</code></pre> <p>We can now access our image representations for each topic with <code>topic_model.topic_aspects_[\"Visual_Aspect\"]</code>. If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:</p> <pre><code>import base64\nfrom io import BytesIO\nfrom IPython.display import HTML\n\ndef image_base64(im):\n    if isinstance(im, str):\n        im = get_thumbnail(im)\n    with BytesIO() as buffer:\n        im.save(buffer, 'jpeg')\n        return base64.b64encode(buffer.getvalue()).decode()\n\n\ndef image_formatter(im):\n    return f'&lt;img src=\"data:image/jpeg;base64,{image_base64(im)}\"&gt;'\n\n# Extract dataframe\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\n\n# Visualize the images\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\n</code></pre> <p></p> Multi-aspect Topic Modeling <p>In this new release, we introduce <code>multi-aspect topic modeling</code>! During the <code>.fit</code> or <code>.fit_transform</code> stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).</p> <p> </p> <p>The approach is rather straightforward. We might want to represent our topics using a <code>PartOfSpeech</code> representation model but we might also want to try out <code>KeyBERTInspired</code> and compare those representation models. We can do this as follows:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic.representation import PartOfSpeech\nfrom bertopic.representation import MaximalMarginalRelevance\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Documents to train on\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# The main representation of a topic\nmain_representation = KeyBERTInspired()\n\n# Additional ways of representing a topic\naspect_model1 = PartOfSpeech(\"en_core_web_sm\")\naspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]\n\n# Add all models together to be run in a single `fit`\nrepresentation_model = {\n   \"Main\": main_representation,\n   \"Aspect1\":  aspect_model1,\n   \"Aspect2\":  aspect_model2 \n}\ntopic_model = BERTopic(representation_model=representation_model).fit(docs)\n</code></pre> <p>As show above, to perform multi-aspect topic modeling, we make sure that <code>representation_model</code> is a dictionary where each representation model pipeline is defined.  The main pipeline, that is used in most visualization options, is defined with the <code>\"Main\"</code> key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as <code>\"Aspect1\"</code> and <code>\"Aspect2\"</code>. </p> <p>After we have fitted our model, we can access all representations with <code>topic_model.get_topic_info()</code>:</p> <p> </p> <p>As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in <code>topic_model.topic_aspects_</code>. </p> Serialization <p>Saving, loading, and sharing a BERTopic model can be done in several ways. With this new release, it is now  advised to go with <code>.safetensors</code> as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as <code>.pickle</code> and pytorch <code>.bin</code> are also possible.</p> <p>The methods are used as follows:</p> <pre><code>topic_model = BERTopic().fit(my_docs)\n\n# Method 1 - safetensors\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 2 - pytorch\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"pytorch\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 3 - pickle\ntopic_model.save(\"my_model\", serialization=\"pickle\")\n</code></pre> <p>Saving the topic modeling with <code>.safetensors</code> or <code>pytorch</code> has a number of advantages:</p> <ul> <li><code>.safetensors</code> is a relatively safe format</li> <li>The resulting model can be very small (often &lt; 20MB&gt;) since no sub-models need to be saved</li> <li>Although version control is important, there is a bit more flexibility with respect to specific versions of packages</li> <li>More easily used in production</li> <li>Share models with the HuggingFace Hub</li> </ul> <p> </p> <p>The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing <code>safetensors</code>, <code>pytorch</code>, and <code>pickle</code>. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings. </p> HuggingFace Hub <p>When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account:</p> <pre><code>from huggingface_hub import login\nlogin()\n</code></pre> <p>When you have logged in to your HuggingFace account, you can save and upload the model as follows:</p> <pre><code>from bertopic import BERTopic\n\n# Train model\ntopic_model = BERTopic().fit(my_docs)\n\n# Push to HuggingFace Hub\ntopic_model.push_to_hf_hub(\n    repo_id=\"MaartenGr/BERTopic_ArXiv\",\n    save_ctfidf=True\n)\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_ArXiv\")\n</code></pre>"},{"location":"changelog.html#version-0141","title":"Version 0.14.1","text":"<p>Release date: 2 March, 2023</p> Highlights: <ul> <li>Use ChatGPT to create topic representations!:</li> <li>Added <code>delay_in_seconds</code> parameter to OpenAI and Cohere representation models for throttling the API<ul> <li>Setting this between 5 and 10 allows for trial users now to use more easily without hitting RateLimitErrors</li> </ul> </li> <li>Fixed missing <code>title</code> param to visualization methods</li> <li>Fixed probabilities not correctly aligning (#1024)</li> <li>Fix typo in textgenerator  @dkopljar27 in #1002</li> </ul> ChatGPT <p>Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.  In order to use ChatGPT with BERTopic, we need to define the model and make sure to set <code>chat=True</code>:</p> <pre><code>import openai\nfrom bertopic import BERTopic\nfrom bertopic.representation import OpenAI\n\n# Create your representation model\nopenai.api_key = MY_API_KEY\nrepresentation_model = OpenAI(model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Prompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags.  There are currently two tags, namely <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code>.  These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.  For example, if we have the following prompt:</p> <pre><code>prompt = \"\"\"\nI have topic that contains the following documents: \\n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\n</code></pre> <p>then that will be rendered as follows and passed to OpenAI's API:</p> <pre><code>\"\"\"\nI have a topic that contains the following documents: \n- Our videos are also made possible by your support on patreon.co.\n- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.\n- If you want to help us make more videos, you can do so there.\n- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.\n\nThe topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch \n\nBased on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\n</code></pre> <p>Note</p> <p>Whenever you create a custom prompt, it is important to add  <pre><code>Based on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n</code></pre> at the end of your prompt as BERTopic extracts everything that comes after <code>topic:</code>. Having  said that, if <code>topic:</code> is not in the output, then it will simply extract the entire response, so  feel free to experiment with the prompts. </p>"},{"location":"changelog.html#version-0140","title":"Version 0.14.0","text":"<p>Release date: 14 February, 2023</p> Highlights: <ul> <li>Fine-tune topic representations with <code>bertopic.representation</code><ul> <li>Diverse range of models, including KeyBERT, MMR, POS, Transformers, OpenAI, and more!'</li> <li>Create your own prompts for text generation models, like GPT3:<ul> <li>Use <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code> in the prompt to decide where the keywords and set of representative documents need to be inserted.</li> </ul> </li> <li>Chain models to perform fine-grained fine-tuning</li> <li>Create and customize your represention model</li> </ul> </li> <li>Improved the topic reduction technique when using <code>nr_topics=int</code></li> <li>Added <code>title</code> parameters for all graphs (#800)</li> </ul> Fixes: <ul> <li>Improve documentation (#837, #769, #954, #912, #911)</li> <li>Bump pyyaml (#903)</li> <li>Fix large number of representative docs (#965)</li> <li>Prevent stochastisch behavior in <code>.visualize_topics</code> (#952)</li> <li>Add custom labels parameter to <code>.visualize_topics</code> (#976)</li> <li>Fix cuML HDBSCAN type checks by @FelSiq in #981</li> </ul> API Changes: <ul> <li>The <code>diversity</code> parameter was removed in favor of <code>bertopic.representation.MaximalMarginalRelevance</code></li> <li>The <code>representation_model</code> parameter was added to <code>bertopic.BERTopic</code></li> </ul> <p> </p> Representation Models <p>Fine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you! </p> <p> </p> KeyBERTInspired <p>The algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:</p> <p></p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> PartOfSpeech <p>Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic. </p> <p></p> <pre><code>from bertopic.representation import PartOfSpeech\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> MaximalMarginalRelevance <p>When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like \"car\" and \"cars\"  essentially represent the same information and often redundant. We can use <code>MaximalMarginalRelevance</code> to improve diversity of our candidate topics:</p> <p></p> <pre><code>from bertopic.representation import MaximalMarginalRelevance\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = MaximalMarginalRelevance(diversity=0.3)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Zero-Shot Classification <p>To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords. </p> <p>We use it in BERTopic as follows:</p> <pre><code>from bertopic.representation import ZeroShotClassification\nfrom bertopic import BERTopic\n\n# Create your representation model\ncandidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\nrepresentation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: \ud83e\udd17 Transformers <p>Nearly every week, there are new and improved models released on the \ud83e\udd17 Model Hub that, with some creativity, allow for  further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these  methods are created as a way to support whatever might be released in the future. </p> <p>Using a GPT-like model from the huggingface hub is rather straightforward:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = TextGeneration('gpt2')\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: Cohere <p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords. To use this, you will need to install cohere first:</p> <pre><code>pip install cohere\n</code></pre> <p>Then, get yourself an API key and use Cohere's API as follows:</p> <pre><code>import cohere\nfrom bertopic.representation import Cohere\nfrom bertopic import BERTopic\n\n# Create your representation model\nco = cohere.Client(my_api_key)\nrepresentation_model = Cohere(co)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: OpenAI <p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords. To use this, you will need to install openai first:</p> <pre><code>pip install openai\n</code></pre> <p>Then, get yourself an API key and use OpenAI's API as follows:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create your representation model\nopenai.api_key = MY_API_KEY\nrepresentation_model = OpenAI()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: LangChain <p>Langchain is a package that helps users with chaining large language models. In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this  external knowledge are the most representative documents in each topic. </p> <p>To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:</p> <pre><code>pip install langchain, openai\n</code></pre> <p>Then, you can create your chain as follows:</p> <pre><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\nchain = load_qa_chain(OpenAI(temperature=0, openai_api_key=MY_API_KEY), chain_type=\"stuff\")\n</code></pre> <p>Finally, you can pass the chain to BERTopic as follows:</p> <pre><code>from bertopic.representation import LangChain\n\n# Create your representation model\nrepresentation_model = LangChain(chain)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"changelog.html#version-0130","title":"Version 0.13.0","text":"<p>Release date: 4 January, 2023</p> Highlights: <ul> <li>Calculate topic distributions with <code>.approximate_distribution</code> regardless of the cluster model used<ul> <li>Generates topic distributions on a document- and token-levels</li> <li>Can be used for any document regardless of its size!</li> </ul> </li> <li>Fully supervised BERTopic<ul> <li>You can now use a classification model for the clustering step instead to create a fully supervised topic model</li> </ul> </li> <li>Manual topic modeling<ul> <li>Generate topic representations from labels directly</li> <li>Allows for skipping the embedding and clustering steps in order to go directly to the topic representation step</li> </ul> </li> <li>Reduce outliers with 4 different strategies using <code>.reduce_outliers</code></li> <li>Install BERTopic without <code>SentenceTransformers</code> for a lightweight package:<ul> <li><code>pip install --no-deps bertopic</code></li> <li><code>pip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml</code></li> </ul> </li> <li>Get meta data of trained documents such as topics and probabilities using <code>.get_document_info(docs)</code></li> <li>Added more support for cuML's HDBSCAN<ul> <li>Calculate and predict probabilities during <code>fit_transform</code>  and <code>transform</code> respectively</li> <li>This should give a major speed-up when setting <code>calculate_probabilities=True</code></li> </ul> </li> <li>More images to the documentation and a lot of changes/updates/clarifications</li> <li>Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations </li> <li>Sklearn Pipeline Embedder by @koaning in #791</li> </ul> Fixes: <ul> <li>Improve <code>.partial_fit</code> documentation (#837)</li> <li>Fixed scipy linkage usage (#807)</li> <li>Fixed shifted heatmap (#782)</li> <li>Fixed SpaCy backend (#744)</li> <li>Fixed representative docs with small clusters (&lt;3) (#703)</li> <li>Typo fixed by @timpal0l in #734</li> <li>Typo fixed by @srulikbd in #842</li> <li>Correcting iframe urls by @Mustapha-AJEGHRIR in #798</li> <li>Refactor embedding methods by @zachschillaci27 in #855</li> <li>Added diversity parameter to update_topics() function by @anubhabdaserrr in #887</li> </ul> Documentation <p>Personally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard\ud83d\ude05... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes. </p> Topic Distributions <p>The difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same? </p> <p>To do so, we approximate the distribution of topics in a document by calculating and summing the similarities of tokensets (achieved by applying a sliding window) with the topics:</p> <pre><code># After fitting your model run the following for either your trained documents or even unseen documents\ntopic_distr, _ = topic_model.approximate_distribution(docs)\n</code></pre> <p>To calculate and visualize the topic distributions in a document on a token-level, we can run the following:</p> <pre><code># We need to calculate the topic distributions on a token level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Create a visualization using a styled dataframe if Jinja2 is installed\ndf = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0]); df\n</code></pre> Supervised Topic Modeling <p>BERTopic now supports fully-supervised classification! Instead of using a clustering algorithm, like HDBSCAN, we can replace it with a classifier, like Logistic Regression:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseDimensionalityReduction\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.linear_model import LogisticRegression\n\n# Get labeled data\ndata= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n\n# Allows us to skip over the dimensionality reduction step\nempty_dimensionality_model = BaseDimensionalityReduction()\n\n# Create a classifier to be used instead of the cluster model\nclf= LogisticRegression()\n\n# Create a fully supervised BERTopic instance\ntopic_model= BERTopic(\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=clf\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> Manual Topic Modeling <p>When you already have a bunch of labels and simply want to extract topic representations from them, you might not need to actually learn how those can predicted. We can bypass the <code>embeddings -&gt; dimensionality reduction -&gt; clustering</code> steps and go straight to the c-TF-IDF representation of our labels:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.backend import BaseEmbedder\nfrom bertopic.cluster import BaseCluster\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\n# Prepare our empty sub-models and reduce frequent words while we are at it.\nempty_embedding_model = BaseEmbedder()\nempty_dimensionality_model = BaseDimensionalityReduction()\nempty_cluster_model = BaseCluster()\n\n# Fit BERTopic without actually performing any clustering\ntopic_model= BERTopic(\n        embedding_model=empty_embedding_model,\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=empty_cluster_model,\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> Outlier Reduction <p>Outlier reduction is an frequently-discussed topic in BERTopic as its default cluster model, HDBSCAN, has a tendency to generate many outliers. This often helps in the topic representation steps, as we do not consider documents that are less relevant, but you might want to still assign those outliers to actual topics. In the modular philosophy of BERTopic, keeping training times in mind, it is now possible to perform outlier reduction after having trained your topic model. This allows for ease of iteration and prevents having to train BERTopic many times to find the parameters you are searching for. There are 4 different strategies that you can use, so make sure to check out the documentation!</p> <p>Using it is rather straightforward:</p> <pre><code>new_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> Lightweight BERTopic <p>The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely <code>\"all-MiniLM-L6-v2\"</code>. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires <code>pytorch</code> which often results in a rather large environment, memory-wise. </p> <p>Fortunately, it is possible to install BERTopic without <code>sentence-transformers</code> and use it as a lightweight solution instead. The installation can be done as follows:</p> <pre><code>pip install --no-deps bertopic\npip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml\n</code></pre> <p>Then, we can use BERTopic without <code>sentence-transformers</code> as follows using a CPU-based embedding technique:</p> <pre><code>from sklearn.pipeline import make_pipeline\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\npipe = make_pipeline(\n    TfidfVectorizer(),\n    TruncatedSVD(100)\n)\n\ntopic_model = BERTopic(embedding_model=pipe)\n</code></pre> <p>As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!</p> Document Information <p>Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and the probability of the clustering if the cluster model supports it. There are also options to include other metadata, such as the topic distributions or the x and y coordinates of the reduced embeddings that you can learn more about here.</p> <p>To get the document info, you will only need to pass the documents on which the topic model was trained:</p> <pre><code>&gt;&gt;&gt; topic_model.get_document_info(docs)\n\nDocument                               Topic    Name                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...\nFinally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...\nThink! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...\n1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...\n</code></pre>"},{"location":"changelog.html#version-0120","title":"Version 0.12.0","text":"<p>Release date: 5 September, 2022</p> <p>Highlights:</p> <ul> <li>Perform online/incremental topic modeling with <code>.partial_fit</code></li> <li>Expose c-TF-IDF model for customization with <code>bertopic.vectorizers.ClassTfidfTransformer</code><ul> <li>The parameters <code>bm25_weighting</code> and <code>reduce_frequent_words</code> were added to potentially improve representations:</li> </ul> </li> <li>Expose attributes for easier access to internal data</li> <li>Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:<ul> <li>Visualize Overview</li> <li>Code Overview</li> <li>Detailed Overview</li> </ul> </li> <li>Added an example of combining BERTopic with KeyBERT</li> <li>Added many tests with the intention of making development a bit more stable</li> </ul> <p>Fixes: </p> <ul> <li>Fixed iteratively merging topics (#632 and (#648)</li> <li>Fixed 0th topic not showing up in visualizations (#667)</li> <li>Fixed lowercasing not being optional (#682)</li> <li>Fixed spelling (#664 and (#673)</li> <li>Fixed 0th topic not shown in <code>.get_topic_info</code> by @oxymor0n in #660</li> <li>Fixed spelling by @domenicrosati in #674</li> <li>Add custom labels and title options to barchart @leloykun in #694</li> </ul> <p>Online/incremental topic modeling:</p> <p>Online topic modeling (sometimes called \"incremental topic modeling\") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained on before. In Scikit-Learn, this technique is often modeled through a <code>.partial_fit</code> function, which is also used in BERTopic. </p> <p>At a minimum, the cluster model needs to support a <code>.partial_fit</code> function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating. </p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.cluster import MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom bertopic.vectorizers import OnlineCountVectorizer\nfrom bertopic import BERTopic\n\n# Prepare documents\nall_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\ndoc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]\n\n# Prepare sub-models that support online learning\numap_model = IncrementalPCA(n_components=5)\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\ntopic_model = BERTopic(umap_model=umap_model,\n                       hdbscan_model=cluster_model,\n                       vectorizer_model=vectorizer_model)\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n</code></pre> <p>Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the <code>.topics_</code> attribute as variations such as hierarchical topic modeling will not work afterward:</p> <pre><code># Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration\ntopics = []\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n    topics.extend(topic_model.topics_)\n\ntopic_model.topics_ = topics\n</code></pre> <p>c-TF-IDF:</p> <p>Explicitly define, use, and adjust the <code>ClassTfidfTransformer</code> with new parameters, <code>bm25_weighting</code> and <code>reduce_frequent_words</code>, to potentially improve the topic representation: </p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(bm25_weighting=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\n</code></pre> <p>Attributes:</p> <p>After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in <code>topic_model.topics_</code>. </p> Attribute Type Description topics_ List[int] The topics that are generated for each document after training or updating the topic model. The most recent topics are tracked. probabilities_ List[float] The probability of the assigned topic per document. These are only calculated if a HDBSCAN model is used for the clustering step. When <code>calculate_probabilities=True</code>, then it is the probabilities of all topics per document. topic_sizes_ Mapping[int, int] The size of each topic. topic_mapper_ TopicMapper A class for tracking topics and their mappings anytime they are merged, reduced, added, or removed. topic_representations_ Mapping[int, Tuple[int, float]] The top n terms per topic and their respective c-TF-IDF values. c_tf_idf_ csr_matrix The topic-term matrix as calculated through c-TF-IDF. To access its respective words, run <code>.vectorizer_model.get_feature_names()</code> or <code>.vectorizer_model.get_feature_names_out()</code> topic_labels_ Mapping[int, str] The default labels for each topic. custom_labels_ List[str] Custom labels for each topic as generated through <code>.set_topic_labels</code>. topic_embeddings_ np.ndarray The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values. representative_docs_ Mapping[int, str] The representative documents for each topic if HDBSCAN is used."},{"location":"changelog.html#version-0110","title":"Version 0.11.0","text":"<p>Release date: 11 July, 2022</p> <p>Highlights:</p> <ul> <li>Perform hierarchical topic modeling with <code>.hierarchical_topics</code></li> </ul> <pre><code>hierarchical_topics = topic_model.hierarchical_topics(docs, topics) \n</code></pre> <ul> <li>Visualize hierarchical topic representations with <code>.visualize_hierarchy</code></li> </ul> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <ul> <li>Extract a text-based hierarchical topic representation with <code>.get_topic_tree</code></li> </ul> <pre><code>tree = topic_model.get_topic_tree(hierarchical_topics)\n</code></pre> <ul> <li>Visualize 2D documents with <code>.visualize_documents()</code></li> </ul> <pre><code># Use input embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# or use 2D reduced embeddings through a method of your own (e.g., PCA, t-SNE, UMAP, etc.)\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <ul> <li>Visualize 2D hierarchical documents with <code>.visualize_hierarchical_documents()</code> </li> </ul> <pre><code># Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already which speed things up quite a bit:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <ul> <li>Create custom labels to the topics throughout most visualizations</li> </ul> <pre><code># Generate topic labels\ntopic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=10, separator=\", \")\n\n# Set them internally in BERTopic\ntopic_model.set_topic_labels(topics_labels)\n</code></pre> <ul> <li>Manually merge topics with <code>.merge_topics()</code></li> </ul> <pre><code># Merge topics 1, 2, and 3\ntopics_to_merge = [1, 2, 3]\ntopic_model.merge_topics(docs, topics, topics_to_merge)\n\n# Merge topics 1 and 2, and separately merge topics 3 and 4\ntopics_to_merge = [[1, 2], [3, 4]]\ntopic_model.merge_topics(docs, topics, topics_to_merge)\n</code></pre> <ul> <li>Added example for finding similar topics between two models in the tips &amp; tricks page</li> <li>Add multi-modal example in the tips &amp; tricks page</li> <li>Added native Hugging Face transformers support </li> </ul> <p>Fixes: </p> <ul> <li>Fix support for k-Means in <code>.visualize_heatmap</code> (#532)</li> <li>Fix missing topic 0 in <code>.visualize_topics</code> (#533)</li> <li>Fix inconsistencies in <code>.get_topic_info</code> (#572) and (#581)</li> <li>Add <code>optimal_ordering</code> parameter to <code>.visualize_hierarchy</code> by @rafaelvalero in #390</li> <li>Fix RuntimeError when used as sklearn estimator by @simonfelding in #448</li> <li>Fix typo in visualization documentation by @dwhdai in #475</li> <li>Fix typo in docstrings by @xwwwwww in #549</li> <li>Support higher Flair versions</li> </ul>"},{"location":"changelog.html#version-0100","title":"Version 0.10.0","text":"<p>Release date: 30 April, 2022</p> <p>Highlights: </p> <ul> <li>Use any dimensionality reduction technique instead of UMAP:</li> </ul> <pre><code>from bertopic import BERTopic\nfrom sklearn.decomposition import PCA\n\ndim_model = PCA(n_components=5)\ntopic_model = BERTopic(umap_model=dim_model)\n</code></pre> <ul> <li>Use any clustering technique instead of HDBSCAN:</li> </ul> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\n\ncluster_model = KMeans(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre> <p>Documentation: </p> <ul> <li>Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case</li> <li>Added pages on how to use other dimensionality reduction and clustering algorithms</li> <li>Additional instructions on how to reduce outliers in the FAQ:</li> </ul> <pre><code>import numpy as np\nprobability_threshold = 0.01\nnew_topics = [np.argmax(prob) if max(prob) &gt;= probability_threshold else -1 for prob in probs] \n</code></pre> <p>Fixes: </p> <ul> <li>Fixed <code>None</code> being returned for probabilities when transforming unseen documents</li> <li>Replaced all instances of <code>arg:</code> with <code>Arguments:</code> for consistency</li> <li>Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if <code>min_df</code> is set to a value larger than 1</li> <li>Set <code>\"hdbscan&gt;=0.8.28\"</code> to prevent numpy issues</li> <li>Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues</li> <li>Update gensim dependency to <code>&gt;=4.0.0</code> (#371)</li> <li>Fix topic 0 not appearing in visualizations (#472)</li> <li>Fix (#506)</li> <li>Fix (#429)</li> <li>Fix typo in DTM documentation by @hp0404 in #386</li> </ul>"},{"location":"changelog.html#version-094","title":"Version 0.9.4","text":"<p>Release date: 14 December, 2021</p> <p>A number of fixes, documentation updates, and small features:</p> <ul> <li>Expose diversity parameter<ul> <li>Use <code>BERTopic(diversity=0.1)</code> to change how diverse the words in a topic representation are (ranges from 0 to 1)</li> </ul> </li> <li>Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings</li> <li>Added property to c-TF-IDF that all IDF values should be positive (#351)</li> <li>Improve stability of <code>.visualize_barchart()</code> and <code>.visualize_hierarchy()</code></li> <li>Major documentation overhaul (mkdocs, tutorials, FAQ, images, etc. ) (#330)</li> <li>Drop python 3.6 (#333)</li> <li>Relax plotly dependency (#88)</li> <li>Additional logging for <code>.transform</code> (#356)</li> </ul>"},{"location":"changelog.html#version-093","title":"Version 0.9.3","text":"<p>Release date:  17 October, 2021</p> <ul> <li>Fix #282<ul> <li>As it turns out the old implementation of topic mapping was still found in the <code>transform</code> function</li> </ul> </li> <li>Fix #285<ul> <li>Fix getting all representative docs</li> </ul> </li> <li>Fix #288<ul> <li>A recent issue with the package <code>pyyaml</code> that can be found in Google Colab</li> </ul> </li> </ul>"},{"location":"changelog.html#version-092","title":"Version 0.9.2","text":"<p>Release date:  12 October, 2021</p> <p>A release focused on algorithmic optimization and fixing several issues:</p> <p>Highlights:  </p> <ul> <li>Update the non-multilingual paraphrase- models to the all- models due to improved performance</li> <li>Reduce necessary RAM in c-TF-IDF top 30 word extraction</li> </ul> <p>Fixes:  </p> <ul> <li>Fix topic mapping<ul> <li>When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version</li> <li>A new class was created as a way to track these mappings regardless of how many times they were executed</li> <li>In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model</li> </ul> </li> <li>Fix typo in embeddings page (#200) </li> <li>Fix link in README (#233)</li> <li>Fix documentation <code>.visualize_term_rank()</code> (#253) </li> <li>Fix getting correct representative docs (#258)</li> <li>Update memory FAQ with HDBSCAN pr</li> </ul>"},{"location":"changelog.html#version-091","title":"Version 0.9.1","text":"<p>Release date:  1 September, 2021</p> <p>A release focused on fixing several issues:</p> <p>Fixes:  </p> <ul> <li>Fix TypeError when auto-reducing topics (#210) </li> <li>Fix mapping representative docs when reducing topics (#208)</li> <li>Fix visualization issues with probabilities (#205)</li> <li>Fix missing <code>normalize_frequency</code> param in plots (#213)</li> </ul>"},{"location":"changelog.html#version-090","title":"Version 0.9.0","text":"<p>Release date:  9 August, 2021</p> <p>Highlights:  </p> <ul> <li>Implemented a Guided BERTopic -&gt; Use seeds to steer the Topic Modeling</li> <li>Get the most representative documents per topic: <code>topic_model.get_representative_docs(topic=1)</code><ul> <li>This allows users to see which documents are good representations of a topic and better understand the topics that were created</li> </ul> </li> <li>Added <code>normalize_frequency</code> parameter to <code>visualize_topics_per_class</code> and <code>visualize_topics_over_time</code> in order to better compare the relative topic frequencies between topics</li> <li>Return flat probabilities as default, only calculate the probabilities of all topics per document if <code>calculate_probabilities</code> is True</li> <li>Added several FAQs</li> </ul> <p>Fixes:  </p> <ul> <li>Fix loading pre-trained BERTopic model</li> <li>Fix mapping of probabilities</li> <li>Fix #190</li> </ul> <p>Guided BERTopic:    </p> <p>Guided BERTopic works in two ways: </p> <p>First, we create embeddings for each seeded topics by joining them and passing them through the document embedder.  These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label.  If the document is most similar to a seeded topic, then it will get that topic's label.  If it is most similar to the average document embedding, it will get the -1 label.  These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics. </p> <p>Second, we take all words in <code>seed_topic_list</code> and assign them a multiplier larger than 1.  Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing  the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an  irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to  remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant,  like taking the distribution of IDF values and its position into account when defining the multiplier. </p> <pre><code>seed_topic_list = [[\"company\", \"billion\", \"quarter\", \"shrs\", \"earnings\"],\n                   [\"acquisition\", \"procurement\", \"merge\"],\n                   [\"exchange\", \"currency\", \"trading\", \"rate\", \"euro\"],\n                   [\"grain\", \"wheat\", \"corn\"],\n                   [\"coffee\", \"cocoa\"],\n                   [\"natural\", \"gas\", \"oil\", \"fuel\", \"products\", \"petrol\"]]\n\ntopic_model = BERTopic(seed_topic_list=seed_topic_list)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre>"},{"location":"changelog.html#version-081","title":"Version 0.8.1","text":"<p>Release date:  8 June, 2021</p> <p>Highlights:  </p> <ul> <li>Improved models:<ul> <li>For English documents the default is now: <code>\"paraphrase-MiniLM-L6-v2\"</code> </li> <li>For Non-English or multi-lingual documents the default is now: <code>\"paraphrase-multilingual-MiniLM-L12-v2\"</code> </li> <li>Both models show not only great performance but are much faster!  </li> </ul> </li> <li>Add interactive visualizations to the <code>plotting</code> API documentation</li> </ul> <p>For better performance, please use the following models:  </p> <ul> <li>English: <code>\"paraphrase-mpnet-base-v2\"</code></li> <li>Non-English or multi-lingual: <code>\"paraphrase-multilingual-mpnet-base-v2\"</code></li> </ul> <p>Fixes:   </p> <ul> <li>Improved unit testing for more stability</li> <li>Set transformers version for Flair</li> </ul>"},{"location":"changelog.html#version-080","title":"Version 0.8.0","text":"<p>Release date:  31 May, 2021</p> <p>Highlights:  </p> <ul> <li>Additional visualizations:<ul> <li>Topic Hierarchy: <code>topic_model.visualize_hierarchy()</code> </li> <li>Topic Similarity Heatmap: <code>topic_model.visualize_heatmap()</code> </li> <li>Topic Representation Barchart: <code>topic_model.visualize_barchart()</code> </li> <li>Term Score Decline: <code>topic_model.visualize_term_rank()</code> </li> </ul> </li> <li>Created <code>bertopic.plotting</code> library to easily extend visualizations</li> <li>Improved automatic topic reduction by using HDBSCAN to detect similar topics</li> <li>Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest  topic, 1 the second largest, etc. </li> </ul> <p>Fixes:   </p> <ul> <li>Fix typo #113, #117</li> <li>Fix #121 by removing these two lines</li> <li>Fix mapping of topics after reduction (it now excludes 0) (#103)</li> </ul>"},{"location":"changelog.html#version-070","title":"Version 0.7.0","text":"<p>Release date:  26 April, 2021 </p> <p>The two main features are (semi-)supervised topic modeling  and several backends to use instead of Flair and SentenceTransformers!</p> <p>Highlights:</p> <ul> <li>(semi-)supervised topic modeling by leveraging supervised options in UMAP<ul> <li><code>model.fit(docs, y=target_classes)</code></li> </ul> </li> <li>Backends:<ul> <li>Added Spacy, Gensim, USE (TFHub)</li> <li>Use a different backend for document embeddings and word embeddings</li> <li>Create your own backends with <code>bertopic.backend.BaseEmbedder</code></li> <li>Click here for an overview of all new backends</li> </ul> </li> <li>Calculate and visualize topics per class<ul> <li>Calculate: <code>topics_per_class = topic_model.topics_per_class(docs, topics, classes)</code></li> <li>Visualize: <code>topic_model.visualize_topics_per_class(topics_per_class)</code></li> </ul> </li> <li>Several tutorials were updated and added:</li> </ul> Name Link Topic Modeling with BERTopic (Custom) Embedding Models in BERTopic Advanced Customization in BERTopic (semi-)Supervised Topic Modeling with BERTopic Dynamic Topic Modeling with Trump's Tweets <p>Fixes:  </p> <ul> <li>Fixed issues with Torch req</li> <li>Prevent saving term frequency matrix in CTFIDF class</li> <li>Fixed DTM not working when reducing topics (#96)</li> <li>Moved visualization dependencies to base BERTopic<ul> <li><code>pip install bertopic[visualization]</code> becomes <code>pip install bertopic</code></li> </ul> </li> <li>Allow precomputed embeddings in bertopic.find_topics() (#79):</li> </ul> <pre><code>model = BERTopic(embedding_model=my_embedding_model)\nmodel.fit(docs, my_precomputed_embeddings)\nmodel.find_topics(search_term)\n</code></pre>"},{"location":"changelog.html#version-060","title":"Version 0.6.0","text":"<p>Release date:  1 March, 2021</p> <p>Highlights:</p> <ul> <li>DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation <ul> <li><code>model.topics_over_time(docs, timestamps, global_tuning=True)</code></li> </ul> </li> <li>DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time<ul> <li>Only uses topics at t-1 and skips evolution if there is a gap</li> <li><code>model.topics_over_time(docs, timestamps, evolution_tuning=True)</code></li> </ul> </li> <li>DTM: Function to visualize topics over time <ul> <li><code>model.visualize_topics_over_time(topics_over_time)</code></li> </ul> </li> <li>DTM: Add binning of timestamps  <ul> <li><code>model.topics_over_time(docs, timestamps, nr_bins=10)</code></li> </ul> </li> <li>Add function get general information about topics (id, frequency, name, etc.) <ul> <li><code>get_topic_info()</code></li> </ul> </li> <li>Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents</li> </ul> <p>Fixes:</p> <ul> <li><code>_map_probabilities()</code> does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)</li> </ul>"},{"location":"changelog.html#version-050","title":"Version 0.5.0","text":"<p>Release date:  8 Februari, 2021</p> <p>Highlights:</p> <ul> <li>Add <code>Flair</code> to allow for more (custom) token/document embeddings, including \ud83e\udd17 transformers </li> <li>Option to use custom UMAP, HDBSCAN, and CountVectorizer</li> <li>Added <code>low_memory</code> parameter to reduce memory during computation</li> <li>Improved verbosity (shows progress bar)</li> <li>Return the figure of <code>visualize_topics()</code></li> <li>Expose all parameters with a single function: <code>get_params()</code></li> </ul> <p>Fixes:</p> <ul> <li>To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.</li> <li>Set <code>calculate_probabilities</code> to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.</li> <li>Use the newest version of <code>sentence-transformers</code> as it speeds ups encoding significantly</li> </ul>"},{"location":"changelog.html#version-042","title":"Version 0.4.2","text":"<p>Release date:  10 Januari, 2021</p> <p>Fixes:  </p> <ul> <li>Selecting <code>embedding_model</code> did not work when <code>language</code> was also used. This led to the user needing  to set <code>language</code> to None before being able to use <code>embedding_model</code>. Fixed by using <code>embedding_model</code> when  <code>language</code> is used (as a default parameter).</li> </ul>"},{"location":"changelog.html#version-041","title":"Version 0.4.1","text":"<p>Release date:  07 Januari, 2021</p> <p>Fixes:  </p> <ul> <li>Simple fix by lowering the languages variable to match the lowered input language.</li> </ul>"},{"location":"changelog.html#version-040","title":"Version 0.4.0","text":"<p>Release date:  21 December, 2020</p> <p>Highlights:  </p> <ul> <li>Visualize Topics similar to LDAvis</li> <li>Added option to reduce topics after training</li> <li>Added option to update topic representation after training</li> <li>Added option to search topics using a search term</li> <li>Significantly improved the stability of generating clusters</li> <li>Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values </li> <li>More extensive tutorials in the documentation</li> </ul> <p>Notable Changes:  </p> <ul> <li>Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic</li> <li>Improved logging (remove duplicates) </li> <li>Check if BERTopic is fitted </li> <li>Added TF-IDF as an embedder instead of transformer models (see tutorial)</li> <li>Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.</li> <li>Preprocess text before passing it through c-TF-IDF</li> <li>Merged <code>get_topics_freq()</code> with <code>get_topic_freq()</code> </li> </ul> <p>Fixes:  </p> <ul> <li>Fix error handling topic probabilities</li> </ul>"},{"location":"changelog.html#version-032","title":"Version 0.3.2","text":"<p>Release date:  16 November, 2020</p> <p>Highlights:</p> <ul> <li>Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.</li> </ul>"},{"location":"changelog.html#version-031","title":"Version 0.3.1","text":"<p>Release date:  4 November, 2020</p> <p>Highlights:</p> <ul> <li>Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.</li> </ul>"},{"location":"changelog.html#version-030","title":"Version 0.3.0","text":"<p>Release date:  29 October, 2020</p> <p>Highlights:</p> <ul> <li>transform() and fit_transform() now also return the topic probability distributions</li> <li>Added visualize_distribution() which visualizes the topic probability distribution for a single document</li> </ul>"},{"location":"changelog.html#version-022","title":"Version 0.2.2","text":"<p>Release date:  17 October, 2020</p> <p>Highlights:</p> <ul> <li>Fixed n_gram_range not being used</li> <li>Added option for using stopwords</li> </ul>"},{"location":"changelog.html#version-021","title":"Version 0.2.1","text":"<p>Release date:  11 October, 2020</p> <p>Highlights:</p> <ul> <li>Improved the calculation of the class-based TF-IDF procedure by limiting the calculation to sparse matrices. This prevents out-of-memory problems when faced with large datasets.</li> </ul>"},{"location":"changelog.html#version-020","title":"Version 0.2.0","text":"<p>Release date:  11 October, 2020</p> <p>Highlights:</p> <ul> <li>Changed c-TF-IDF procedure such that it implements a version of scikit-learns procedure. This should also speed up the calculation of the sparse matrix and prevent memory errors.</li> <li>Added automated unit tests  </li> </ul>"},{"location":"changelog.html#version-012","title":"Version 0.1.2","text":"<p>Release date:  1 October, 2020</p> <p>Highlights:</p> <ul> <li>When transforming new documents, self.mapped_topics seemed to be missing. Added to the init.</li> </ul>"},{"location":"changelog.html#version-011","title":"Version 0.1.1","text":"<p>Release date:  24 September, 2020</p> <p>Highlights:</p> <ul> <li>Fixed requirements --&gt; Issue with pytorch</li> <li>Update documentation</li> </ul>"},{"location":"changelog.html#version-010","title":"Version 0.1.0","text":"<p>Release date:  24 September, 2020</p> <p>Highlights:  </p> <ul> <li>First release of <code>BERTopic</code></li> <li>Added parameters for UMAP and HDBSCAN</li> <li>Option to choose sentence-transformer model</li> <li>Method for transforming unseen documents</li> <li>Save and load trained models (UMAP and HDBSCAN)</li> <li>Extract topics and their sizes</li> </ul> <p>Notable Changes:  </p> <ul> <li>Optimized c-TF-IDF</li> <li>Improved documentation</li> <li>Improved topic reduction</li> </ul>"},{"location":"faq.html","title":"Frequently Asked Questions","text":""},{"location":"faq.html#why-are-the-results-not-consistent-between-runs","title":"Why are the results not consistent between runs?","text":"<p>Due to the stochastic nature of UMAP, the results from BERTopic might differ even if you run the same code multiple times. Using custom embeddings allows you to try out BERTopic several times until you find the topics that suit you best. You only need to generate the embeddings themselves once and run BERTopic several times with different parameters. </p> <p>If you want to reproduce the results, at the expense of performance, you can set a <code>random_state</code> in UMAP to prevent  any stochastic behavior:</p> <pre><code>from bertopic import BERTopic\nfrom umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=5, \n                  min_dist=0.0, metric='cosine', random_state=42)\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre>"},{"location":"faq.html#which-embedding-model-should-i-choose","title":"Which embedding model should I choose?","text":"<p>Unfortunately, there is not a definitive list of the best models for each language, this highly depends on your data, the model, and your specific use case. However, the default model in BERTopic (<code>\"all-MiniLM-L6-v2\"</code>) works great for English documents. In contrast, for multi-lingual documents or any other language, <code>\"paraphrase-multilingual-MiniLM-L12-v2\"</code> has shown great performance.  </p> <p>If you want to use a model that provides a higher quality, but takes more computing time, then I would advise using <code>all-mpnet-base-v2</code> and <code>paraphrase-multilingual-mpnet-base-v2</code> instead. </p> <p>MTEB Leaderboard New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the MTEB leaderboard. It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look. </p> <p>Many of these models can be used with <code>SentenceTransformers</code> in BERTopic, like so:</p> <pre><code>from bertopic import BERTopic\nfrom sentence_transformers import SentenceTransformer\n\nembedding_model = SentenceTransformer(\"BAAI/bge-base-en-v1.5\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre> <p>SentenceTransformers SentenceTransformers work typically quite well  and are the preferred models to use. They are great at generating document embeddings and have several  multi-lingual versions available.  </p> <p>\ud83e\udd17 transformers BERTopic allows you to use any \ud83e\udd17 transformers model. These models are typically embeddings created on a word/sentence level but can easily be pooled using Flair (see Guides/Embeddings). If you have a specific language for which you want to generate embeddings, you can choose the model here.</p>"},{"location":"faq.html#how-do-i-reduce-topic-outliers","title":"How do I reduce topic outliers?","text":"<p>There are several ways we can reduce outliers.</p> <p>First, the amount of datapoint classified as outliers is handled by the <code>min_samples</code> parameters in HDBSCAN. This value is automatically set to the  same value of <code>min_cluster_size</code>. However, you can set it independently if you want to reduce the number of generated outliers. Lowering this value will  result in less noise being generated. </p> <pre><code>from bertopic import BERTopic\nfrom hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', \n                        cluster_selection_method='eom', prediction_data=True, min_samples=5)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Note</p> <p>Although this will lower outliers found in the data, this might force outliers to be put into topics where they do not belong. So make  sure to strike a balance between keeping noise and reducing outliers. </p> <p>Second, after training our BERTopic model, we can assign outliers to topics by making use of the <code>.reduce_outliers</code> function in BERTopic. An advantage of using this approach is that there are four built in strategies one can choose for reducing outliers. Moreover, this technique allows the user to experiment with reducing outliers across a number of strategies and parameters without actually having to re-train the topic model each time. You can learn more about the <code>.reduce_outlier</code> function here. The following is a minimal example of how to use this function:</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers\nnew_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> <p>Third, we can replace HDBSCAN with any other clustering algorithm that we want. So we can choose a clustering algorithm, like k-Means, that  does not produce any outliers at all. Using k-Means instead of HDBSCAN is straightforward:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\n\ncluster_model = KMeans(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre>"},{"location":"faq.html#how-do-i-remove-stop-words","title":"How do I remove stop words?","text":"<p>At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings. </p> <p>Instead, we can use the <code>CountVectorizer</code> to preprocess our documents after having generated embeddings and clustered  our documents. I have found almost no disadvantages to using the <code>CountVectorizer</code> to remove stop words and  it is something I would strongly advise to try out:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>We can also use the <code>ClassTfidfTransformer</code> to reduce the impact of frequent words. The result is very similar to explicitly removing stop words but this process does this automatically:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\n</code></pre>"},{"location":"faq.html#how-can-i-speed-up-bertopic","title":"How can I speed up BERTopic?","text":"<p>You can speed up BERTopic by either generating your embeddings beforehand or by  setting <code>calculate_probabilities</code> to False. Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than a couple of hundred thousand documents. </p> <p>Also, make sure to use a GPU when extracting the sentence/document embeddings. Transformer models typically require a GPU and using only a CPU can slow down computation time quite a lot. However, if you do not have access to a GPU, looking into quantization might help. </p> <p>Lastly, it is also possible to speed up BERTopic with cuML's GPU acceleration of UMAP and HDBSCAN: </p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\nfrom cuml.manifold import UMAP\n\n# Create instances of GPU-accelerated UMAP and HDBSCAN\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\n\n# Pass the above models to be used in BERTopic\ntopic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)\n</code></pre>"},{"location":"faq.html#i-am-facing-memory-issues-help","title":"I am facing memory issues. Help!","text":"<p>There are several ways to perform computation with large datasets:</p> <ul> <li> <p>First, you can set <code>low_memory</code> to True when instantiating BERTopic.  This may prevent blowing up the memory in UMAP. </p> </li> <li> <p>Second, setting <code>calculate_probabilities</code> to False when instantiating BERTopic prevents a huge document-topic  probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets. </p> </li> <li> <p>Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting  sparse c-TF-IDF matrix. You can do this as follows:</p> </li> </ul> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\", min_df=10)\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>The min_df  parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.</p> <ul> <li>Fourth, you can use online topic modeling instead to use BERTopic on big data by training the model in chunks</li> </ul> <p>If the problem persists, then this could be an issue related to your available memory. The processing of millions of documents is quite computationally expensive and sufficient RAM is necessary.  </p>"},{"location":"faq.html#i-have-only-a-few-topics-how-do-i-increase-them","title":"I have only a few topics, how do I increase them?","text":"<p>There are several reasons why your topic model may result in only a few topics:</p> <ul> <li> <p>First, you might only have a few documents (~1000). This makes it very difficult to properly  extract topics due to the little amount of data available. Increasing the number of documents  might solve your issues. </p> </li> <li> <p>Second, <code>min_topic_size</code> might be simply too large for your number of documents. If you decrease  the minimum size of topics, then you are much more likely to increase the number of topics generated. You could also decrease the <code>n_neighbors</code> parameter used in <code>UMAP</code> if this does not work. </p> </li> <li> <p>Third, although this does not happen very often, there simply aren't that many topics to be found  in your documents. You can often see this when you have many <code>-1</code> topics, which is not a topic  but a category of outliers.  </p> </li> </ul>"},{"location":"faq.html#i-have-too-many-topics-how-do-i-decrease-them","title":"I have too many topics, how do I decrease them?","text":"<p>If you have a large dataset, then it is possible to generate thousands of topics. Especially with large datasets, there is a good chance they contain many small topics. In practice, you might want a few hundred topics at most to interpret them nicely. </p> <p>There are a few ways of decreasing the number of generated topics: </p> <ul> <li> <p>First, we can set the <code>min_topic_size</code> in the BERTopic initialization much higher (e.g., 300) to make sure that those small clusters will not be generated. This is an HDBSCAN parameter that specifies the minimum number of documents needed in a cluster. More documents in a cluster mean fewer topics will be generated. </p> </li> <li> <p>Second, you can create a custom UMAP model and set <code>n_neighbors</code> much higher than the default 15 (e.g., 200). This also prevents those micro clusters to be generated as it will need many neighboring documents to create a cluster. </p> </li> <li> <p>Third, we can set <code>nr_topics</code> to a value that seems logical to the user. Do note that topics are forced  to merge which might result in a lower quality of topics. In practice, I would advise using  <code>nr_topic=\"auto\"</code> as that will merge topics that are very similar. Dissimilar topics will  therefore remain separated. </p> </li> </ul>"},{"location":"faq.html#how-do-i-calculate-the-probabilities-of-all-topics-in-a-document","title":"How do I calculate the probabilities of all topics in a document?","text":"<p>Although it is possible to calculate all the probabilities, the process of doing so is quite computationally  inefficient and might significantly increase the computation time. To prevent this, the probabilities are  not calculated as a default. To calculate them, you will have to set <code>calculate_probabilities</code> to True:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>Note</p> <p>The <code>calculate_probabilties</code> parameter is only used when using HDBSCAN or cuML's HDBSCAN model. In other words, this will not work when using a model other than HDBSCAN. Instead, we can approximate the topic distributions across all documents with <code>.approximate_distribution</code>.</p>"},{"location":"faq.html#numpy-gives-me-an-error-when-running-bertopic","title":"Numpy gives me an error when running BERTopic","text":"<p>With the release of Numpy 1.20.0, there have been significant issues with using that version (and previous ones) due to compilation issues and pypi.   </p> <p>This is a known issue with the order of installation using pypi. You can find more details about this issue  here and here.</p> <p>I would suggest doing one of the following:</p> <ul> <li>Install the newest version from BERTopic (&gt;= v0.5).</li> <li>You can install hdbscan with <code>pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation</code> which might resolve the issue</li> <li>Install BERTopic in a fresh environment using these steps. </li> </ul>"},{"location":"faq.html#how-can-i-run-bertopic-without-an-internet-connection","title":"How can I run BERTopic without an internet connection?","text":"<p>The great thing about using sentence-transformers is that it searches automatically for an embedding model locally.  If it cannot find one, it will download the pre-trained model from its servers.  Make sure that you set the correct path for sentence-transformers to work. You can find a bit more about that  here. </p> <p>You can download the corresponding model here and unzip it. Then, simply use the following to create your embedding model:</p> <pre><code>from sentence_transformers import SentenceTransformer\nembedding_model = SentenceTransformer('path/to/unzipped/model')\n</code></pre> <p>Then, pass it to BERTopic:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"faq.html#can-i-use-the-gpu-to-speed-up-the-model","title":"Can I use the GPU to speed up the model?","text":"<p>Yes. The GPU is automatically used when you use a SentenceTransformer or Flair embedding model. Using  a CPU would then definitely slow things down. However, you can use other embeddings like TF-IDF or Doc2Vec  embeddings in BERTopic which do not depend on GPU acceleration. </p> <p>You can use cuML to speed up both  UMAP and HDBSCAN through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\nfrom cuml.manifold import UMAP\n\n# Create instances of GPU-accelerated UMAP and HDBSCAN\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\n\n# Pass the above models to be used in BERTopic\ntopic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Depending on the embeddings you are using, you might want to normalize them first to force a cosine-related distance metric in UMAP:</p> <pre><code>from cuml.preprocessing import normalize\nembeddings = normalize(embeddings)\n</code></pre>"},{"location":"faq.html#how-can-i-use-bertopic-with-chinese-documents","title":"How can I use BERTopic with Chinese documents?","text":"<p>Currently, CountVectorizer tokenizes text by splitting whitespace which does not work for Chinese.  To get it to work, you will have to create a custom <code>CountVectorizer</code> with <code>jieba</code>:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nimport jieba\n\ndef tokenize_zh(text):\n    words = jieba.lcut(text)\n    return words\n\nvectorizer = CountVectorizer(tokenizer=tokenize_zh)\n</code></pre> <p>Next, we pass our custom vectorizer to BERTopic and create our topic model:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(embedding_model=model, verbose=True, vectorizer_model=vectorizer)\ntopics, _ = topic_model.fit_transform(docs, embeddings=embeddings)\n</code></pre>"},{"location":"faq.html#why-does-it-take-so-long-to-import-bertopic","title":"Why does it take so long to import BERTopic?","text":"<p>The main culprit here seems to be UMAP. After running tests with Tuna we  can see that most of the resources when importing BERTopic can be dedicated to UMAP:   </p> <p></p> <p>Unfortunately, there currently is no fix for this issue. The most recent ticket regarding this  issue can be found here.</p>"},{"location":"faq.html#should-i-preprocess-the-data","title":"Should I preprocess the data?","text":"<p>No. By using document embeddings there is typically no need to preprocess the data as all parts of a document  are important in understanding the general topic of the document. Although this holds in 99% of cases, if you  have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags  typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply  topic modeling to HTML-code to extract topics of code, then it becomes important.</p>"},{"location":"usecases.html","title":"Use Cases","text":"<p>Over the last few years, BERTopic has been used on a wide variety of use cases and domains, from cancer research and voice perception, to employee surveys and social media. This diversity allows for interesting use cases but it might quickly become overwhelming. This page is meant to demonstrate how, when, and why BERTopic is used in practice. </p> <p> </p>"},{"location":"usecases.html#examples","title":"Examples","text":"<p>Below are a number of use cases that have been applied in practice. These use cases are collected from and written by data-professionals.</p> <p>Note</p> <p>If you would like to add your use case, feel free open up a PR! You only need to update this file and add your example.  You can just copy-paste one of the existing examples and adjust it contain a description of your use case.</p>"},{"location":"usecases.html#app-user-feedback","title":"App User Feedback","text":"<p><p>      \"Analyzing user reviews from the App Store and Play Store helps us reveal valuable customer information, fix technical or usability issues, and help constantly improve customer experience. We utilize BERTopic for topic modeling and supervised classification of predefined categories.\"          \u2022\u2022\u2022Tibor Fabian, Ph.D.Lead/Master Data ScientistTelef\u00f3nica Germany </p></p>"},{"location":"usecases.html#employee-surveys","title":"Employee Surveys","text":"<p><p>      \"We are using BERTopic to support analysis of employee surveys. Here, we use BERTopic to compute the topics of discussion found in employee responses to open-ended survey questions. To further understand how employees feel about certain topics, we combined BERTopic with sentiment analysis to identify the sentiments associated with different topics and vice versa.\"          \u2022\u2022\u2022Steve Quirolgico, Ph.D.Principal Engineer U.S. Department of Homeland Security </p></p>"},{"location":"usecases.html#voice-perception","title":"Voice Perception","text":"<p><p> \"A research project on voice perception to categorize what people describe when they make first impressions based on hearing people say, \"Hi\".\" preprint | code     \u2022\u2022\u2022David FeinbergAssociate ProfessorMcMaster University </p></p>"},{"location":"usecases.html#social-media","title":"Social Media","text":"<p><p> \"We use BERTopic to detect trending topics in social media, Our product (AIM Insights) is a social media monitoring tool so detecting trending topics in social media helps our clients to capitalize on them for their campaigns.     We use BERTopic to group social media posts into clusters, sort them by engagement to detect the ones that are trending, and then use OpenAI's GPT-3 to generate a label for each of the top clusters based on the most relevant documents in it. This is all done on Arabic posts using an in-house sentence embeddings model.\"     \u2022\u2022\u2022Ahmed RashwanAI leadAIM Technologies </p></p>"},{"location":"usecases.html#it-service-management","title":"IT Service Management","text":"<p><p> \"In IT Service Management systems (e.g., Service Now) we receive Incidents, Problems, Change requests etc. We use BERTopic to categorize them into a group of topics/clusters to understand the distribution of the work requests over the period of time to plan and act accordingly for the future.\"     \u2022\u2022\u2022Rajesh ThanaseelanData Science ConsultantDXC Technology </p></p>"},{"location":"usecases.html#colon-cancer","title":"Colon Cancer","text":"<p><p> \"We use BERTopic to evaluate P53 in Ovarian cancer for Computational backgrounds researchers, who find it easier to relate Artificial Intelligence with advancing the transformer model and unstructured medical data. The paper explores the heterogeneity of keyBERT, BERTopic, PyCaret, and LDAs as key phrase generators and topic model extractors, with P53 in ovarian cancer as a use case.\"     \u2022\u2022\u2022     Mary AdewunmiPhD Student in Colon Cancer and AIUTAS </p></p>"},{"location":"usecases.html#telephone-help-line","title":"Telephone Help Line","text":"<p><p> \"We analyzed 100K+ phone call memos from a telephone help line. The Help Line is open to all people, regardless of religion, culture, and origin. It follows the principles of IFOTES (International Federation Of Telephone Emergency Services). The regional offices each offer independent counseling services via telephone or online.     The phone call memos are written by hundreds of independent volunteers and come in various shapes, lengths, forms, and wordings - additionally to have them in multiple languages. While using BERTopic we ran a few tests to figure out if the topic modeling works. Selecting only one language with ~60K data points and a mixed language model we achieved good results. It helped identify topics within the calls and therefore show the organization what reasons there are for people calling them. We identified in a workshop a few interesting topics, which they were not aware of, for example, religious topics.     The identification of existing and new, arising topics is crucial for the service quality of the organization. It furthermore helps detect trends over time, which can then be reported directly to Public Health institutions, which can then come up with campaigns to inform the public and help reduce certain psychological concerns. It acts as a representative psychological health barometer of the population.\"     \u2022\u2022\u2022Kevin KuhnChief Executive Officergopf </p></p>"},{"location":"usecases.html#regional-newspaper","title":"Regional Newspaper","text":"<p><p> \"Recently, we wanted to evaluate our overall section structure, especially our local news section. As you can imagine, local news is quite a big part of what we do in a regional newspaper. We used BERTopic on a year's worth of local news data to explore the topics in local news and define a new section structure. The results from this analysis helped to define the new section structure, which was implemented this month. \"     \u2022\u2022\u2022Thomas HuskenData ScientistBergens Tidende </p></p>"},{"location":"usecases.html#intelligent-virtual-assistants","title":"Intelligent Virtual Assistants","text":"<p><p> \"We have been using BERTopic as an early step in our exploratory analysis for intelligent virtual assistants. It helps us get a quick read on what some of the intents may be. The results help in the design discussions with customers.\"     \u2022\u2022\u2022Stephen DrewVP, AI and Automation SolutionsFive9 </p></p>"},{"location":"usecases.html#electronic-health-records","title":"Electronic Health Records","text":"<p><p> \"Given physician-created documents from hospitals, find themes in the text as well as differentiate between \"relevant\" and \"irrelevant\" text, and disambiguate homonyms. \"     \u2022\u2022\u2022     Alexis RaykhelSenior NLP EngineerIodine Software </p></p>"},{"location":"usecases.html#teaching","title":"Teaching","text":"<p><p> \"BERTopic was used to determine a taxonomy of climate change risks discussed in financial news, and to compute firms' related exposure. It was used in a context a course offering on Climate Risks modelling with NLP.\"     \u2022\u2022\u2022     Thomas LoransSenior Associate, Quantitative Analyst </p></p>"},{"location":"usecases.html#zero-hunger-lab","title":"Zero Hunger Lab","text":"<p><p> \"I am a PhD student at Tilburg University, at a lab called Zero Hunger Lab, where we try to use data science methods to improve food insecurity. One key issue is classifying and predicting food insecurity in food-insecure nations. The Integrated Food Security Phase Classification (IPC) system serves this purpose. The IPC categorizes food insecurity into five phases, ranging from minimal food insecurity to famine, and serves as a guide for directing humanitarian resources to the most affected regions.     The IPC system strives to be based on evidence, however, obtaining accurate information about food insecurity in remote regions can prove challenging. Despite the availability of weather data, data in the socio-economic domain, such as food prices and conflict, can be scarce or unreliable due to limited infrastructure and bureaucratic obstacles. These complications often result in infrequent releases of IPC classifications and projections, making it difficult to effectively respond to food insecurity in these areas.     One large source of daily-updated information is local news. Thus, one can build a model that classifies/predicts IPC by relying on news features obtained by NLP methods in addition to stuff like weather data. Previous research shows this is possible (see https://arxiv.org/pdf/2111.15602.pdf). The authors find words related to food insecurity using semantic frame parsing. After which, they count the occurrence of these words to create features. The features are put into a linear classifier. We wanted to apply more advanced methods and use local news sources (which we suppose contain more localized information). We used BERTopic on over a million articles scraped from Somali news websites. Because articles are both in English and Somali, we use a multilingual sentence encoder (LaBSE, which outperforms newer models in Somali). The results are quite nice. For example, topics most strongly correlated with known conflict casualty data are topics about terrorist attacks, car bombings, etc. And topics most negatively correlated with known conflict casualty data are about peace talks. We can also get an indication of food price development and forced migration. Most importantly, we can track the development of topics relating to food insecurity over time. While topic modelling cannot replace evidence-based food insecurity assessment, it can give a quick insight into a local situation when 'hard data' is lacking.     I applaud you on your success with BERTopic. The package is incredibly clean and easy to use, and the method works well with little parameter tuning. To me, the fact that you were able to deliver such a useful tool on your own is incredible, especially in the field of NLP, which is dominated by large organizations such as Google and Meta.  \"     \u2022\u2022\u2022Cascha van WanrooijPhD StudentTilburg University </p></p>"},{"location":"usecases.html#papers","title":"Papers","text":"<p>BERTopic has also been adopted more and more in the academic field. Here are a few from all different kinds of research domains with interesting applications:</p> <ul> <li>Adewunmi, M., Sharma, S. K., Sharma, N., Sushma, N. S., &amp; Mounmo, B. (2022). Cancer Health Disparities drivers with BERTopic modelling and PyCaret Evaluation. Cancer Health Disparities, 6.</li> <li>Ebeling, R., S\u00e1enz, C. A. C., Nobre, J. C., &amp; Becker, K. (2022, May). Analysis of the influence of political polarization in the vaccination stance: the Brazilian COVID-19 scenario. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 16, pp. 159-170).</li> <li>Hoseini, M., Melo, P., Benevenuto, F., Feldmann, A., &amp; Zannettou, S. (2021). On the globalization of the QAnon conspiracy theory through Telegram. arXiv preprint arXiv:2105.13020.</li> <li>Falkenberg, M., Galeazzi, A., Torricelli, M., Di Marco, N., Larosa, F., Sas, M., ... &amp; Baronchelli, A. (2022). Growing polarization around climate change on social media. Nature Climate Change, 1-8.</li> <li>S\u00e1nchez\u2010Franco, M. J., &amp; Rey\u2010Moreno, M. (2022). Do travelers' reviews depend on the destination? An analysis in coastal and urban peer\u2010to\u2010peer lodgings. Psychology &amp; Marketing, 39(2), 441-459.</li> <li>Zhunis, A., Lima, G., Song, H., Han, J., &amp; Cha, M. (2022, April). Emotion bubbles: Emotional composition of online discourse before and after the COVID-19 outbreak. In Proceedings of the ACM Web Conference 2022 (pp. 2603-2613).</li> <li>Alhaj, F., Al-Haj, A., Sharieh, A., &amp; Jabri, R. (2022). Improving Arabic cognitive distortion classification in Twitter using BERTopic. International Journal of Advanced Computer Science and Applications, 13(1), 854-860.</li> </ul> <p>Click here for a full overview of papers citing BERTopic. </p>"},{"location":"algorithm/algorithm.html","title":"The Algorithm","text":"<p>Below, you will find different types of overviews of each step in BERTopic's main algorithm. Each successive overview will be more in-depth than the previous overview. This approach aims to make the underlying algorithm as intuitive as possible for a wide range of users. </p>"},{"location":"algorithm/algorithm.html#visual-overview","title":"Visual Overview","text":"<p>BERTopic can be viewed as a sequence of steps to create its topic representations. There are five steps to this process:</p> <p></p> <p>Although these steps are the default, there is some modularity to BERTopic. Each step in this process was carefully selected such that they are all somewhat independent from one another. For example, the tokenization step is not directly influenced by the embedding model that was used to convert the documents which allow us to be creative in how we perform the tokenization step. </p> <p>This effect is especially strong in the clustering step. Models like HDBSCAN assume that clusters can have different shapes and forms. As a result, using a centroid-based technique to model the topic representations would not be beneficial since the centroid is not always representative of these types of clusters. A bag-of-words representation, however, makes very few assumptions concerning the shape and form of a cluster. </p> <p>As a result, BERTopic is quite modular and can maintain its quality of topic generation throughout a variety of sub-models. In other words, BERTopic essentially allows you to build your own topic model:</p> <p></p> <p>There is extensive documentation on how to use each step in this pipeline:</p> <ol> <li>Embeddings</li> <li>Dimensionality Reduction</li> <li>Clustering</li> <li>Tokenizer</li> <li>Weighting Scheme</li> <li>Representation Tuning<ul> <li>Large Language Models (LLM)</li> </ul> </li> </ol>"},{"location":"algorithm/algorithm.html#code-overview","title":"Code Overview","text":"<p>After going through the visual overview, this code overview demonstrates the algorithm using BERTopic. An advantage of using BERTopic is each major step in its algorithm can be explicitly defined, thereby making the process not only transparent but also more intuitive. </p> <pre><code>from umap import UMAP\nfrom hdbscan import HDBSCAN\nfrom sentence_transformers import SentenceTransformer\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nfrom bertopic import BERTopic\nfrom bertopic.representation import KeyBERTInspired\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\n\n# Step 1 - Extract embeddings\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n# Step 2 - Reduce dimensionality\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')\n\n# Step 3 - Cluster reduced embeddings\nhdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)\n\n# Step 4 - Tokenize topics\nvectorizer_model = CountVectorizer(stop_words=\"english\")\n\n# Step 5 - Create topic representation\nctfidf_model = ClassTfidfTransformer()\n\n# Step 6 - (Optional) Fine-tune topic representations with \n# a `bertopic.representation` model\nrepresentation_model = KeyBERTInspired()\n\n# All steps together\ntopic_model = BERTopic(\n  embedding_model=embedding_model,          # Step 1 - Extract embeddings\n  umap_model=umap_model,                    # Step 2 - Reduce dimensionality\n  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings\n  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics\n  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words\n  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations\n)\n</code></pre>"},{"location":"algorithm/algorithm.html#detailed-overview","title":"Detailed Overview","text":"<p>This overview describes each step in more detail such that you can get an intuitive feeling as to what models might fit best at each step in your use case. </p>"},{"location":"algorithm/algorithm.html#1-embed-documents","title":"1. Embed documents","text":"<p>We start by converting our documents to numerical representations. Although there are many methods for doing so the default in BERTopic is sentence-transformers. These models are often optimized for semantic similarity which helps tremendously in our clustering task. Moreover, they are great for creating either document- or sentence-embeddings.   In BERTopic, you can choose any sentence-transformers model but two models are set as defaults:</p> <ul> <li><code>\"all-MiniLM-L6-v2\"</code></li> <li><code>\"paraphrase-multilingual-MiniLM-L12-v2\"</code></li> </ul> <p>The first is an English language model trained specifically for semantic similarity tasks which works quite  well for most use cases. The second model is very similar to the first with one major difference being that the  <code>multilingual</code> models work for 50+ languages. This model is quite a bit larger than the first and is only selected if  you select any language other than English.</p> <p>Tip</p> <p>Although BERTopic uses sentence-transformers models as a default, you can choose  any embedding model that fits your use case. Follow the guide here for selecting  and customizing your model.</p>"},{"location":"algorithm/algorithm.html#2-dimensionality-reduction","title":"2. Dimensionality reduction","text":"<p>After having created our numerical representations of the documents we have to reduce the dimensionality of these representations. Cluster models typically have difficulty handling high dimensional data due to the curse of dimensionality. There are great approaches that can reduce dimensionality, such as PCA, but as a default UMAP is selected in BERTopic. It is a technique that can keep some of a dataset's local and global structure when reducing its dimensionality. This structure is important to keep as it contains the information necessary to create clusters of semantically similar documents. </p> <p>Tip</p> <p>Although BERTopic uses UMAP as a default, you can choose  any dimensionality reduction model that fits your use case. Follow the guide here for selecting  and customizing your model.</p>"},{"location":"algorithm/algorithm.html#3-cluster-documents","title":"3. Cluster Documents","text":"<p>After having reduced our embeddings, we can start clustering our data. For that, we leverage a density-based clustering technique, HDBSCAN. It can find clusters of different shapes and has the nice feature of identifying outliers where possible. As a result, we do not force documents into a cluster where they might not belong. This will improve the resulting topic representation as there is less noise to draw from. </p> <p>Tip</p> <p>Although BERTopic uses HDBSCAN as a default, you can choose  any cluster model that fits your use case. Follow the guide here for selecting  and customizing your model.</p>"},{"location":"algorithm/algorithm.html#4-bag-of-words","title":"4. Bag-of-words","text":"<p>Before we can start creating the topic representation we first need to select a technique that allows for modularity in BERTopic's algorithm. When we use HDBSCAN as a cluster model, we may assume that our clusters have different degrees of density and different shapes. This means that a centroid-based topic representation technique might not be the best-fitting model. In other words, we want a topic representation technique that makes little to no assumption on the expected structure of the clusters.   To do this, we first combine all documents in a cluster into a single document. That, very long, document then represents the cluster. Then, we can count how often each word appears in each cluster. This generates something called a bag-of-words representation in which the frequency of each word in each cluster can be found. This bag-of-words representation is therefore on a cluster level and not on a document level. This distinction is important as we are interested in words on a topic level (i.e., cluster level). By using a bag-of-words representation, no assumption is made concerning the structure of the clusters. Moreover, the bag-of-words representation is L1-normalized to account for clusters that have different sizes. </p> <p>Tip</p> <p>There are many ways you can tune or change the bag-of-words step. This step allows for processing the documents however you want without affecting the first step, embedding the documents. You can follow the guide here for more information about tokenization options in BERTopic.</p>"},{"location":"algorithm/algorithm.html#5-topic-representation","title":"5. Topic representation","text":"<p>From the generated bag-of-words representation, we want to know what makes one cluster different from another. Which words are typical for cluster 1 and not so much for all other clusters? To solve this, we need to modify TF-IDF such that it considers topics (i.e., clusters) instead of documents.    When you apply TF-IDF as usual on a set of documents, what you are doing is comparing the importance of  words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. The more important words are within a cluster, the more it is representative of that topic. In other words, if we extract the most important words per cluster, we get descriptions of topics! This model is called class-based TF-IDF: </p> <p></p> <p> Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word <code>x</code> in class <code>c</code>, where <code>c</code> refers to the cluster we created before. This results in our class-based <code>tf</code> representation. This representation is L1-normalized to account for the differences in topic sizes.     Then, we take the logarithm of one plus the average number of words per class <code>A</code> divided by the frequency of word <code>x</code> across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based <code>idf</code> representation. Like with the classic TF-IDF, we then multiply <code>tf</code> with <code>idf</code> to get the importance score per word in each class. In other words, the classical TF-IDF procedure is not used here but a modified version of the algorithm that allows for a much better representation. </p> <p>Tip</p> <p>In the <code>ClassTfidfTransformer</code>, there are a few parameters that might be worth exploring, including an option to perform additional BM-25 weighting. You can find more information about that here.</p>"},{"location":"algorithm/algorithm.html#6-optional-fine-tune-topic-representation","title":"6. (Optional) Fine-tune Topic representation","text":"<p>After having generated the c-TF-IDF representations, we have a set of words that describe a collection of documents. c-TF-IDF  is a method that can quickly generate accurate topic representations. However, with the fast developments in NLP-world, new  and exciting methods are released weekly. In order to keep up with what is happening, there is the possibility to further fine-tune  these c-TF-IDF topics using GPT, T5, KeyBERT, Spacy, and other techniques. Many are implemented in BERTopic for you to use and play around with. </p> <p>More specifically, we can consider the c-TF-IDF generated topics to be candidate topics. They each contain a set of keywords and  representative documents that we can use to further fine-tune the topic representations. Having a set of representative documents  for each topic is huge advantage as it allows for fine-tuning on a reduced number of documents. This reduces computation for  large models as they only need to operate on that small set of representative documents for each topic. As a result,  large language models like GPT and T5 becomes feasible in production settings and typically take less wall time than the dimensionality reduction  and clustering steps. </p> <p>The following models are implemented in <code>bertopic.representation</code>:</p> <ul> <li><code>MaximalMarginalRelevance</code></li> <li><code>PartOfSpeech</code></li> <li><code>KeyBERTInspired</code></li> <li><code>ZeroShotClassification</code></li> <li><code>TextGeneration</code></li> <li><code>Cohere</code></li> <li><code>OpenAI</code></li> <li><code>LangChain</code></li> </ul>"},{"location":"api/bertopic.html","title":"<code>BERTopic</code>","text":"<p>BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.</p> <p>The default embedding model is <code>all-MiniLM-L6-v2</code> when selecting <code>language=\"english\"</code> and <code>paraphrase-multilingual-MiniLM-L12-v2</code> when selecting <code>language=\"multilingual\"</code>.</p> <p>Attributes:</p> Name Type Description <code>topics_</code> <code>List[int]) </code> <p>The topics that are generated for each document after training or updating                   the topic model. The most recent topics are tracked.</p> <code>probabilities_</code> <code>List[float]</code> <p>The probability of the assigned topic per document. These are                           only calculated if a HDBSCAN model is used for the clustering step.                           When <code>calculate_probabilities=True</code>, then it is the probabilities                           of all topics per document.</p> <code>topic_sizes_</code> <code>Mapping[int, int]) </code> <p>The size of each topic</p> <code>topic_mapper_</code> <code>TopicMapper) </code> <p>A class for tracking topics and their mappings anytime they are                           merged, reduced, added, or removed.</p> <code>topic_representations_</code> <code>Mapping[int, Tuple[int, float]]) </code> <p>The top n terms per topic and their respective                                                        c-TF-IDF values.</p> <code>c_tf_idf_</code> <code>csr_matrix) </code> <p>The topic-term matrix as calculated through c-TF-IDF. To access its respective                      words, run <code>.vectorizer_model.get_feature_names()</code>  or                      <code>.vectorizer_model.get_feature_names_out()</code></p> <code>topic_labels_</code> <code>Mapping[int, str]) </code> <p>The default labels for each topic.</p> <code>custom_labels_</code> <code>List[str]) </code> <p>Custom labels for each topic.</p> <code>topic_embeddings_</code> <code>np.ndarray) </code> <p>The embeddings for each topic. It is calculated by taking the                              centroid embedding of each cluster.</p> <code>representative_docs_</code> <code>Mapping[int, str]) </code> <p>The representative documents for each topic.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic()\ntopics, probabilities = topic_model.fit_transform(docs)\n</code></pre> <p>If you want to use your own embedding model, use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\ntopic_model = BERTopic(embedding_model=sentence_model)\n</code></pre> <p>Due to the stochastisch nature of UMAP, the results from BERTopic might differ and the quality can degrade. Using your own embeddings allows you to try out BERTopic several times until you find the topics that suit you best.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>class BERTopic:\n    \"\"\"BERTopic is a topic modeling technique that leverages BERT embeddings and\n    c-TF-IDF to create dense clusters allowing for easily interpretable topics\n    whilst keeping important words in the topic descriptions.\n\n    The default embedding model is `all-MiniLM-L6-v2` when selecting `language=\"english\"`\n    and `paraphrase-multilingual-MiniLM-L12-v2` when selecting `language=\"multilingual\"`.\n\n    Attributes:\n        topics_ (List[int]) : The topics that are generated for each document after training or updating\n                              the topic model. The most recent topics are tracked.\n        probabilities_ (List[float]): The probability of the assigned topic per document. These are\n                                      only calculated if a HDBSCAN model is used for the clustering step.\n                                      When `calculate_probabilities=True`, then it is the probabilities\n                                      of all topics per document.\n        topic_sizes_ (Mapping[int, int]) : The size of each topic\n        topic_mapper_ (TopicMapper) : A class for tracking topics and their mappings anytime they are\n                                      merged, reduced, added, or removed.\n        topic_representations_ (Mapping[int, Tuple[int, float]]) : The top n terms per topic and their respective\n                                                                   c-TF-IDF values.\n        c_tf_idf_ (csr_matrix) : The topic-term matrix as calculated through c-TF-IDF. To access its respective\n                                 words, run `.vectorizer_model.get_feature_names()`  or\n                                 `.vectorizer_model.get_feature_names_out()`\n        topic_labels_ (Mapping[int, str]) : The default labels for each topic.\n        custom_labels_ (List[str]) : Custom labels for each topic.\n        topic_embeddings_ (np.ndarray) : The embeddings for each topic. It is calculated by taking the\n                                         centroid embedding of each cluster.\n        representative_docs_ (Mapping[int, str]) : The representative documents for each topic.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic()\n    topics, probabilities = topic_model.fit_transform(docs)\n    ```\n\n    If you want to use your own embedding model, use it as follows:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    topic_model = BERTopic(embedding_model=sentence_model)\n    ```\n\n    Due to the stochastisch nature of UMAP, the results from BERTopic might differ\n    and the quality can degrade. Using your own embeddings allows you to\n    try out BERTopic several times until you find the topics that suit\n    you best.\n    \"\"\"\n    def __init__(self,\n                 language: str = \"english\",\n                 top_n_words: int = 10,\n                 n_gram_range: Tuple[int, int] = (1, 1),\n                 min_topic_size: int = 10,\n                 nr_topics: Union[int, str] = None,\n                 low_memory: bool = False,\n                 calculate_probabilities: bool = False,\n                 seed_topic_list: List[List[str]] = None,\n                 zeroshot_topic_list: List[str] = None,\n                 zeroshot_min_similarity: float = .7,\n                 embedding_model=None,\n                 umap_model: UMAP = None,\n                 hdbscan_model: hdbscan.HDBSCAN = None,\n                 vectorizer_model: CountVectorizer = None,\n                 ctfidf_model: TfidfTransformer = None,\n                 representation_model: BaseRepresentation = None,\n                 verbose: bool = False,\n                 ):\n        \"\"\"BERTopic initialization\n\n        Arguments:\n            language: The main language used in your documents. The default sentence-transformers\n                      model for \"english\" is `all-MiniLM-L6-v2`. For a full overview of\n                      supported languages see bertopic.backend.languages. Select\n                      \"multilingual\" to load in the `paraphrase-multilingual-MiniLM-L12-v2`\n                      sentence-transformers model that supports 50+ languages.\n                      NOTE: This is not used if `embedding_model` is used.\n            top_n_words: The number of words per topic to extract. Setting this\n                         too high can negatively impact topic embeddings as topics\n                         are typically best represented by at most 10 words.\n            n_gram_range: The n-gram range for the CountVectorizer.\n                          Advised to keep high values between 1 and 3.\n                          More would likely lead to memory issues.\n                          NOTE: This param will not be used if you pass in your own\n                          CountVectorizer.\n            min_topic_size: The minimum size of the topic. Increasing this value will lead\n                            to a lower number of clusters/topics and vice versa. \n                            It is the same parameter as `min_cluster_size` in HDBSCAN.\n                            NOTE: This param will not be used if you are `hdbscan_model`.\n            nr_topics: Specifying the number of topics will reduce the initial\n                       number of topics to the value specified. This reduction can take\n                       a while as each reduction in topics (-1) activates a c-TF-IDF\n                       calculation. If this is set to None, no reduction is applied. Use\n                       \"auto\" to automatically reduce topics using HDBSCAN.\n                       NOTE: Controlling the number of topics is best done by adjusting\n                       `min_topic_size` first before adjusting this parameter.\n            low_memory: Sets UMAP low memory to True to make sure less memory is used.\n                        NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP\n                        this parameter will not be used.\n            calculate_probabilities: Calculate the probabilities of all topics\n                                     per document instead of the probability of the assigned\n                                     topic per document. This could slow down the extraction\n                                     of topics if you have many documents (&gt; 100_000).\n                                     NOTE: If false you cannot use the corresponding\n                                     visualization method `visualize_probabilities`.\n                                     NOTE: This is an approximation of topic probabilities\n                                     as used in HDBSCAN and not an exact representation.\n            seed_topic_list: A list of seed words per topic to converge around\n            zeroshot_topic_list: A list of topic names to use for zero-shot classification\n            zeroshot_min_similarity: The minimum similarity between a zero-shot topic and\n                                     a document for assignment. The higher this value, the more\n                                     confident the model needs to be to assign a zero-shot topic to a document.\n            verbose: Changes the verbosity of the model, Set to True if you want\n                     to track the stages of the model.\n            embedding_model: Use a custom embedding model.\n                             The following backends are currently supported\n                               * SentenceTransformers\n                               * Flair\n                               * Spacy\n                               * Gensim\n                               * USE (TF-Hub)\n                             You can also pass in a string that points to one of the following\n                             sentence-transformers models:\n                               * https://www.sbert.net/docs/pretrained_models.html\n            umap_model: Pass in a UMAP model to be used instead of the default.\n                        NOTE: You can also pass in any dimensionality reduction algorithm as long\n                        as it has `.fit` and `.transform` functions.\n            hdbscan_model: Pass in a hdbscan.HDBSCAN model to be used instead of the default\n                           NOTE: You can also pass in any clustering algorithm as long as it has\n                           `.fit` and `.predict` functions along with the `.labels_` variable.\n            vectorizer_model: Pass in a custom `CountVectorizer` instead of the default model.\n            ctfidf_model: Pass in a custom ClassTfidfTransformer instead of the default model.\n            representation_model: Pass in a model that fine-tunes the topic representations\n                                  calculated through c-TF-IDF. Models from `bertopic.representation`\n                                  are supported.\n        \"\"\"\n        # Topic-based parameters\n        if top_n_words &gt; 100:\n            logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                           \"can slow down computation quite a bit.\")\n\n        self.top_n_words = top_n_words\n        self.min_topic_size = min_topic_size\n        self.nr_topics = nr_topics\n        self.low_memory = low_memory\n        self.calculate_probabilities = calculate_probabilities\n        self.verbose = verbose\n        self.seed_topic_list = seed_topic_list\n        self.zeroshot_topic_list = zeroshot_topic_list\n        self.zeroshot_min_similarity = zeroshot_min_similarity\n\n        # Embedding model\n        self.language = language if not embedding_model else None\n        self.embedding_model = embedding_model\n\n        # Vectorizer\n        self.n_gram_range = n_gram_range\n        self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=self.n_gram_range)\n        self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n\n        # Representation model\n        self.representation_model = representation_model\n\n        # UMAP or another algorithm that has .fit and .transform functions\n        self.umap_model = umap_model or UMAP(n_neighbors=15,\n                                             n_components=5,\n                                             min_dist=0.0,\n                                             metric='cosine',\n                                             low_memory=self.low_memory)\n\n        # HDBSCAN or another clustering algorithm that has .fit and .predict functions and\n        # the .labels_ variable to extract the labels\n        self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,\n                                                              metric='euclidean',\n                                                              cluster_selection_method='eom',\n                                                              prediction_data=True)\n\n        # Public attributes\n        self.topics_ = None\n        self.probabilities_ = None\n        self.topic_sizes_ = None\n        self.topic_mapper_ = None\n        self.topic_representations_ = None\n        self.topic_embeddings_ = None\n        self.topic_labels_ = None\n        self.custom_labels_ = None\n        self.c_tf_idf_ = None\n        self.representative_images_ = None\n        self.representative_docs_ = {}\n        self.topic_aspects_ = {}\n\n        # Private attributes for internal tracking purposes\n        self._outliers = 1\n        self._merged_topics = None\n\n        if verbose:\n            logger.set_level(\"DEBUG\")\n        else:\n            logger.set_level(\"WARNING\")\n\n    def fit(self,\n            documents: List[str],\n            embeddings: np.ndarray = None,\n            images: List[str] = None,\n            y: Union[List[int], np.ndarray] = None):\n        \"\"\" Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics\n\n        Arguments:\n            documents: A list of documents to fit on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model\n            images: A list of paths to the images to fit on or the images themselves\n            y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n               specific instance is specified.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all')['data']\n        topic_model = BERTopic().fit(docs)\n        ```\n\n        If you want to use your own embeddings, use it as follows:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n\n        # Create embeddings\n        docs = fetch_20newsgroups(subset='all')['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n        # Create topic model\n        topic_model = BERTopic().fit(docs, embeddings)\n        ```\n        \"\"\"\n        self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)\n        return self\n\n    def fit_transform(self,\n                      documents: List[str],\n                      embeddings: np.ndarray = None,\n                      images: List[str] = None,\n                      y: Union[List[int], np.ndarray] = None) -&gt; Tuple[List[int],\n                                                                       Union[np.ndarray, None]]:\n        \"\"\" Fit the models on a collection of documents, generate topics,\n        and return the probabilities and topic per document.\n\n        Arguments:\n            documents: A list of documents to fit on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model\n            images: A list of paths to the images to fit on or the images themselves\n            y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n               specific instance is specified.\n\n        Returns:\n            predictions: Topic predictions for each documents\n            probabilities: The probability of the assigned topic per document.\n                           If `calculate_probabilities` in BERTopic is set to True, then\n                           it calculates the probabilities of all topics across all documents\n                           instead of only the assigned topic. This, however, slows down\n                           computation and may increase memory usage.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all')['data']\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        ```\n\n        If you want to use your own embeddings, use it as follows:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n\n        # Create embeddings\n        docs = fetch_20newsgroups(subset='all')['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n        # Create topic model\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs, embeddings)\n        ```\n        \"\"\"\n        if documents is not None:\n            check_documents_type(documents)\n            check_embeddings_shape(embeddings, documents)\n\n        doc_ids = range(len(documents)) if documents is not None else range(len(images))\n        documents = pd.DataFrame({\"Document\": documents,\n                                  \"ID\": doc_ids,\n                                  \"Topic\": None,\n                                  \"Image\": images})\n\n        # Extract embeddings\n        if embeddings is None:\n            logger.info(\"Embedding - Transforming documents to embeddings.\")\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n            embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                                  images=images,\n                                                  method=\"document\",\n                                                  verbose=self.verbose)\n            logger.info(\"Embedding - Completed \\u2713\")\n        else:\n            if self.embedding_model is not None:\n                self.embedding_model = select_backend(self.embedding_model,\n                                                      language=self.language)\n\n        # Guided Topic Modeling\n        if self.seed_topic_list is not None and self.embedding_model is not None:\n            y, embeddings = self._guided_topic_modeling(embeddings)\n\n        # Zero-shot Topic Modeling\n        if self._is_zeroshot():\n            documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(documents, embeddings)\n            if documents is None:\n                return self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n        # Reduce dimensionality\n        umap_embeddings = self._reduce_dimensionality(embeddings, y)\n\n        # Cluster reduced embeddings\n        documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)\n\n        # Sort and Map Topic IDs by their frequency\n        if not self.nr_topics:\n            documents = self._sort_mappings_by_frequency(documents)\n\n        # Create documents from images if we have images only\n        if documents.Document.values[0] is None:\n            custom_documents = self._images_to_text(documents, embeddings)\n\n            # Extract topics by calculating c-TF-IDF\n            self._extract_topics(custom_documents, embeddings=embeddings)\n            self._create_topic_vectors(documents=documents, embeddings=embeddings)\n\n            # Reduce topics\n            if self.nr_topics:\n                custom_documents = self._reduce_topics(custom_documents)\n\n            # Save the top 3 most representative documents per topic\n            self._save_representative_docs(custom_documents)\n        else:\n            # Extract topics by calculating c-TF-IDF\n            self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)\n\n            # Reduce topics\n            if self.nr_topics:\n                documents = self._reduce_topics(documents)\n\n            # Save the top 3 most representative documents per topic\n            self._save_representative_docs(documents)\n\n        # Resulting output\n        self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)\n        predictions = documents.Topic.to_list()\n\n        # Combine Zero-shot with outliers\n        if self._is_zeroshot() and len(documents) != len(doc_ids):\n            predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n        return predictions, self.probabilities_\n\n    def transform(self,\n                  documents: Union[str, List[str]],\n                  embeddings: np.ndarray = None,\n                  images: List[str] = None) -&gt; Tuple[List[int], np.ndarray]:\n        \"\"\" After having fit a model, use transform to predict new instances\n\n        Arguments:\n            documents: A single document or a list of documents to predict on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model.\n            images: A list of paths to the images to predict on or the images themselves\n\n        Returns:\n            predictions: Topic predictions for each documents\n            probabilities: The topic probability distribution which is returned by default.\n                           If `calculate_probabilities` in BERTopic is set to False, then the\n                           probabilities are not calculated to speed up computation and\n                           decrease memory usage.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all')['data']\n        topic_model = BERTopic().fit(docs)\n        topics, probs = topic_model.transform(docs)\n        ```\n\n        If you want to use your own embeddings:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n\n        # Create embeddings\n        docs = fetch_20newsgroups(subset='all')['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n        # Create topic model\n        topic_model = BERTopic().fit(docs, embeddings)\n        topics, probs = topic_model.transform(docs, embeddings)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_embeddings_shape(embeddings, documents)\n\n        if isinstance(documents, str) or documents is None:\n            documents = [documents]\n\n        if embeddings is None:\n            embeddings = self._extract_embeddings(documents,\n                                                  images=images,\n                                                  method=\"document\",\n                                                  verbose=self.verbose)\n\n        # Check if an embedding model was found\n        if embeddings is None:\n            raise ValueError(\"No embedding model was found to embed the documents.\"\n                             \"Make sure when loading in the model using BERTopic.load()\"\n                             \"to also specify the embedding model.\")\n\n        # Transform without hdbscan_model and umap_model using only cosine similarity\n        elif type(self.hdbscan_model) == BaseCluster:\n            logger.info(\"Predicting topic assignments through cosine similarity of topic and document embeddings.\")\n            sim_matrix = cosine_similarity(embeddings, np.array(self.topic_embeddings_))\n            predictions = np.argmax(sim_matrix, axis=1) - self._outliers\n\n            if self.calculate_probabilities:\n                probabilities = sim_matrix\n            else:\n                probabilities = np.max(sim_matrix, axis=1)\n\n        # Transform with full pipeline\n        else:\n            logger.info(\"Dimensionality - Reducing dimensionality of input embeddings.\")\n            umap_embeddings = self.umap_model.transform(embeddings)\n            logger.info(\"Dimensionality - Completed \\u2713\")\n\n            # Extract predictions and probabilities if it is a HDBSCAN-like model\n            logger.info(\"Clustering - Approximating new points with `hdbscan_model`\")\n            if is_supported_hdbscan(self.hdbscan_model):\n                predictions, probabilities = hdbscan_delegator(self.hdbscan_model, \"approximate_predict\", umap_embeddings)\n\n                # Calculate probabilities\n                if self.calculate_probabilities:\n                    logger.info(\"Probabilities - Start calculation of probabilities with HDBSCAN\")\n                    probabilities = hdbscan_delegator(self.hdbscan_model, \"membership_vector\", umap_embeddings)\n                    logger.info(\"Probabilities - Completed \\u2713\")\n            else:\n                predictions = self.hdbscan_model.predict(umap_embeddings)\n                probabilities = None\n            logger.info(\"Cluster - Completed \\u2713\")\n\n            # Map probabilities and predictions\n            probabilities = self._map_probabilities(probabilities, original_topics=True)\n            predictions = self._map_predictions(predictions)\n        return predictions, probabilities\n\n    def partial_fit(self,\n                    documents: List[str],\n                    embeddings: np.ndarray = None,\n                    y: Union[List[int], np.ndarray] = None):\n        \"\"\" Fit BERTopic on a subset of the data and perform online learning\n        with batch-like data.\n\n        Online topic modeling in BERTopic is performed by using dimensionality\n        reduction and cluster algorithms that support a `partial_fit` method\n        in order to incrementally train the topic model.\n\n        Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used\n        to dynamically update its vocabulary when presented with new data.\n        It has several parameters for modeling decay and updating the\n        representations.\n\n        In other words, although the main algorithm stays the same, the training\n        procedure now works as follows:\n\n        For each subset of the data:\n\n        1. Generate embeddings with a pre-traing language model\n        2. Incrementally update the dimensionality reduction algorithm with `partial_fit`\n        3. Incrementally update the cluster algorithm with `partial_fit`\n        4. Incrementally update the OnlineCountVectorizer and apply some form of decay\n\n        Note that it is advised to use `partial_fit` with batches and\n        not single documents for the best performance.\n\n        Arguments:\n            documents: A list of documents to fit on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model\n            y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n               specific instance is specified.\n\n        Examples:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n        from sklearn.cluster import MiniBatchKMeans\n        from sklearn.decomposition import IncrementalPCA\n        from bertopic.vectorizers import OnlineCountVectorizer\n        from bertopic import BERTopic\n\n        # Prepare documents\n        docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\n        # Prepare sub-models that support online learning\n        umap_model = IncrementalPCA(n_components=5)\n        cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\n        vectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\n        topic_model = BERTopic(umap_model=umap_model,\n                               hdbscan_model=cluster_model,\n                               vectorizer_model=vectorizer_model)\n\n        # Incrementally fit the topic model by training on 1000 documents at a time\n        for index in range(0, len(docs), 1000):\n            topic_model.partial_fit(docs[index: index+1000])\n        ```\n        \"\"\"\n        # Checks\n        check_embeddings_shape(embeddings, documents)\n        if not hasattr(self.hdbscan_model, \"partial_fit\"):\n            raise ValueError(\"In order to use `.partial_fit`, the cluster model should have \"\n                             \"a `.partial_fit` function.\")\n\n        # Prepare documents\n        if isinstance(documents, str):\n            documents = [documents]\n        documents = pd.DataFrame({\"Document\": documents,\n                                  \"ID\": range(len(documents)),\n                                  \"Topic\": None})\n\n        # Extract embeddings\n        if embeddings is None:\n            if self.topic_representations_ is None:\n                self.embedding_model = select_backend(self.embedding_model,\n                                                      language=self.language)\n            embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                                  method=\"document\",\n                                                  verbose=self.verbose)\n        else:\n            if self.embedding_model is not None and self.topic_representations_ is None:\n                self.embedding_model = select_backend(self.embedding_model,\n                                                      language=self.language)\n\n        # Reduce dimensionality\n        if self.seed_topic_list is not None and self.embedding_model is not None:\n            y, embeddings = self._guided_topic_modeling(embeddings)\n        umap_embeddings = self._reduce_dimensionality(embeddings, y, partial_fit=True)\n\n        # Cluster reduced embeddings\n        documents, self.probabilities_ = self._cluster_embeddings(umap_embeddings, documents, partial_fit=True)\n        topics = documents.Topic.to_list()\n\n        # Map and find new topics\n        if not self.topic_mapper_:\n            self.topic_mapper_ = TopicMapper(topics)\n        mappings = self.topic_mapper_.get_mappings()\n        new_topics = set(topics).difference(set(mappings.keys()))\n        new_topic_ids = {topic: max(mappings.values()) + index + 1 for index, topic in enumerate(new_topics)}\n        self.topic_mapper_.add_new_topics(new_topic_ids)\n        updated_mappings = self.topic_mapper_.get_mappings()\n        updated_topics = [updated_mappings[topic] for topic in topics]\n        documents[\"Topic\"] = updated_topics\n\n        # Add missing topics (topics that were originally created but are now missing)\n        if self.topic_representations_:\n            missing_topics = set(self.topic_representations_.keys()).difference(set(updated_topics))\n            for missing_topic in missing_topics:\n                documents.loc[len(documents), :] = [\" \", len(documents), missing_topic]\n        else:\n            missing_topics = {}\n\n        # Prepare documents\n        documents_per_topic = documents.sort_values(\"Topic\").groupby(['Topic'], as_index=False)\n        updated_topics = documents_per_topic.first().Topic.astype(int)\n        documents_per_topic = documents_per_topic.agg({'Document': ' '.join})\n\n        # Update topic representations\n        self.c_tf_idf_, updated_words = self._c_tf_idf(documents_per_topic, partial_fit=True)\n        self.topic_representations_ = self._extract_words_per_topic(updated_words, documents, self.c_tf_idf_, calculate_aspects=False)\n        self._create_topic_vectors()\n        self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                              for key, values in self.topic_representations_.items()}\n\n        # Update topic sizes\n        if len(missing_topics) &gt; 0:\n            documents = documents.iloc[:-len(missing_topics)]\n\n        if self.topic_sizes_ is None:\n            self._update_topic_size(documents)\n        else:\n            sizes = documents.groupby(['Topic'], as_index=False).count()\n            for _, row in sizes.iterrows():\n                topic = int(row.Topic)\n                if self.topic_sizes_.get(topic) is not None and topic not in missing_topics:\n                    self.topic_sizes_[topic] += int(row.Document)\n                elif self.topic_sizes_.get(topic) is None:\n                    self.topic_sizes_[topic] = int(row.Document)\n            self.topics_ = documents.Topic.astype(int).tolist()\n\n        return self\n\n    def topics_over_time(self,\n                         docs: List[str],\n                         timestamps: Union[List[str],\n                                           List[int]],\n                         topics: List[int] = None,\n                         nr_bins: int = None,\n                         datetime_format: str = None,\n                         evolution_tuning: bool = True,\n                         global_tuning: bool = True) -&gt; pd.DataFrame:\n        \"\"\" Create topics over time\n\n        To create the topics over time, BERTopic needs to be already fitted once.\n        From the fitted models, the c-TF-IDF representations are calculate at\n        each timestamp t. Then, the c-TF-IDF representations at timestamp t are\n        averaged with the global c-TF-IDF representations in order to fine-tune the\n        local representations.\n\n        NOTE:\n            Make sure to use a limited number of unique timestamps (&lt;100) as the\n            c-TF-IDF representation will be calculated at each single unique timestamp.\n            Having a large number of unique timestamps can take some time to be calculated.\n            Moreover, there aren't many use-cased where you would like to see the difference\n            in topic representations over more than 100 different timestamps.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            timestamps: The timestamp of each document. This can be either a list of strings or ints.\n                        If it is a list of strings, then the datetime format will be automatically\n                        inferred. If it is a list of ints, then the documents will be ordered by\n                        ascending order.\n            topics: A list of topics where each topic is related to a document in `docs` and\n                    a timestamp in `timestamps`. You can use this to apply topics_over_time on\n                    a subset of the data. Make sure that `docs`, `timestamps`, and `topics`\n                    all correspond to one another and have the same size.\n            nr_bins: The number of bins you want to create for the timestamps. The left interval will\n                     be chosen as the timestamp. An additional column will be created with the\n                     entire interval.\n            datetime_format: The datetime format of the timestamps if they are strings, eg \u201c%d/%m/%Y\u201d.\n                             Set this to None if you want to have it automatically detect the format.\n                             See strftime documentation for more information on choices:\n                             https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.\n            evolution_tuning: Fine-tune each topic representation at timestamp *t* by averaging its\n                              c-TF-IDF matrix with the c-TF-IDF matrix at timestamp *t-1*. This creates\n                              evolutionary topic representations.\n            global_tuning: Fine-tune each topic representation at timestamp *t* by averaging its c-TF-IDF matrix\n                       with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                       topic representations that could not be found in the documents at timestamp *t*.\n\n        Returns:\n            topics_over_time: A dataframe that contains the topic, words, and frequency of topic\n                              at timestamp *t*.\n\n        Examples:\n\n        The timestamps variable represent the timestamp of each document. If you have over\n        100 unique timestamps, it is advised to bin the timestamps as shown below:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        selected_topics = topics if topics else self.topics_\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": selected_topics, \"Timestamps\": timestamps})\n        global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n        all_topics = sorted(list(documents.Topic.unique()))\n        all_topics_indices = {topic: index for index, topic in enumerate(all_topics)}\n\n        if isinstance(timestamps[0], str):\n            infer_datetime_format = True if not datetime_format else False\n            documents[\"Timestamps\"] = pd.to_datetime(documents[\"Timestamps\"],\n                                                     infer_datetime_format=infer_datetime_format,\n                                                     format=datetime_format)\n\n        if nr_bins:\n            documents[\"Bins\"] = pd.cut(documents.Timestamps, bins=nr_bins)\n            documents[\"Timestamps\"] = documents.apply(lambda row: row.Bins.left, 1)\n\n        # Sort documents in chronological order\n        documents = documents.sort_values(\"Timestamps\")\n        timestamps = documents.Timestamps.unique()\n        if len(timestamps) &gt; 100:\n            logger.warning(f\"There are more than 100 unique timestamps (i.e., {len(timestamps)}) \"\n                           \"which significantly slows down the application. Consider setting `nr_bins` \"\n                           \"to a value lower than 100 to speed up calculation. \")\n\n        # For each unique timestamp, create topic representations\n        topics_over_time = []\n        for index, timestamp in tqdm(enumerate(timestamps), disable=not self.verbose):\n\n            # Calculate c-TF-IDF representation for a specific timestamp\n            selection = documents.loc[documents.Timestamps == timestamp, :]\n            documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                    \"Timestamps\": \"count\"})\n            c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n            if global_tuning or evolution_tuning:\n                c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n\n            # Fine-tune the c-TF-IDF matrix at timestamp t by averaging it with the c-TF-IDF\n            # matrix at timestamp t-1\n            if evolution_tuning and index != 0:\n                current_topics = sorted(list(documents_per_topic.Topic.values))\n                overlapping_topics = sorted(list(set(previous_topics).intersection(set(current_topics))))\n\n                current_overlap_idx = [current_topics.index(topic) for topic in overlapping_topics]\n                previous_overlap_idx = [previous_topics.index(topic) for topic in overlapping_topics]\n\n                c_tf_idf.tolil()[current_overlap_idx] = ((c_tf_idf[current_overlap_idx] +\n                                                          previous_c_tf_idf[previous_overlap_idx]) / 2.0).tolil()\n\n            # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n            # by simply taking the average of the two\n            if global_tuning:\n                selected_topics = [all_topics_indices[topic] for topic in documents_per_topic.Topic.values]\n                c_tf_idf = (global_c_tf_idf[selected_topics] + c_tf_idf) / 2.0\n\n            # Extract the words per topic\n            words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n            topic_frequency = pd.Series(documents_per_topic.Timestamps.values,\n                                        index=documents_per_topic.Topic).to_dict()\n\n            # Fill dataframe with results\n            topics_at_timestamp = [(topic,\n                                    \", \".join([words[0] for words in values][:5]),\n                                    topic_frequency[topic],\n                                    timestamp) for topic, values in words_per_topic.items()]\n            topics_over_time.extend(topics_at_timestamp)\n\n            if evolution_tuning:\n                previous_topics = sorted(list(documents_per_topic.Topic.values))\n                previous_c_tf_idf = c_tf_idf.copy()\n\n        return pd.DataFrame(topics_over_time, columns=[\"Topic\", \"Words\", \"Frequency\", \"Timestamp\"])\n\n    def topics_per_class(self,\n                         docs: List[str],\n                         classes: Union[List[int], List[str]],\n                         global_tuning: bool = True) -&gt; pd.DataFrame:\n        \"\"\" Create topics per class\n\n        To create the topics per class, BERTopic needs to be already fitted once.\n        From the fitted models, the c-TF-IDF representations are calculate at\n        each class c. Then, the c-TF-IDF representations at class c are\n        averaged with the global c-TF-IDF representations in order to fine-tune the\n        local representations. This can be turned off if the pure representation is\n        needed.\n\n        NOTE:\n            Make sure to use a limited number of unique classes (&lt;100) as the\n            c-TF-IDF representation will be calculated at each single unique class.\n            Having a large number of unique classes can take some time to be calculated.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            classes: The class of each document. This can be either a list of strings or ints.\n            global_tuning: Fine-tune each topic representation for class c t by averaging its c-TF-IDF matrix\n                           with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                           topic representations that could not be found in the documents for class c.\n\n        Returns:\n            topics_per_class: A dataframe that contains the topic, words, and frequency of topics\n                              for each class.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        topics_per_class = topic_model.topics_per_class(docs, classes)\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Class\": classes})\n        global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n        # For each unique timestamp, create topic representations\n        topics_per_class = []\n        for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose):\n\n            # Calculate c-TF-IDF representation for a specific timestamp\n            selection = documents.loc[documents.Class == class_, :]\n            documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                    \"Class\": \"count\"})\n            c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n            # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n            # by simply taking the average of the two\n            if global_tuning:\n                c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n                c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0\n\n            # Extract the words per topic\n            words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n            topic_frequency = pd.Series(documents_per_topic.Class.values,\n                                        index=documents_per_topic.Topic).to_dict()\n\n            # Fill dataframe with results\n            topics_at_class = [(topic,\n                                \", \".join([words[0] for words in values][:5]),\n                                topic_frequency[topic],\n                                class_) for topic, values in words_per_topic.items()]\n            topics_per_class.extend(topics_at_class)\n\n        topics_per_class = pd.DataFrame(topics_per_class, columns=[\"Topic\", \"Words\", \"Frequency\", \"Class\"])\n\n        return topics_per_class\n\n    def hierarchical_topics(self,\n                            docs: List[str],\n                            linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                            distance_function: Callable[[csr_matrix], csr_matrix] = None) -&gt; pd.DataFrame:\n        \"\"\" Create a hierarchy of topics\n\n        To create this hierarchy, BERTopic needs to be already fitted once.\n        Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF\n        representation using `scipy.cluster.hierarchy.linkage`.\n\n        Based on that hierarchy, we calculate the topic representation at each\n        merged step. This is a local representation, as we only assume that the\n        chosen step is merged and not all others which typically improves the\n        topic representation.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            linkage_function: The linkage function to use. Default is:\n                            `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n            distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                                `lambda x: 1 - cosine_similarity(x)`.\n                                You can pass any function that returns either a square matrix of\n                                shape (n_samples, n_samples) with zeros on the diagonal and\n                                non-negative values or condensed distance matrix of shape\n                                (n_samples * (n_samples - 1) / 2,) containing the upper\n                                triangular of the distance matrix.\n\n        Returns:\n            hierarchical_topics: A dataframe that contains a hierarchy of topics\n                                represented by their parents and their children\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n        ```\n\n        A custom linkage function can be used as follows:\n\n        ```python\n        from scipy.cluster import hierarchy as sch\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n\n        # Hierarchical topics\n        linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n        hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        if distance_function is None:\n            distance_function = lambda x: 1 - cosine_similarity(x)\n\n        if linkage_function is None:\n            linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n\n        # Calculate distance\n        embeddings = self.c_tf_idf_[self._outliers:]\n        X = distance_function(embeddings)\n        X = validate_distance_matrix(X, embeddings.shape[0])\n\n        # Use the 1-D condensed distance matrix as an input instead of the raw distance matrix\n        Z = linkage_function(X)\n\n        # Calculate basic bag-of-words to be iteratively merged later\n        documents = pd.DataFrame({\"Document\": docs,\n                                  \"ID\": range(len(docs)),\n                                  \"Topic\": self.topics_})\n        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n        documents_per_topic = documents_per_topic.loc[documents_per_topic.Topic != -1, :]\n        clean_documents = self._preprocess_text(documents_per_topic.Document.values)\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = self.vectorizer_model.get_feature_names_out()\n        else:\n            words = self.vectorizer_model.get_feature_names()\n\n        bow = self.vectorizer_model.transform(clean_documents)\n\n        # Extract clusters\n        hier_topics = pd.DataFrame(columns=[\"Parent_ID\", \"Parent_Name\", \"Topics\",\n                                            \"Child_Left_ID\", \"Child_Left_Name\",\n                                            \"Child_Right_ID\", \"Child_Right_Name\"])\n        for index in tqdm(range(len(Z))):\n\n            # Find clustered documents\n            clusters = sch.fcluster(Z, t=Z[index][2], criterion='distance') - self._outliers\n            cluster_df = pd.DataFrame({\"Topic\": range(len(clusters)), \"Cluster\": clusters})\n            cluster_df = cluster_df.groupby(\"Cluster\").agg({'Topic': lambda x: list(x)}).reset_index()\n            nr_clusters = len(clusters)\n\n            # Extract first topic we find to get the set of topics in a merged topic\n            topic = None\n            val = Z[index][0]\n            while topic is None:\n                if val - len(clusters) &lt; 0:\n                    topic = int(val)\n                else:\n                    val = Z[int(val - len(clusters))][0]\n            clustered_topics = [i for i, x in enumerate(clusters) if x == clusters[topic]]\n\n            # Group bow per cluster, calculate c-TF-IDF and extract words\n            grouped = csr_matrix(bow[clustered_topics].sum(axis=0))\n            c_tf_idf = self.ctfidf_model.transform(grouped)\n            selection = documents.loc[documents.Topic.isin(clustered_topics), :]\n            selection.Topic = 0\n            words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n\n            # Extract parent's name and ID\n            parent_id = index + len(clusters)\n            parent_name = \"_\".join([x[0] for x in words_per_topic[0]][:5])\n\n            # Extract child's name and ID\n            Z_id = Z[index][0]\n            child_left_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n            if Z_id - nr_clusters &lt; 0:\n                child_left_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n            else:\n                child_left_name = hier_topics.iloc[int(child_left_id)].Parent_Name\n\n            # Extract child's name and ID\n            Z_id = Z[index][1]\n            child_right_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n            if Z_id - nr_clusters &lt; 0:\n                child_right_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n            else:\n                child_right_name = hier_topics.iloc[int(child_right_id)].Parent_Name\n\n            # Save results\n            hier_topics.loc[len(hier_topics), :] = [parent_id, parent_name,\n                                                    clustered_topics,\n                                                    int(Z[index][0]), child_left_name,\n                                                    int(Z[index][1]), child_right_name]\n\n        hier_topics[\"Distance\"] = Z[:, 2]\n        hier_topics = hier_topics.sort_values(\"Parent_ID\", ascending=False)\n        hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]] = hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]].astype(str)\n\n        return hier_topics\n\n    def approximate_distribution(self,\n                                 documents: Union[str, List[str]],\n                                 window: int = 4,\n                                 stride: int = 1,\n                                 min_similarity: float = 0.1,\n                                 batch_size: int = 1000,\n                                 padding: bool = False,\n                                 use_embedding_model: bool = False,\n                                 calculate_tokens: bool = False,\n                                 separator: str = \" \") -&gt; Tuple[np.ndarray,\n                                                                Union[List[np.ndarray], None]]:\n        \"\"\" A post-hoc approximation of topic distributions across documents.\n\n        In order to perform this approximation, each document is split into tokens\n        according to the provided tokenizer in the `CountVectorizer`. Then, a\n        sliding window is applied on each document creating subsets of the document.\n        For example, with a window size of 3 and stride of 1, the sentence:\n\n        `Solving the right problem is difficult.`\n\n        can be split up into `solving the right`, `the right problem`, `right problem is`,\n        and `problem is difficult`. These are called tokensets. For each of these\n        tokensets, we calculate their c-TF-IDF representation and find out\n        how similar they are to the previously generated topics. Then, the\n        similarities to the topics for each tokenset are summed in order to\n        create a topic distribution for the entire document.\n\n        We can also dive into this a bit deeper by then splitting these tokensets\n        up into individual tokens and calculate how much a word, in a specific sentence,\n        contributes to the topics found in that document. This can be enabled by\n        setting `calculate_tokens=True` which can be used for visualization purposes\n        in `topic_model.visualize_approximate_distribution`.\n\n        The main output, `topic_distributions`, can also be used directly in\n        `.visualize_distribution(topic_distributions[index])` by simply selecting\n        a single distribution.\n\n        Arguments:\n            documents: A single document or a list of documents for which we\n                    approximate their topic distributions\n            window: Size of the moving window which indicates the number of\n                    tokens being considered.\n            stride: How far the window should move at each step.\n            min_similarity: The minimum similarity of a document's tokenset\n                            with respect to the topics.\n            batch_size: The number of documents to process at a time. If None,\n                        then all documents are processed at once.\n                        NOTE: With a large number of documents, it is not\n                        advised to process all documents at once.\n            padding: Whether to pad the beginning and ending of a document with\n                     empty tokens.\n            use_embedding_model: Whether to use the topic model's embedding\n                                model to calculate the similarity between\n                                tokensets and topics instead of using c-TF-IDF.\n            calculate_tokens: Calculate the similarity of tokens with all topics.\n                            NOTE: This is computation-wise more expensive and\n                            can require more memory. Using this over batches of\n                            documents might be preferred.\n            separator: The separator used to merge tokens into tokensets.\n\n        Returns:\n            topic_distributions: A `n` x `m` matrix containing the topic distributions\n                                for all input documents with `n` being the documents\n                                and `m` the topics.\n            topic_token_distributions: A list of `t` x `m` arrays with `t` being the\n                                    number of tokens for the respective document\n                                    and `m` the topics.\n\n        Examples:\n\n        After fitting the model, the topic distributions can be calculated regardless\n        of the clustering model and regardless of whether the documents were previously\n        seen or not:\n\n        ```python\n        topic_distr, _ = topic_model.approximate_distribution(docs)\n        ```\n\n        As a result, the topic distributions are calculated in `topic_distr` for the\n        entire document based on token set with a specific window size and stride.\n\n        If you want to calculate the topic distributions on a token-level:\n\n        ```python\n        topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n        ```\n\n        The `topic_token_distr` then contains, for each token, the best fitting topics.\n        As with `topic_distr`, it can contain multiple topics for a single token.\n        \"\"\"\n        if isinstance(documents, str):\n            documents = [documents]\n\n        if batch_size is None:\n            batch_size = len(documents)\n            batches = 1\n        else:\n            batches = math.ceil(len(documents)/batch_size)\n\n        topic_distributions = []\n        topic_token_distributions = []\n\n        for i in tqdm(range(batches), disable=not self.verbose):\n            doc_set = documents[i*batch_size: (i+1) * batch_size]\n\n            # Extract tokens\n            analyzer = self.vectorizer_model.build_tokenizer()\n            tokens = [analyzer(document) for document in doc_set]\n\n            # Extract token sets\n            all_sentences = []\n            all_indices = [0]\n            all_token_sets_ids = []\n\n            for tokenset in tokens:\n                if len(tokenset) &lt; window:\n                    token_sets = [tokenset]\n                    token_sets_ids = [list(range(len(tokenset)))]\n                else:\n\n                    # Extract tokensets using window and stride parameters\n                    stride_indices = list(range(len(tokenset)))[::stride]\n                    token_sets = []\n                    token_sets_ids = []\n                    for stride_index in stride_indices:\n                        selected_tokens = tokenset[stride_index: stride_index+window]\n\n                        if padding or len(selected_tokens) == window:\n                            token_sets.append(selected_tokens)\n                            token_sets_ids.append(list(range(stride_index, stride_index+len(selected_tokens))))\n\n                    # Add empty tokens at the beginning and end of a document\n                    if padding:\n                        padded = []\n                        padded_ids = []\n                        t = math.ceil(window / stride) - 1\n                        for i in range(math.ceil(window / stride) - 1):\n                            padded.append(tokenset[:window - ((t-i) * stride)])\n                            padded_ids.append(list(range(0, window - ((t-i) * stride))))\n\n                        token_sets = padded + token_sets\n                        token_sets_ids = padded_ids + token_sets_ids\n\n                # Join the tokens\n                sentences = [separator.join(token) for token in token_sets]\n                all_sentences.extend(sentences)\n                all_token_sets_ids.extend(token_sets_ids)\n                all_indices.append(all_indices[-1] + len(sentences))\n\n            # Calculate similarity between embeddings of token sets and the topics\n            if use_embedding_model:\n                embeddings = self._extract_embeddings(all_sentences, method=\"document\", verbose=True)\n                similarity = cosine_similarity(embeddings, self.topic_embeddings_[self._outliers:])\n\n            # Calculate similarity between c-TF-IDF of token sets and the topics\n            else:\n                bow_doc = self.vectorizer_model.transform(all_sentences)\n                c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n                similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n            # Only keep similarities that exceed the minimum\n            similarity[similarity &lt; min_similarity] = 0\n\n            # Aggregate results on an individual token level\n            if calculate_tokens:\n                topic_distribution = []\n                topic_token_distribution = []\n                for index, token in enumerate(tokens):\n                    start = all_indices[index]\n                    end = all_indices[index+1]\n\n                    if start == end:\n                        end = end + 1\n\n                    # Assign topics to individual tokens\n                    token_id = [i for i in range(len(token))]\n                    token_val = {index: [] for index in token_id}\n                    for sim, token_set in zip(similarity[start:end], all_token_sets_ids[start:end]):\n                        for token in token_set:\n                            if token in token_val:\n                                token_val[token].append(sim)\n\n                    matrix = []\n                    for _, value in token_val.items():\n                        matrix.append(np.add.reduce(value))\n\n                    # Take empty documents into account\n                    matrix = np.array(matrix)\n                    if len(matrix.shape) == 1:\n                        matrix = np.zeros((1, len(self.topic_labels_) - self._outliers))\n\n                    topic_token_distribution.append(np.array(matrix))\n                    topic_distribution.append(np.add.reduce(matrix))\n\n                topic_distribution = normalize(topic_distribution, norm='l1', axis=1)\n\n            # Aggregate on a tokenset level indicated by the window and stride\n            else:\n                topic_distribution = []\n                for index in range(len(all_indices)-1):\n                    start = all_indices[index]\n                    end = all_indices[index+1]\n\n                    if start == end:\n                        end = end + 1\n                    group = similarity[start:end].sum(axis=0)\n                    topic_distribution.append(group)\n                topic_distribution = normalize(np.array(topic_distribution), norm='l1', axis=1)\n                topic_token_distribution = None\n\n            # Combine results\n            topic_distributions.append(topic_distribution)\n            if topic_token_distribution is None:\n                topic_token_distributions = None\n            else:\n                topic_token_distributions.extend(topic_token_distribution)\n\n        topic_distributions = np.vstack(topic_distributions)\n\n        return topic_distributions, topic_token_distributions\n\n    def find_topics(self,\n                    search_term: str = None,\n                    image: str = None,\n                    top_n: int = 5) -&gt; Tuple[List[int], List[float]]:\n        \"\"\" Find topics most similar to a search_term\n\n        Creates an embedding for search_term and compares that with\n        the topic embeddings. The most similar topics are returned\n        along with their similarity values.\n\n        The search_term can be of any size but since it compares\n        with the topic representation it is advised to keep it\n        below 5 words.\n\n        Arguments:\n            search_term: the term you want to use to search for topics.\n            top_n: the number of topics to return\n\n        Returns:\n            similar_topics: the most similar topics from high to low\n            similarity: the similarity scores from high to low\n\n        Examples:\n\n        You can use the underlying embedding model to find topics that\n        best represent the search term:\n\n        ```python\n        topics, similarity = topic_model.find_topics(\"sports\", top_n=5)\n        ```\n\n        Note that the search query is typically more accurate if the\n        search_term consists of a phrase or multiple words.\n        \"\"\"\n        if self.embedding_model is None:\n            raise Exception(\"This method can only be used if you did not use custom embeddings.\")\n\n        topic_list = list(self.topic_representations_.keys())\n        topic_list.sort()\n\n        # Extract search_term embeddings and compare with topic embeddings\n        if search_term is not None:\n            search_embedding = self._extract_embeddings([search_term],\n                                                        method=\"word\",\n                                                        verbose=False).flatten()\n        elif image is not None:\n            search_embedding = self._extract_embeddings([None],\n                                                        images=[image],\n                                                        method=\"document\",\n                                                        verbose=False).flatten()\n        sims = cosine_similarity(search_embedding.reshape(1, -1), self.topic_embeddings_).flatten()\n\n        # Extract topics most similar to search_term\n        ids = np.argsort(sims)[-top_n:]\n        similarity = [sims[i] for i in ids][::-1]\n        similar_topics = [topic_list[index] for index in ids][::-1]\n\n        return similar_topics, similarity\n\n    def update_topics(self,\n                      docs: List[str],\n                      images: List[str] = None,\n                      topics: List[int] = None,\n                      top_n_words: int = 10,\n                      n_gram_range: Tuple[int, int] = None,\n                      vectorizer_model: CountVectorizer = None,\n                      ctfidf_model: ClassTfidfTransformer = None,\n                      representation_model: BaseRepresentation = None):\n        \"\"\" Updates the topic representation by recalculating c-TF-IDF with the new\n        parameters as defined in this function.\n\n        When you have trained a model and viewed the topics and the words that represent them,\n        you might not be satisfied with the representation. Perhaps you forgot to remove\n        stop_words or you want to try out a different n_gram_range. This function allows you\n        to update the topic representation after they have been formed.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            images: The images you used when calling either `fit` or `fit_transform`\n            topics: A list of topics where each topic is related to a document in `docs`.\n                    Use this variable to change or map the topics.\n                    NOTE: Using a custom list of topic assignments may lead to errors if\n                          topic reduction techniques are used afterwards. Make sure that\n                          manually assigning topics is the last step in the pipeline\n            top_n_words: The number of words per topic to extract. Setting this\n                         too high can negatively impact topic embeddings as topics\n                         are typically best represented by at most 10 words.\n            n_gram_range: The n-gram range for the CountVectorizer.\n            vectorizer_model: Pass in your own CountVectorizer from scikit-learn\n            ctfidf_model: Pass in your own c-TF-IDF model to update the representations\n            representation_model: Pass in a model that fine-tunes the topic representations\n                        calculated through c-TF-IDF. Models from `bertopic.representation`\n                        are supported.\n\n        Examples:\n\n        In order to update the topic representation, you will need to first fit the topic\n        model and extract topics from them. Based on these, you can update the representation:\n\n        ```python\n        topic_model.update_topics(docs, n_gram_range=(2, 3))\n        ```\n\n        You can also use a custom vectorizer to update the representation:\n\n        ```python\n        from sklearn.feature_extraction.text import CountVectorizer\n        vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\")\n        topic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n        ```\n\n        You can also use this function to change or map the topics to something else.\n        You can update them as follows:\n\n        ```python\n        topic_model.update_topics(docs, my_updated_topics)\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        check_is_fitted(self)\n        if not n_gram_range:\n            n_gram_range = self.n_gram_range\n\n        if top_n_words &gt; 100:\n            logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                           \"can slow down computation quite a bit.\")\n        self.top_n_words = top_n_words\n        self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=n_gram_range)\n        self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n        self.representation_model = representation_model\n\n        if topics is None:\n            topics = self.topics_\n        else:\n            logger.warning(\"Using a custom list of topic assignments may lead to errors if \"\n                           \"topic reduction techniques are used afterwards. Make sure that \"\n                           \"manually assigning topics is the last step in the pipeline.\"\n                           \"Note that topic embeddings will also be created through weighted\"\n                           \"c-TF-IDF embeddings instead of centroid embeddings.\")\n\n        self._outliers = 1 if -1 in set(topics) else 0\n        # Extract words\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": topics, \"ID\": range(len(docs)), \"Image\": images})\n        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n        self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)\n        self.topic_representations_ = self._extract_words_per_topic(words, documents)\n        if set(topics) != self.topics_:\n            self._create_topic_vectors()\n        self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                              for key, values in\n                              self.topic_representations_.items()}\n        self._update_topic_size(documents)\n\n    def get_topics(self, full: bool = False) -&gt; Mapping[str, Tuple[str, float]]:\n        \"\"\" Return topics with top n words and their c-TF-IDF score\n\n        Arguments:\n            full: If True, returns all different forms of topic representations\n                  for each topic, including aspects\n\n        Returns:\n            self.topic_representations_: The top n words per topic and the corresponding c-TF-IDF score\n\n        Examples:\n\n        ```python\n        all_topics = topic_model.get_topics()\n        ```\n        \"\"\"\n        check_is_fitted(self)\n\n        if full:\n            topic_representations = {\"Main\": self.topic_representations_}\n            topic_representations.update(self.topic_aspects_)\n            return topic_representations\n        else:\n            return self.topic_representations_\n\n    def get_topic(self, topic: int, full: bool = False) -&gt; Union[Mapping[str, Tuple[str, float]], bool]:\n        \"\"\" Return top n words for a specific topic and their c-TF-IDF scores\n\n        Arguments:\n            topic: A specific topic for which you want its representation\n            full: If True, returns all different forms of topic representations\n                  for a topic, including aspects\n\n        Returns:\n            The top n words for a specific word and its respective c-TF-IDF scores\n\n        Examples:\n\n        ```python\n        topic = topic_model.get_topic(12)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        if topic in self.topic_representations_:\n            if full:\n                representations = {\"Main\": self.topic_representations_[topic]}\n                aspects = {aspect: representations[topic] for aspect, representations in self.topic_aspects_.items()}\n                representations.update(aspects)\n                return representations\n            else:\n                return self.topic_representations_[topic]\n        else:\n            return False\n\n    def get_topic_info(self, topic: int = None) -&gt; pd.DataFrame:\n        \"\"\" Get information about each topic including its ID, frequency, and name.\n\n        Arguments:\n            topic: A specific topic for which you want the frequency\n\n        Returns:\n            info: The information relating to either a single topic or all topics\n\n        Examples:\n\n        ```python\n        info_df = topic_model.get_topic_info()\n        ```\n        \"\"\"\n        check_is_fitted(self)\n\n        info = pd.DataFrame(self.topic_sizes_.items(), columns=[\"Topic\", \"Count\"]).sort_values(\"Topic\")\n        info[\"Name\"] = info.Topic.map(self.topic_labels_)\n\n        # Custom label\n        if self.custom_labels_ is not None:\n            if len(self.custom_labels_) == len(info):\n                labels = {topic - self._outliers: label for topic, label in enumerate(self.custom_labels_)}\n                info[\"CustomName\"] = info[\"Topic\"].map(labels)\n\n        # Main Keywords\n        values = {topic: list(list(zip(*values))[0]) for topic, values in self.topic_representations_.items()}\n        info[\"Representation\"] = info[\"Topic\"].map(values)\n\n        # Extract all topic aspects\n        if self.topic_aspects_:\n            for aspect, values in self.topic_aspects_.items():\n                if isinstance(list(values.values())[-1], list):\n                    if isinstance(list(values.values())[-1][0], tuple) or isinstance(list(values.values())[-1][0], list):\n                        values = {topic: list(list(zip(*value))[0]) for topic, value in values.items()}\n                    elif isinstance(list(values.values())[-1][0], str):\n                        values = {topic: \" \".join(value).strip() for topic, value in values.items()}\n                info[aspect] = info[\"Topic\"].map(values)\n\n        # Representative Docs / Images\n        if self.representative_docs_ is not None:\n            info[\"Representative_Docs\"] = info[\"Topic\"].map(self.representative_docs_)\n        if self.representative_images_ is not None:\n            info[\"Representative_Images\"] = info[\"Topic\"].map(self.representative_images_)\n\n        # Select specific topic to return\n        if topic is not None:\n            info = info.loc[info.Topic == topic, :]\n\n        return info.reset_index(drop=True)\n\n    def get_topic_freq(self, topic: int = None) -&gt; Union[pd.DataFrame, int]:\n        \"\"\" Return the size of topics (descending order)\n\n        Arguments:\n            topic: A specific topic for which you want the frequency\n\n        Returns:\n            Either the frequency of a single topic or dataframe with\n            the frequencies of all topics\n\n        Examples:\n\n        To extract the frequency of all topics:\n\n        ```python\n        frequency = topic_model.get_topic_freq()\n        ```\n\n        To get the frequency of a single topic:\n\n        ```python\n        frequency = topic_model.get_topic_freq(12)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        if isinstance(topic, int):\n            return self.topic_sizes_[topic]\n        else:\n            return pd.DataFrame(self.topic_sizes_.items(), columns=['Topic', 'Count']).sort_values(\"Count\",\n                                                                                                   ascending=False)\n\n    def get_document_info(self,\n                          docs: List[str],\n                          df: pd.DataFrame = None,\n                          metadata: Mapping[str, Any] = None) -&gt; pd.DataFrame:\n        \"\"\" Get information about the documents on which the topic was trained\n        including the documents themselves, their respective topics, the name\n        of each topic, the top n words of each topic, whether it is a\n        representative document, and probability of the clustering if the cluster\n        model supports it.\n\n        There are also options to include other meta data, such as the topic\n        distributions or the x and y coordinates of the reduced embeddings.\n\n        Arguments:\n            docs: The documents on which the topic model was trained.\n            df: A dataframe containing the metadata and the documents on which\n                the topic model was originally trained on.\n            metadata: A dictionary with meta data for each document in the form\n                    of column name (key) and the respective values (value).\n\n        Returns:\n            document_info: A dataframe with several statistics regarding\n                        the documents on which the topic model was trained.\n\n        Usage:\n\n        To get the document info, you will only need to pass the documents on which\n        the topic model was trained:\n\n        ```python\n        document_info = topic_model.get_document_info(docs)\n        ```\n\n        There are additionally options to include meta data, such as the topic\n        distributions. Moreover, we can pass the original dataframe that contains\n        the documents and extend it with the information retrieved from BERTopic:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n\n        # The original data in a dataframe format to include the target variable\n        data= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\n        df = pd.DataFrame({\"Document\": data['data'], \"Class\": data['target']})\n\n        # Add information about the percentage of the document that relates to the topic\n        topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)\n        distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]\n\n        # Create our documents dataframe using the original dataframe and meta data about\n        # the topic distributions\n        document_info = topic_model.get_document_info(docs, df=df,\n                                                      metadata={\"Topic_distribution\": distributions})\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        if df is not None:\n            document_info = df.copy()\n            document_info[\"Document\"] = docs\n            document_info[\"Topic\"] = self.topics_\n        else:\n            document_info = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_})\n\n        # Add topic info through `.get_topic_info()`\n        topic_info = self.get_topic_info().drop(\"Count\", axis=1)\n        document_info = pd.merge(document_info, topic_info, on=\"Topic\", how=\"left\")\n\n        # Add top n words\n        top_n_words = {topic: \" - \".join(list(zip(*self.get_topic(topic)))[0]) for topic in set(self.topics_)}\n        document_info[\"Top_n_words\"] = document_info.Topic.map(top_n_words)\n\n        # Add flat probabilities\n        if self.probabilities_ is not None:\n            if len(self.probabilities_.shape) == 1:\n                document_info[\"Probability\"] = self.probabilities_\n            else:\n                document_info[\"Probability\"] = [max(probs) if topic != -1 else 1-sum(probs)\n                                                for topic, probs in zip(self.topics_, self.probabilities_)]\n\n        # Add representative document labels\n        repr_docs = [repr_doc for repr_docs in self.representative_docs_.values() for repr_doc in repr_docs]\n        document_info[\"Representative_document\"] = False\n        document_info.loc[document_info.Document.isin(repr_docs), \"Representative_document\"] = True\n\n        # Add custom meta data provided by the user\n        if metadata is not None:\n            for column, values in metadata.items():\n                document_info[column] = values\n        return document_info\n\n    def get_representative_docs(self, topic: int = None) -&gt; List[str]:\n        \"\"\" Extract the best representing documents per topic.\n\n        NOTE:\n            This does not extract all documents per topic as all documents\n            are not saved within BERTopic. To get all documents, please\n            run the following:\n\n            ```python\n            # When you used `.fit_transform`:\n            df = pd.DataFrame({\"Document\": docs, \"Topic\": topic})\n\n            # When you used `.fit`:\n            df = pd.DataFrame({\"Document\": docs, \"Topic\": topic_model.topics_})\n            ```\n\n        Arguments:\n            topic: A specific topic for which you want\n                   the representative documents\n\n        Returns:\n            Representative documents of the chosen topic\n\n        Examples:\n\n        To extract the representative docs of all topics:\n\n        ```python\n        representative_docs = topic_model.get_representative_docs()\n        ```\n\n        To get the representative docs of a single topic:\n\n        ```python\n        representative_docs = topic_model.get_representative_docs(12)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        if isinstance(topic, int):\n            if self.representative_docs_.get(topic):\n                return self.representative_docs_[topic]\n            else:\n                return None\n        else:\n            return self.representative_docs_\n\n    @staticmethod\n    def get_topic_tree(hier_topics: pd.DataFrame,\n                       max_distance: float = None,\n                       tight_layout: bool = False) -&gt; str:\n        \"\"\" Extract the topic tree such that it can be printed\n\n        Arguments:\n            hier_topics: A dataframe containing the structure of the topic tree.\n                        This is the output of `topic_model.hierachical_topics()`\n            max_distance: The maximum distance between two topics. This value is\n                        based on the Distance column in `hier_topics`.\n            tight_layout: Whether to use a tight layout (narrow width) for\n                        easier readability if you have hundreds of topics.\n\n        Returns:\n            A tree that has the following structure when printed:\n                .\n                .\n                \u2514\u2500health_medical_disease_patients_hiv\n                    \u251c\u2500patients_medical_disease_candida_health\n                    \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n                    \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n                    \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n                    \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n                    \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n\n            The blocks (\u25a0) indicate that the topic is one you can directly access\n            from `topic_model.get_topic`. In other words, they are the original un-grouped topics.\n\n        Examples:\n\n        ```python\n        # Train model\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n        # Print topic tree\n        tree = topic_model.get_topic_tree(hierarchical_topics)\n        print(tree)\n        ```\n        \"\"\"\n        width = 1 if tight_layout else 4\n        if max_distance is None:\n            max_distance = hier_topics.Distance.max() + 1\n\n        max_original_topic = hier_topics.Parent_ID.astype(int).min() - 1\n\n        # Extract mapping from ID to name\n        topic_to_name = dict(zip(hier_topics.Child_Left_ID, hier_topics.Child_Left_Name))\n        topic_to_name.update(dict(zip(hier_topics.Child_Right_ID, hier_topics.Child_Right_Name)))\n        topic_to_name = {topic: name[:100] for topic, name in topic_to_name.items()}\n\n        # Create tree\n        tree = {str(row[1].Parent_ID): [str(row[1].Child_Left_ID), str(row[1].Child_Right_ID)]\n                for row in hier_topics.iterrows()}\n\n        def get_tree(start, tree):\n            \"\"\" Based on: https://stackoverflow.com/a/51920869/10532563 \"\"\"\n\n            def _tree(to_print, start, parent, tree, grandpa=None, indent=\"\"):\n\n                # Get distance between merged topics\n                distance = hier_topics.loc[(hier_topics.Child_Left_ID == parent) |\n                                           (hier_topics.Child_Right_ID == parent), \"Distance\"]\n                distance = distance.values[0] if len(distance) &gt; 0 else 10\n\n                if parent != start:\n                    if grandpa is None:\n                        to_print += topic_to_name[parent]\n                    else:\n                        if int(parent) &lt;= max_original_topic:\n\n                            # Do not append topic ID if they are not merged\n                            if distance &lt; max_distance:\n                                to_print += \"\u25a0\u2500\u2500\" + topic_to_name[parent] + f\" \u2500\u2500 Topic: {parent}\" + \"\\n\"\n                            else:\n                                to_print += \"O \\n\"\n                        else:\n                            to_print += topic_to_name[parent] + \"\\n\"\n\n                if parent not in tree:\n                    return to_print\n\n                for child in tree[parent][:-1]:\n                    to_print += indent + \"\u251c\" + \"\u2500\"\n                    to_print = _tree(to_print, start, child, tree, parent, indent + \"\u2502\" + \" \" * width)\n\n                child = tree[parent][-1]\n                to_print += indent + \"\u2514\" + \"\u2500\"\n                to_print = _tree(to_print, start, child, tree, parent, indent + \" \" * (width+1))\n\n                return to_print\n\n            to_print = \".\" + \"\\n\"\n            to_print = _tree(to_print, start, start, tree)\n            return to_print\n\n        start = str(hier_topics.Parent_ID.astype(int).max())\n        return get_tree(start, tree)\n\n    def set_topic_labels(self, topic_labels: Union[List[str], Mapping[int, str]]) -&gt; None:\n        \"\"\" Set custom topic labels in your fitted BERTopic model\n\n        Arguments:\n            topic_labels: If a list of topic labels, it should contain the same number\n                        of labels as there are topics. This must be ordered\n                        from the topic with the lowest ID to the highest ID,\n                        including topic -1 if it exists.\n                        If a dictionary of `topic ID`: `topic_label`, it can have\n                        any number of topics as it will only map the topics found\n                        in the dictionary.\n\n        Examples:\n\n        First, we define our topic labels with `.generate_topic_labels` in which\n        we can customize our topic labels:\n\n        ```python\n        topic_labels = topic_model.generate_topic_labels(nr_words=2,\n                                                    topic_prefix=True,\n                                                    word_length=10,\n                                                    separator=\", \")\n        ```\n\n        Then, we pass these `topic_labels` to our topic model which\n        can be accessed at any time with `.custom_labels_`:\n\n        ```python\n        topic_model.set_topic_labels(topic_labels)\n        topic_model.custom_labels_\n        ```\n\n        You might want to change only a few topic labels instead of all of them.\n        To do so, you can pass a dictionary where the keys are the topic IDs and\n        its keys the topic labels:\n\n        ```python\n        topic_model.set_topic_labels({0: \"Space\", 1: \"Sports\", 2: \"Medicine\"})\n        topic_model.custom_labels_\n        ```\n        \"\"\"\n        unique_topics = sorted(set(self.topics_))\n\n        if isinstance(topic_labels, dict):\n            if self.custom_labels_ is not None:\n                original_labels = {topic: label for topic, label in zip(unique_topics, self.custom_labels_)}\n            else:\n                info = self.get_topic_info()\n                original_labels = dict(zip(info.Topic, info.Name))\n            custom_labels = [topic_labels.get(topic) if topic_labels.get(topic) else original_labels[topic] for topic in unique_topics]\n\n        elif isinstance(topic_labels, list):\n            if len(topic_labels) == len(unique_topics):\n                custom_labels = topic_labels\n            else:\n                raise ValueError(\"Make sure that `topic_labels` contains the same number \"\n                                 \"of labels as that there are topics.\")\n\n        self.custom_labels_ = custom_labels\n\n    def generate_topic_labels(self,\n                              nr_words: int = 3,\n                              topic_prefix: bool = True,\n                              word_length: int = None,\n                              separator: str = \"_\",\n                              aspect: str = None) -&gt; List[str]:\n        \"\"\" Get labels for each topic in a user-defined format\n\n        Arguments:\n            nr_words: Top `n` words per topic to use\n            topic_prefix: Whether to use the topic ID as a prefix.\n                          If set to True, the topic ID will be separated\n                          using the `separator`\n            word_length: The maximum length of each word in the topic label.\n                         Some words might be relatively long and setting this\n                         value helps to make sure that all labels have relatively\n                         similar lengths.\n            separator: The string with which the words and topic prefix will be\n                       separated. Underscores are the default but a nice alternative\n                       is `\", \"`.\n            aspect: The aspect from which to generate topic labels\n\n        Returns:\n            topic_labels: A list of topic labels sorted from the lowest topic ID to the highest.\n                        If the topic model was trained using HDBSCAN, the lowest topic ID is -1,\n                        otherwise it is 0.\n\n        Examples:\n\n        To create our custom topic labels, usage is rather straightforward:\n\n        ```python\n        topic_labels = topic_model.generate_topic_labels(nr_words=2, separator=\", \")\n        ```\n        \"\"\"\n        unique_topics = sorted(set(self.topics_))\n\n        topic_labels = []\n        for topic in unique_topics:\n            if aspect:\n                words, _ = zip(*self.topic_aspects_[aspect][topic])\n            else:\n                words, _ = zip(*self.get_topic(topic))\n\n            if word_length:\n                words = [word[:word_length] for word in words][:nr_words]\n            else:\n                words = list(words)[:nr_words]\n\n            if topic_prefix:\n                topic_label = f\"{topic}{separator}\" + separator.join(words)\n            else:\n                topic_label = separator.join(words)\n\n            topic_labels.append(topic_label)\n\n        return topic_labels\n\n    def merge_topics(self,\n                     docs: List[str],\n                     topics_to_merge: List[Union[Iterable[int], int]],\n                     images: List[str] = None) -&gt; None:\n        \"\"\"\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            topics_to_merge: Either a list of topics or a list of list of topics\n                            to merge. For example:\n                                [1, 2, 3] will merge topics 1, 2 and 3\n                                [[1, 2], [3, 4]] will merge topics 1 and 2, and\n                                separately merge topics 3 and 4.\n            images: A list of paths to the images used when calling either\n                    `fit` or `fit_transform`\n\n        Examples:\n\n        If you want to merge topics 1, 2, and 3:\n\n        ```python\n        topics_to_merge = [1, 2, 3]\n        topic_model.merge_topics(docs, topics_to_merge)\n        ```\n\n        or if you want to merge topics 1 and 2, and separately\n        merge topics 3 and 4:\n\n        ```python\n        topics_to_merge = [[1, 2]\n                            [3, 4]]\n        topic_model.merge_topics(docs, topics_to_merge)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n        mapping = {topic: topic for topic in set(self.topics_)}\n        if isinstance(topics_to_merge[0], int):\n            for topic in sorted(topics_to_merge):\n                mapping[topic] = topics_to_merge[0]\n        elif isinstance(topics_to_merge[0], Iterable):\n            for topic_group in sorted(topics_to_merge):\n                for topic in topic_group:\n                    mapping[topic] = topic_group[0]\n        else:\n            raise ValueError(\"Make sure that `topics_to_merge` is either\"\n                             \"a list of topics or a list of list of topics.\")\n\n        # Track mappings and sizes of topics for merging topic embeddings\n        mappings = defaultdict(list)\n        for key, val in sorted(mapping.items()):\n            mappings[val].append(key)\n        mappings = {topic_from:\n                    {\"topics_to\": topics_to,\n                     \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                    for topic_from, topics_to in mappings.items()}\n\n        # Update topics\n        documents.Topic = documents.Topic.map(mapping)\n        self.topic_mapper_.add_mappings(mapping)\n        documents = self._sort_mappings_by_frequency(documents)\n        self._extract_topics(documents, mappings=mappings)\n        self._update_topic_size(documents)\n        self._save_representative_docs(documents)\n        self.probabilities_ = self._map_probabilities(self.probabilities_)\n\n    def reduce_topics(self,\n                      docs: List[str],\n                      nr_topics: Union[int, str] = 20,\n                      images: List[str] = None) -&gt; None:\n        \"\"\" Reduce the number of topics to a fixed number of topics\n        or automatically.\n\n        If nr_topics is a integer, then the number of topics is reduced\n        to nr_topics using `AgglomerativeClustering` on the cosine distance matrix\n        of the topic embeddings.\n\n        If nr_topics is `\"auto\"`, then HDBSCAN is used to automatically\n        reduce the number of topics by running it on the topic embeddings.\n\n        The topics, their sizes, and representations are updated.\n\n        Arguments:\n            docs: The docs you used when calling either `fit` or `fit_transform`\n            nr_topics: The number of topics you want reduced to\n            images: A list of paths to the images used when calling either\n                    `fit` or `fit_transform`\n\n        Updates:\n            topics_ : Assigns topics to their merged representations.\n            probabilities_ : Assigns probabilities to their merged representations.\n\n        Examples:\n\n        You can further reduce the topics by passing the documents with its\n        topics and probabilities (if they were calculated):\n\n        ```python\n        topic_model.reduce_topics(docs, nr_topics=30)\n        ```\n\n        You can then access the updated topics and probabilities with:\n\n        ```python\n        topics = topic_model.topics_\n        probabilities = topic_model.probabilities_\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n\n        self.nr_topics = nr_topics\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n        # Reduce number of topics\n        documents = self._reduce_topics(documents)\n        self._merged_topics = None\n        self._save_representative_docs(documents)\n        self.probabilities_ = self._map_probabilities(self.probabilities_)\n\n        return self\n\n    def reduce_outliers(self,\n                        documents: List[str],\n                        topics: List[int],\n                        images: List[str] = None,\n                        strategy: str = \"distributions\",\n                        probabilities: np.ndarray = None,\n                        threshold: float = 0,\n                        embeddings: np.ndarray = None,\n                        distributions_params: Mapping[str, Any] = {}) -&gt; List[int]:\n        \"\"\" Reduce outliers by merging them with their nearest topic according\n        to one of several strategies.\n\n        When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created\n        that do not fall within any of the created topics. These are labeled as -1.\n        This function allows the user to match outlier documents with their nearest topic\n        using one of the following strategies using the `strategy` parameter:\n            * \"probabilities\"\n                This uses the soft-clustering as performed by HDBSCAN to find the\n                best matching topic for each outlier document. To use this, make\n                sure to calculate the `probabilities` beforehand by instantiating\n                BERTopic with `calculate_probabilities=True`.\n            * \"distributions\"\n                Use the topic distributions, as calculated with `.approximate_distribution`\n                to find the most frequent topic in each outlier document. You can use the\n                `distributions_params` variable to tweak the parameters of\n                `.approximate_distribution`.\n            * \"c-tf-idf\"\n                Calculate the c-TF-IDF representation for each outlier document and\n                find the best matching c-TF-IDF topic representation using\n                cosine similarity.\n            * \"embeddings\"\n                Using the embeddings of each outlier documents, find the best\n                matching topic embedding using cosine similarity.\n\n        Arguments:\n            documents: A list of documents for which we reduce or remove the outliers.\n            topics: The topics that correspond to the documents\n            images: A list of paths to the images used when calling either\n                    `fit` or `fit_transform`\n            strategy: The strategy used for reducing outliers.\n                    Options:\n                        * \"probabilities\"\n                            This uses the soft-clustering as performed by HDBSCAN\n                            to find the best matching topic for each outlier document.\n\n                        * \"distributions\"\n                            Use the topic distributions, as calculated with `.approximate_distribution`\n                            to find the most frequent topic in each outlier document.\n\n                        * \"c-tf-idf\"\n                            Calculate the c-TF-IDF representation for outlier documents and\n                            find the best matching c-TF-IDF topic representation.\n\n                        * \"embeddings\"\n                            Calculate the embeddings for outlier documents and\n                            find the best matching topic embedding.\n            threshold: The threshold for assigning topics to outlier documents. This value\n                    represents the minimum probability when `strategy=\"probabilities\"`.\n                    For all other strategies, it represents the minimum similarity.\n            embeddings: The pre-computed embeddings to be used when `strategy=\"embeddings\"`.\n                        If this is None, then it will compute the embeddings for the outlier documents.\n            distributions_params: The parameters used in `.approximate_distribution` when using\n                                  the strategy `\"distributions\"`.\n\n        Returns:\n            new_topics: The updated topics\n\n        Usage:\n\n        The default settings uses the `\"distributions\"` strategy:\n\n        ```python\n        new_topics = topic_model.reduce_outliers(docs, topics)\n        ```\n\n        When you use the `\"probabilities\"` strategy, make sure to also pass the probabilities\n        as generated through HDBSCAN:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic(calculate_probabilities=True)\n        topics, probs = topic_model.fit_transform(docs)\n\n        new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n        ```\n        \"\"\"\n        if images is not None:\n            strategy = \"embeddings\"\n\n        # Check correct use of parameters\n        if strategy.lower() == \"probabilities\" and probabilities is None:\n            raise ValueError(\"Make sure to pass in `probabilities` in order to use the probabilities strategy\")\n\n        # Reduce outliers by extracting most likely topics through the topic-term probability matrix\n        if strategy.lower() == \"probabilities\":\n            new_topics = [np.argmax(prob) if np.max(prob) &gt;= threshold and topic == -1 else topic\n                          for topic, prob in zip(topics, probabilities)]\n\n        # Reduce outliers by extracting most frequent topics through calculating of Topic Distributions\n        elif strategy.lower() == \"distributions\":\n            outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n            outlier_docs = [documents[index] for index in outlier_ids]\n            topic_distr, _ = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params)\n            outlier_topics = iter([np.argmax(prob) if sum(prob) &gt; 0 else -1 for prob in topic_distr])\n            new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n        # Reduce outliers by finding the most similar c-TF-IDF representations\n        elif strategy.lower() == \"c-tf-idf\":\n            outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n            outlier_docs = [documents[index] for index in outlier_ids]\n\n            # Calculate c-TF-IDF of outlier documents with all topics\n            bow_doc = self.vectorizer_model.transform(outlier_docs)\n            c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n            similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n            # Update topics\n            similarity[similarity &lt; threshold] = 0\n            outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n            new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n        # Reduce outliers by finding the most similar topic embeddings\n        elif strategy.lower() == \"embeddings\":\n            if self.embedding_model is None and embeddings is None:\n                raise ValueError(\"To use this strategy, you will need to pass a model to `embedding_model`\"\n                                 \"when instantiating BERTopic.\")\n            outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n            if images is not None:\n                outlier_docs = [images[index] for index in outlier_ids]\n            else:\n                outlier_docs = [documents[index] for index in outlier_ids]\n\n            # Extract or calculate embeddings for outlier documents\n            if embeddings is not None:\n                outlier_embeddings = np.array([embeddings[index] for index in outlier_ids])\n            elif images is not None:\n                outlier_images = [images[index] for index in outlier_ids]\n                outlier_embeddings = self.embedding_model.embed_images(outlier_images, verbose=self.verbose)\n            else:\n                outlier_embeddings = self.embedding_model.embed_documents(outlier_docs)\n            similarity = cosine_similarity(outlier_embeddings, self.topic_embeddings_[self._outliers:])\n\n            # Update topics\n            similarity[similarity &lt; threshold] = 0\n            outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n            new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n        return new_topics\n\n    def visualize_topics(self,\n                         topics: List[int] = None,\n                         top_n_topics: int = None,\n                         custom_labels: bool = False,\n                         title: str = \"&lt;b&gt;Intertopic Distance Map&lt;/b&gt;\",\n                         width: int = 650,\n                         height: int = 650) -&gt; go.Figure:\n        \"\"\" Visualize topics, their sizes, and their corresponding words\n\n        This visualization is highly inspired by LDAvis, a great visualization\n        technique typically reserved for LDA.\n\n        Arguments:\n            topics: A selection of topics to visualize\n                    Not to be confused with the topics that you get from `.fit_transform`.\n                    For example, if you want to visualize only topics 1 through 5:\n                    `topics = [1, 2, 3, 4, 5]`.\n            top_n_topics: Only select the top n most frequent topics\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        To visualize the topics simply run:\n\n        ```python\n        topic_model.visualize_topics()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_topics()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_topics(self,\n                                         topics=topics,\n                                         top_n_topics=top_n_topics,\n                                         custom_labels=custom_labels,\n                                         title=title,\n                                         width=width,\n                                         height=height)\n\n    def visualize_documents(self,\n                            docs: List[str],\n                            topics: List[int] = None,\n                            embeddings: np.ndarray = None,\n                            reduced_embeddings: np.ndarray = None,\n                            sample: float = None,\n                            hide_annotations: bool = False,\n                            hide_document_hover: bool = False,\n                            custom_labels: bool = False,\n                            title: str = \"&lt;b&gt;Documents and Topics&lt;/b&gt;\",\n                            width: int = 1200,\n                            height: int = 750) -&gt; go.Figure:\n        \"\"\" Visualize documents and their topics in 2D\n\n        Arguments:\n            topic_model: A fitted BERTopic instance.\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            topics: A selection of topics to visualize.\n                    Not to be confused with the topics that you get from `.fit_transform`.\n                    For example, if you want to visualize only topics 1 through 5:\n                    `topics = [1, 2, 3, 4, 5]`.\n            embeddings: The embeddings of all documents in `docs`.\n            reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n            sample: The percentage of documents in each topic that you would like to keep.\n                    Value can be between 0 and 1. Setting this value to, for example,\n                    0.1 (10% of documents in each topic) makes it easier to visualize\n                    millions of documents as a subset is chosen.\n            hide_annotations: Hide the names of the traces on top of each cluster.\n            hide_document_hover: Hide the content of the documents when hovering over\n                                specific points. Helps to speed up generation of visualization.\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        To visualize the topics simply run:\n\n        ```python\n        topic_model.visualize_documents(docs)\n        ```\n\n        Do note that this re-calculates the embeddings and reduces them to 2D.\n        The advised and prefered pipeline for using this function is as follows:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n        from bertopic import BERTopic\n        from umap import UMAP\n\n        # Prepare embeddings\n        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n        # Train BERTopic\n        topic_model = BERTopic().fit(docs, embeddings)\n\n        # Reduce dimensionality of embeddings, this step is optional\n        # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n        # Run the visualization with the original embeddings\n        topic_model.visualize_documents(docs, embeddings=embeddings)\n\n        # Or, if you have reduced the original embeddings already:\n        topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n        fig.write_html(\"path/to/file.html\")\n        ```\n\n        &lt;iframe src=\"../getting_started/visualization/documents.html\"\n        style=\"width:1000px; height: 800px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        return plotting.visualize_documents(self,\n                                            docs=docs,\n                                            topics=topics,\n                                            embeddings=embeddings,\n                                            reduced_embeddings=reduced_embeddings,\n                                            sample=sample,\n                                            hide_annotations=hide_annotations,\n                                            hide_document_hover=hide_document_hover,\n                                            custom_labels=custom_labels,\n                                            title=title,\n                                            width=width,\n                                            height=height)\n\n    def visualize_hierarchical_documents(self,\n                                         docs: List[str],\n                                         hierarchical_topics: pd.DataFrame,\n                                         topics: List[int] = None,\n                                         embeddings: np.ndarray = None,\n                                         reduced_embeddings: np.ndarray = None,\n                                         sample: Union[float, int] = None,\n                                         hide_annotations: bool = False,\n                                         hide_document_hover: bool = True,\n                                         nr_levels: int = 10,\n                                         level_scale: str = 'linear',\n                                         custom_labels: bool = False,\n                                         title: str = \"&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;\",\n                                         width: int = 1200,\n                                         height: int = 750) -&gt; go.Figure:\n        \"\"\" Visualize documents and their topics in 2D at different levels of hierarchy\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            hierarchical_topics: A dataframe that contains a hierarchy of topics\n                                represented by their parents and their children\n            topics: A selection of topics to visualize.\n                    Not to be confused with the topics that you get from `.fit_transform`.\n                    For example, if you want to visualize only topics 1 through 5:\n                    `topics = [1, 2, 3, 4, 5]`.\n            embeddings: The embeddings of all documents in `docs`.\n            reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n            sample: The percentage of documents in each topic that you would like to keep.\n                    Value can be between 0 and 1. Setting this value to, for example,\n                    0.1 (10% of documents in each topic) makes it easier to visualize\n                    millions of documents as a subset is chosen.\n            hide_annotations: Hide the names of the traces on top of each cluster.\n            hide_document_hover: Hide the content of the documents when hovering over\n                                specific points. Helps to speed up generation of visualizations.\n            nr_levels: The number of levels to be visualized in the hierarchy. First, the distances\n                    in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances with\n                    equal length. Then, for each list of distances, the merged topics are selected that\n                    have a distance less or equal to the maximum distance of the selected list of distances.\n                    NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to\n                    the length of `hierarchical_topics`.\n            level_scale: Whether to apply a linear or logarithmic ('log') scale levels of the distance\n                         vector. Linear scaling will perform an equal number of merges at each level\n                         while logarithmic scaling will perform more mergers in earlier levels to\n                         provide more resolution at higher levels (this can be used for when the number\n                         of topics is large).\n            custom_labels: Whether to use custom topic labels that were defined using\n                           `topic_model.set_topic_labels`.\n                           NOTE: Custom labels are only generated for the original\n                           un-merged topics.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        To visualize the topics simply run:\n\n        ```python\n        topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n        ```\n\n        Do note that this re-calculates the embeddings and reduces them to 2D.\n        The advised and prefered pipeline for using this function is as follows:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n        from bertopic import BERTopic\n        from umap import UMAP\n\n        # Prepare embeddings\n        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n        # Train BERTopic and extract hierarchical topics\n        topic_model = BERTopic().fit(docs, embeddings)\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n        # Reduce dimensionality of embeddings, this step is optional\n        # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n        # Run the visualization with the original embeddings\n        topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n        # Or, if you have reduced the original embeddings already:\n        topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n        fig.write_html(\"path/to/file.html\")\n        ```\n\n        &lt;iframe src=\"../getting_started/visualization/hierarchical_documents.html\"\n        style=\"width:1000px; height: 770px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        return plotting.visualize_hierarchical_documents(self,\n                                                         docs=docs,\n                                                         hierarchical_topics=hierarchical_topics,\n                                                         topics=topics,\n                                                         embeddings=embeddings,\n                                                         reduced_embeddings=reduced_embeddings,\n                                                         sample=sample,\n                                                         hide_annotations=hide_annotations,\n                                                         hide_document_hover=hide_document_hover,\n                                                         nr_levels=nr_levels,\n                                                         level_scale=level_scale,\n                                                         custom_labels=custom_labels,\n                                                         title=title,\n                                                         width=width,\n                                                         height=height)\n\n    def visualize_term_rank(self,\n                            topics: List[int] = None,\n                            log_scale: bool = False,\n                            custom_labels: bool = False,\n                            title: str = \"&lt;b&gt;Term score decline per Topic&lt;/b&gt;\",\n                            width: int = 800,\n                            height: int = 500) -&gt; go.Figure:\n        \"\"\" Visualize the ranks of all terms across all topics\n\n        Each topic is represented by a set of words. These words, however,\n        do not all equally represent the topic. This visualization shows\n        how many words are needed to represent a topic and at which point\n        the beneficial effect of adding words starts to decline.\n\n        Arguments:\n            topics: A selection of topics to visualize. These will be colored\n                    red where all others will be colored black.\n            log_scale: Whether to represent the ranking on a log scale\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the ranks of all words across\n        all topics simply run:\n\n        ```python\n        topic_model.visualize_term_rank()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_term_rank()\n        fig.write_html(\"path/to/file.html\")\n        ```\n\n        Reference:\n\n        This visualization was heavily inspired by the\n        \"Term Probability Decline\" visualization found in an\n        analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).\n        Reference to that specific analysis can be found\n        [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_term_rank(self,\n                                            topics=topics,\n                                            log_scale=log_scale,\n                                            custom_labels=custom_labels,\n                                            title=title,\n                                            width=width,\n                                            height=height)\n\n    def visualize_topics_over_time(self,\n                                   topics_over_time: pd.DataFrame,\n                                   top_n_topics: int = None,\n                                   topics: List[int] = None,\n                                   normalize_frequency: bool = False,\n                                   custom_labels: bool = False,\n                                   title: str = \"&lt;b&gt;Topics over Time&lt;/b&gt;\",\n                                   width: int = 1250,\n                                   height: int = 450) -&gt; go.Figure:\n        \"\"\" Visualize topics over time\n\n        Arguments:\n            topics_over_time: The topics you would like to be visualized with the\n                              corresponding topic representation\n            top_n_topics: To visualize the most frequent topics instead of all\n            topics: Select which topics you would like to be visualized\n            normalize_frequency: Whether to normalize each topic's frequency individually\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            A plotly.graph_objects.Figure including all traces\n\n        Examples:\n\n        To visualize the topics over time, simply run:\n\n        ```python\n        topics_over_time = topic_model.topics_over_time(docs, timestamps)\n        topic_model.visualize_topics_over_time(topics_over_time)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_topics_over_time(topics_over_time)\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_topics_over_time(self,\n                                                   topics_over_time=topics_over_time,\n                                                   top_n_topics=top_n_topics,\n                                                   topics=topics,\n                                                   normalize_frequency=normalize_frequency,\n                                                   custom_labels=custom_labels,\n                                                   title=title,\n                                                   width=width,\n                                                   height=height)\n\n    def visualize_topics_per_class(self,\n                                   topics_per_class: pd.DataFrame,\n                                   top_n_topics: int = 10,\n                                   topics: List[int] = None,\n                                   normalize_frequency: bool = False,\n                                   custom_labels: bool = False,\n                                   title: str = \"&lt;b&gt;Topics per Class&lt;/b&gt;\",\n                                   width: int = 1250,\n                                   height: int = 900) -&gt; go.Figure:\n        \"\"\" Visualize topics per class\n\n        Arguments:\n            topics_per_class: The topics you would like to be visualized with the\n                              corresponding topic representation\n            top_n_topics: To visualize the most frequent topics instead of all\n            topics: Select which topics you would like to be visualized\n            normalize_frequency: Whether to normalize each topic's frequency individually\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            A plotly.graph_objects.Figure including all traces\n\n        Examples:\n\n        To visualize the topics per class, simply run:\n\n        ```python\n        topics_per_class = topic_model.topics_per_class(docs, classes)\n        topic_model.visualize_topics_per_class(topics_per_class)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_topics_per_class(topics_per_class)\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_topics_per_class(self,\n                                                   topics_per_class=topics_per_class,\n                                                   top_n_topics=top_n_topics,\n                                                   topics=topics,\n                                                   normalize_frequency=normalize_frequency,\n                                                   custom_labels=custom_labels,\n                                                   title=title,\n                                                   width=width,\n                                                   height=height)\n\n    def visualize_distribution(self,\n                               probabilities: np.ndarray,\n                               min_probability: float = 0.015,\n                               custom_labels: bool = False,\n                               title: str = \"&lt;b&gt;Topic Probability Distribution&lt;/b&gt;\",\n                               width: int = 800,\n                               height: int = 600) -&gt; go.Figure:\n        \"\"\" Visualize the distribution of topic probabilities\n\n        Arguments:\n            probabilities: An array of probability scores\n            min_probability: The minimum probability score to visualize.\n                             All others are ignored.\n            custom_labels: Whether to use custom topic labels that were defined using\n                           `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        Make sure to fit the model before and only input the\n        probabilities of a single document:\n\n        ```python\n        topic_model.visualize_distribution(topic_model.probabilities_[0])\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_distribution(topic_model.probabilities_[0])\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_distribution(self,\n                                               probabilities=probabilities,\n                                               min_probability=min_probability,\n                                               custom_labels=custom_labels,\n                                               title=title,\n                                               width=width,\n                                               height=height)\n\n    def visualize_approximate_distribution(self,\n                                           document: str,\n                                           topic_token_distribution: np.ndarray,\n                                           normalize: bool = False):\n        \"\"\" Visualize the topic distribution calculated by `.approximate_topic_distribution`\n        on a token level. Thereby indicating the extend to which a certain word or phrases belong\n        to a specific topic. The assumption here is that a single word can belong to multiple\n        similar topics and as such give information about the broader set of topics within\n        a single document.\n\n        Arguments:\n            topic_model: A fitted BERTopic instance.\n            document: The document for which you want to visualize\n                    the approximated topic distribution.\n            topic_token_distribution: The topic-token distribution of the document as\n                                    extracted by `.approximate_topic_distribution`\n            normalize: Whether to normalize, between 0 and 1 (summing to 1), the\n                    topic distribution values.\n\n        Returns:\n            df: A stylized dataframe indicating the best fitting topics\n                for each token.\n\n        Examples:\n\n        ```python\n        # Calculate the topic distributions on a token level\n        # Note that we need to have `calculate_token_level=True`\n        topic_distr, topic_token_distr = topic_model.approximate_distribution(\n                docs, calculate_token_level=True\n        )\n\n        # Visualize the approximated topic distributions\n        df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])\n        df\n        ```\n\n        To revert this stylized dataframe back to a regular dataframe,\n        you can run the following:\n\n        ```python\n        df.data.columns = [column.strip() for column in df.data.columns]\n        df = df.data\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_approximate_distribution(self,\n                                                           document=document,\n                                                           topic_token_distribution=topic_token_distribution,\n                                                           normalize=normalize)\n\n    def visualize_hierarchy(self,\n                            orientation: str = \"left\",\n                            topics: List[int] = None,\n                            top_n_topics: int = None,\n                            custom_labels: bool = False,\n                            title: str = \"&lt;b&gt;Hierarchical Clustering&lt;/b&gt;\",\n                            width: int = 1000,\n                            height: int = 600,\n                            hierarchical_topics: pd.DataFrame = None,\n                            linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                            distance_function: Callable[[csr_matrix], csr_matrix] = None,\n                            color_threshold: int = 1) -&gt; go.Figure:\n        \"\"\" Visualize a hierarchical structure of the topics\n\n        A ward linkage function is used to perform the\n        hierarchical clustering based on the cosine distance\n        matrix between topic embeddings.\n\n        Arguments:\n            topic_model: A fitted BERTopic instance.\n            orientation: The orientation of the figure.\n                        Either 'left' or 'bottom'\n            topics: A selection of topics to visualize\n            top_n_topics: Only select the top n most frequent topics\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n                       NOTE: Custom labels are only generated for the original\n                       un-merged topics.\n            title: Title of the plot.\n            width: The width of the figure. Only works if orientation is set to 'left'\n            height: The height of the figure. Only works if orientation is set to 'bottom'\n            hierarchical_topics: A dataframe that contains a hierarchy of topics\n                                represented by their parents and their children.\n                                NOTE: The hierarchical topic names are only visualized\n                                if both `topics` and `top_n_topics` are not set.\n            linkage_function: The linkage function to use. Default is:\n                            `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n                            NOTE: Make sure to use the same `linkage_function` as used\n                            in `topic_model.hierarchical_topics`.\n            distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                            `lambda x: 1 - cosine_similarity(x)`\n                            NOTE: Make sure to use the same `distance_function` as used\n                            in `topic_model.hierarchical_topics`.\n            color_threshold: Value at which the separation of clusters will be made which\n                         will result in different colors for different clusters.\n                         A higher value will typically lead in less colored clusters.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the hierarchical structure of\n        topics simply run:\n\n        ```python\n        topic_model.visualize_hierarchy()\n        ```\n\n        If you also want the labels visualized of hierarchical topics,\n        run the following:\n\n        ```python\n        # Extract hierarchical topics and their representations\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n        # Visualize these representations\n        topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n        ```\n\n        If you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_hierarchy()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        &lt;iframe src=\"../getting_started/visualization/hierarchy.html\"\n        style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_hierarchy(self,\n                                            orientation=orientation,\n                                            topics=topics,\n                                            top_n_topics=top_n_topics,\n                                            custom_labels=custom_labels,\n                                            title=title,\n                                            width=width,\n                                            height=height,\n                                            hierarchical_topics=hierarchical_topics,\n                                            linkage_function=linkage_function,\n                                            distance_function=distance_function,\n                                            color_threshold=color_threshold\n                                            )\n\n    def visualize_heatmap(self,\n                          topics: List[int] = None,\n                          top_n_topics: int = None,\n                          n_clusters: int = None,\n                          custom_labels: bool = False,\n                          title: str = \"&lt;b&gt;Similarity Matrix&lt;/b&gt;\",\n                          width: int = 800,\n                          height: int = 800) -&gt; go.Figure:\n        \"\"\" Visualize a heatmap of the topic's similarity matrix\n\n        Based on the cosine similarity matrix between topic embeddings,\n        a heatmap is created showing the similarity between topics.\n\n        Arguments:\n            topics: A selection of topics to visualize.\n            top_n_topics: Only select the top n most frequent topics.\n            n_clusters: Create n clusters and order the similarity\n                        matrix by those clusters.\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the similarity matrix of\n        topics simply run:\n\n        ```python\n        topic_model.visualize_heatmap()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_heatmap()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_heatmap(self,\n                                          topics=topics,\n                                          top_n_topics=top_n_topics,\n                                          n_clusters=n_clusters,\n                                          custom_labels=custom_labels,\n                                          title=title,\n                                          width=width,\n                                          height=height)\n\n    def visualize_barchart(self,\n                           topics: List[int] = None,\n                           top_n_topics: int = 8,\n                           n_words: int = 5,\n                           custom_labels: bool = False,\n                           title: str = \"Topic Word Scores\",\n                           width: int = 250,\n                           height: int = 250) -&gt; go.Figure:\n        \"\"\" Visualize a barchart of selected topics\n\n        Arguments:\n            topics: A selection of topics to visualize.\n            top_n_topics: Only select the top n most frequent topics.\n            n_words: Number of words to show in a topic\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of each figure.\n            height: The height of each figure.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the barchart of selected topics\n        simply run:\n\n        ```python\n        topic_model.visualize_barchart()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_barchart()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_barchart(self,\n                                           topics=topics,\n                                           top_n_topics=top_n_topics,\n                                           n_words=n_words,\n                                           custom_labels=custom_labels,\n                                           title=title,\n                                           width=width,\n                                           height=height)\n\n    def save(self,\n             path,\n             serialization: Literal[\"safetensors\", \"pickle\", \"pytorch\"] = \"pickle\",\n             save_embedding_model: Union[bool, str] = True,\n             save_ctfidf: bool = False):\n        \"\"\" Saves the model to the specified path or folder\n\n        When saving the model, make sure to also keep track of the versions\n        of dependencies and Python used. Loading and saving the model should\n        be done using the same dependencies and Python. Moreover, models\n        saved in one version of BERTopic should not be loaded in other versions.\n\n        Arguments:\n            path: If `serialization` is 'safetensors' or `pytorch`, this is a directory.\n                  If `serialization` is `pickle`, then this is file.\n            serialization: If `pickle`, the entire model will be pickled. If `safetensors`\n                           or `pytorch` the model will be saved without the embedding,\n                           dimensionality reduction, and clustering algorithms.\n                           This is a very efficient format and typically advised.\n            save_embedding_model: If serialization `pickle`, then you can choose to skip\n                                  saving the embedding model. If serialization `safetensors`\n                                  or `pytorch`, this variable can be used as a string pointing\n                                  towards a huggingface model.\n            save_ctfidf: Whether to save c-TF-IDF information if serialization is `safetensors`\n                         or `pytorch`\n\n        Examples:\n\n        To save the model in an efficient and safe format (safetensors) with c-TF-IDF information:\n\n        ```python\n        topic_model.save(\"model_dir\", serialization=\"safetensors\", save_ctfidf=True)\n        ```\n\n        If you wish to also add a pointer to the embedding model, which will be downloaded from\n        HuggingFace upon loading:\n\n        ```python\n        embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\n        topic_model.save(\"model_dir\", serialization=\"safetensors\", save_embedding_model=embedding_model)\n        ```\n\n        or if you want save the full model with pickle:\n\n        ```python\n        topic_model.save(\"my_model\")\n        ```\n\n        NOTE: Pickle can run arbitrary code and is generally considered to be less safe than\n        safetensors.\n        \"\"\"\n        if serialization == \"pickle\":\n            logger.warning(\"When you use `pickle` to save/load a BERTopic model,\"\n                           \"please make sure that the environments in which you save\"\n                           \"and load the model are **exactly** the same. The version of BERTopic,\"\n                           \"its dependencies, and python need to remain the same.\")\n\n            with open(path, 'wb') as file:\n\n                # This prevents the vectorizer from being too large in size if `min_df` was\n                # set to a value higher than 1\n                self.vectorizer_model.stop_words_ = None\n\n                if not save_embedding_model:\n                    embedding_model = self.embedding_model\n                    self.embedding_model = None\n                    joblib.dump(self, file)\n                    self.embedding_model = embedding_model\n                else:\n                    joblib.dump(self, file)\n        elif serialization == \"safetensors\" or serialization == \"pytorch\":\n\n            # Directory\n            save_directory = Path(path)\n            save_directory.mkdir(exist_ok=True, parents=True)\n\n            # Check embedding model\n            if save_embedding_model and hasattr(self.embedding_model, '_hf_model') and not isinstance(save_embedding_model, str):\n                save_embedding_model = self.embedding_model._hf_model\n            elif not save_embedding_model:\n                logger.warning(\"You are saving a BERTopic model without explicitly defining an embedding model.\"\n                               \"If you are using a sentence-transformers model or a HuggingFace model supported\"\n                               \"by sentence-transformers, please save the model by using a pointer towards that model.\"\n                               \"For example, `save_embedding_model=sentence-transformers/all-mpnet-base-v2`\", RuntimeWarning)\n\n            # Minimal\n            save_utils.save_hf(model=self, save_directory=save_directory, serialization=serialization)\n            save_utils.save_topics(model=self, path=save_directory / \"topics.json\")\n            save_utils.save_images(model=self, path=save_directory / \"images\")\n            save_utils.save_config(model=self, path=save_directory / 'config.json', embedding_model=save_embedding_model)\n\n            # Additional\n            if save_ctfidf:\n                save_utils.save_ctfidf(model=self, save_directory=save_directory, serialization=serialization)\n                save_utils.save_ctfidf_config(model=self, path=save_directory / 'ctfidf_config.json')\n\n    @classmethod\n    def load(cls,\n             path: str,\n             embedding_model=None):\n        \"\"\" Loads the model from the specified path or directory\n\n        Arguments:\n            path: Either load a BERTopic model from a file (`.pickle`) or a folder containing\n                  `.safetensors` or `.bin` files.\n            embedding_model: Additionally load in an embedding model if it was not saved\n                             in the BERTopic model file or directory.\n\n        Examples:\n\n        ```python\n        BERTopic.load(\"model_dir\")\n        ```\n\n        or if you did not save the embedding model:\n\n        ```python\n        BERTopic.load(\"model_dir\", embedding_model=\"all-MiniLM-L6-v2\")\n        ```\n        \"\"\"\n        file_or_dir = Path(path)\n\n        # Load from Pickle\n        if file_or_dir.is_file():\n            with open(file_or_dir, 'rb') as file:\n                if embedding_model:\n                    topic_model = joblib.load(file)\n                    topic_model.embedding_model = select_backend(embedding_model)\n                else:\n                    topic_model = joblib.load(file)\n                return topic_model\n\n        # Load from directory or HF\n        if file_or_dir.is_dir():\n            topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_local_files(file_or_dir)\n        elif \"/\" in str(path):\n            topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)\n        else:\n            raise ValueError(\"Make sure to either pass a valid directory or HF model.\")\n        topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)\n\n        # Replace embedding model if one is specifically chosen\n        if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:\n            topic_model.embedding_model = select_backend(embedding_model)\n\n        return topic_model\n\n    @classmethod\n    def merge_models(cls, models, min_similarity: float = .7, embedding_model=None):\n        \"\"\" Merge multiple pre-trained BERTopic models into a single model.\n\n        The models are merged as if they were all saved using pytorch or\n        safetensors, so a minimal version without c-TF-IDF.\n\n        To do this, we choose the first model in the list of\n        models as a baseline. Then, we check for each model\n        whether they contain topics that are not in the baseline.\n        This check if based on the cosine similarity between\n        topics embeddings. If topic embeddings between two models\n        are similar, then the topic of the second model are re-assigned\n        to the first. If they are dissimilar, the topic of the second\n        model is assigned to the first.\n\n        In essence, we simply check whether sufficiently \"new\"\n        topics emerge and add them.\n\n        Arguments:\n            models: A list of fitted BERTopic models\n            min_similarity: The minimum similarity for when topics are merged.\n            embedding_model: Additionally load in an embedding model if necessary.\n\n        Returns:\n            A new BERTopic model that was created as if you were\n            loading a model from the HuggingFace Hub without c-TF-IDF\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n        # Create three separate models\n        topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])\n        topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])\n        topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])\n\n        # Combine all models into one\n        merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n        ```\n        \"\"\"\n        import torch\n\n        # Temporarily save model and push to HF\n        with TemporaryDirectory() as tmpdir:\n\n            # Save model weights and config.\n            all_topics, all_params, all_tensors = [], [], []\n            for index, model in enumerate(models):\n                model.save(tmpdir, serialization=\"pytorch\")\n                topics, params, tensors, _, _, _ = save_utils.load_local_files(Path(tmpdir))\n                all_topics.append(topics)\n                all_params.append(params)\n                all_tensors.append(np.array(tensors[\"topic_embeddings\"]))\n\n                # Create a base set of parameters\n                if index == 0:\n                    merged_topics = topics\n                    merged_params = params\n                    merged_tensors = np.array(tensors[\"topic_embeddings\"])\n                    merged_topics[\"custom_labels\"] = None\n\n        for tensors, selected_topics in zip(all_tensors[1:], all_topics[1:]):\n            # Calculate similarity matrix\n            sim_matrix = cosine_similarity(tensors, merged_tensors)\n            sims = np.max(sim_matrix, axis=1)\n\n            # Extract new topics\n            new_topics = sorted([index - selected_topics[\"_outliers\"] for index, sim in enumerate(sims) if sim &lt; min_similarity])\n            max_topic = max(set(merged_topics[\"topics\"]))\n\n            # Merge Topic Representations\n            new_topics_dict = {}\n            new_topic_val = max_topic + 1\n            for index, new_topic in enumerate(new_topics):\n                new_topic_val = max_topic + index + 1\n                new_topics_dict[new_topic] = new_topic_val\n                merged_topics[\"topic_representations\"][str(new_topic_val)] = selected_topics[\"topic_representations\"][str(new_topic)]\n                merged_topics[\"topic_labels\"][str(new_topic_val)] = selected_topics[\"topic_labels\"][str(new_topic)]\n\n                if selected_topics[\"topic_aspects\"]:\n                    merged_topics[\"topic_aspects\"][str(new_topic_val)] = selected_topics[\"topic_aspects\"][str(new_topic)]\n\n                # Add new embeddings\n                new_tensors = tensors[new_topic - selected_topics[\"_outliers\"]]\n                merged_tensors = np.vstack([merged_tensors, new_tensors])\n\n            # Topic Mapper\n            merged_topics[\"topic_mapper\"] = TopicMapper(list(range(-1, new_topic_val+1, 1))).mappings_\n\n            # Find similar topics and re-assign those from the new models\n            sims_idx = np.argmax(sim_matrix, axis=1)\n            sims = np.max(sim_matrix, axis=1)\n            to_merge = {\n                a - selected_topics[\"_outliers\"]:\n                b - merged_topics[\"_outliers\"] for a, (b, val) in enumerate(zip(sims_idx, sims))\n                if val &gt;= min_similarity\n            }\n            to_merge.update(new_topics_dict)\n            to_merge[-1] = -1\n            topics = [to_merge[topic] for topic in selected_topics[\"topics\"]]\n            merged_topics[\"topics\"].extend(topics)\n            merged_topics[\"topic_sizes\"] = dict(Counter(merged_topics[\"topics\"]))\n\n        # Create a new model from the merged parameters\n        merged_tensors = {\"topic_embeddings\": torch.from_numpy(merged_tensors)}\n        merged_model = _create_model_from_files(merged_topics, merged_params, merged_tensors, None, None, None, warn_no_backend=False)\n        merged_model.embedding_model = models[0].embedding_model\n\n        # Replace embedding model if one is specifically chosen\n        if embedding_model is not None and type(merged_model.embedding_model) == BaseEmbedder:\n            merged_model.embedding_model = select_backend(embedding_model)\n        return merged_model\n\n    def push_to_hf_hub(\n            self,\n            repo_id: str,\n            commit_message: str = 'Add BERTopic model',\n            token: str = None,\n            revision: str = None,\n            private: bool = False,\n            create_pr: bool = False,\n            model_card: bool = True,\n            serialization: str = \"safetensors\",\n            save_embedding_model: Union[str, bool] = True,\n            save_ctfidf: bool = False,\n            ):\n        \"\"\" Push your BERTopic model to a HuggingFace Hub\n\n        Whenever you want to upload files to the Hub, you need to log in to your Hugging Face account:\n\n        * Log in to your Hugging Face account with the following command:\n            ```bash\n            huggingface-cli login\n\n            # or using an environment variable\n            huggingface-cli login --token $HUGGINGFACE_TOKEN\n            ```\n        * Alternatively, you can programmatically login using login() in a notebook or a script:\n            ```python\n            from huggingface_hub import login\n            login()\n            ```\n        * Or you can give a token with the `token` variable\n\n        Arguments:\n            repo_id: The name of your HuggingFace repository\n            commit_message: A commit message\n            token: Token to add if not already logged in\n            revision: Repository revision\n            private: Whether to create a private repository\n            create_pr: Whether to upload the model as a Pull Request\n            model_card: Whether to automatically create a modelcard\n            serialization: The type of serialization.\n                        Either `safetensors` or `pytorch`\n            save_embedding_model: A pointer towards a HuggingFace model to be loaded in with\n                                    SentenceTransformers. E.g.,\n                                    `sentence-transformers/all-MiniLM-L6-v2`\n            save_ctfidf: Whether to save c-TF-IDF information\n\n\n        Examples:\n\n        ```python\n        topic_model.push_to_hf_hub(\n            repo_id=\"ArXiv\",\n            save_ctfidf=True,\n            save_embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\"\n        )\n        ```\n        \"\"\"\n        return save_utils.push_to_hf_hub(model=self, repo_id=repo_id, commit_message=commit_message,\n                                         token=token, revision=revision, private=private, create_pr=create_pr,\n                                         model_card=model_card, serialization=serialization,\n                                         save_embedding_model=save_embedding_model, save_ctfidf=save_ctfidf)\n\n    def get_params(self, deep: bool = False) -&gt; Mapping[str, Any]:\n        \"\"\" Get parameters for this estimator.\n\n        Adapted from:\n            https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178\n\n        Arguments:\n            deep: bool, default=True\n                  If True, will return the parameters for this estimator and\n                  contained subobjects that are estimators.\n\n        Returns:\n            out: Parameter names mapped to their values.\n        \"\"\"\n        out = dict()\n        for key in self._get_param_names():\n            value = getattr(self, key)\n            if deep and hasattr(value, 'get_params'):\n                deep_items = value.get_params().items()\n                out.update((key + '__' + k, val) for k, val in deep_items)\n            out[key] = value\n        return out\n\n    def _extract_embeddings(self,\n                            documents: Union[List[str], str],\n                            images: List[str] = None,\n                            method: str = \"document\",\n                            verbose: bool = None) -&gt; np.ndarray:\n        \"\"\" Extract sentence/document embeddings through pre-trained embeddings\n        For an overview of pre-trained models: https://www.sbert.net/docs/pretrained_models.html\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            images: A list of paths to the images to fit on or the images themselves\n            method: Whether to extract document or word-embeddings, options are \"document\" and \"word\"\n            verbose: Whether to show a progressbar demonstrating the time to extract embeddings\n\n        Returns:\n            embeddings: The extracted embeddings.\n        \"\"\"\n        if isinstance(documents, str):\n            documents = [documents]\n\n        if images is not None and hasattr(self.embedding_model, \"embed_images\"):\n            embeddings = self.embedding_model.embed(documents=documents, images=images, verbose=verbose)\n        elif method == \"word\":\n            embeddings = self.embedding_model.embed_words(words=documents, verbose=verbose)\n        elif method == \"document\":\n            embeddings = self.embedding_model.embed_documents(documents, verbose=verbose)\n        elif documents[0] is None and images is None:\n            raise ValueError(\"Make sure to use an embedding model that can either embed documents\"\n                             \"or images depending on which you want to embed.\")\n        else:\n            raise ValueError(\"Wrong method for extracting document/word embeddings. \"\n                             \"Either choose 'word' or 'document' as the method. \")\n        return embeddings\n\n    def _images_to_text(self, documents: pd.DataFrame, embeddings: np.ndarray) -&gt; pd.DataFrame:\n        \"\"\" Convert images to text \"\"\"\n        logger.info(\"Images - Converting images to text. This might take a while.\")\n        if isinstance(self.representation_model, dict):\n            for tuner in self.representation_model.values():\n                if getattr(tuner, 'image_to_text_model', False):\n                    documents = tuner.image_to_text(documents, embeddings)\n        elif isinstance(self.representation_model, list):\n            for tuner in self.representation_model:\n                if getattr(tuner, 'image_to_text_model', False):\n                    documents = tuner.image_to_text(documents, embeddings)\n        elif isinstance(self.representation_model, BaseRepresentation):\n            if getattr(self.representation_model, 'image_to_text_model', False):\n                documents = self.representation_model.image_to_text(documents, embeddings)\n        logger.info(\"Images - Completed \\u2713\")\n        return documents\n\n    def _map_predictions(self, predictions: List[int]) -&gt; List[int]:\n        \"\"\" Map predictions to the correct topics if topics were reduced \"\"\"\n        mappings = self.topic_mapper_.get_mappings(original_topics=True)\n        mapped_predictions = [mappings[prediction]\n                              if prediction in mappings\n                              else -1\n                              for prediction in predictions]\n        return mapped_predictions\n\n    def _reduce_dimensionality(self,\n                               embeddings: Union[np.ndarray, csr_matrix],\n                               y: Union[List[int], np.ndarray] = None,\n                               partial_fit: bool = False) -&gt; np.ndarray:\n        \"\"\" Reduce dimensionality of embeddings using UMAP and train a UMAP model\n\n        Arguments:\n            embeddings: The extracted embeddings using the sentence transformer module.\n            y: The target class for (semi)-supervised dimensionality reduction\n            partial_fit: Whether to run `partial_fit` for online learning\n\n        Returns:\n            umap_embeddings: The reduced embeddings\n        \"\"\"\n        logger.info(\"Dimensionality - Fitting the dimensionality reduction algorithm\")\n        # Partial fit\n        if partial_fit:\n            if hasattr(self.umap_model, \"partial_fit\"):\n                self.umap_model = self.umap_model.partial_fit(embeddings)\n            elif self.topic_representations_ is None:\n                self.umap_model.fit(embeddings)\n\n        # Regular fit\n        else:\n            try:\n                # cuml umap needs y to be an numpy array\n                y = np.array(y) if y is not None else None\n                self.umap_model.fit(embeddings, y=y)\n            except TypeError:\n\n                self.umap_model.fit(embeddings)\n\n        umap_embeddings = self.umap_model.transform(embeddings)\n        logger.info(\"Dimensionality - Completed \\u2713\")\n        return np.nan_to_num(umap_embeddings)\n\n    def _cluster_embeddings(self,\n                            umap_embeddings: np.ndarray,\n                            documents: pd.DataFrame,\n                            partial_fit: bool = False,\n                            y: np.ndarray = None) -&gt; Tuple[pd.DataFrame,\n                                                           np.ndarray]:\n        \"\"\" Cluster UMAP embeddings with HDBSCAN\n\n        Arguments:\n            umap_embeddings: The reduced sentence embeddings with UMAP\n            documents: Dataframe with documents and their corresponding IDs\n            partial_fit: Whether to run `partial_fit` for online learning\n\n        Returns:\n            documents: Updated dataframe with documents and their corresponding IDs\n                       and newly added Topics\n            probabilities: The distribution of probabilities\n        \"\"\"\n        logger.info(\"Cluster - Start clustering the reduced embeddings\")\n        if partial_fit:\n            self.hdbscan_model = self.hdbscan_model.partial_fit(umap_embeddings)\n            labels = self.hdbscan_model.labels_\n            documents['Topic'] = labels\n            self.topics_ = labels\n        else:\n            try:\n                self.hdbscan_model.fit(umap_embeddings, y=y)\n            except TypeError:\n                self.hdbscan_model.fit(umap_embeddings)\n\n            try:\n                labels = self.hdbscan_model.labels_\n            except AttributeError:\n                labels = y\n            documents['Topic'] = labels\n            self._update_topic_size(documents)\n\n        # Some algorithms have outlier labels (-1) that can be tricky to work\n        # with if you are slicing data based on that labels. Therefore, we\n        # track if there are outlier labels and act accordingly when slicing.\n        self._outliers = 1 if -1 in set(labels) else 0\n\n        # Extract probabilities\n        probabilities = None\n        if hasattr(self.hdbscan_model, \"probabilities_\"):\n            probabilities = self.hdbscan_model.probabilities_\n\n            if self.calculate_probabilities and is_supported_hdbscan(self.hdbscan_model):\n                probabilities = hdbscan_delegator(self.hdbscan_model, \"all_points_membership_vectors\")\n\n        if not partial_fit:\n            self.topic_mapper_ = TopicMapper(self.topics_)\n        logger.info(\"Cluster - Completed \\u2713\")\n        return documents, probabilities\n\n    def _zeroshot_topic_modeling(self, documents: pd.DataFrame, embeddings: np.ndarray) -&gt; Tuple[pd.DataFrame, np.array,\n                                                                                                 pd.DataFrame, np.array]:\n        \"\"\" Find documents that could be assigned to either one of the topics in self.zeroshot_topic_list\n\n        We transform the the topics in `self.zeroshot_topic_list` to embeddings and\n        through cosine similarity compare them with the document embeddings.\n        If they pass the `self.zeroshot_min_similarity` threshold, they are assigned.\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            embeddings: The document embeddings\n\n        Returns:\n            documents: The leftover documents that were not assigned to any topic\n            embeddings: The leftover embeddings that were not assigned to any topic\n        \"\"\"\n        logger.info(\"Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics\")\n        # Similarity between document and zero-shot topic embeddings\n        zeroshot_embeddings = self._extract_embeddings(self.zeroshot_topic_list)\n        cosine_similarities = cosine_similarity(embeddings, zeroshot_embeddings)\n        assignment = np.argmax(cosine_similarities, 1)\n        assignment_vals = np.max(cosine_similarities, 1)\n        assigned_ids = [index for index, value in enumerate(assignment_vals) if value &gt;= self.zeroshot_min_similarity]\n        non_assigned_ids = [index for index, value in enumerate(assignment_vals) if value &lt; self.zeroshot_min_similarity]\n\n        # Assign topics\n        assigned_documents = documents.iloc[assigned_ids]\n        assigned_documents[\"Topic\"] = [topic for topic in assignment[assigned_ids]]\n        assigned_documents[\"Old_ID\"] = assigned_documents[\"ID\"].copy()\n        assigned_documents[\"ID\"] = range(len(assigned_documents))\n        assigned_embeddings = embeddings[assigned_ids]\n\n        # Select non-assigned topics to be clustered\n        documents = documents.iloc[non_assigned_ids]\n        documents[\"Old_ID\"] = documents[\"ID\"].copy()\n        documents[\"ID\"] = range(len(documents))\n        embeddings = embeddings[non_assigned_ids]\n\n        # If only matches were found\n        if len(non_assigned_ids) == 0:\n            return None, None, assigned_documents, assigned_embeddings\n        logger.info(\"Zeroshot Step 1 - Completed \\u2713\")\n        return documents, embeddings, assigned_documents, assigned_embeddings\n\n    def _is_zeroshot(self):\n        \"\"\" Check whether zero-shot topic modeling is possible\n\n        * There should be a cluster model used\n        * Embedding model is necessary to convert zero-shot topics to embeddings\n        * Zero-shot topics should be defined\n        \"\"\"\n        if self.zeroshot_topic_list is not None and self.embedding_model is not None and type(self.hdbscan_model) != BaseCluster:\n            return True\n        return False\n\n    def _combine_zeroshot_topics(self,\n                                 documents: pd.DataFrame,\n                                 assigned_documents: pd.DataFrame,\n                                 embeddings: np.ndarray) -&gt; Union[Tuple[np.ndarray, np.ndarray], np.ndarray]:\n        \"\"\" Combine the zero-shot topics with the clustered topics\n\n        There are three cases considered:\n        * Only zero-shot topics were found which will only return the zero-shot topic model\n        * Only clustered topics were found which will only return the clustered topic model\n        * Both zero-shot and clustered topics were found which will return a merged model\n          * This merged model is created using the `merge_models` function which will ignore\n            the underlying UMAP and HDBSCAN models\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            assigned_documents: Dataframe with documents and their corresponding IDs\n                                that were assigned to a zero-shot topic\n            embeddings: The document embeddings\n\n        Returns:\n            topics: The topics for each document\n            probabilities: The probabilities for each document\n        \"\"\"\n        logger.info(\"Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...\")\n\n        # Fit BERTopic without actually performing any clustering\n        docs = assigned_documents.Document.tolist()\n        y = assigned_documents.Topic.tolist()\n        empty_dimensionality_model = BaseDimensionalityReduction()\n        empty_cluster_model = BaseCluster()\n        zeroshot_model = BERTopic(\n                n_gram_range=self.n_gram_range,\n                low_memory=self.low_memory,\n                calculate_probabilities=self.calculate_probabilities,\n                embedding_model=self.embedding_model,\n                umap_model=empty_dimensionality_model,\n                hdbscan_model=empty_cluster_model,\n                vectorizer_model=self.vectorizer_model,\n                ctfidf_model=self.ctfidf_model,\n                representation_model=self.representation_model,\n                verbose=self.verbose\n        ).fit(docs, embeddings=embeddings, y=y)\n        logger.info(\"Zeroshot Step 2 - Completed \\u2713\")\n        logger.info(\"Zeroshot Step 3 - Combining clustered topics with the zeroshot model\")\n\n        # Update model\n        self.umap_model = BaseDimensionalityReduction()\n        self.hdbscan_model = BaseCluster()\n\n        # Update topic label\n        assigned_topics = assigned_documents.groupby(\"Topic\").first().reset_index()\n        indices, topics = assigned_topics.ID.values, assigned_topics.Topic.values\n        labels = [zeroshot_model.topic_labels_[zeroshot_model.topics_[index]] for index in indices]\n        labels = {label: self.zeroshot_topic_list[topic] for label, topic in zip(labels, topics)}\n\n        # If only zero-shot matches were found and clustering was not performed\n        if documents is None:\n            for topic in range(len(set(y))):\n                if zeroshot_model.topic_labels_.get(topic):\n                    if labels.get(zeroshot_model.topic_labels_[topic]):\n                        zeroshot_model.topic_labels_[topic] = labels[zeroshot_model.topic_labels_[topic]]\n            self.__dict__.clear()\n            self.__dict__.update(zeroshot_model.__dict__)\n            return self.topics_, self.probabilities_\n\n        # Merge the two topic models\n        merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1)\n\n        # Update topic labels and representative docs of the zero-shot model\n        for topic in range(len(set(y))):\n            if merged_model.topic_labels_.get(topic):\n                if labels.get(merged_model.topic_labels_[topic]):\n                    label = labels[merged_model.topic_labels_[topic]]\n                    merged_model.topic_labels_[topic] = label\n                    merged_model.representative_docs_[topic] = zeroshot_model.representative_docs_[topic]\n\n        # Add representative docs of the clustered model\n        for topic in set(self.topics_):\n            merged_model.representative_docs_[topic + self._outliers + len(set(y))] = self.representative_docs_[topic]\n\n        if self._outliers and merged_model.topic_sizes_.get(-1):\n            merged_model.topic_sizes_[len(set(y))] = merged_model.topic_sizes_[-1]\n            del merged_model.topic_sizes_[-1]\n\n        # Update topic assignment by finding the documents with the\n        # correct updated topics\n        zeroshot_indices = list(assigned_documents.Old_ID.values)\n        zeroshot_topics = [self.zeroshot_topic_list[topic] for topic in assigned_documents.Topic.values]\n\n        cluster_indices = list(documents.Old_ID.values)\n        cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]\n        cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]\n\n        df = pd.DataFrame({\n            \"Indices\": zeroshot_indices + cluster_indices,\n            \"Label\": zeroshot_topics + cluster_topics}\n        ).sort_values(\"Indices\")\n        reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())\n        df.Label = df.Label.map(reverse_topic_labels)\n        merged_model.topics_ = df.Label.values\n\n        # Update the class internally\n        self.__dict__.clear()\n        self.__dict__.update(merged_model.__dict__)\n        logger.info(\"Zeroshot Step 3 - Completed \\u2713\")\n        return self.topics_\n\n    def _guided_topic_modeling(self, embeddings: np.ndarray) -&gt; Tuple[List[int], np.array]:\n        \"\"\" Apply Guided Topic Modeling\n\n        We transform the seeded topics to embeddings using the\n        same embedder as used for generating document embeddings.\n\n        Then, we apply cosine similarity between the embeddings\n        and set labels for documents that are more similar to\n        one of the topics, then the average document.\n\n        If a document is more similar to the average document\n        than any of the topics, it gets the -1 label and is\n        thereby not included in UMAP.\n\n        Arguments:\n            embeddings: The document embeddings\n\n        Returns\n            y: The labels for each seeded topic\n            embeddings: Updated embeddings\n        \"\"\"\n        logger.info(\"Guided - Find embeddings highly related to seeded topics.\")\n        # Create embeddings from the seeded topics\n        seed_topic_list = [\" \".join(seed_topic) for seed_topic in self.seed_topic_list]\n        seed_topic_embeddings = self._extract_embeddings(seed_topic_list, verbose=self.verbose)\n        seed_topic_embeddings = np.vstack([seed_topic_embeddings, embeddings.mean(axis=0)])\n\n        # Label documents that are most similar to one of the seeded topics\n        sim_matrix = cosine_similarity(embeddings, seed_topic_embeddings)\n        y = [np.argmax(sim_matrix[index]) for index in range(sim_matrix.shape[0])]\n        y = [val if val != len(seed_topic_list) else -1 for val in y]\n\n        # Average the document embeddings related to the seeded topics with the\n        # embedding of the seeded topic to force the documents in a cluster\n        for seed_topic in range(len(seed_topic_list)):\n            indices = [index for index, topic in enumerate(y) if topic == seed_topic]\n            embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])\n        logger.info(\"Guided - Completed \\u2713\")\n        return y, embeddings\n\n    def _extract_topics(self, documents: pd.DataFrame, embeddings: np.ndarray = None, mappings=None, verbose: bool = False):\n        \"\"\" Extract topics from the clusters using a class-based TF-IDF\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            embeddings: The document embeddings\n            mappings: The mappings from topic to word\n            verbose: Whether log the process of extracting topics\n\n        Returns:\n            c_tf_idf: The resulting matrix giving a value (importance score) for each word per topic\n        \"\"\"\n        if verbose:\n            logger.info(\"Representation - Extracting topics from clusters using representation models.\")\n        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n        self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)\n        self.topic_representations_ = self._extract_words_per_topic(words, documents)\n        self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)\n        self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                              for key, values in\n                              self.topic_representations_.items()}\n        if verbose:\n            logger.info(\"Representation - Completed \\u2713\")\n\n    def _save_representative_docs(self, documents: pd.DataFrame):\n        \"\"\" Save the 3 most representative docs per topic\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n\n        Updates:\n            self.representative_docs_: Populate each topic with 3 representative docs\n        \"\"\"\n        repr_docs, _, _, _ = self._extract_representative_docs(\n            self.c_tf_idf_,\n            documents,\n            self.topic_representations_,\n            nr_samples=500,\n            nr_repr_docs=3\n        )\n        self.representative_docs_ = repr_docs\n\n    def _extract_representative_docs(self,\n                                     c_tf_idf: csr_matrix,\n                                     documents: pd.DataFrame,\n                                     topics: Mapping[str, List[Tuple[str, float]]],\n                                     nr_samples: int = 500,\n                                     nr_repr_docs: int = 5,\n                                     diversity: float = None\n                                     ) -&gt; Union[List[str], List[List[int]]]:\n        \"\"\" Approximate most representative documents per topic by sampling\n        a subset of the documents in each topic and calculating which are\n        most represenative to their topic based on the cosine similarity between\n        c-TF-IDF representations.\n\n        Arguments:\n            c_tf_idf: The topic c-TF-IDF representation\n            documents: All input documents\n            topics: The candidate topics as calculated with c-TF-IDF\n            nr_samples: The number of candidate documents to extract per topic\n            nr_repr_docs: The number of representative documents to extract per topic\n            diversity: The diversity between the most representative documents.\n                       If None, no MMR is used. Otherwise, accepts values between 0 and 1.\n\n        Returns:\n            repr_docs_mappings: A dictionary from topic to representative documents\n            representative_docs: A flat list of representative documents\n            repr_doc_indices: Orrdere indices of representative documents\n                              that belong to each topic\n            repr_doc_ids: The indices of representative documents\n                              that belong to each topic\n        \"\"\"\n        # Sample documents per topic\n        documents_per_topic = (\n            documents.drop(\"Image\", axis=1, errors=\"ignore\")\n                     .groupby('Topic')\n                     .sample(n=nr_samples, replace=True, random_state=42)\n                     .drop_duplicates()\n        )\n\n        # Find and extract documents that are most similar to the topic\n        repr_docs = []\n        repr_docs_indices = []\n        repr_docs_mappings = {}\n        repr_docs_ids = []\n        labels = sorted(list(topics.keys()))\n        for index, topic in enumerate(labels):\n\n            # Slice data\n            selection = documents_per_topic.loc[documents_per_topic.Topic == topic, :]\n            selected_docs = selection[\"Document\"].values\n            selected_docs_ids = selection.index.tolist()\n\n            # Calculate similarity\n            nr_docs = nr_repr_docs if len(selected_docs) &gt; nr_repr_docs else len(selected_docs)\n            bow = self.vectorizer_model.transform(selected_docs)\n            ctfidf = self.ctfidf_model.transform(bow)\n            sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])\n\n            # Use MMR to find representative but diverse documents\n            if diversity:\n                docs = mmr(c_tf_idf[index], ctfidf, selected_docs, top_n=nr_docs, diversity=diversity)\n\n            # Extract top n most representative documents\n            else:\n                indices = np.argpartition(sim_matrix.reshape(1, -1)[0], -nr_docs)[-nr_docs:]\n                docs = [selected_docs[index] for index in indices]\n\n            doc_ids = [selected_docs_ids[index] for index, doc in enumerate(selected_docs) if doc in docs]\n            repr_docs_ids.append(doc_ids)\n            repr_docs.extend(docs)\n            repr_docs_indices.append([repr_docs_indices[-1][-1] + i + 1 if index != 0 else i for i in range(nr_docs)])\n        repr_docs_mappings = {topic: repr_docs[i[0]:i[-1]+1] for topic, i in zip(topics.keys(), repr_docs_indices)}\n\n        return repr_docs_mappings, repr_docs, repr_docs_indices, repr_docs_ids\n\n    def _create_topic_vectors(self, documents: pd.DataFrame = None, embeddings: np.ndarray = None, mappings=None):\n        \"\"\" Creates embeddings per topics based on their topic representation\n\n        As a default, topic vectors (topic embeddings) or created by taking\n        the average of all document embeddings within a topic. If topics are\n        merged, then a weighted average of topic embeddings is taken based on\n        the initial topic sizes.\n\n        For the `.partial_fit` and `.update_topics` method, the average\n        of all document embeddings is not taken since those are not known.\n        Instead, the weighted average of the embeddings of the top n words\n        is taken for each topic. The weighting is done based on the c-TF-IDF\n        score. This will put more emphasis to words that represent a topic best.\n        \"\"\"\n        # Topic embeddings based on input embeddings\n        if embeddings is not None and documents is not None:\n            topic_embeddings = []\n            topics = documents.sort_values(\"Topic\").Topic.unique()\n            for topic in topics:\n                indices = documents.loc[documents.Topic == topic, \"ID\"].values\n                indices = [int(index) for index in indices]\n                topic_embedding = np.mean(embeddings[indices], axis=0)\n                topic_embeddings.append(topic_embedding)\n            self.topic_embeddings_ = np.array(topic_embeddings)\n\n        # Topic embeddings when merging topics\n        elif self.topic_embeddings_ is not None and mappings is not None:\n            topic_embeddings_dict = {}\n            for topic_from, topics_to in mappings.items():\n                topic_ids = topics_to[\"topics_to\"]\n                topic_sizes = topics_to[\"topic_sizes\"]\n                if topic_ids:\n                    embds = np.array(self.topic_embeddings_)[np.array(topic_ids) + self._outliers]\n                    topic_embedding = np.average(embds, axis=0, weights=topic_sizes)\n                    topic_embeddings_dict[topic_from] = topic_embedding\n\n            # Re-order topic embeddings\n            topics_to_map = {topic_mapping[0]: topic_mapping[1] for topic_mapping in np.array(self.topic_mapper_.mappings_)[:, -2:]}\n            topic_embeddings = {}\n            for topic, embds in topic_embeddings_dict.items():\n                topic_embeddings[topics_to_map[topic]] = embds\n            unique_topics = sorted(list(topic_embeddings.keys()))\n            self.topic_embeddings_ = np.array([topic_embeddings[topic] for topic in unique_topics])\n\n        # Topic embeddings based on keyword representations\n        elif self.embedding_model is not None and type(self.embedding_model) is not BaseEmbedder:\n            topic_list = list(self.topic_representations_.keys())\n            topic_list.sort()\n\n            # Only extract top n words\n            n = len(self.topic_representations_[topic_list[0]])\n            if self.top_n_words &lt; n:\n                n = self.top_n_words\n\n            # Extract embeddings for all words in all topics\n            topic_words = [self.get_topic(topic) for topic in topic_list]\n            topic_words = [word[0] for topic in topic_words for word in topic]\n            word_embeddings = self._extract_embeddings(\n                topic_words,\n                method=\"word\",\n                verbose=False\n            )\n\n            # Take the weighted average of word embeddings in a topic based on their c-TF-IDF value\n            # The embeddings var is a single numpy matrix and therefore slicing is necessary to\n            # access the words per topic\n            topic_embeddings = []\n            for i, topic in enumerate(topic_list):\n                word_importance = [val[1] for val in self.get_topic(topic)]\n                if sum(word_importance) == 0:\n                    word_importance = [1 for _ in range(len(self.get_topic(topic)))]\n                topic_embedding = np.average(word_embeddings[i * n: n + (i * n)], weights=word_importance, axis=0)\n                topic_embeddings.append(topic_embedding)\n\n            self.topic_embeddings_ = np.array(topic_embeddings)\n\n    def _c_tf_idf(self,\n                  documents_per_topic: pd.DataFrame,\n                  fit: bool = True,\n                  partial_fit: bool = False) -&gt; Tuple[csr_matrix, List[str]]:\n        \"\"\" Calculate a class-based TF-IDF where m is the number of total documents.\n\n        Arguments:\n            documents_per_topic: The joined documents per topic such that each topic has a single\n                                 string made out of multiple documents\n            m: The total number of documents (unjoined)\n            fit: Whether to fit a new vectorizer or use the fitted self.vectorizer_model\n            partial_fit: Whether to run `partial_fit` for online learning\n\n        Returns:\n            tf_idf: The resulting matrix giving a value (importance score) for each word per topic\n            words: The names of the words to which values were given\n        \"\"\"\n        documents = self._preprocess_text(documents_per_topic.Document.values)\n\n        if partial_fit:\n            X = self.vectorizer_model.partial_fit(documents).update_bow(documents)\n        elif fit:\n            self.vectorizer_model.fit(documents)\n            X = self.vectorizer_model.transform(documents)\n        else:\n            X = self.vectorizer_model.transform(documents)\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = self.vectorizer_model.get_feature_names_out()\n        else:\n            words = self.vectorizer_model.get_feature_names()\n\n        multiplier = None\n        if self.ctfidf_model.seed_words and self.seed_topic_list:\n            seed_topic_list = [seed for seeds in self.seed_topic_list for seed in seeds]\n            multiplier = np.array([self.ctfidf_model.seed_multiplier if word in self.ctfidf_model.seed_words else 1 for word in words])\n            multiplier = np.array([1.2 if word in seed_topic_list else value for value, word in zip(multiplier, words)])\n        elif self.ctfidf_model.seed_words:\n            multiplier = np.array([self.ctfidf_model.seed_multiplier if word in self.ctfidf_model.seed_words else 1 for word in words])\n        elif self.seed_topic_list:\n            seed_topic_list = [seed for seeds in self.seed_topic_list for seed in seeds]\n            multiplier = np.array([1.2 if word in seed_topic_list else 1 for word in words])\n\n        if fit:\n            self.ctfidf_model = self.ctfidf_model.fit(X, multiplier=multiplier)\n\n        c_tf_idf = self.ctfidf_model.transform(X)\n\n        return c_tf_idf, words\n\n    def _update_topic_size(self, documents: pd.DataFrame):\n        \"\"\" Calculate the topic sizes\n\n        Arguments:\n            documents: Updated dataframe with documents and their corresponding IDs and newly added Topics\n        \"\"\"\n        self.topic_sizes_ = collections.Counter(documents.Topic.values.tolist())\n        self.topics_ = documents.Topic.astype(int).tolist()\n\n    def _extract_words_per_topic(self,\n                                 words: List[str],\n                                 documents: pd.DataFrame,\n                                 c_tf_idf: csr_matrix = None,\n                                 calculate_aspects: bool = True) -&gt; Mapping[str,\n                                                                            List[Tuple[str, float]]]:\n        \"\"\" Based on tf_idf scores per topic, extract the top n words per topic\n\n        If the top words per topic need to be extracted, then only the `words` parameter\n        needs to be passed. If the top words per topic in a specific timestamp, then it\n        is important to pass the timestamp-based c-TF-IDF matrix and its corresponding\n        labels.\n\n        Arguments:\n            words: List of all words (sorted according to tf_idf matrix position)\n            documents: DataFrame with documents and their topic IDs\n            c_tf_idf: A c-TF-IDF matrix from which to calculate the top words\n\n        Returns:\n            topics: The top words per topic\n        \"\"\"\n        if c_tf_idf is None:\n            c_tf_idf = self.c_tf_idf_\n\n        labels = sorted(list(documents.Topic.unique()))\n        labels = [int(label) for label in labels]\n\n        # Get at least the top 30 indices and values per row in a sparse c-TF-IDF matrix\n        top_n_words = max(self.top_n_words, 30)\n        indices = self._top_n_idx_sparse(c_tf_idf, top_n_words)\n        scores = self._top_n_values_sparse(c_tf_idf, indices)\n        sorted_indices = np.argsort(scores, 1)\n        indices = np.take_along_axis(indices, sorted_indices, axis=1)\n        scores = np.take_along_axis(scores, sorted_indices, axis=1)\n\n        # Get top 30 words per topic based on c-TF-IDF score\n        topics = {label: [(words[word_index], score)\n                          if word_index is not None and score &gt; 0\n                          else (\"\", 0.00001)\n                          for word_index, score in zip(indices[index][::-1], scores[index][::-1])\n                          ]\n                  for index, label in enumerate(labels)}\n\n        # Fine-tune the topic representations\n        if isinstance(self.representation_model, list):\n            for tuner in self.representation_model:\n                topics = tuner.extract_topics(self, documents, c_tf_idf, topics)\n        elif isinstance(self.representation_model, BaseRepresentation):\n            topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)\n        elif isinstance(self.representation_model, dict):\n            if self.representation_model.get(\"Main\"):\n                topics = self.representation_model[\"Main\"].extract_topics(self, documents, c_tf_idf, topics)\n        topics = {label: values[:self.top_n_words] for label, values in topics.items()}\n\n        # Extract additional topic aspects\n        if calculate_aspects and isinstance(self.representation_model, dict):\n            for aspect, aspect_model in self.representation_model.items():\n                aspects = topics.copy()\n                if aspect != \"Main\":\n                    if isinstance(aspect_model, list):\n                        for tuner in aspect_model:\n                            aspects = tuner.extract_topics(self, documents, c_tf_idf, aspects)\n                        self.topic_aspects_[aspect] = aspects\n                    elif isinstance(aspect_model, BaseRepresentation):\n                        self.topic_aspects_[aspect] = aspect_model.extract_topics(self, documents, c_tf_idf, aspects)\n\n        return topics\n\n    def _reduce_topics(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reduce topics to self.nr_topics\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the reduced number of Topics\n        \"\"\"\n        logger.info(\"Topic reduction - Reducing number of topics\")\n        initial_nr_topics = len(self.get_topics())\n\n        if isinstance(self.nr_topics, int):\n            if self.nr_topics &lt; initial_nr_topics:\n                documents = self._reduce_to_n_topics(documents)\n        elif isinstance(self.nr_topics, str):\n            documents = self._auto_reduce_topics(documents)\n        else:\n            raise ValueError(\"nr_topics needs to be an int or 'auto'! \")\n\n        logger.info(f\"Topic reduction - Reduced number of topics from {initial_nr_topics} to {len(self.get_topic_freq())}\")\n        return documents\n\n    def _reduce_to_n_topics(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reduce topics to self.nr_topics\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the reduced number of Topics\n        \"\"\"\n        topics = documents.Topic.tolist().copy()\n\n        # Create topic distance matrix\n        if self.topic_embeddings_ is not None:\n            topic_embeddings = self.topic_embeddings_[self._outliers:, ]\n        else:\n            topic_embeddings = self.c_tf_idf_[self._outliers:, ].toarray()\n        distance_matrix = 1-cosine_similarity(topic_embeddings)\n        np.fill_diagonal(distance_matrix, 0)\n\n        # Cluster the topic embeddings using AgglomerativeClustering\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.4.0\"):\n            cluster = AgglomerativeClustering(self.nr_topics - self._outliers, metric=\"precomputed\", linkage=\"average\")\n        else:\n            cluster = AgglomerativeClustering(self.nr_topics - self._outliers, affinity=\"precomputed\", linkage=\"average\")\n        cluster.fit(distance_matrix)\n        new_topics = [cluster.labels_[topic] if topic != -1 else -1 for topic in topics]\n\n        # Track mappings and sizes of topics for merging topic embeddings\n        mapped_topics = {from_topic: to_topic for from_topic, to_topic in zip(topics, new_topics)}\n        mappings = defaultdict(list)\n        for key, val in sorted(mapped_topics.items()):\n            mappings[val].append(key)\n        mappings = {topic_from:\n                    {\"topics_to\": topics_to,\n                     \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                    for topic_from, topics_to in mappings.items()}\n\n        # Map topics\n        documents.Topic = new_topics\n        self._update_topic_size(documents)\n        self.topic_mapper_.add_mappings(mapped_topics)\n\n        # Update representations\n        documents = self._sort_mappings_by_frequency(documents)\n        self._extract_topics(documents, mappings=mappings)\n        self._update_topic_size(documents)\n        return documents\n\n    def _auto_reduce_topics(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reduce the number of topics automatically using HDBSCAN\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the reduced number of Topics\n        \"\"\"\n        topics = documents.Topic.tolist().copy()\n        unique_topics = sorted(list(documents.Topic.unique()))[self._outliers:]\n        max_topic = unique_topics[-1]\n\n        # Find similar topics\n        if self.topic_embeddings_ is not None:\n            embeddings = np.array(self.topic_embeddings_)\n        else:\n            embeddings = self.c_tf_idf_.toarray()\n        norm_data = normalize(embeddings, norm='l2')\n        predictions = hdbscan.HDBSCAN(min_cluster_size=2,\n                                      metric='euclidean',\n                                      cluster_selection_method='eom',\n                                      prediction_data=True).fit_predict(norm_data[self._outliers:])\n\n        # Map similar topics\n        mapped_topics = {unique_topics[index]: prediction + max_topic\n                         for index, prediction in enumerate(predictions)\n                         if prediction != -1}\n        documents.Topic = documents.Topic.map(mapped_topics).fillna(documents.Topic).astype(int)\n        mapped_topics = {from_topic: to_topic for from_topic, to_topic in zip(topics, documents.Topic.tolist())}\n\n        # Track mappings and sizes of topics for merging topic embeddings\n        mappings = defaultdict(list)\n        for key, val in sorted(mapped_topics.items()):\n            mappings[val].append(key)\n        mappings = {topic_from:\n                    {\"topics_to\": topics_to,\n                     \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                    for topic_from, topics_to in mappings.items()}\n\n        # Update documents and topics\n        self.topic_mapper_.add_mappings(mapped_topics)\n        documents = self._sort_mappings_by_frequency(documents)\n        self._extract_topics(documents, mappings=mappings)\n        self._update_topic_size(documents)\n        return documents\n\n    def _sort_mappings_by_frequency(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reorder mappings by their frequency.\n\n        For example, if topic 88 was mapped to topic\n        5 and topic 5 turns out to be the largest topic,\n        then topic 5 will be topic 0. The second largest,\n        will be topic 1, etc.\n\n        If there are no mappings since no reduction of topics\n        took place, then the topics will simply be ordered\n        by their frequency and will get the topic ids based\n        on that order.\n\n        This means that -1 will remain the outlier class, and\n        that the rest of the topics will be in descending order\n        of ids and frequency.\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the mapped\n                       and re-ordered topic ids\n        \"\"\"\n        self._update_topic_size(documents)\n\n        # Map topics based on frequency\n        df = pd.DataFrame(self.topic_sizes_.items(), columns=[\"Old_Topic\", \"Size\"]).sort_values(\"Size\", ascending=False)\n        df = df[df.Old_Topic != -1]\n        sorted_topics = {**{-1: -1}, **dict(zip(df.Old_Topic, range(len(df))))}\n        self.topic_mapper_.add_mappings(sorted_topics)\n\n        # Map documents\n        documents.Topic = documents.Topic.map(sorted_topics).fillna(documents.Topic).astype(int)\n        self._update_topic_size(documents)\n        return documents\n\n    def _map_probabilities(self,\n                           probabilities: Union[np.ndarray, None],\n                           original_topics: bool = False) -&gt; Union[np.ndarray, None]:\n        \"\"\" Map the probabilities to the reduced topics.\n        This is achieved by adding the probabilities together\n        of all topics that were mapped to the same topic. Then,\n        the topics that were mapped from were set to 0 as they\n        were reduced.\n\n        Arguments:\n            probabilities: An array containing probabilities\n            original_topics: Whether we want to map from the\n                             original topics to the most recent topics\n                             or from the second-most recent topics.\n\n        Returns:\n            mapped_probabilities: Updated probabilities\n        \"\"\"\n        mappings = self.topic_mapper_.get_mappings(original_topics)\n\n        # Map array of probabilities (probability for assigned topic per document)\n        if probabilities is not None:\n            if len(probabilities.shape) == 2:\n                mapped_probabilities = np.zeros((probabilities.shape[0],\n                                                 len(set(mappings.values())) - self._outliers))\n                for from_topic, to_topic in mappings.items():\n                    if to_topic != -1 and from_topic != -1:\n                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]\n\n                return mapped_probabilities\n\n        return probabilities\n\n    def _preprocess_text(self, documents: np.ndarray) -&gt; List[str]:\n        \"\"\" Basic preprocessing of text\n\n        Steps:\n            * Replace \\n and \\t with whitespace\n            * Only keep alpha-numerical characters\n        \"\"\"\n        cleaned_documents = [doc.replace(\"\\n\", \" \") for doc in documents]\n        cleaned_documents = [doc.replace(\"\\t\", \" \") for doc in cleaned_documents]\n        if self.language == \"english\":\n            cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]\n        cleaned_documents = [doc if doc != \"\" else \"emptydoc\" for doc in cleaned_documents]\n        return cleaned_documents\n\n    @staticmethod\n    def _top_n_idx_sparse(matrix: csr_matrix, n: int) -&gt; np.ndarray:\n        \"\"\" Return indices of top n values in each row of a sparse matrix\n\n        Retrieved from:\n            https://stackoverflow.com/questions/49207275/finding-the-top-n-values-in-a-row-of-a-scipy-sparse-matrix\n\n        Arguments:\n            matrix: The sparse matrix from which to get the top n indices per row\n            n: The number of highest values to extract from each row\n\n        Returns:\n            indices: The top n indices per row\n        \"\"\"\n        indices = []\n        for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):\n            n_row_pick = min(n, ri - le)\n            values = matrix.indices[le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]]\n            values = [values[index] if len(values) &gt;= index + 1 else None for index in range(n)]\n            indices.append(values)\n        return np.array(indices)\n\n    @staticmethod\n    def _top_n_values_sparse(matrix: csr_matrix, indices: np.ndarray) -&gt; np.ndarray:\n        \"\"\" Return the top n values for each row in a sparse matrix\n\n        Arguments:\n            matrix: The sparse matrix from which to get the top n indices per row\n            indices: The top n indices per row\n\n        Returns:\n            top_values: The top n scores per row\n        \"\"\"\n        top_values = []\n        for row, values in enumerate(indices):\n            scores = np.array([matrix[row, value] if value is not None else 0 for value in values])\n            top_values.append(scores)\n        return np.array(top_values)\n\n    @classmethod\n    def _get_param_names(cls):\n        \"\"\"Get parameter names for the estimator\n\n        Adapted from:\n            https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178\n        \"\"\"\n        init_signature = inspect.signature(cls.__init__)\n        parameters = sorted([p.name for p in init_signature.parameters.values()\n                             if p.name != 'self' and p.kind != p.VAR_KEYWORD])\n        return parameters\n\n    def __str__(self):\n        \"\"\"Get a string representation of the current object.\n\n        Returns:\n            str: Human readable representation of the most important model parameters.\n                 The parameters that represent models are ignored due to their length.\n        \"\"\"\n        parameters = \"\"\n        for parameter, value in self.get_params().items():\n            value = str(value)\n            if \"(\" in value and value[0] != \"(\":\n                value = value.split(\"(\")[0] + \"(...)\"\n            parameters += f\"{parameter}={value}, \"\n\n        return f\"BERTopic({parameters[:-2]})\"\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.__init__","title":"<code>__init__(self, language='english', top_n_words=10, n_gram_range=(1, 1), min_topic_size=10, nr_topics=None, low_memory=False, calculate_probabilities=False, seed_topic_list=None, zeroshot_topic_list=None, zeroshot_min_similarity=0.7, embedding_model=None, umap_model=None, hdbscan_model=None, vectorizer_model=None, ctfidf_model=None, representation_model=None, verbose=False)</code>  <code>special</code>","text":"<p>BERTopic initialization</p> <p>Parameters:</p> Name Type Description Default <code>language</code> <code>str</code> <p>The main language used in your documents. The default sentence-transformers       model for \"english\" is <code>all-MiniLM-L6-v2</code>. For a full overview of       supported languages see bertopic.backend.languages. Select       \"multilingual\" to load in the <code>paraphrase-multilingual-MiniLM-L12-v2</code>       sentence-transformers model that supports 50+ languages.       NOTE: This is not used if <code>embedding_model</code> is used.</p> <code>'english'</code> <code>top_n_words</code> <code>int</code> <p>The number of words per topic to extract. Setting this          too high can negatively impact topic embeddings as topics          are typically best represented by at most 10 words.</p> <code>10</code> <code>n_gram_range</code> <code>Tuple[int, int]</code> <p>The n-gram range for the CountVectorizer.           Advised to keep high values between 1 and 3.           More would likely lead to memory issues.           NOTE: This param will not be used if you pass in your own           CountVectorizer.</p> <code>(1, 1)</code> <code>min_topic_size</code> <code>int</code> <p>The minimum size of the topic. Increasing this value will lead             to a lower number of clusters/topics and vice versa.              It is the same parameter as <code>min_cluster_size</code> in HDBSCAN.             NOTE: This param will not be used if you are <code>hdbscan_model</code>.</p> <code>10</code> <code>nr_topics</code> <code>Union[int, str]</code> <p>Specifying the number of topics will reduce the initial        number of topics to the value specified. This reduction can take        a while as each reduction in topics (-1) activates a c-TF-IDF        calculation. If this is set to None, no reduction is applied. Use        \"auto\" to automatically reduce topics using HDBSCAN.        NOTE: Controlling the number of topics is best done by adjusting        <code>min_topic_size</code> first before adjusting this parameter.</p> <code>None</code> <code>low_memory</code> <code>bool</code> <p>Sets UMAP low memory to True to make sure less memory is used.         NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP         this parameter will not be used.</p> <code>False</code> <code>calculate_probabilities</code> <code>bool</code> <p>Calculate the probabilities of all topics                      per document instead of the probability of the assigned                      topic per document. This could slow down the extraction                      of topics if you have many documents (&gt; 100_000).                      NOTE: If false you cannot use the corresponding                      visualization method <code>visualize_probabilities</code>.                      NOTE: This is an approximation of topic probabilities                      as used in HDBSCAN and not an exact representation.</p> <code>False</code> <code>seed_topic_list</code> <code>List[List[str]]</code> <p>A list of seed words per topic to converge around</p> <code>None</code> <code>zeroshot_topic_list</code> <code>List[str]</code> <p>A list of topic names to use for zero-shot classification</p> <code>None</code> <code>zeroshot_min_similarity</code> <code>float</code> <p>The minimum similarity between a zero-shot topic and                      a document for assignment. The higher this value, the more                      confident the model needs to be to assign a zero-shot topic to a document.</p> <code>0.7</code> <code>verbose</code> <code>bool</code> <p>Changes the verbosity of the model, Set to True if you want      to track the stages of the model.</p> <code>False</code> <code>embedding_model</code> <p>Use a custom embedding model.              The following backends are currently supported                * SentenceTransformers                * Flair                * Spacy                * Gensim                * USE (TF-Hub)              You can also pass in a string that points to one of the following              sentence-transformers models:                * https://www.sbert.net/docs/pretrained_models.html</p> <code>None</code> <code>umap_model</code> <code>UMAP</code> <p>Pass in a UMAP model to be used instead of the default.         NOTE: You can also pass in any dimensionality reduction algorithm as long         as it has <code>.fit</code> and <code>.transform</code> functions.</p> <code>None</code> <code>hdbscan_model</code> <code>HDBSCAN</code> <p>Pass in a hdbscan.HDBSCAN model to be used instead of the default            NOTE: You can also pass in any clustering algorithm as long as it has            <code>.fit</code> and <code>.predict</code> functions along with the <code>.labels_</code> variable.</p> <code>None</code> <code>vectorizer_model</code> <code>CountVectorizer</code> <p>Pass in a custom <code>CountVectorizer</code> instead of the default model.</p> <code>None</code> <code>ctfidf_model</code> <code>TfidfTransformer</code> <p>Pass in a custom ClassTfidfTransformer instead of the default model.</p> <code>None</code> <code>representation_model</code> <code>BaseRepresentation</code> <p>Pass in a model that fine-tunes the topic representations                   calculated through c-TF-IDF. Models from <code>bertopic.representation</code>                   are supported.</p> <code>None</code> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def __init__(self,\n             language: str = \"english\",\n             top_n_words: int = 10,\n             n_gram_range: Tuple[int, int] = (1, 1),\n             min_topic_size: int = 10,\n             nr_topics: Union[int, str] = None,\n             low_memory: bool = False,\n             calculate_probabilities: bool = False,\n             seed_topic_list: List[List[str]] = None,\n             zeroshot_topic_list: List[str] = None,\n             zeroshot_min_similarity: float = .7,\n             embedding_model=None,\n             umap_model: UMAP = None,\n             hdbscan_model: hdbscan.HDBSCAN = None,\n             vectorizer_model: CountVectorizer = None,\n             ctfidf_model: TfidfTransformer = None,\n             representation_model: BaseRepresentation = None,\n             verbose: bool = False,\n             ):\n    \"\"\"BERTopic initialization\n\n    Arguments:\n        language: The main language used in your documents. The default sentence-transformers\n                  model for \"english\" is `all-MiniLM-L6-v2`. For a full overview of\n                  supported languages see bertopic.backend.languages. Select\n                  \"multilingual\" to load in the `paraphrase-multilingual-MiniLM-L12-v2`\n                  sentence-transformers model that supports 50+ languages.\n                  NOTE: This is not used if `embedding_model` is used.\n        top_n_words: The number of words per topic to extract. Setting this\n                     too high can negatively impact topic embeddings as topics\n                     are typically best represented by at most 10 words.\n        n_gram_range: The n-gram range for the CountVectorizer.\n                      Advised to keep high values between 1 and 3.\n                      More would likely lead to memory issues.\n                      NOTE: This param will not be used if you pass in your own\n                      CountVectorizer.\n        min_topic_size: The minimum size of the topic. Increasing this value will lead\n                        to a lower number of clusters/topics and vice versa. \n                        It is the same parameter as `min_cluster_size` in HDBSCAN.\n                        NOTE: This param will not be used if you are `hdbscan_model`.\n        nr_topics: Specifying the number of topics will reduce the initial\n                   number of topics to the value specified. This reduction can take\n                   a while as each reduction in topics (-1) activates a c-TF-IDF\n                   calculation. If this is set to None, no reduction is applied. Use\n                   \"auto\" to automatically reduce topics using HDBSCAN.\n                   NOTE: Controlling the number of topics is best done by adjusting\n                   `min_topic_size` first before adjusting this parameter.\n        low_memory: Sets UMAP low memory to True to make sure less memory is used.\n                    NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP\n                    this parameter will not be used.\n        calculate_probabilities: Calculate the probabilities of all topics\n                                 per document instead of the probability of the assigned\n                                 topic per document. This could slow down the extraction\n                                 of topics if you have many documents (&gt; 100_000).\n                                 NOTE: If false you cannot use the corresponding\n                                 visualization method `visualize_probabilities`.\n                                 NOTE: This is an approximation of topic probabilities\n                                 as used in HDBSCAN and not an exact representation.\n        seed_topic_list: A list of seed words per topic to converge around\n        zeroshot_topic_list: A list of topic names to use for zero-shot classification\n        zeroshot_min_similarity: The minimum similarity between a zero-shot topic and\n                                 a document for assignment. The higher this value, the more\n                                 confident the model needs to be to assign a zero-shot topic to a document.\n        verbose: Changes the verbosity of the model, Set to True if you want\n                 to track the stages of the model.\n        embedding_model: Use a custom embedding model.\n                         The following backends are currently supported\n                           * SentenceTransformers\n                           * Flair\n                           * Spacy\n                           * Gensim\n                           * USE (TF-Hub)\n                         You can also pass in a string that points to one of the following\n                         sentence-transformers models:\n                           * https://www.sbert.net/docs/pretrained_models.html\n        umap_model: Pass in a UMAP model to be used instead of the default.\n                    NOTE: You can also pass in any dimensionality reduction algorithm as long\n                    as it has `.fit` and `.transform` functions.\n        hdbscan_model: Pass in a hdbscan.HDBSCAN model to be used instead of the default\n                       NOTE: You can also pass in any clustering algorithm as long as it has\n                       `.fit` and `.predict` functions along with the `.labels_` variable.\n        vectorizer_model: Pass in a custom `CountVectorizer` instead of the default model.\n        ctfidf_model: Pass in a custom ClassTfidfTransformer instead of the default model.\n        representation_model: Pass in a model that fine-tunes the topic representations\n                              calculated through c-TF-IDF. Models from `bertopic.representation`\n                              are supported.\n    \"\"\"\n    # Topic-based parameters\n    if top_n_words &gt; 100:\n        logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                       \"can slow down computation quite a bit.\")\n\n    self.top_n_words = top_n_words\n    self.min_topic_size = min_topic_size\n    self.nr_topics = nr_topics\n    self.low_memory = low_memory\n    self.calculate_probabilities = calculate_probabilities\n    self.verbose = verbose\n    self.seed_topic_list = seed_topic_list\n    self.zeroshot_topic_list = zeroshot_topic_list\n    self.zeroshot_min_similarity = zeroshot_min_similarity\n\n    # Embedding model\n    self.language = language if not embedding_model else None\n    self.embedding_model = embedding_model\n\n    # Vectorizer\n    self.n_gram_range = n_gram_range\n    self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=self.n_gram_range)\n    self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n\n    # Representation model\n    self.representation_model = representation_model\n\n    # UMAP or another algorithm that has .fit and .transform functions\n    self.umap_model = umap_model or UMAP(n_neighbors=15,\n                                         n_components=5,\n                                         min_dist=0.0,\n                                         metric='cosine',\n                                         low_memory=self.low_memory)\n\n    # HDBSCAN or another clustering algorithm that has .fit and .predict functions and\n    # the .labels_ variable to extract the labels\n    self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,\n                                                          metric='euclidean',\n                                                          cluster_selection_method='eom',\n                                                          prediction_data=True)\n\n    # Public attributes\n    self.topics_ = None\n    self.probabilities_ = None\n    self.topic_sizes_ = None\n    self.topic_mapper_ = None\n    self.topic_representations_ = None\n    self.topic_embeddings_ = None\n    self.topic_labels_ = None\n    self.custom_labels_ = None\n    self.c_tf_idf_ = None\n    self.representative_images_ = None\n    self.representative_docs_ = {}\n    self.topic_aspects_ = {}\n\n    # Private attributes for internal tracking purposes\n    self._outliers = 1\n    self._merged_topics = None\n\n    if verbose:\n        logger.set_level(\"DEBUG\")\n    else:\n        logger.set_level(\"WARNING\")\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.__str__","title":"<code>__str__(self)</code>  <code>special</code>","text":"<p>Get a string representation of the current object.</p> <p>Returns:</p> Type Description <code>str</code> <p>Human readable representation of the most important model parameters.      The parameters that represent models are ignored due to their length.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def __str__(self):\n    \"\"\"Get a string representation of the current object.\n\n    Returns:\n        str: Human readable representation of the most important model parameters.\n             The parameters that represent models are ignored due to their length.\n    \"\"\"\n    parameters = \"\"\n    for parameter, value in self.get_params().items():\n        value = str(value)\n        if \"(\" in value and value[0] != \"(\":\n            value = value.split(\"(\")[0] + \"(...)\"\n        parameters += f\"{parameter}={value}, \"\n\n    return f\"BERTopic({parameters[:-2]})\"\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.approximate_distribution","title":"<code>approximate_distribution(self, documents, window=4, stride=1, min_similarity=0.1, batch_size=1000, padding=False, use_embedding_model=False, calculate_tokens=False, separator=' ')</code>","text":"<p>A post-hoc approximation of topic distributions across documents.</p> <p>In order to perform this approximation, each document is split into tokens according to the provided tokenizer in the <code>CountVectorizer</code>. Then, a sliding window is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the sentence:</p> <p><code>Solving the right problem is difficult.</code></p> <p>can be split up into <code>solving the right</code>, <code>the right problem</code>, <code>right problem is</code>, and <code>problem is difficult</code>. These are called tokensets. For each of these tokensets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics. Then, the similarities to the topics for each tokenset are summed in order to create a topic distribution for the entire document.</p> <p>We can also dive into this a bit deeper by then splitting these tokensets up into individual tokens and calculate how much a word, in a specific sentence, contributes to the topics found in that document. This can be enabled by setting <code>calculate_tokens=True</code> which can be used for visualization purposes in <code>topic_model.visualize_approximate_distribution</code>.</p> <p>The main output, <code>topic_distributions</code>, can also be used directly in <code>.visualize_distribution(topic_distributions[index])</code> by simply selecting a single distribution.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>Union[str, List[str]]</code> <p>A single document or a list of documents for which we     approximate their topic distributions</p> required <code>window</code> <code>int</code> <p>Size of the moving window which indicates the number of     tokens being considered.</p> <code>4</code> <code>stride</code> <code>int</code> <p>How far the window should move at each step.</p> <code>1</code> <code>min_similarity</code> <code>float</code> <p>The minimum similarity of a document's tokenset             with respect to the topics.</p> <code>0.1</code> <code>batch_size</code> <code>int</code> <p>The number of documents to process at a time. If None,         then all documents are processed at once.         NOTE: With a large number of documents, it is not         advised to process all documents at once.</p> <code>1000</code> <code>padding</code> <code>bool</code> <p>Whether to pad the beginning and ending of a document with      empty tokens.</p> <code>False</code> <code>use_embedding_model</code> <code>bool</code> <p>Whether to use the topic model's embedding                 model to calculate the similarity between                 tokensets and topics instead of using c-TF-IDF.</p> <code>False</code> <code>calculate_tokens</code> <code>bool</code> <p>Calculate the similarity of tokens with all topics.             NOTE: This is computation-wise more expensive and             can require more memory. Using this over batches of             documents might be preferred.</p> <code>False</code> <code>separator</code> <code>str</code> <p>The separator used to merge tokens into tokensets.</p> <code>' '</code> <p>Returns:</p> Type Description <code>topic_distributions</code> <p>A <code>n</code> x <code>m</code> matrix containing the topic distributions                     for all input documents with <code>n</code> being the documents                     and <code>m</code> the topics. topic_token_distributions: A list of <code>t</code> x <code>m</code> arrays with <code>t</code> being the                         number of tokens for the respective document                         and <code>m</code> the topics.</p> <p>Examples:</p> <p>After fitting the model, the topic distributions can be calculated regardless of the clustering model and regardless of whether the documents were previously seen or not:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs)\n</code></pre> <p>As a result, the topic distributions are calculated in <code>topic_distr</code> for the entire document based on token set with a specific window size and stride.</p> <p>If you want to calculate the topic distributions on a token-level:</p> <pre><code>topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n</code></pre> <p>The <code>topic_token_distr</code> then contains, for each token, the best fitting topics. As with <code>topic_distr</code>, it can contain multiple topics for a single token.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def approximate_distribution(self,\n                             documents: Union[str, List[str]],\n                             window: int = 4,\n                             stride: int = 1,\n                             min_similarity: float = 0.1,\n                             batch_size: int = 1000,\n                             padding: bool = False,\n                             use_embedding_model: bool = False,\n                             calculate_tokens: bool = False,\n                             separator: str = \" \") -&gt; Tuple[np.ndarray,\n                                                            Union[List[np.ndarray], None]]:\n    \"\"\" A post-hoc approximation of topic distributions across documents.\n\n    In order to perform this approximation, each document is split into tokens\n    according to the provided tokenizer in the `CountVectorizer`. Then, a\n    sliding window is applied on each document creating subsets of the document.\n    For example, with a window size of 3 and stride of 1, the sentence:\n\n    `Solving the right problem is difficult.`\n\n    can be split up into `solving the right`, `the right problem`, `right problem is`,\n    and `problem is difficult`. These are called tokensets. For each of these\n    tokensets, we calculate their c-TF-IDF representation and find out\n    how similar they are to the previously generated topics. Then, the\n    similarities to the topics for each tokenset are summed in order to\n    create a topic distribution for the entire document.\n\n    We can also dive into this a bit deeper by then splitting these tokensets\n    up into individual tokens and calculate how much a word, in a specific sentence,\n    contributes to the topics found in that document. This can be enabled by\n    setting `calculate_tokens=True` which can be used for visualization purposes\n    in `topic_model.visualize_approximate_distribution`.\n\n    The main output, `topic_distributions`, can also be used directly in\n    `.visualize_distribution(topic_distributions[index])` by simply selecting\n    a single distribution.\n\n    Arguments:\n        documents: A single document or a list of documents for which we\n                approximate their topic distributions\n        window: Size of the moving window which indicates the number of\n                tokens being considered.\n        stride: How far the window should move at each step.\n        min_similarity: The minimum similarity of a document's tokenset\n                        with respect to the topics.\n        batch_size: The number of documents to process at a time. If None,\n                    then all documents are processed at once.\n                    NOTE: With a large number of documents, it is not\n                    advised to process all documents at once.\n        padding: Whether to pad the beginning and ending of a document with\n                 empty tokens.\n        use_embedding_model: Whether to use the topic model's embedding\n                            model to calculate the similarity between\n                            tokensets and topics instead of using c-TF-IDF.\n        calculate_tokens: Calculate the similarity of tokens with all topics.\n                        NOTE: This is computation-wise more expensive and\n                        can require more memory. Using this over batches of\n                        documents might be preferred.\n        separator: The separator used to merge tokens into tokensets.\n\n    Returns:\n        topic_distributions: A `n` x `m` matrix containing the topic distributions\n                            for all input documents with `n` being the documents\n                            and `m` the topics.\n        topic_token_distributions: A list of `t` x `m` arrays with `t` being the\n                                number of tokens for the respective document\n                                and `m` the topics.\n\n    Examples:\n\n    After fitting the model, the topic distributions can be calculated regardless\n    of the clustering model and regardless of whether the documents were previously\n    seen or not:\n\n    ```python\n    topic_distr, _ = topic_model.approximate_distribution(docs)\n    ```\n\n    As a result, the topic distributions are calculated in `topic_distr` for the\n    entire document based on token set with a specific window size and stride.\n\n    If you want to calculate the topic distributions on a token-level:\n\n    ```python\n    topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n    ```\n\n    The `topic_token_distr` then contains, for each token, the best fitting topics.\n    As with `topic_distr`, it can contain multiple topics for a single token.\n    \"\"\"\n    if isinstance(documents, str):\n        documents = [documents]\n\n    if batch_size is None:\n        batch_size = len(documents)\n        batches = 1\n    else:\n        batches = math.ceil(len(documents)/batch_size)\n\n    topic_distributions = []\n    topic_token_distributions = []\n\n    for i in tqdm(range(batches), disable=not self.verbose):\n        doc_set = documents[i*batch_size: (i+1) * batch_size]\n\n        # Extract tokens\n        analyzer = self.vectorizer_model.build_tokenizer()\n        tokens = [analyzer(document) for document in doc_set]\n\n        # Extract token sets\n        all_sentences = []\n        all_indices = [0]\n        all_token_sets_ids = []\n\n        for tokenset in tokens:\n            if len(tokenset) &lt; window:\n                token_sets = [tokenset]\n                token_sets_ids = [list(range(len(tokenset)))]\n            else:\n\n                # Extract tokensets using window and stride parameters\n                stride_indices = list(range(len(tokenset)))[::stride]\n                token_sets = []\n                token_sets_ids = []\n                for stride_index in stride_indices:\n                    selected_tokens = tokenset[stride_index: stride_index+window]\n\n                    if padding or len(selected_tokens) == window:\n                        token_sets.append(selected_tokens)\n                        token_sets_ids.append(list(range(stride_index, stride_index+len(selected_tokens))))\n\n                # Add empty tokens at the beginning and end of a document\n                if padding:\n                    padded = []\n                    padded_ids = []\n                    t = math.ceil(window / stride) - 1\n                    for i in range(math.ceil(window / stride) - 1):\n                        padded.append(tokenset[:window - ((t-i) * stride)])\n                        padded_ids.append(list(range(0, window - ((t-i) * stride))))\n\n                    token_sets = padded + token_sets\n                    token_sets_ids = padded_ids + token_sets_ids\n\n            # Join the tokens\n            sentences = [separator.join(token) for token in token_sets]\n            all_sentences.extend(sentences)\n            all_token_sets_ids.extend(token_sets_ids)\n            all_indices.append(all_indices[-1] + len(sentences))\n\n        # Calculate similarity between embeddings of token sets and the topics\n        if use_embedding_model:\n            embeddings = self._extract_embeddings(all_sentences, method=\"document\", verbose=True)\n            similarity = cosine_similarity(embeddings, self.topic_embeddings_[self._outliers:])\n\n        # Calculate similarity between c-TF-IDF of token sets and the topics\n        else:\n            bow_doc = self.vectorizer_model.transform(all_sentences)\n            c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n            similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n        # Only keep similarities that exceed the minimum\n        similarity[similarity &lt; min_similarity] = 0\n\n        # Aggregate results on an individual token level\n        if calculate_tokens:\n            topic_distribution = []\n            topic_token_distribution = []\n            for index, token in enumerate(tokens):\n                start = all_indices[index]\n                end = all_indices[index+1]\n\n                if start == end:\n                    end = end + 1\n\n                # Assign topics to individual tokens\n                token_id = [i for i in range(len(token))]\n                token_val = {index: [] for index in token_id}\n                for sim, token_set in zip(similarity[start:end], all_token_sets_ids[start:end]):\n                    for token in token_set:\n                        if token in token_val:\n                            token_val[token].append(sim)\n\n                matrix = []\n                for _, value in token_val.items():\n                    matrix.append(np.add.reduce(value))\n\n                # Take empty documents into account\n                matrix = np.array(matrix)\n                if len(matrix.shape) == 1:\n                    matrix = np.zeros((1, len(self.topic_labels_) - self._outliers))\n\n                topic_token_distribution.append(np.array(matrix))\n                topic_distribution.append(np.add.reduce(matrix))\n\n            topic_distribution = normalize(topic_distribution, norm='l1', axis=1)\n\n        # Aggregate on a tokenset level indicated by the window and stride\n        else:\n            topic_distribution = []\n            for index in range(len(all_indices)-1):\n                start = all_indices[index]\n                end = all_indices[index+1]\n\n                if start == end:\n                    end = end + 1\n                group = similarity[start:end].sum(axis=0)\n                topic_distribution.append(group)\n            topic_distribution = normalize(np.array(topic_distribution), norm='l1', axis=1)\n            topic_token_distribution = None\n\n        # Combine results\n        topic_distributions.append(topic_distribution)\n        if topic_token_distribution is None:\n            topic_token_distributions = None\n        else:\n            topic_token_distributions.extend(topic_token_distribution)\n\n    topic_distributions = np.vstack(topic_distributions)\n\n    return topic_distributions, topic_token_distributions\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.find_topics","title":"<code>find_topics(self, search_term=None, image=None, top_n=5)</code>","text":"<p>Find topics most similar to a search_term</p> <p>Creates an embedding for search_term and compares that with the topic embeddings. The most similar topics are returned along with their similarity values.</p> <p>The search_term can be of any size but since it compares with the topic representation it is advised to keep it below 5 words.</p> <p>Parameters:</p> Name Type Description Default <code>search_term</code> <code>str</code> <p>the term you want to use to search for topics.</p> <code>None</code> <code>top_n</code> <code>int</code> <p>the number of topics to return</p> <code>5</code> <p>Returns:</p> Type Description <code>similar_topics</code> <p>the most similar topics from high to low similarity: the similarity scores from high to low</p> <p>Examples:</p> <p>You can use the underlying embedding model to find topics that best represent the search term:</p> <pre><code>topics, similarity = topic_model.find_topics(\"sports\", top_n=5)\n</code></pre> <p>Note that the search query is typically more accurate if the search_term consists of a phrase or multiple words.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def find_topics(self,\n                search_term: str = None,\n                image: str = None,\n                top_n: int = 5) -&gt; Tuple[List[int], List[float]]:\n    \"\"\" Find topics most similar to a search_term\n\n    Creates an embedding for search_term and compares that with\n    the topic embeddings. The most similar topics are returned\n    along with their similarity values.\n\n    The search_term can be of any size but since it compares\n    with the topic representation it is advised to keep it\n    below 5 words.\n\n    Arguments:\n        search_term: the term you want to use to search for topics.\n        top_n: the number of topics to return\n\n    Returns:\n        similar_topics: the most similar topics from high to low\n        similarity: the similarity scores from high to low\n\n    Examples:\n\n    You can use the underlying embedding model to find topics that\n    best represent the search term:\n\n    ```python\n    topics, similarity = topic_model.find_topics(\"sports\", top_n=5)\n    ```\n\n    Note that the search query is typically more accurate if the\n    search_term consists of a phrase or multiple words.\n    \"\"\"\n    if self.embedding_model is None:\n        raise Exception(\"This method can only be used if you did not use custom embeddings.\")\n\n    topic_list = list(self.topic_representations_.keys())\n    topic_list.sort()\n\n    # Extract search_term embeddings and compare with topic embeddings\n    if search_term is not None:\n        search_embedding = self._extract_embeddings([search_term],\n                                                    method=\"word\",\n                                                    verbose=False).flatten()\n    elif image is not None:\n        search_embedding = self._extract_embeddings([None],\n                                                    images=[image],\n                                                    method=\"document\",\n                                                    verbose=False).flatten()\n    sims = cosine_similarity(search_embedding.reshape(1, -1), self.topic_embeddings_).flatten()\n\n    # Extract topics most similar to search_term\n    ids = np.argsort(sims)[-top_n:]\n    similarity = [sims[i] for i in ids][::-1]\n    similar_topics = [topic_list[index] for index in ids][::-1]\n\n    return similar_topics, similarity\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.fit","title":"<code>fit(self, documents, embeddings=None, images=None, y=None)</code>","text":"<p>Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents to fit on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model</p> <code>None</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images to fit on or the images themselves</p> <code>None</code> <code>y</code> <code>Union[List[int], numpy.ndarray]</code> <p>The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified.</p> <code>None</code> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic().fit(docs)\n</code></pre> <p>If you want to use your own embeddings, use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Create embeddings\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# Create topic model\ntopic_model = BERTopic().fit(docs, embeddings)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def fit(self,\n        documents: List[str],\n        embeddings: np.ndarray = None,\n        images: List[str] = None,\n        y: Union[List[int], np.ndarray] = None):\n    \"\"\" Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics\n\n    Arguments:\n        documents: A list of documents to fit on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model\n        images: A list of paths to the images to fit on or the images themselves\n        y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n           specific instance is specified.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic().fit(docs)\n    ```\n\n    If you want to use your own embeddings, use it as follows:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    # Create embeddings\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n    # Create topic model\n    topic_model = BERTopic().fit(docs, embeddings)\n    ```\n    \"\"\"\n    self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)\n    return self\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.fit_transform","title":"<code>fit_transform(self, documents, embeddings=None, images=None, y=None)</code>","text":"<p>Fit the models on a collection of documents, generate topics, and return the probabilities and topic per document.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents to fit on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model</p> <code>None</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images to fit on or the images themselves</p> <code>None</code> <code>y</code> <code>Union[List[int], numpy.ndarray]</code> <p>The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified.</p> <code>None</code> <p>Returns:</p> Type Description <code>predictions</code> <p>Topic predictions for each documents probabilities: The probability of the assigned topic per document.                If <code>calculate_probabilities</code> in BERTopic is set to True, then                it calculates the probabilities of all topics across all documents                instead of only the assigned topic. This, however, slows down                computation and may increase memory usage.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>If you want to use your own embeddings, use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Create embeddings\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# Create topic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def fit_transform(self,\n                  documents: List[str],\n                  embeddings: np.ndarray = None,\n                  images: List[str] = None,\n                  y: Union[List[int], np.ndarray] = None) -&gt; Tuple[List[int],\n                                                                   Union[np.ndarray, None]]:\n    \"\"\" Fit the models on a collection of documents, generate topics,\n    and return the probabilities and topic per document.\n\n    Arguments:\n        documents: A list of documents to fit on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model\n        images: A list of paths to the images to fit on or the images themselves\n        y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n           specific instance is specified.\n\n    Returns:\n        predictions: Topic predictions for each documents\n        probabilities: The probability of the assigned topic per document.\n                       If `calculate_probabilities` in BERTopic is set to True, then\n                       it calculates the probabilities of all topics across all documents\n                       instead of only the assigned topic. This, however, slows down\n                       computation and may increase memory usage.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    ```\n\n    If you want to use your own embeddings, use it as follows:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    # Create embeddings\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n    # Create topic model\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs, embeddings)\n    ```\n    \"\"\"\n    if documents is not None:\n        check_documents_type(documents)\n        check_embeddings_shape(embeddings, documents)\n\n    doc_ids = range(len(documents)) if documents is not None else range(len(images))\n    documents = pd.DataFrame({\"Document\": documents,\n                              \"ID\": doc_ids,\n                              \"Topic\": None,\n                              \"Image\": images})\n\n    # Extract embeddings\n    if embeddings is None:\n        logger.info(\"Embedding - Transforming documents to embeddings.\")\n        self.embedding_model = select_backend(self.embedding_model,\n                                              language=self.language)\n        embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                              images=images,\n                                              method=\"document\",\n                                              verbose=self.verbose)\n        logger.info(\"Embedding - Completed \\u2713\")\n    else:\n        if self.embedding_model is not None:\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n\n    # Guided Topic Modeling\n    if self.seed_topic_list is not None and self.embedding_model is not None:\n        y, embeddings = self._guided_topic_modeling(embeddings)\n\n    # Zero-shot Topic Modeling\n    if self._is_zeroshot():\n        documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(documents, embeddings)\n        if documents is None:\n            return self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n    # Reduce dimensionality\n    umap_embeddings = self._reduce_dimensionality(embeddings, y)\n\n    # Cluster reduced embeddings\n    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)\n\n    # Sort and Map Topic IDs by their frequency\n    if not self.nr_topics:\n        documents = self._sort_mappings_by_frequency(documents)\n\n    # Create documents from images if we have images only\n    if documents.Document.values[0] is None:\n        custom_documents = self._images_to_text(documents, embeddings)\n\n        # Extract topics by calculating c-TF-IDF\n        self._extract_topics(custom_documents, embeddings=embeddings)\n        self._create_topic_vectors(documents=documents, embeddings=embeddings)\n\n        # Reduce topics\n        if self.nr_topics:\n            custom_documents = self._reduce_topics(custom_documents)\n\n        # Save the top 3 most representative documents per topic\n        self._save_representative_docs(custom_documents)\n    else:\n        # Extract topics by calculating c-TF-IDF\n        self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)\n\n        # Reduce topics\n        if self.nr_topics:\n            documents = self._reduce_topics(documents)\n\n        # Save the top 3 most representative documents per topic\n        self._save_representative_docs(documents)\n\n    # Resulting output\n    self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)\n    predictions = documents.Topic.to_list()\n\n    # Combine Zero-shot with outliers\n    if self._is_zeroshot() and len(documents) != len(doc_ids):\n        predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n    return predictions, self.probabilities_\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.generate_topic_labels","title":"<code>generate_topic_labels(self, nr_words=3, topic_prefix=True, word_length=None, separator='_', aspect=None)</code>","text":"<p>Get labels for each topic in a user-defined format</p> <p>Parameters:</p> Name Type Description Default <code>nr_words</code> <code>int</code> <p>Top <code>n</code> words per topic to use</p> <code>3</code> <code>topic_prefix</code> <code>bool</code> <p>Whether to use the topic ID as a prefix.           If set to True, the topic ID will be separated           using the <code>separator</code></p> <code>True</code> <code>word_length</code> <code>int</code> <p>The maximum length of each word in the topic label.          Some words might be relatively long and setting this          value helps to make sure that all labels have relatively          similar lengths.</p> <code>None</code> <code>separator</code> <code>str</code> <p>The string with which the words and topic prefix will be        separated. Underscores are the default but a nice alternative        is <code>\", \"</code>.</p> <code>'_'</code> <code>aspect</code> <code>str</code> <p>The aspect from which to generate topic labels</p> <code>None</code> <p>Returns:</p> Type Description <code>topic_labels</code> <p>A list of topic labels sorted from the lowest topic ID to the highest.             If the topic model was trained using HDBSCAN, the lowest topic ID is -1,             otherwise it is 0.</p> <p>Examples:</p> <p>To create our custom topic labels, usage is rather straightforward:</p> <pre><code>topic_labels = topic_model.generate_topic_labels(nr_words=2, separator=\", \")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def generate_topic_labels(self,\n                          nr_words: int = 3,\n                          topic_prefix: bool = True,\n                          word_length: int = None,\n                          separator: str = \"_\",\n                          aspect: str = None) -&gt; List[str]:\n    \"\"\" Get labels for each topic in a user-defined format\n\n    Arguments:\n        nr_words: Top `n` words per topic to use\n        topic_prefix: Whether to use the topic ID as a prefix.\n                      If set to True, the topic ID will be separated\n                      using the `separator`\n        word_length: The maximum length of each word in the topic label.\n                     Some words might be relatively long and setting this\n                     value helps to make sure that all labels have relatively\n                     similar lengths.\n        separator: The string with which the words and topic prefix will be\n                   separated. Underscores are the default but a nice alternative\n                   is `\", \"`.\n        aspect: The aspect from which to generate topic labels\n\n    Returns:\n        topic_labels: A list of topic labels sorted from the lowest topic ID to the highest.\n                    If the topic model was trained using HDBSCAN, the lowest topic ID is -1,\n                    otherwise it is 0.\n\n    Examples:\n\n    To create our custom topic labels, usage is rather straightforward:\n\n    ```python\n    topic_labels = topic_model.generate_topic_labels(nr_words=2, separator=\", \")\n    ```\n    \"\"\"\n    unique_topics = sorted(set(self.topics_))\n\n    topic_labels = []\n    for topic in unique_topics:\n        if aspect:\n            words, _ = zip(*self.topic_aspects_[aspect][topic])\n        else:\n            words, _ = zip(*self.get_topic(topic))\n\n        if word_length:\n            words = [word[:word_length] for word in words][:nr_words]\n        else:\n            words = list(words)[:nr_words]\n\n        if topic_prefix:\n            topic_label = f\"{topic}{separator}\" + separator.join(words)\n        else:\n            topic_label = separator.join(words)\n\n        topic_labels.append(topic_label)\n\n    return topic_labels\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info","title":"<code>get_document_info(self, docs, df=None, metadata=None)</code>","text":"<p>Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and probability of the clustering if the cluster model supports it.</p> <p>There are also options to include other meta data, such as the topic distributions or the x and y coordinates of the reduced embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents on which the topic model was trained.</p> required <code>df</code> <code>DataFrame</code> <p>A dataframe containing the metadata and the documents on which the topic model was originally trained on.</p> <code>None</code> <code>metadata</code> <code>Mapping[str, Any]</code> <p>A dictionary with meta data for each document in the form     of column name (key) and the respective values (value).</p> <code>None</code> <p>Returns:</p> Type Description <code>document_info</code> <p>A dataframe with several statistics regarding             the documents on which the topic model was trained.</p> <p>Usage:</p> <p>To get the document info, you will only need to pass the documents on which the topic model was trained:</p> <pre><code>document_info = topic_model.get_document_info(docs)\n</code></pre> <p>There are additionally options to include meta data, such as the topic distributions. Moreover, we can pass the original dataframe that contains the documents and extend it with the information retrieved from BERTopic:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# The original data in a dataframe format to include the target variable\ndata= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndf = pd.DataFrame({\"Document\": data['data'], \"Class\": data['target']})\n\n# Add information about the percentage of the document that relates to the topic\ntopic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)\ndistributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]\n\n# Create our documents dataframe using the original dataframe and meta data about\n# the topic distributions\ndocument_info = topic_model.get_document_info(docs, df=df,\n                                              metadata={\"Topic_distribution\": distributions})\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_document_info(self,\n                      docs: List[str],\n                      df: pd.DataFrame = None,\n                      metadata: Mapping[str, Any] = None) -&gt; pd.DataFrame:\n    \"\"\" Get information about the documents on which the topic was trained\n    including the documents themselves, their respective topics, the name\n    of each topic, the top n words of each topic, whether it is a\n    representative document, and probability of the clustering if the cluster\n    model supports it.\n\n    There are also options to include other meta data, such as the topic\n    distributions or the x and y coordinates of the reduced embeddings.\n\n    Arguments:\n        docs: The documents on which the topic model was trained.\n        df: A dataframe containing the metadata and the documents on which\n            the topic model was originally trained on.\n        metadata: A dictionary with meta data for each document in the form\n                of column name (key) and the respective values (value).\n\n    Returns:\n        document_info: A dataframe with several statistics regarding\n                    the documents on which the topic model was trained.\n\n    Usage:\n\n    To get the document info, you will only need to pass the documents on which\n    the topic model was trained:\n\n    ```python\n    document_info = topic_model.get_document_info(docs)\n    ```\n\n    There are additionally options to include meta data, such as the topic\n    distributions. Moreover, we can pass the original dataframe that contains\n    the documents and extend it with the information retrieved from BERTopic:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n\n    # The original data in a dataframe format to include the target variable\n    data= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\n    df = pd.DataFrame({\"Document\": data['data'], \"Class\": data['target']})\n\n    # Add information about the percentage of the document that relates to the topic\n    topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)\n    distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]\n\n    # Create our documents dataframe using the original dataframe and meta data about\n    # the topic distributions\n    document_info = topic_model.get_document_info(docs, df=df,\n                                                  metadata={\"Topic_distribution\": distributions})\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    if df is not None:\n        document_info = df.copy()\n        document_info[\"Document\"] = docs\n        document_info[\"Topic\"] = self.topics_\n    else:\n        document_info = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_})\n\n    # Add topic info through `.get_topic_info()`\n    topic_info = self.get_topic_info().drop(\"Count\", axis=1)\n    document_info = pd.merge(document_info, topic_info, on=\"Topic\", how=\"left\")\n\n    # Add top n words\n    top_n_words = {topic: \" - \".join(list(zip(*self.get_topic(topic)))[0]) for topic in set(self.topics_)}\n    document_info[\"Top_n_words\"] = document_info.Topic.map(top_n_words)\n\n    # Add flat probabilities\n    if self.probabilities_ is not None:\n        if len(self.probabilities_.shape) == 1:\n            document_info[\"Probability\"] = self.probabilities_\n        else:\n            document_info[\"Probability\"] = [max(probs) if topic != -1 else 1-sum(probs)\n                                            for topic, probs in zip(self.topics_, self.probabilities_)]\n\n    # Add representative document labels\n    repr_docs = [repr_doc for repr_docs in self.representative_docs_.values() for repr_doc in repr_docs]\n    document_info[\"Representative_document\"] = False\n    document_info.loc[document_info.Document.isin(repr_docs), \"Representative_document\"] = True\n\n    # Add custom meta data provided by the user\n    if metadata is not None:\n        for column, values in metadata.items():\n            document_info[column] = values\n    return document_info\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_params","title":"<code>get_params(self, deep=False)</code>","text":"<p>Get parameters for this estimator.</p> <p>Adapted from:     https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178</p> <p>Parameters:</p> Name Type Description Default <code>deep</code> <code>bool</code> <p>bool, default=True   If True, will return the parameters for this estimator and   contained subobjects that are estimators.</p> <code>False</code> <p>Returns:</p> Type Description <code>out</code> <p>Parameter names mapped to their values.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_params(self, deep: bool = False) -&gt; Mapping[str, Any]:\n    \"\"\" Get parameters for this estimator.\n\n    Adapted from:\n        https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178\n\n    Arguments:\n        deep: bool, default=True\n              If True, will return the parameters for this estimator and\n              contained subobjects that are estimators.\n\n    Returns:\n        out: Parameter names mapped to their values.\n    \"\"\"\n    out = dict()\n    for key in self._get_param_names():\n        value = getattr(self, key)\n        if deep and hasattr(value, 'get_params'):\n            deep_items = value.get_params().items()\n            out.update((key + '__' + k, val) for k, val in deep_items)\n        out[key] = value\n    return out\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_representative_docs","title":"<code>get_representative_docs(self, topic=None)</code>","text":"<p>Extract the best representing documents per topic.</p> <p>Note</p> <p>This does not extract all documents per topic as all documents are not saved within BERTopic. To get all documents, please run the following:</p> <pre><code># When you used `.fit_transform`:\ndf = pd.DataFrame({\"Document\": docs, \"Topic\": topic})\n\n# When you used `.fit`:\ndf = pd.DataFrame({\"Document\": docs, \"Topic\": topic_model.topics_})\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want    the representative documents</p> <code>None</code> <p>Returns:</p> Type Description <code>List[str]</code> <p>Representative documents of the chosen topic</p> <p>Examples:</p> <p>To extract the representative docs of all topics:</p> <pre><code>representative_docs = topic_model.get_representative_docs()\n</code></pre> <p>To get the representative docs of a single topic:</p> <pre><code>representative_docs = topic_model.get_representative_docs(12)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_representative_docs(self, topic: int = None) -&gt; List[str]:\n    \"\"\" Extract the best representing documents per topic.\n\n    NOTE:\n        This does not extract all documents per topic as all documents\n        are not saved within BERTopic. To get all documents, please\n        run the following:\n\n        ```python\n        # When you used `.fit_transform`:\n        df = pd.DataFrame({\"Document\": docs, \"Topic\": topic})\n\n        # When you used `.fit`:\n        df = pd.DataFrame({\"Document\": docs, \"Topic\": topic_model.topics_})\n        ```\n\n    Arguments:\n        topic: A specific topic for which you want\n               the representative documents\n\n    Returns:\n        Representative documents of the chosen topic\n\n    Examples:\n\n    To extract the representative docs of all topics:\n\n    ```python\n    representative_docs = topic_model.get_representative_docs()\n    ```\n\n    To get the representative docs of a single topic:\n\n    ```python\n    representative_docs = topic_model.get_representative_docs(12)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    if isinstance(topic, int):\n        if self.representative_docs_.get(topic):\n            return self.representative_docs_[topic]\n        else:\n            return None\n    else:\n        return self.representative_docs_\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic","title":"<code>get_topic(self, topic, full=False)</code>","text":"<p>Return top n words for a specific topic and their c-TF-IDF scores</p> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want its representation</p> required <code>full</code> <code>bool</code> <p>If True, returns all different forms of topic representations   for a topic, including aspects</p> <code>False</code> <p>Returns:</p> Type Description <code>Union[Mapping[str, Tuple[str, float]], bool]</code> <p>The top n words for a specific word and its respective c-TF-IDF scores</p> <p>Examples:</p> <pre><code>topic = topic_model.get_topic(12)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topic(self, topic: int, full: bool = False) -&gt; Union[Mapping[str, Tuple[str, float]], bool]:\n    \"\"\" Return top n words for a specific topic and their c-TF-IDF scores\n\n    Arguments:\n        topic: A specific topic for which you want its representation\n        full: If True, returns all different forms of topic representations\n              for a topic, including aspects\n\n    Returns:\n        The top n words for a specific word and its respective c-TF-IDF scores\n\n    Examples:\n\n    ```python\n    topic = topic_model.get_topic(12)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    if topic in self.topic_representations_:\n        if full:\n            representations = {\"Main\": self.topic_representations_[topic]}\n            aspects = {aspect: representations[topic] for aspect, representations in self.topic_aspects_.items()}\n            representations.update(aspects)\n            return representations\n        else:\n            return self.topic_representations_[topic]\n    else:\n        return False\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic_freq","title":"<code>get_topic_freq(self, topic=None)</code>","text":"<p>Return the size of topics (descending order)</p> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want the frequency</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[pandas.core.frame.DataFrame, int]</code> <p>Either the frequency of a single topic or dataframe with the frequencies of all topics</p> <p>Examples:</p> <p>To extract the frequency of all topics:</p> <pre><code>frequency = topic_model.get_topic_freq()\n</code></pre> <p>To get the frequency of a single topic:</p> <pre><code>frequency = topic_model.get_topic_freq(12)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topic_freq(self, topic: int = None) -&gt; Union[pd.DataFrame, int]:\n    \"\"\" Return the size of topics (descending order)\n\n    Arguments:\n        topic: A specific topic for which you want the frequency\n\n    Returns:\n        Either the frequency of a single topic or dataframe with\n        the frequencies of all topics\n\n    Examples:\n\n    To extract the frequency of all topics:\n\n    ```python\n    frequency = topic_model.get_topic_freq()\n    ```\n\n    To get the frequency of a single topic:\n\n    ```python\n    frequency = topic_model.get_topic_freq(12)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    if isinstance(topic, int):\n        return self.topic_sizes_[topic]\n    else:\n        return pd.DataFrame(self.topic_sizes_.items(), columns=['Topic', 'Count']).sort_values(\"Count\",\n                                                                                               ascending=False)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic_info","title":"<code>get_topic_info(self, topic=None)</code>","text":"<p>Get information about each topic including its ID, frequency, and name.</p> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want the frequency</p> <code>None</code> <p>Returns:</p> Type Description <code>info</code> <p>The information relating to either a single topic or all topics</p> <p>Examples:</p> <pre><code>info_df = topic_model.get_topic_info()\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topic_info(self, topic: int = None) -&gt; pd.DataFrame:\n    \"\"\" Get information about each topic including its ID, frequency, and name.\n\n    Arguments:\n        topic: A specific topic for which you want the frequency\n\n    Returns:\n        info: The information relating to either a single topic or all topics\n\n    Examples:\n\n    ```python\n    info_df = topic_model.get_topic_info()\n    ```\n    \"\"\"\n    check_is_fitted(self)\n\n    info = pd.DataFrame(self.topic_sizes_.items(), columns=[\"Topic\", \"Count\"]).sort_values(\"Topic\")\n    info[\"Name\"] = info.Topic.map(self.topic_labels_)\n\n    # Custom label\n    if self.custom_labels_ is not None:\n        if len(self.custom_labels_) == len(info):\n            labels = {topic - self._outliers: label for topic, label in enumerate(self.custom_labels_)}\n            info[\"CustomName\"] = info[\"Topic\"].map(labels)\n\n    # Main Keywords\n    values = {topic: list(list(zip(*values))[0]) for topic, values in self.topic_representations_.items()}\n    info[\"Representation\"] = info[\"Topic\"].map(values)\n\n    # Extract all topic aspects\n    if self.topic_aspects_:\n        for aspect, values in self.topic_aspects_.items():\n            if isinstance(list(values.values())[-1], list):\n                if isinstance(list(values.values())[-1][0], tuple) or isinstance(list(values.values())[-1][0], list):\n                    values = {topic: list(list(zip(*value))[0]) for topic, value in values.items()}\n                elif isinstance(list(values.values())[-1][0], str):\n                    values = {topic: \" \".join(value).strip() for topic, value in values.items()}\n            info[aspect] = info[\"Topic\"].map(values)\n\n    # Representative Docs / Images\n    if self.representative_docs_ is not None:\n        info[\"Representative_Docs\"] = info[\"Topic\"].map(self.representative_docs_)\n    if self.representative_images_ is not None:\n        info[\"Representative_Images\"] = info[\"Topic\"].map(self.representative_images_)\n\n    # Select specific topic to return\n    if topic is not None:\n        info = info.loc[info.Topic == topic, :]\n\n    return info.reset_index(drop=True)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic_tree","title":"<code>get_topic_tree(hier_topics, max_distance=None, tight_layout=False)</code>  <code>staticmethod</code>","text":"<p>Extract the topic tree such that it can be printed</p> <p>Parameters:</p> Name Type Description Default <code>hier_topics</code> <code>DataFrame</code> <p>A dataframe containing the structure of the topic tree.         This is the output of <code>topic_model.hierachical_topics()</code></p> required <code>max_distance</code> <code>float</code> <p>The maximum distance between two topics. This value is         based on the Distance column in <code>hier_topics</code>.</p> <code>None</code> <code>tight_layout</code> <code>bool</code> <p>Whether to use a tight layout (narrow width) for         easier readability if you have hundreds of topics.</p> <code>False</code> <p>Returns:</p> Type Description <code>A tree that has the following structure when printed</code> <p>.     .     \u2514\u2500health_medical_disease_patients_hiv         \u251c\u2500patients_medical_disease_candida_health         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9</p> <p>The blocks (\u25a0) indicate that the topic is one you can directly access from <code>topic_model.get_topic</code>. In other words, they are the original un-grouped topics.</p> <p>Examples:</p> <pre><code># Train model\nfrom bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Print topic tree\ntree = topic_model.get_topic_tree(hierarchical_topics)\nprint(tree)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>@staticmethod\ndef get_topic_tree(hier_topics: pd.DataFrame,\n                   max_distance: float = None,\n                   tight_layout: bool = False) -&gt; str:\n    \"\"\" Extract the topic tree such that it can be printed\n\n    Arguments:\n        hier_topics: A dataframe containing the structure of the topic tree.\n                    This is the output of `topic_model.hierachical_topics()`\n        max_distance: The maximum distance between two topics. This value is\n                    based on the Distance column in `hier_topics`.\n        tight_layout: Whether to use a tight layout (narrow width) for\n                    easier readability if you have hundreds of topics.\n\n    Returns:\n        A tree that has the following structure when printed:\n            .\n            .\n            \u2514\u2500health_medical_disease_patients_hiv\n                \u251c\u2500patients_medical_disease_candida_health\n                \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n                \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n                \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n                \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n                \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n\n        The blocks (\u25a0) indicate that the topic is one you can directly access\n        from `topic_model.get_topic`. In other words, they are the original un-grouped topics.\n\n    Examples:\n\n    ```python\n    # Train model\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Print topic tree\n    tree = topic_model.get_topic_tree(hierarchical_topics)\n    print(tree)\n    ```\n    \"\"\"\n    width = 1 if tight_layout else 4\n    if max_distance is None:\n        max_distance = hier_topics.Distance.max() + 1\n\n    max_original_topic = hier_topics.Parent_ID.astype(int).min() - 1\n\n    # Extract mapping from ID to name\n    topic_to_name = dict(zip(hier_topics.Child_Left_ID, hier_topics.Child_Left_Name))\n    topic_to_name.update(dict(zip(hier_topics.Child_Right_ID, hier_topics.Child_Right_Name)))\n    topic_to_name = {topic: name[:100] for topic, name in topic_to_name.items()}\n\n    # Create tree\n    tree = {str(row[1].Parent_ID): [str(row[1].Child_Left_ID), str(row[1].Child_Right_ID)]\n            for row in hier_topics.iterrows()}\n\n    def get_tree(start, tree):\n        \"\"\" Based on: https://stackoverflow.com/a/51920869/10532563 \"\"\"\n\n        def _tree(to_print, start, parent, tree, grandpa=None, indent=\"\"):\n\n            # Get distance between merged topics\n            distance = hier_topics.loc[(hier_topics.Child_Left_ID == parent) |\n                                       (hier_topics.Child_Right_ID == parent), \"Distance\"]\n            distance = distance.values[0] if len(distance) &gt; 0 else 10\n\n            if parent != start:\n                if grandpa is None:\n                    to_print += topic_to_name[parent]\n                else:\n                    if int(parent) &lt;= max_original_topic:\n\n                        # Do not append topic ID if they are not merged\n                        if distance &lt; max_distance:\n                            to_print += \"\u25a0\u2500\u2500\" + topic_to_name[parent] + f\" \u2500\u2500 Topic: {parent}\" + \"\\n\"\n                        else:\n                            to_print += \"O \\n\"\n                    else:\n                        to_print += topic_to_name[parent] + \"\\n\"\n\n            if parent not in tree:\n                return to_print\n\n            for child in tree[parent][:-1]:\n                to_print += indent + \"\u251c\" + \"\u2500\"\n                to_print = _tree(to_print, start, child, tree, parent, indent + \"\u2502\" + \" \" * width)\n\n            child = tree[parent][-1]\n            to_print += indent + \"\u2514\" + \"\u2500\"\n            to_print = _tree(to_print, start, child, tree, parent, indent + \" \" * (width+1))\n\n            return to_print\n\n        to_print = \".\" + \"\\n\"\n        to_print = _tree(to_print, start, start, tree)\n        return to_print\n\n    start = str(hier_topics.Parent_ID.astype(int).max())\n    return get_tree(start, tree)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topics","title":"<code>get_topics(self, full=False)</code>","text":"<p>Return topics with top n words and their c-TF-IDF score</p> <p>Parameters:</p> Name Type Description Default <code>full</code> <code>bool</code> <p>If True, returns all different forms of topic representations   for each topic, including aspects</p> <code>False</code> <p>Returns:</p> Type Description <code>self.topic_representations_</code> <p>The top n words per topic and the corresponding c-TF-IDF score</p> <p>Examples:</p> <pre><code>all_topics = topic_model.get_topics()\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topics(self, full: bool = False) -&gt; Mapping[str, Tuple[str, float]]:\n    \"\"\" Return topics with top n words and their c-TF-IDF score\n\n    Arguments:\n        full: If True, returns all different forms of topic representations\n              for each topic, including aspects\n\n    Returns:\n        self.topic_representations_: The top n words per topic and the corresponding c-TF-IDF score\n\n    Examples:\n\n    ```python\n    all_topics = topic_model.get_topics()\n    ```\n    \"\"\"\n    check_is_fitted(self)\n\n    if full:\n        topic_representations = {\"Main\": self.topic_representations_}\n        topic_representations.update(self.topic_aspects_)\n        return topic_representations\n    else:\n        return self.topic_representations_\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.hierarchical_topics","title":"<code>hierarchical_topics(self, docs, linkage_function=None, distance_function=None)</code>","text":"<p>Create a hierarchy of topics</p> <p>To create this hierarchy, BERTopic needs to be already fitted once. Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF representation using <code>scipy.cluster.hierarchy.linkage</code>.</p> <p>Based on that hierarchy, we calculate the topic representation at each merged step. This is a local representation, as we only assume that the chosen step is merged and not all others which typically improves the topic representation.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>linkage_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], numpy.ndarray]</code> <p>The linkage function to use. Default is:             <code>lambda x: sch.linkage(x, 'ward', optimal_ordering=True)</code></p> <code>None</code> <code>distance_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], scipy.sparse._csr.csr_matrix]</code> <p>The distance function to use on the c-TF-IDF matrix. Default is:                 <code>lambda x: 1 - cosine_similarity(x)</code>.                 You can pass any function that returns either a square matrix of                 shape (n_samples, n_samples) with zeros on the diagonal and                 non-negative values or condensed distance matrix of shape                 (n_samples * (n_samples - 1) / 2,) containing the upper                 triangular of the distance matrix.</p> <code>None</code> <p>Returns:</p> Type Description <code>hierarchical_topics</code> <p>A dataframe that contains a hierarchy of topics                     represented by their parents and their children</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>A custom linkage function can be used as follows:</p> <pre><code>from scipy.cluster import hierarchy as sch\nfrom bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Hierarchical topics\nlinkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\nhierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def hierarchical_topics(self,\n                        docs: List[str],\n                        linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                        distance_function: Callable[[csr_matrix], csr_matrix] = None) -&gt; pd.DataFrame:\n    \"\"\" Create a hierarchy of topics\n\n    To create this hierarchy, BERTopic needs to be already fitted once.\n    Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF\n    representation using `scipy.cluster.hierarchy.linkage`.\n\n    Based on that hierarchy, we calculate the topic representation at each\n    merged step. This is a local representation, as we only assume that the\n    chosen step is merged and not all others which typically improves the\n    topic representation.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        linkage_function: The linkage function to use. Default is:\n                        `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                            `lambda x: 1 - cosine_similarity(x)`.\n                            You can pass any function that returns either a square matrix of\n                            shape (n_samples, n_samples) with zeros on the diagonal and\n                            non-negative values or condensed distance matrix of shape\n                            (n_samples * (n_samples - 1) / 2,) containing the upper\n                            triangular of the distance matrix.\n\n    Returns:\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                            represented by their parents and their children\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n    ```\n\n    A custom linkage function can be used as follows:\n\n    ```python\n    from scipy.cluster import hierarchy as sch\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n\n    # Hierarchical topics\n    linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n    hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    if distance_function is None:\n        distance_function = lambda x: 1 - cosine_similarity(x)\n\n    if linkage_function is None:\n        linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n\n    # Calculate distance\n    embeddings = self.c_tf_idf_[self._outliers:]\n    X = distance_function(embeddings)\n    X = validate_distance_matrix(X, embeddings.shape[0])\n\n    # Use the 1-D condensed distance matrix as an input instead of the raw distance matrix\n    Z = linkage_function(X)\n\n    # Calculate basic bag-of-words to be iteratively merged later\n    documents = pd.DataFrame({\"Document\": docs,\n                              \"ID\": range(len(docs)),\n                              \"Topic\": self.topics_})\n    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n    documents_per_topic = documents_per_topic.loc[documents_per_topic.Topic != -1, :]\n    clean_documents = self._preprocess_text(documents_per_topic.Document.values)\n\n    # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n    # and will be removed in 1.2. Please use get_feature_names_out instead.\n    if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n        words = self.vectorizer_model.get_feature_names_out()\n    else:\n        words = self.vectorizer_model.get_feature_names()\n\n    bow = self.vectorizer_model.transform(clean_documents)\n\n    # Extract clusters\n    hier_topics = pd.DataFrame(columns=[\"Parent_ID\", \"Parent_Name\", \"Topics\",\n                                        \"Child_Left_ID\", \"Child_Left_Name\",\n                                        \"Child_Right_ID\", \"Child_Right_Name\"])\n    for index in tqdm(range(len(Z))):\n\n        # Find clustered documents\n        clusters = sch.fcluster(Z, t=Z[index][2], criterion='distance') - self._outliers\n        cluster_df = pd.DataFrame({\"Topic\": range(len(clusters)), \"Cluster\": clusters})\n        cluster_df = cluster_df.groupby(\"Cluster\").agg({'Topic': lambda x: list(x)}).reset_index()\n        nr_clusters = len(clusters)\n\n        # Extract first topic we find to get the set of topics in a merged topic\n        topic = None\n        val = Z[index][0]\n        while topic is None:\n            if val - len(clusters) &lt; 0:\n                topic = int(val)\n            else:\n                val = Z[int(val - len(clusters))][0]\n        clustered_topics = [i for i, x in enumerate(clusters) if x == clusters[topic]]\n\n        # Group bow per cluster, calculate c-TF-IDF and extract words\n        grouped = csr_matrix(bow[clustered_topics].sum(axis=0))\n        c_tf_idf = self.ctfidf_model.transform(grouped)\n        selection = documents.loc[documents.Topic.isin(clustered_topics), :]\n        selection.Topic = 0\n        words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n\n        # Extract parent's name and ID\n        parent_id = index + len(clusters)\n        parent_name = \"_\".join([x[0] for x in words_per_topic[0]][:5])\n\n        # Extract child's name and ID\n        Z_id = Z[index][0]\n        child_left_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n        if Z_id - nr_clusters &lt; 0:\n            child_left_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n        else:\n            child_left_name = hier_topics.iloc[int(child_left_id)].Parent_Name\n\n        # Extract child's name and ID\n        Z_id = Z[index][1]\n        child_right_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n        if Z_id - nr_clusters &lt; 0:\n            child_right_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n        else:\n            child_right_name = hier_topics.iloc[int(child_right_id)].Parent_Name\n\n        # Save results\n        hier_topics.loc[len(hier_topics), :] = [parent_id, parent_name,\n                                                clustered_topics,\n                                                int(Z[index][0]), child_left_name,\n                                                int(Z[index][1]), child_right_name]\n\n    hier_topics[\"Distance\"] = Z[:, 2]\n    hier_topics = hier_topics.sort_values(\"Parent_ID\", ascending=False)\n    hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]] = hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]].astype(str)\n\n    return hier_topics\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.load","title":"<code>load(path, embedding_model=None)</code>  <code>classmethod</code>","text":"<p>Loads the model from the specified path or directory</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>str</code> <p>Either load a BERTopic model from a file (<code>.pickle</code>) or a folder containing   <code>.safetensors</code> or <code>.bin</code> files.</p> required <code>embedding_model</code> <p>Additionally load in an embedding model if it was not saved              in the BERTopic model file or directory.</p> <code>None</code> <p>Examples:</p> <pre><code>BERTopic.load(\"model_dir\")\n</code></pre> <p>or if you did not save the embedding model:</p> <pre><code>BERTopic.load(\"model_dir\", embedding_model=\"all-MiniLM-L6-v2\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>@classmethod\ndef load(cls,\n         path: str,\n         embedding_model=None):\n    \"\"\" Loads the model from the specified path or directory\n\n    Arguments:\n        path: Either load a BERTopic model from a file (`.pickle`) or a folder containing\n              `.safetensors` or `.bin` files.\n        embedding_model: Additionally load in an embedding model if it was not saved\n                         in the BERTopic model file or directory.\n\n    Examples:\n\n    ```python\n    BERTopic.load(\"model_dir\")\n    ```\n\n    or if you did not save the embedding model:\n\n    ```python\n    BERTopic.load(\"model_dir\", embedding_model=\"all-MiniLM-L6-v2\")\n    ```\n    \"\"\"\n    file_or_dir = Path(path)\n\n    # Load from Pickle\n    if file_or_dir.is_file():\n        with open(file_or_dir, 'rb') as file:\n            if embedding_model:\n                topic_model = joblib.load(file)\n                topic_model.embedding_model = select_backend(embedding_model)\n            else:\n                topic_model = joblib.load(file)\n            return topic_model\n\n    # Load from directory or HF\n    if file_or_dir.is_dir():\n        topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_local_files(file_or_dir)\n    elif \"/\" in str(path):\n        topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)\n    else:\n        raise ValueError(\"Make sure to either pass a valid directory or HF model.\")\n    topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)\n\n    # Replace embedding model if one is specifically chosen\n    if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:\n        topic_model.embedding_model = select_backend(embedding_model)\n\n    return topic_model\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.merge_models","title":"<code>merge_models(models, min_similarity=0.7, embedding_model=None)</code>  <code>classmethod</code>","text":"<p>Merge multiple pre-trained BERTopic models into a single model.</p> <p>The models are merged as if they were all saved using pytorch or safetensors, so a minimal version without c-TF-IDF.</p> <p>To do this, we choose the first model in the list of models as a baseline. Then, we check for each model whether they contain topics that are not in the baseline. This check if based on the cosine similarity between topics embeddings. If topic embeddings between two models are similar, then the topic of the second model are re-assigned to the first. If they are dissimilar, the topic of the second model is assigned to the first.</p> <p>In essence, we simply check whether sufficiently \"new\" topics emerge and add them.</p> <p>Parameters:</p> Name Type Description Default <code>models</code> <p>A list of fitted BERTopic models</p> required <code>min_similarity</code> <code>float</code> <p>The minimum similarity for when topics are merged.</p> <code>0.7</code> <code>embedding_model</code> <p>Additionally load in an embedding model if necessary.</p> <code>None</code> <p>Returns:</p> Type Description <p>A new BERTopic model that was created as if you were loading a model from the HuggingFace Hub without c-TF-IDF</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# Create three separate models\ntopic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])\ntopic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])\ntopic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])\n\n# Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>@classmethod\ndef merge_models(cls, models, min_similarity: float = .7, embedding_model=None):\n    \"\"\" Merge multiple pre-trained BERTopic models into a single model.\n\n    The models are merged as if they were all saved using pytorch or\n    safetensors, so a minimal version without c-TF-IDF.\n\n    To do this, we choose the first model in the list of\n    models as a baseline. Then, we check for each model\n    whether they contain topics that are not in the baseline.\n    This check if based on the cosine similarity between\n    topics embeddings. If topic embeddings between two models\n    are similar, then the topic of the second model are re-assigned\n    to the first. If they are dissimilar, the topic of the second\n    model is assigned to the first.\n\n    In essence, we simply check whether sufficiently \"new\"\n    topics emerge and add them.\n\n    Arguments:\n        models: A list of fitted BERTopic models\n        min_similarity: The minimum similarity for when topics are merged.\n        embedding_model: Additionally load in an embedding model if necessary.\n\n    Returns:\n        A new BERTopic model that was created as if you were\n        loading a model from the HuggingFace Hub without c-TF-IDF\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n    # Create three separate models\n    topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])\n    topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])\n    topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])\n\n    # Combine all models into one\n    merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n    ```\n    \"\"\"\n    import torch\n\n    # Temporarily save model and push to HF\n    with TemporaryDirectory() as tmpdir:\n\n        # Save model weights and config.\n        all_topics, all_params, all_tensors = [], [], []\n        for index, model in enumerate(models):\n            model.save(tmpdir, serialization=\"pytorch\")\n            topics, params, tensors, _, _, _ = save_utils.load_local_files(Path(tmpdir))\n            all_topics.append(topics)\n            all_params.append(params)\n            all_tensors.append(np.array(tensors[\"topic_embeddings\"]))\n\n            # Create a base set of parameters\n            if index == 0:\n                merged_topics = topics\n                merged_params = params\n                merged_tensors = np.array(tensors[\"topic_embeddings\"])\n                merged_topics[\"custom_labels\"] = None\n\n    for tensors, selected_topics in zip(all_tensors[1:], all_topics[1:]):\n        # Calculate similarity matrix\n        sim_matrix = cosine_similarity(tensors, merged_tensors)\n        sims = np.max(sim_matrix, axis=1)\n\n        # Extract new topics\n        new_topics = sorted([index - selected_topics[\"_outliers\"] for index, sim in enumerate(sims) if sim &lt; min_similarity])\n        max_topic = max(set(merged_topics[\"topics\"]))\n\n        # Merge Topic Representations\n        new_topics_dict = {}\n        new_topic_val = max_topic + 1\n        for index, new_topic in enumerate(new_topics):\n            new_topic_val = max_topic + index + 1\n            new_topics_dict[new_topic] = new_topic_val\n            merged_topics[\"topic_representations\"][str(new_topic_val)] = selected_topics[\"topic_representations\"][str(new_topic)]\n            merged_topics[\"topic_labels\"][str(new_topic_val)] = selected_topics[\"topic_labels\"][str(new_topic)]\n\n            if selected_topics[\"topic_aspects\"]:\n                merged_topics[\"topic_aspects\"][str(new_topic_val)] = selected_topics[\"topic_aspects\"][str(new_topic)]\n\n            # Add new embeddings\n            new_tensors = tensors[new_topic - selected_topics[\"_outliers\"]]\n            merged_tensors = np.vstack([merged_tensors, new_tensors])\n\n        # Topic Mapper\n        merged_topics[\"topic_mapper\"] = TopicMapper(list(range(-1, new_topic_val+1, 1))).mappings_\n\n        # Find similar topics and re-assign those from the new models\n        sims_idx = np.argmax(sim_matrix, axis=1)\n        sims = np.max(sim_matrix, axis=1)\n        to_merge = {\n            a - selected_topics[\"_outliers\"]:\n            b - merged_topics[\"_outliers\"] for a, (b, val) in enumerate(zip(sims_idx, sims))\n            if val &gt;= min_similarity\n        }\n        to_merge.update(new_topics_dict)\n        to_merge[-1] = -1\n        topics = [to_merge[topic] for topic in selected_topics[\"topics\"]]\n        merged_topics[\"topics\"].extend(topics)\n        merged_topics[\"topic_sizes\"] = dict(Counter(merged_topics[\"topics\"]))\n\n    # Create a new model from the merged parameters\n    merged_tensors = {\"topic_embeddings\": torch.from_numpy(merged_tensors)}\n    merged_model = _create_model_from_files(merged_topics, merged_params, merged_tensors, None, None, None, warn_no_backend=False)\n    merged_model.embedding_model = models[0].embedding_model\n\n    # Replace embedding model if one is specifically chosen\n    if embedding_model is not None and type(merged_model.embedding_model) == BaseEmbedder:\n        merged_model.embedding_model = select_backend(embedding_model)\n    return merged_model\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.merge_topics","title":"<code>merge_topics(self, docs, topics_to_merge, images=None)</code>","text":"<p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>topics_to_merge</code> <code>List[Union[Iterable[int], int]]</code> <p>Either a list of topics or a list of list of topics             to merge. For example:                 [1, 2, 3] will merge topics 1, 2 and 3                 [[1, 2], [3, 4]] will merge topics 1 and 2, and                 separately merge topics 3 and 4.</p> required <code>images</code> <code>List[str]</code> <p>A list of paths to the images used when calling either     <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <p>Examples:</p> <p>If you want to merge topics 1, 2, and 3:</p> <pre><code>topics_to_merge = [1, 2, 3]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> <p>or if you want to merge topics 1 and 2, and separately merge topics 3 and 4:</p> <pre><code>topics_to_merge = [[1, 2]\n                    [3, 4]]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def merge_topics(self,\n                 docs: List[str],\n                 topics_to_merge: List[Union[Iterable[int], int]],\n                 images: List[str] = None) -&gt; None:\n    \"\"\"\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        topics_to_merge: Either a list of topics or a list of list of topics\n                        to merge. For example:\n                            [1, 2, 3] will merge topics 1, 2 and 3\n                            [[1, 2], [3, 4]] will merge topics 1 and 2, and\n                            separately merge topics 3 and 4.\n        images: A list of paths to the images used when calling either\n                `fit` or `fit_transform`\n\n    Examples:\n\n    If you want to merge topics 1, 2, and 3:\n\n    ```python\n    topics_to_merge = [1, 2, 3]\n    topic_model.merge_topics(docs, topics_to_merge)\n    ```\n\n    or if you want to merge topics 1 and 2, and separately\n    merge topics 3 and 4:\n\n    ```python\n    topics_to_merge = [[1, 2]\n                        [3, 4]]\n    topic_model.merge_topics(docs, topics_to_merge)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n    mapping = {topic: topic for topic in set(self.topics_)}\n    if isinstance(topics_to_merge[0], int):\n        for topic in sorted(topics_to_merge):\n            mapping[topic] = topics_to_merge[0]\n    elif isinstance(topics_to_merge[0], Iterable):\n        for topic_group in sorted(topics_to_merge):\n            for topic in topic_group:\n                mapping[topic] = topic_group[0]\n    else:\n        raise ValueError(\"Make sure that `topics_to_merge` is either\"\n                         \"a list of topics or a list of list of topics.\")\n\n    # Track mappings and sizes of topics for merging topic embeddings\n    mappings = defaultdict(list)\n    for key, val in sorted(mapping.items()):\n        mappings[val].append(key)\n    mappings = {topic_from:\n                {\"topics_to\": topics_to,\n                 \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                for topic_from, topics_to in mappings.items()}\n\n    # Update topics\n    documents.Topic = documents.Topic.map(mapping)\n    self.topic_mapper_.add_mappings(mapping)\n    documents = self._sort_mappings_by_frequency(documents)\n    self._extract_topics(documents, mappings=mappings)\n    self._update_topic_size(documents)\n    self._save_representative_docs(documents)\n    self.probabilities_ = self._map_probabilities(self.probabilities_)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.partial_fit","title":"<code>partial_fit(self, documents, embeddings=None, y=None)</code>","text":"<p>Fit BERTopic on a subset of the data and perform online learning with batch-like data.</p> <p>Online topic modeling in BERTopic is performed by using dimensionality reduction and cluster algorithms that support a <code>partial_fit</code> method in order to incrementally train the topic model.</p> <p>Likewise, the <code>bertopic.vectorizers.OnlineCountVectorizer</code> is used to dynamically update its vocabulary when presented with new data. It has several parameters for modeling decay and updating the representations.</p> <p>In other words, although the main algorithm stays the same, the training procedure now works as follows:</p> <p>For each subset of the data:</p> <ol> <li>Generate embeddings with a pre-traing language model</li> <li>Incrementally update the dimensionality reduction algorithm with <code>partial_fit</code></li> <li>Incrementally update the cluster algorithm with <code>partial_fit</code></li> <li>Incrementally update the OnlineCountVectorizer and apply some form of decay</li> </ol> <p>Note that it is advised to use <code>partial_fit</code> with batches and not single documents for the best performance.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents to fit on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model</p> <code>None</code> <code>y</code> <code>Union[List[int], numpy.ndarray]</code> <p>The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified.</p> <code>None</code> <p>Examples:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.cluster import MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom bertopic.vectorizers import OnlineCountVectorizer\nfrom bertopic import BERTopic\n\n# Prepare documents\ndocs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\n# Prepare sub-models that support online learning\numap_model = IncrementalPCA(n_components=5)\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\ntopic_model = BERTopic(umap_model=umap_model,\n                       hdbscan_model=cluster_model,\n                       vectorizer_model=vectorizer_model)\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor index in range(0, len(docs), 1000):\n    topic_model.partial_fit(docs[index: index+1000])\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def partial_fit(self,\n                documents: List[str],\n                embeddings: np.ndarray = None,\n                y: Union[List[int], np.ndarray] = None):\n    \"\"\" Fit BERTopic on a subset of the data and perform online learning\n    with batch-like data.\n\n    Online topic modeling in BERTopic is performed by using dimensionality\n    reduction and cluster algorithms that support a `partial_fit` method\n    in order to incrementally train the topic model.\n\n    Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used\n    to dynamically update its vocabulary when presented with new data.\n    It has several parameters for modeling decay and updating the\n    representations.\n\n    In other words, although the main algorithm stays the same, the training\n    procedure now works as follows:\n\n    For each subset of the data:\n\n    1. Generate embeddings with a pre-traing language model\n    2. Incrementally update the dimensionality reduction algorithm with `partial_fit`\n    3. Incrementally update the cluster algorithm with `partial_fit`\n    4. Incrementally update the OnlineCountVectorizer and apply some form of decay\n\n    Note that it is advised to use `partial_fit` with batches and\n    not single documents for the best performance.\n\n    Arguments:\n        documents: A list of documents to fit on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model\n        y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n           specific instance is specified.\n\n    Examples:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sklearn.cluster import MiniBatchKMeans\n    from sklearn.decomposition import IncrementalPCA\n    from bertopic.vectorizers import OnlineCountVectorizer\n    from bertopic import BERTopic\n\n    # Prepare documents\n    docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\n    # Prepare sub-models that support online learning\n    umap_model = IncrementalPCA(n_components=5)\n    cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\n    vectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\n    topic_model = BERTopic(umap_model=umap_model,\n                           hdbscan_model=cluster_model,\n                           vectorizer_model=vectorizer_model)\n\n    # Incrementally fit the topic model by training on 1000 documents at a time\n    for index in range(0, len(docs), 1000):\n        topic_model.partial_fit(docs[index: index+1000])\n    ```\n    \"\"\"\n    # Checks\n    check_embeddings_shape(embeddings, documents)\n    if not hasattr(self.hdbscan_model, \"partial_fit\"):\n        raise ValueError(\"In order to use `.partial_fit`, the cluster model should have \"\n                         \"a `.partial_fit` function.\")\n\n    # Prepare documents\n    if isinstance(documents, str):\n        documents = [documents]\n    documents = pd.DataFrame({\"Document\": documents,\n                              \"ID\": range(len(documents)),\n                              \"Topic\": None})\n\n    # Extract embeddings\n    if embeddings is None:\n        if self.topic_representations_ is None:\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n        embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                              method=\"document\",\n                                              verbose=self.verbose)\n    else:\n        if self.embedding_model is not None and self.topic_representations_ is None:\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n\n    # Reduce dimensionality\n    if self.seed_topic_list is not None and self.embedding_model is not None:\n        y, embeddings = self._guided_topic_modeling(embeddings)\n    umap_embeddings = self._reduce_dimensionality(embeddings, y, partial_fit=True)\n\n    # Cluster reduced embeddings\n    documents, self.probabilities_ = self._cluster_embeddings(umap_embeddings, documents, partial_fit=True)\n    topics = documents.Topic.to_list()\n\n    # Map and find new topics\n    if not self.topic_mapper_:\n        self.topic_mapper_ = TopicMapper(topics)\n    mappings = self.topic_mapper_.get_mappings()\n    new_topics = set(topics).difference(set(mappings.keys()))\n    new_topic_ids = {topic: max(mappings.values()) + index + 1 for index, topic in enumerate(new_topics)}\n    self.topic_mapper_.add_new_topics(new_topic_ids)\n    updated_mappings = self.topic_mapper_.get_mappings()\n    updated_topics = [updated_mappings[topic] for topic in topics]\n    documents[\"Topic\"] = updated_topics\n\n    # Add missing topics (topics that were originally created but are now missing)\n    if self.topic_representations_:\n        missing_topics = set(self.topic_representations_.keys()).difference(set(updated_topics))\n        for missing_topic in missing_topics:\n            documents.loc[len(documents), :] = [\" \", len(documents), missing_topic]\n    else:\n        missing_topics = {}\n\n    # Prepare documents\n    documents_per_topic = documents.sort_values(\"Topic\").groupby(['Topic'], as_index=False)\n    updated_topics = documents_per_topic.first().Topic.astype(int)\n    documents_per_topic = documents_per_topic.agg({'Document': ' '.join})\n\n    # Update topic representations\n    self.c_tf_idf_, updated_words = self._c_tf_idf(documents_per_topic, partial_fit=True)\n    self.topic_representations_ = self._extract_words_per_topic(updated_words, documents, self.c_tf_idf_, calculate_aspects=False)\n    self._create_topic_vectors()\n    self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                          for key, values in self.topic_representations_.items()}\n\n    # Update topic sizes\n    if len(missing_topics) &gt; 0:\n        documents = documents.iloc[:-len(missing_topics)]\n\n    if self.topic_sizes_ is None:\n        self._update_topic_size(documents)\n    else:\n        sizes = documents.groupby(['Topic'], as_index=False).count()\n        for _, row in sizes.iterrows():\n            topic = int(row.Topic)\n            if self.topic_sizes_.get(topic) is not None and topic not in missing_topics:\n                self.topic_sizes_[topic] += int(row.Document)\n            elif self.topic_sizes_.get(topic) is None:\n                self.topic_sizes_[topic] = int(row.Document)\n        self.topics_ = documents.Topic.astype(int).tolist()\n\n    return self\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.push_to_hf_hub","title":"<code>push_to_hf_hub(self, repo_id, commit_message='Add BERTopic model', token=None, revision=None, private=False, create_pr=False, model_card=True, serialization='safetensors', save_embedding_model=True, save_ctfidf=False)</code>","text":"<p>Push your BERTopic model to a HuggingFace Hub</p> <p>Whenever you want to upload files to the Hub, you need to log in to your Hugging Face account:</p> <ul> <li>Log in to your Hugging Face account with the following command:     <pre><code>huggingface-cli login\n\n# or using an environment variable\nhuggingface-cli login --token $HUGGINGFACE_TOKEN\n</code></pre></li> <li>Alternatively, you can programmatically login using login() in a notebook or a script:     <pre><code>from huggingface_hub import login\nlogin()\n</code></pre></li> <li>Or you can give a token with the <code>token</code> variable</li> </ul> <p>Parameters:</p> Name Type Description Default <code>repo_id</code> <code>str</code> <p>The name of your HuggingFace repository</p> required <code>commit_message</code> <code>str</code> <p>A commit message</p> <code>'Add BERTopic model'</code> <code>token</code> <code>str</code> <p>Token to add if not already logged in</p> <code>None</code> <code>revision</code> <code>str</code> <p>Repository revision</p> <code>None</code> <code>private</code> <code>bool</code> <p>Whether to create a private repository</p> <code>False</code> <code>create_pr</code> <code>bool</code> <p>Whether to upload the model as a Pull Request</p> <code>False</code> <code>model_card</code> <code>bool</code> <p>Whether to automatically create a modelcard</p> <code>True</code> <code>serialization</code> <code>str</code> <p>The type of serialization.         Either <code>safetensors</code> or <code>pytorch</code></p> <code>'safetensors'</code> <code>save_embedding_model</code> <code>Union[str, bool]</code> <p>A pointer towards a HuggingFace model to be loaded in with                     SentenceTransformers. E.g.,                     <code>sentence-transformers/all-MiniLM-L6-v2</code></p> <code>True</code> <code>save_ctfidf</code> <code>bool</code> <p>Whether to save c-TF-IDF information</p> <code>False</code> <p>Examples:</p> <pre><code>topic_model.push_to_hf_hub(\n    repo_id=\"ArXiv\",\n    save_ctfidf=True,\n    save_embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\"\n)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def push_to_hf_hub(\n        self,\n        repo_id: str,\n        commit_message: str = 'Add BERTopic model',\n        token: str = None,\n        revision: str = None,\n        private: bool = False,\n        create_pr: bool = False,\n        model_card: bool = True,\n        serialization: str = \"safetensors\",\n        save_embedding_model: Union[str, bool] = True,\n        save_ctfidf: bool = False,\n        ):\n    \"\"\" Push your BERTopic model to a HuggingFace Hub\n\n    Whenever you want to upload files to the Hub, you need to log in to your Hugging Face account:\n\n    * Log in to your Hugging Face account with the following command:\n        ```bash\n        huggingface-cli login\n\n        # or using an environment variable\n        huggingface-cli login --token $HUGGINGFACE_TOKEN\n        ```\n    * Alternatively, you can programmatically login using login() in a notebook or a script:\n        ```python\n        from huggingface_hub import login\n        login()\n        ```\n    * Or you can give a token with the `token` variable\n\n    Arguments:\n        repo_id: The name of your HuggingFace repository\n        commit_message: A commit message\n        token: Token to add if not already logged in\n        revision: Repository revision\n        private: Whether to create a private repository\n        create_pr: Whether to upload the model as a Pull Request\n        model_card: Whether to automatically create a modelcard\n        serialization: The type of serialization.\n                    Either `safetensors` or `pytorch`\n        save_embedding_model: A pointer towards a HuggingFace model to be loaded in with\n                                SentenceTransformers. E.g.,\n                                `sentence-transformers/all-MiniLM-L6-v2`\n        save_ctfidf: Whether to save c-TF-IDF information\n\n\n    Examples:\n\n    ```python\n    topic_model.push_to_hf_hub(\n        repo_id=\"ArXiv\",\n        save_ctfidf=True,\n        save_embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\"\n    )\n    ```\n    \"\"\"\n    return save_utils.push_to_hf_hub(model=self, repo_id=repo_id, commit_message=commit_message,\n                                     token=token, revision=revision, private=private, create_pr=create_pr,\n                                     model_card=model_card, serialization=serialization,\n                                     save_embedding_model=save_embedding_model, save_ctfidf=save_ctfidf)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.reduce_outliers","title":"<code>reduce_outliers(self, documents, topics, images=None, strategy='distributions', probabilities=None, threshold=0, embeddings=None, distributions_params={})</code>","text":"<p>Reduce outliers by merging them with their nearest topic according to one of several strategies.</p> <p>When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created that do not fall within any of the created topics. These are labeled as -1. This function allows the user to match outlier documents with their nearest topic using one of the following strategies using the <code>strategy</code> parameter:     * \"probabilities\"         This uses the soft-clustering as performed by HDBSCAN to find the         best matching topic for each outlier document. To use this, make         sure to calculate the <code>probabilities</code> beforehand by instantiating         BERTopic with <code>calculate_probabilities=True</code>.     * \"distributions\"         Use the topic distributions, as calculated with <code>.approximate_distribution</code>         to find the most frequent topic in each outlier document. You can use the         <code>distributions_params</code> variable to tweak the parameters of         <code>.approximate_distribution</code>.     * \"c-tf-idf\"         Calculate the c-TF-IDF representation for each outlier document and         find the best matching c-TF-IDF topic representation using         cosine similarity.     * \"embeddings\"         Using the embeddings of each outlier documents, find the best         matching topic embedding using cosine similarity.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents for which we reduce or remove the outliers.</p> required <code>topics</code> <code>List[int]</code> <p>The topics that correspond to the documents</p> required <code>images</code> <code>List[str]</code> <p>A list of paths to the images used when calling either     <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <code>strategy</code> <code>str</code> <p>The strategy used for reducing outliers.     Options:         * \"probabilities\"             This uses the soft-clustering as performed by HDBSCAN             to find the best matching topic for each outlier document.</p> <pre><code>    * \"distributions\"\n        Use the topic distributions, as calculated with `.approximate_distribution`\n        to find the most frequent topic in each outlier document.\n\n    * \"c-tf-idf\"\n        Calculate the c-TF-IDF representation for outlier documents and\n        find the best matching c-TF-IDF topic representation.\n\n    * \"embeddings\"\n        Calculate the embeddings for outlier documents and\n        find the best matching topic embedding.\n</code></pre> <code>'distributions'</code> <code>threshold</code> <code>float</code> <p>The threshold for assigning topics to outlier documents. This value     represents the minimum probability when <code>strategy=\"probabilities\"</code>.     For all other strategies, it represents the minimum similarity.</p> <code>0</code> <code>embeddings</code> <code>ndarray</code> <p>The pre-computed embeddings to be used when <code>strategy=\"embeddings\"</code>.         If this is None, then it will compute the embeddings for the outlier documents.</p> <code>None</code> <code>distributions_params</code> <code>Mapping[str, Any]</code> <p>The parameters used in <code>.approximate_distribution</code> when using                   the strategy <code>\"distributions\"</code>.</p> <code>{}</code> <p>Returns:</p> Type Description <code>new_topics</code> <p>The updated topics</p> <p>Usage:</p> <p>The default settings uses the <code>\"distributions\"</code> strategy:</p> <pre><code>new_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> <p>When you use the <code>\"probabilities\"</code> strategy, make sure to also pass the probabilities as generated through HDBSCAN:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs)\n\nnew_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def reduce_outliers(self,\n                    documents: List[str],\n                    topics: List[int],\n                    images: List[str] = None,\n                    strategy: str = \"distributions\",\n                    probabilities: np.ndarray = None,\n                    threshold: float = 0,\n                    embeddings: np.ndarray = None,\n                    distributions_params: Mapping[str, Any] = {}) -&gt; List[int]:\n    \"\"\" Reduce outliers by merging them with their nearest topic according\n    to one of several strategies.\n\n    When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created\n    that do not fall within any of the created topics. These are labeled as -1.\n    This function allows the user to match outlier documents with their nearest topic\n    using one of the following strategies using the `strategy` parameter:\n        * \"probabilities\"\n            This uses the soft-clustering as performed by HDBSCAN to find the\n            best matching topic for each outlier document. To use this, make\n            sure to calculate the `probabilities` beforehand by instantiating\n            BERTopic with `calculate_probabilities=True`.\n        * \"distributions\"\n            Use the topic distributions, as calculated with `.approximate_distribution`\n            to find the most frequent topic in each outlier document. You can use the\n            `distributions_params` variable to tweak the parameters of\n            `.approximate_distribution`.\n        * \"c-tf-idf\"\n            Calculate the c-TF-IDF representation for each outlier document and\n            find the best matching c-TF-IDF topic representation using\n            cosine similarity.\n        * \"embeddings\"\n            Using the embeddings of each outlier documents, find the best\n            matching topic embedding using cosine similarity.\n\n    Arguments:\n        documents: A list of documents for which we reduce or remove the outliers.\n        topics: The topics that correspond to the documents\n        images: A list of paths to the images used when calling either\n                `fit` or `fit_transform`\n        strategy: The strategy used for reducing outliers.\n                Options:\n                    * \"probabilities\"\n                        This uses the soft-clustering as performed by HDBSCAN\n                        to find the best matching topic for each outlier document.\n\n                    * \"distributions\"\n                        Use the topic distributions, as calculated with `.approximate_distribution`\n                        to find the most frequent topic in each outlier document.\n\n                    * \"c-tf-idf\"\n                        Calculate the c-TF-IDF representation for outlier documents and\n                        find the best matching c-TF-IDF topic representation.\n\n                    * \"embeddings\"\n                        Calculate the embeddings for outlier documents and\n                        find the best matching topic embedding.\n        threshold: The threshold for assigning topics to outlier documents. This value\n                represents the minimum probability when `strategy=\"probabilities\"`.\n                For all other strategies, it represents the minimum similarity.\n        embeddings: The pre-computed embeddings to be used when `strategy=\"embeddings\"`.\n                    If this is None, then it will compute the embeddings for the outlier documents.\n        distributions_params: The parameters used in `.approximate_distribution` when using\n                              the strategy `\"distributions\"`.\n\n    Returns:\n        new_topics: The updated topics\n\n    Usage:\n\n    The default settings uses the `\"distributions\"` strategy:\n\n    ```python\n    new_topics = topic_model.reduce_outliers(docs, topics)\n    ```\n\n    When you use the `\"probabilities\"` strategy, make sure to also pass the probabilities\n    as generated through HDBSCAN:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic(calculate_probabilities=True)\n    topics, probs = topic_model.fit_transform(docs)\n\n    new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n    ```\n    \"\"\"\n    if images is not None:\n        strategy = \"embeddings\"\n\n    # Check correct use of parameters\n    if strategy.lower() == \"probabilities\" and probabilities is None:\n        raise ValueError(\"Make sure to pass in `probabilities` in order to use the probabilities strategy\")\n\n    # Reduce outliers by extracting most likely topics through the topic-term probability matrix\n    if strategy.lower() == \"probabilities\":\n        new_topics = [np.argmax(prob) if np.max(prob) &gt;= threshold and topic == -1 else topic\n                      for topic, prob in zip(topics, probabilities)]\n\n    # Reduce outliers by extracting most frequent topics through calculating of Topic Distributions\n    elif strategy.lower() == \"distributions\":\n        outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n        outlier_docs = [documents[index] for index in outlier_ids]\n        topic_distr, _ = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params)\n        outlier_topics = iter([np.argmax(prob) if sum(prob) &gt; 0 else -1 for prob in topic_distr])\n        new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n    # Reduce outliers by finding the most similar c-TF-IDF representations\n    elif strategy.lower() == \"c-tf-idf\":\n        outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n        outlier_docs = [documents[index] for index in outlier_ids]\n\n        # Calculate c-TF-IDF of outlier documents with all topics\n        bow_doc = self.vectorizer_model.transform(outlier_docs)\n        c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n        similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n        # Update topics\n        similarity[similarity &lt; threshold] = 0\n        outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n        new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n    # Reduce outliers by finding the most similar topic embeddings\n    elif strategy.lower() == \"embeddings\":\n        if self.embedding_model is None and embeddings is None:\n            raise ValueError(\"To use this strategy, you will need to pass a model to `embedding_model`\"\n                             \"when instantiating BERTopic.\")\n        outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n        if images is not None:\n            outlier_docs = [images[index] for index in outlier_ids]\n        else:\n            outlier_docs = [documents[index] for index in outlier_ids]\n\n        # Extract or calculate embeddings for outlier documents\n        if embeddings is not None:\n            outlier_embeddings = np.array([embeddings[index] for index in outlier_ids])\n        elif images is not None:\n            outlier_images = [images[index] for index in outlier_ids]\n            outlier_embeddings = self.embedding_model.embed_images(outlier_images, verbose=self.verbose)\n        else:\n            outlier_embeddings = self.embedding_model.embed_documents(outlier_docs)\n        similarity = cosine_similarity(outlier_embeddings, self.topic_embeddings_[self._outliers:])\n\n        # Update topics\n        similarity[similarity &lt; threshold] = 0\n        outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n        new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n    return new_topics\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.reduce_topics","title":"<code>reduce_topics(self, docs, nr_topics=20, images=None)</code>","text":"<p>Reduce the number of topics to a fixed number of topics or automatically.</p> <p>If nr_topics is a integer, then the number of topics is reduced to nr_topics using <code>AgglomerativeClustering</code> on the cosine distance matrix of the topic embeddings.</p> <p>If nr_topics is <code>\"auto\"</code>, then HDBSCAN is used to automatically reduce the number of topics by running it on the topic embeddings.</p> <p>The topics, their sizes, and representations are updated.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The docs you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>nr_topics</code> <code>Union[int, str]</code> <p>The number of topics you want reduced to</p> <code>20</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images used when calling either     <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <p>Updates</p> <p>topics_ : Assigns topics to their merged representations. probabilities_ : Assigns probabilities to their merged representations.</p> <p>Examples:</p> <p>You can further reduce the topics by passing the documents with its topics and probabilities (if they were calculated):</p> <pre><code>topic_model.reduce_topics(docs, nr_topics=30)\n</code></pre> <p>You can then access the updated topics and probabilities with:</p> <pre><code>topics = topic_model.topics_\nprobabilities = topic_model.probabilities_\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def reduce_topics(self,\n                  docs: List[str],\n                  nr_topics: Union[int, str] = 20,\n                  images: List[str] = None) -&gt; None:\n    \"\"\" Reduce the number of topics to a fixed number of topics\n    or automatically.\n\n    If nr_topics is a integer, then the number of topics is reduced\n    to nr_topics using `AgglomerativeClustering` on the cosine distance matrix\n    of the topic embeddings.\n\n    If nr_topics is `\"auto\"`, then HDBSCAN is used to automatically\n    reduce the number of topics by running it on the topic embeddings.\n\n    The topics, their sizes, and representations are updated.\n\n    Arguments:\n        docs: The docs you used when calling either `fit` or `fit_transform`\n        nr_topics: The number of topics you want reduced to\n        images: A list of paths to the images used when calling either\n                `fit` or `fit_transform`\n\n    Updates:\n        topics_ : Assigns topics to their merged representations.\n        probabilities_ : Assigns probabilities to their merged representations.\n\n    Examples:\n\n    You can further reduce the topics by passing the documents with its\n    topics and probabilities (if they were calculated):\n\n    ```python\n    topic_model.reduce_topics(docs, nr_topics=30)\n    ```\n\n    You can then access the updated topics and probabilities with:\n\n    ```python\n    topics = topic_model.topics_\n    probabilities = topic_model.probabilities_\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n\n    self.nr_topics = nr_topics\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n    # Reduce number of topics\n    documents = self._reduce_topics(documents)\n    self._merged_topics = None\n    self._save_representative_docs(documents)\n    self.probabilities_ = self._map_probabilities(self.probabilities_)\n\n    return self\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.save","title":"<code>save(self, path, serialization='pickle', save_embedding_model=True, save_ctfidf=False)</code>","text":"<p>Saves the model to the specified path or folder</p> <p>When saving the model, make sure to also keep track of the versions of dependencies and Python used. Loading and saving the model should be done using the same dependencies and Python. Moreover, models saved in one version of BERTopic should not be loaded in other versions.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <p>If <code>serialization</code> is 'safetensors' or <code>pytorch</code>, this is a directory.   If <code>serialization</code> is <code>pickle</code>, then this is file.</p> required <code>serialization</code> <code>Literal['safetensors', 'pickle', 'pytorch']</code> <p>If <code>pickle</code>, the entire model will be pickled. If <code>safetensors</code>            or <code>pytorch</code> the model will be saved without the embedding,            dimensionality reduction, and clustering algorithms.            This is a very efficient format and typically advised.</p> <code>'pickle'</code> <code>save_embedding_model</code> <code>Union[bool, str]</code> <p>If serialization <code>pickle</code>, then you can choose to skip                   saving the embedding model. If serialization <code>safetensors</code>                   or <code>pytorch</code>, this variable can be used as a string pointing                   towards a huggingface model.</p> <code>True</code> <code>save_ctfidf</code> <code>bool</code> <p>Whether to save c-TF-IDF information if serialization is <code>safetensors</code>          or <code>pytorch</code></p> <code>False</code> <p>Examples:</p> <p>To save the model in an efficient and safe format (safetensors) with c-TF-IDF information:</p> <pre><code>topic_model.save(\"model_dir\", serialization=\"safetensors\", save_ctfidf=True)\n</code></pre> <p>If you wish to also add a pointer to the embedding model, which will be downloaded from HuggingFace upon loading:</p> <pre><code>embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"model_dir\", serialization=\"safetensors\", save_embedding_model=embedding_model)\n</code></pre> <p>or if you want save the full model with pickle:</p> <pre><code>topic_model.save(\"my_model\")\n</code></pre> <p>NOTE: Pickle can run arbitrary code and is generally considered to be less safe than safetensors.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def save(self,\n         path,\n         serialization: Literal[\"safetensors\", \"pickle\", \"pytorch\"] = \"pickle\",\n         save_embedding_model: Union[bool, str] = True,\n         save_ctfidf: bool = False):\n    \"\"\" Saves the model to the specified path or folder\n\n    When saving the model, make sure to also keep track of the versions\n    of dependencies and Python used. Loading and saving the model should\n    be done using the same dependencies and Python. Moreover, models\n    saved in one version of BERTopic should not be loaded in other versions.\n\n    Arguments:\n        path: If `serialization` is 'safetensors' or `pytorch`, this is a directory.\n              If `serialization` is `pickle`, then this is file.\n        serialization: If `pickle`, the entire model will be pickled. If `safetensors`\n                       or `pytorch` the model will be saved without the embedding,\n                       dimensionality reduction, and clustering algorithms.\n                       This is a very efficient format and typically advised.\n        save_embedding_model: If serialization `pickle`, then you can choose to skip\n                              saving the embedding model. If serialization `safetensors`\n                              or `pytorch`, this variable can be used as a string pointing\n                              towards a huggingface model.\n        save_ctfidf: Whether to save c-TF-IDF information if serialization is `safetensors`\n                     or `pytorch`\n\n    Examples:\n\n    To save the model in an efficient and safe format (safetensors) with c-TF-IDF information:\n\n    ```python\n    topic_model.save(\"model_dir\", serialization=\"safetensors\", save_ctfidf=True)\n    ```\n\n    If you wish to also add a pointer to the embedding model, which will be downloaded from\n    HuggingFace upon loading:\n\n    ```python\n    embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\n    topic_model.save(\"model_dir\", serialization=\"safetensors\", save_embedding_model=embedding_model)\n    ```\n\n    or if you want save the full model with pickle:\n\n    ```python\n    topic_model.save(\"my_model\")\n    ```\n\n    NOTE: Pickle can run arbitrary code and is generally considered to be less safe than\n    safetensors.\n    \"\"\"\n    if serialization == \"pickle\":\n        logger.warning(\"When you use `pickle` to save/load a BERTopic model,\"\n                       \"please make sure that the environments in which you save\"\n                       \"and load the model are **exactly** the same. The version of BERTopic,\"\n                       \"its dependencies, and python need to remain the same.\")\n\n        with open(path, 'wb') as file:\n\n            # This prevents the vectorizer from being too large in size if `min_df` was\n            # set to a value higher than 1\n            self.vectorizer_model.stop_words_ = None\n\n            if not save_embedding_model:\n                embedding_model = self.embedding_model\n                self.embedding_model = None\n                joblib.dump(self, file)\n                self.embedding_model = embedding_model\n            else:\n                joblib.dump(self, file)\n    elif serialization == \"safetensors\" or serialization == \"pytorch\":\n\n        # Directory\n        save_directory = Path(path)\n        save_directory.mkdir(exist_ok=True, parents=True)\n\n        # Check embedding model\n        if save_embedding_model and hasattr(self.embedding_model, '_hf_model') and not isinstance(save_embedding_model, str):\n            save_embedding_model = self.embedding_model._hf_model\n        elif not save_embedding_model:\n            logger.warning(\"You are saving a BERTopic model without explicitly defining an embedding model.\"\n                           \"If you are using a sentence-transformers model or a HuggingFace model supported\"\n                           \"by sentence-transformers, please save the model by using a pointer towards that model.\"\n                           \"For example, `save_embedding_model=sentence-transformers/all-mpnet-base-v2`\", RuntimeWarning)\n\n        # Minimal\n        save_utils.save_hf(model=self, save_directory=save_directory, serialization=serialization)\n        save_utils.save_topics(model=self, path=save_directory / \"topics.json\")\n        save_utils.save_images(model=self, path=save_directory / \"images\")\n        save_utils.save_config(model=self, path=save_directory / 'config.json', embedding_model=save_embedding_model)\n\n        # Additional\n        if save_ctfidf:\n            save_utils.save_ctfidf(model=self, save_directory=save_directory, serialization=serialization)\n            save_utils.save_ctfidf_config(model=self, path=save_directory / 'ctfidf_config.json')\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.set_topic_labels","title":"<code>set_topic_labels(self, topic_labels)</code>","text":"<p>Set custom topic labels in your fitted BERTopic model</p> <p>Parameters:</p> Name Type Description Default <code>topic_labels</code> <code>Union[List[str], Mapping[int, str]]</code> <p>If a list of topic labels, it should contain the same number         of labels as there are topics. This must be ordered         from the topic with the lowest ID to the highest ID,         including topic -1 if it exists.         If a dictionary of <code>topic ID</code>: <code>topic_label</code>, it can have         any number of topics as it will only map the topics found         in the dictionary.</p> required <p>Examples:</p> <p>First, we define our topic labels with <code>.generate_topic_labels</code> in which we can customize our topic labels:</p> <pre><code>topic_labels = topic_model.generate_topic_labels(nr_words=2,\n                                            topic_prefix=True,\n                                            word_length=10,\n                                            separator=\", \")\n</code></pre> <p>Then, we pass these <code>topic_labels</code> to our topic model which can be accessed at any time with <code>.custom_labels_</code>:</p> <pre><code>topic_model.set_topic_labels(topic_labels)\ntopic_model.custom_labels_\n</code></pre> <p>You might want to change only a few topic labels instead of all of them. To do so, you can pass a dictionary where the keys are the topic IDs and its keys the topic labels:</p> <pre><code>topic_model.set_topic_labels({0: \"Space\", 1: \"Sports\", 2: \"Medicine\"})\ntopic_model.custom_labels_\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def set_topic_labels(self, topic_labels: Union[List[str], Mapping[int, str]]) -&gt; None:\n    \"\"\" Set custom topic labels in your fitted BERTopic model\n\n    Arguments:\n        topic_labels: If a list of topic labels, it should contain the same number\n                    of labels as there are topics. This must be ordered\n                    from the topic with the lowest ID to the highest ID,\n                    including topic -1 if it exists.\n                    If a dictionary of `topic ID`: `topic_label`, it can have\n                    any number of topics as it will only map the topics found\n                    in the dictionary.\n\n    Examples:\n\n    First, we define our topic labels with `.generate_topic_labels` in which\n    we can customize our topic labels:\n\n    ```python\n    topic_labels = topic_model.generate_topic_labels(nr_words=2,\n                                                topic_prefix=True,\n                                                word_length=10,\n                                                separator=\", \")\n    ```\n\n    Then, we pass these `topic_labels` to our topic model which\n    can be accessed at any time with `.custom_labels_`:\n\n    ```python\n    topic_model.set_topic_labels(topic_labels)\n    topic_model.custom_labels_\n    ```\n\n    You might want to change only a few topic labels instead of all of them.\n    To do so, you can pass a dictionary where the keys are the topic IDs and\n    its keys the topic labels:\n\n    ```python\n    topic_model.set_topic_labels({0: \"Space\", 1: \"Sports\", 2: \"Medicine\"})\n    topic_model.custom_labels_\n    ```\n    \"\"\"\n    unique_topics = sorted(set(self.topics_))\n\n    if isinstance(topic_labels, dict):\n        if self.custom_labels_ is not None:\n            original_labels = {topic: label for topic, label in zip(unique_topics, self.custom_labels_)}\n        else:\n            info = self.get_topic_info()\n            original_labels = dict(zip(info.Topic, info.Name))\n        custom_labels = [topic_labels.get(topic) if topic_labels.get(topic) else original_labels[topic] for topic in unique_topics]\n\n    elif isinstance(topic_labels, list):\n        if len(topic_labels) == len(unique_topics):\n            custom_labels = topic_labels\n        else:\n            raise ValueError(\"Make sure that `topic_labels` contains the same number \"\n                             \"of labels as that there are topics.\")\n\n    self.custom_labels_ = custom_labels\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.topics_over_time","title":"<code>topics_over_time(self, docs, timestamps, topics=None, nr_bins=None, datetime_format=None, evolution_tuning=True, global_tuning=True)</code>","text":"<p>Create topics over time</p> <p>To create the topics over time, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each timestamp t. Then, the c-TF-IDF representations at timestamp t are averaged with the global c-TF-IDF representations in order to fine-tune the local representations.</p> <p>Note</p> <p>Make sure to use a limited number of unique timestamps (&lt;100) as the c-TF-IDF representation will be calculated at each single unique timestamp. Having a large number of unique timestamps can take some time to be calculated. Moreover, there aren't many use-cased where you would like to see the difference in topic representations over more than 100 different timestamps.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>timestamps</code> <code>Union[List[str], List[int]]</code> <p>The timestamp of each document. This can be either a list of strings or ints.         If it is a list of strings, then the datetime format will be automatically         inferred. If it is a list of ints, then the documents will be ordered by         ascending order.</p> required <code>topics</code> <code>List[int]</code> <p>A list of topics where each topic is related to a document in <code>docs</code> and     a timestamp in <code>timestamps</code>. You can use this to apply topics_over_time on     a subset of the data. Make sure that <code>docs</code>, <code>timestamps</code>, and <code>topics</code>     all correspond to one another and have the same size.</p> <code>None</code> <code>nr_bins</code> <code>int</code> <p>The number of bins you want to create for the timestamps. The left interval will      be chosen as the timestamp. An additional column will be created with the      entire interval.</p> <code>None</code> <code>datetime_format</code> <code>str</code> <p>The datetime format of the timestamps if they are strings, eg \u201c%d/%m/%Y\u201d.              Set this to None if you want to have it automatically detect the format.              See strftime documentation for more information on choices:              https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.</p> <code>None</code> <code>evolution_tuning</code> <code>bool</code> <p>Fine-tune each topic representation at timestamp t by averaging its               c-TF-IDF matrix with the c-TF-IDF matrix at timestamp t-1. This creates               evolutionary topic representations.</p> <code>True</code> <code>global_tuning</code> <code>bool</code> <p>Fine-tune each topic representation at timestamp t by averaging its c-TF-IDF matrix        with the global c-TF-IDF matrix. Turn this off if you want to prevent words in        topic representations that could not be found in the documents at timestamp t.</p> <code>True</code> <p>Returns:</p> Type Description <code>topics_over_time</code> <p>A dataframe that contains the topic, words, and frequency of topic                   at timestamp t.</p> <p>Examples:</p> <p>The timestamps variable represent the timestamp of each document. If you have over 100 unique timestamps, it is advised to bin the timestamps as shown below:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def topics_over_time(self,\n                     docs: List[str],\n                     timestamps: Union[List[str],\n                                       List[int]],\n                     topics: List[int] = None,\n                     nr_bins: int = None,\n                     datetime_format: str = None,\n                     evolution_tuning: bool = True,\n                     global_tuning: bool = True) -&gt; pd.DataFrame:\n    \"\"\" Create topics over time\n\n    To create the topics over time, BERTopic needs to be already fitted once.\n    From the fitted models, the c-TF-IDF representations are calculate at\n    each timestamp t. Then, the c-TF-IDF representations at timestamp t are\n    averaged with the global c-TF-IDF representations in order to fine-tune the\n    local representations.\n\n    NOTE:\n        Make sure to use a limited number of unique timestamps (&lt;100) as the\n        c-TF-IDF representation will be calculated at each single unique timestamp.\n        Having a large number of unique timestamps can take some time to be calculated.\n        Moreover, there aren't many use-cased where you would like to see the difference\n        in topic representations over more than 100 different timestamps.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        timestamps: The timestamp of each document. This can be either a list of strings or ints.\n                    If it is a list of strings, then the datetime format will be automatically\n                    inferred. If it is a list of ints, then the documents will be ordered by\n                    ascending order.\n        topics: A list of topics where each topic is related to a document in `docs` and\n                a timestamp in `timestamps`. You can use this to apply topics_over_time on\n                a subset of the data. Make sure that `docs`, `timestamps`, and `topics`\n                all correspond to one another and have the same size.\n        nr_bins: The number of bins you want to create for the timestamps. The left interval will\n                 be chosen as the timestamp. An additional column will be created with the\n                 entire interval.\n        datetime_format: The datetime format of the timestamps if they are strings, eg \u201c%d/%m/%Y\u201d.\n                         Set this to None if you want to have it automatically detect the format.\n                         See strftime documentation for more information on choices:\n                         https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.\n        evolution_tuning: Fine-tune each topic representation at timestamp *t* by averaging its\n                          c-TF-IDF matrix with the c-TF-IDF matrix at timestamp *t-1*. This creates\n                          evolutionary topic representations.\n        global_tuning: Fine-tune each topic representation at timestamp *t* by averaging its c-TF-IDF matrix\n                   with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                   topic representations that could not be found in the documents at timestamp *t*.\n\n    Returns:\n        topics_over_time: A dataframe that contains the topic, words, and frequency of topic\n                          at timestamp *t*.\n\n    Examples:\n\n    The timestamps variable represent the timestamp of each document. If you have over\n    100 unique timestamps, it is advised to bin the timestamps as shown below:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    selected_topics = topics if topics else self.topics_\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": selected_topics, \"Timestamps\": timestamps})\n    global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n    all_topics = sorted(list(documents.Topic.unique()))\n    all_topics_indices = {topic: index for index, topic in enumerate(all_topics)}\n\n    if isinstance(timestamps[0], str):\n        infer_datetime_format = True if not datetime_format else False\n        documents[\"Timestamps\"] = pd.to_datetime(documents[\"Timestamps\"],\n                                                 infer_datetime_format=infer_datetime_format,\n                                                 format=datetime_format)\n\n    if nr_bins:\n        documents[\"Bins\"] = pd.cut(documents.Timestamps, bins=nr_bins)\n        documents[\"Timestamps\"] = documents.apply(lambda row: row.Bins.left, 1)\n\n    # Sort documents in chronological order\n    documents = documents.sort_values(\"Timestamps\")\n    timestamps = documents.Timestamps.unique()\n    if len(timestamps) &gt; 100:\n        logger.warning(f\"There are more than 100 unique timestamps (i.e., {len(timestamps)}) \"\n                       \"which significantly slows down the application. Consider setting `nr_bins` \"\n                       \"to a value lower than 100 to speed up calculation. \")\n\n    # For each unique timestamp, create topic representations\n    topics_over_time = []\n    for index, timestamp in tqdm(enumerate(timestamps), disable=not self.verbose):\n\n        # Calculate c-TF-IDF representation for a specific timestamp\n        selection = documents.loc[documents.Timestamps == timestamp, :]\n        documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                \"Timestamps\": \"count\"})\n        c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n        if global_tuning or evolution_tuning:\n            c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n\n        # Fine-tune the c-TF-IDF matrix at timestamp t by averaging it with the c-TF-IDF\n        # matrix at timestamp t-1\n        if evolution_tuning and index != 0:\n            current_topics = sorted(list(documents_per_topic.Topic.values))\n            overlapping_topics = sorted(list(set(previous_topics).intersection(set(current_topics))))\n\n            current_overlap_idx = [current_topics.index(topic) for topic in overlapping_topics]\n            previous_overlap_idx = [previous_topics.index(topic) for topic in overlapping_topics]\n\n            c_tf_idf.tolil()[current_overlap_idx] = ((c_tf_idf[current_overlap_idx] +\n                                                      previous_c_tf_idf[previous_overlap_idx]) / 2.0).tolil()\n\n        # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n        # by simply taking the average of the two\n        if global_tuning:\n            selected_topics = [all_topics_indices[topic] for topic in documents_per_topic.Topic.values]\n            c_tf_idf = (global_c_tf_idf[selected_topics] + c_tf_idf) / 2.0\n\n        # Extract the words per topic\n        words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n        topic_frequency = pd.Series(documents_per_topic.Timestamps.values,\n                                    index=documents_per_topic.Topic).to_dict()\n\n        # Fill dataframe with results\n        topics_at_timestamp = [(topic,\n                                \", \".join([words[0] for words in values][:5]),\n                                topic_frequency[topic],\n                                timestamp) for topic, values in words_per_topic.items()]\n        topics_over_time.extend(topics_at_timestamp)\n\n        if evolution_tuning:\n            previous_topics = sorted(list(documents_per_topic.Topic.values))\n            previous_c_tf_idf = c_tf_idf.copy()\n\n    return pd.DataFrame(topics_over_time, columns=[\"Topic\", \"Words\", \"Frequency\", \"Timestamp\"])\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.topics_per_class","title":"<code>topics_per_class(self, docs, classes, global_tuning=True)</code>","text":"<p>Create topics per class</p> <p>To create the topics per class, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each class c. Then, the c-TF-IDF representations at class c are averaged with the global c-TF-IDF representations in order to fine-tune the local representations. This can be turned off if the pure representation is needed.</p> <p>Note</p> <p>Make sure to use a limited number of unique classes (&lt;100) as the c-TF-IDF representation will be calculated at each single unique class. Having a large number of unique classes can take some time to be calculated.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>classes</code> <code>Union[List[int], List[str]]</code> <p>The class of each document. This can be either a list of strings or ints.</p> required <code>global_tuning</code> <code>bool</code> <p>Fine-tune each topic representation for class c t by averaging its c-TF-IDF matrix            with the global c-TF-IDF matrix. Turn this off if you want to prevent words in            topic representations that could not be found in the documents for class c.</p> <code>True</code> <p>Returns:</p> Type Description <code>topics_per_class</code> <p>A dataframe that contains the topic, words, and frequency of topics                   for each class.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_per_class = topic_model.topics_per_class(docs, classes)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def topics_per_class(self,\n                     docs: List[str],\n                     classes: Union[List[int], List[str]],\n                     global_tuning: bool = True) -&gt; pd.DataFrame:\n    \"\"\" Create topics per class\n\n    To create the topics per class, BERTopic needs to be already fitted once.\n    From the fitted models, the c-TF-IDF representations are calculate at\n    each class c. Then, the c-TF-IDF representations at class c are\n    averaged with the global c-TF-IDF representations in order to fine-tune the\n    local representations. This can be turned off if the pure representation is\n    needed.\n\n    NOTE:\n        Make sure to use a limited number of unique classes (&lt;100) as the\n        c-TF-IDF representation will be calculated at each single unique class.\n        Having a large number of unique classes can take some time to be calculated.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        classes: The class of each document. This can be either a list of strings or ints.\n        global_tuning: Fine-tune each topic representation for class c t by averaging its c-TF-IDF matrix\n                       with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                       topic representations that could not be found in the documents for class c.\n\n    Returns:\n        topics_per_class: A dataframe that contains the topic, words, and frequency of topics\n                          for each class.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    topics_per_class = topic_model.topics_per_class(docs, classes)\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Class\": classes})\n    global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n    # For each unique timestamp, create topic representations\n    topics_per_class = []\n    for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose):\n\n        # Calculate c-TF-IDF representation for a specific timestamp\n        selection = documents.loc[documents.Class == class_, :]\n        documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                \"Class\": \"count\"})\n        c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n        # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n        # by simply taking the average of the two\n        if global_tuning:\n            c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n            c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0\n\n        # Extract the words per topic\n        words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n        topic_frequency = pd.Series(documents_per_topic.Class.values,\n                                    index=documents_per_topic.Topic).to_dict()\n\n        # Fill dataframe with results\n        topics_at_class = [(topic,\n                            \", \".join([words[0] for words in values][:5]),\n                            topic_frequency[topic],\n                            class_) for topic, values in words_per_topic.items()]\n        topics_per_class.extend(topics_at_class)\n\n    topics_per_class = pd.DataFrame(topics_per_class, columns=[\"Topic\", \"Words\", \"Frequency\", \"Class\"])\n\n    return topics_per_class\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.transform","title":"<code>transform(self, documents, embeddings=None, images=None)</code>","text":"<p>After having fit a model, use transform to predict new instances</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>Union[str, List[str]]</code> <p>A single document or a list of documents to predict on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model.</p> <code>None</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images to predict on or the images themselves</p> <code>None</code> <p>Returns:</p> Type Description <code>predictions</code> <p>Topic predictions for each documents probabilities: The topic probability distribution which is returned by default.                If <code>calculate_probabilities</code> in BERTopic is set to False, then the                probabilities are not calculated to speed up computation and                decrease memory usage.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic().fit(docs)\ntopics, probs = topic_model.transform(docs)\n</code></pre> <p>If you want to use your own embeddings:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Create embeddings\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# Create topic model\ntopic_model = BERTopic().fit(docs, embeddings)\ntopics, probs = topic_model.transform(docs, embeddings)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def transform(self,\n              documents: Union[str, List[str]],\n              embeddings: np.ndarray = None,\n              images: List[str] = None) -&gt; Tuple[List[int], np.ndarray]:\n    \"\"\" After having fit a model, use transform to predict new instances\n\n    Arguments:\n        documents: A single document or a list of documents to predict on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model.\n        images: A list of paths to the images to predict on or the images themselves\n\n    Returns:\n        predictions: Topic predictions for each documents\n        probabilities: The topic probability distribution which is returned by default.\n                       If `calculate_probabilities` in BERTopic is set to False, then the\n                       probabilities are not calculated to speed up computation and\n                       decrease memory usage.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic().fit(docs)\n    topics, probs = topic_model.transform(docs)\n    ```\n\n    If you want to use your own embeddings:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    # Create embeddings\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n    # Create topic model\n    topic_model = BERTopic().fit(docs, embeddings)\n    topics, probs = topic_model.transform(docs, embeddings)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_embeddings_shape(embeddings, documents)\n\n    if isinstance(documents, str) or documents is None:\n        documents = [documents]\n\n    if embeddings is None:\n        embeddings = self._extract_embeddings(documents,\n                                              images=images,\n                                              method=\"document\",\n                                              verbose=self.verbose)\n\n    # Check if an embedding model was found\n    if embeddings is None:\n        raise ValueError(\"No embedding model was found to embed the documents.\"\n                         \"Make sure when loading in the model using BERTopic.load()\"\n                         \"to also specify the embedding model.\")\n\n    # Transform without hdbscan_model and umap_model using only cosine similarity\n    elif type(self.hdbscan_model) == BaseCluster:\n        logger.info(\"Predicting topic assignments through cosine similarity of topic and document embeddings.\")\n        sim_matrix = cosine_similarity(embeddings, np.array(self.topic_embeddings_))\n        predictions = np.argmax(sim_matrix, axis=1) - self._outliers\n\n        if self.calculate_probabilities:\n            probabilities = sim_matrix\n        else:\n            probabilities = np.max(sim_matrix, axis=1)\n\n    # Transform with full pipeline\n    else:\n        logger.info(\"Dimensionality - Reducing dimensionality of input embeddings.\")\n        umap_embeddings = self.umap_model.transform(embeddings)\n        logger.info(\"Dimensionality - Completed \\u2713\")\n\n        # Extract predictions and probabilities if it is a HDBSCAN-like model\n        logger.info(\"Clustering - Approximating new points with `hdbscan_model`\")\n        if is_supported_hdbscan(self.hdbscan_model):\n            predictions, probabilities = hdbscan_delegator(self.hdbscan_model, \"approximate_predict\", umap_embeddings)\n\n            # Calculate probabilities\n            if self.calculate_probabilities:\n                logger.info(\"Probabilities - Start calculation of probabilities with HDBSCAN\")\n                probabilities = hdbscan_delegator(self.hdbscan_model, \"membership_vector\", umap_embeddings)\n                logger.info(\"Probabilities - Completed \\u2713\")\n        else:\n            predictions = self.hdbscan_model.predict(umap_embeddings)\n            probabilities = None\n        logger.info(\"Cluster - Completed \\u2713\")\n\n        # Map probabilities and predictions\n        probabilities = self._map_probabilities(probabilities, original_topics=True)\n        predictions = self._map_predictions(predictions)\n    return predictions, probabilities\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.update_topics","title":"<code>update_topics(self, docs, images=None, topics=None, top_n_words=10, n_gram_range=None, vectorizer_model=None, ctfidf_model=None, representation_model=None)</code>","text":"<p>Updates the topic representation by recalculating c-TF-IDF with the new parameters as defined in this function.</p> <p>When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to remove stop_words or you want to try out a different n_gram_range. This function allows you to update the topic representation after they have been formed.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>images</code> <code>List[str]</code> <p>The images you used when calling either <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <code>topics</code> <code>List[int]</code> <p>A list of topics where each topic is related to a document in <code>docs</code>.     Use this variable to change or map the topics.     NOTE: Using a custom list of topic assignments may lead to errors if           topic reduction techniques are used afterwards. Make sure that           manually assigning topics is the last step in the pipeline</p> <code>None</code> <code>top_n_words</code> <code>int</code> <p>The number of words per topic to extract. Setting this          too high can negatively impact topic embeddings as topics          are typically best represented by at most 10 words.</p> <code>10</code> <code>n_gram_range</code> <code>Tuple[int, int]</code> <p>The n-gram range for the CountVectorizer.</p> <code>None</code> <code>vectorizer_model</code> <code>CountVectorizer</code> <p>Pass in your own CountVectorizer from scikit-learn</p> <code>None</code> <code>ctfidf_model</code> <code>ClassTfidfTransformer</code> <p>Pass in your own c-TF-IDF model to update the representations</p> <code>None</code> <code>representation_model</code> <code>BaseRepresentation</code> <p>Pass in a model that fine-tunes the topic representations         calculated through c-TF-IDF. Models from <code>bertopic.representation</code>         are supported.</p> <code>None</code> <p>Examples:</p> <p>In order to update the topic representation, you will need to first fit the topic model and extract topics from them. Based on these, you can update the representation:</p> <pre><code>topic_model.update_topics(docs, n_gram_range=(2, 3))\n</code></pre> <p>You can also use a custom vectorizer to update the representation:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\")\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>You can also use this function to change or map the topics to something else. You can update them as follows:</p> <pre><code>topic_model.update_topics(docs, my_updated_topics)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def update_topics(self,\n                  docs: List[str],\n                  images: List[str] = None,\n                  topics: List[int] = None,\n                  top_n_words: int = 10,\n                  n_gram_range: Tuple[int, int] = None,\n                  vectorizer_model: CountVectorizer = None,\n                  ctfidf_model: ClassTfidfTransformer = None,\n                  representation_model: BaseRepresentation = None):\n    \"\"\" Updates the topic representation by recalculating c-TF-IDF with the new\n    parameters as defined in this function.\n\n    When you have trained a model and viewed the topics and the words that represent them,\n    you might not be satisfied with the representation. Perhaps you forgot to remove\n    stop_words or you want to try out a different n_gram_range. This function allows you\n    to update the topic representation after they have been formed.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        images: The images you used when calling either `fit` or `fit_transform`\n        topics: A list of topics where each topic is related to a document in `docs`.\n                Use this variable to change or map the topics.\n                NOTE: Using a custom list of topic assignments may lead to errors if\n                      topic reduction techniques are used afterwards. Make sure that\n                      manually assigning topics is the last step in the pipeline\n        top_n_words: The number of words per topic to extract. Setting this\n                     too high can negatively impact topic embeddings as topics\n                     are typically best represented by at most 10 words.\n        n_gram_range: The n-gram range for the CountVectorizer.\n        vectorizer_model: Pass in your own CountVectorizer from scikit-learn\n        ctfidf_model: Pass in your own c-TF-IDF model to update the representations\n        representation_model: Pass in a model that fine-tunes the topic representations\n                    calculated through c-TF-IDF. Models from `bertopic.representation`\n                    are supported.\n\n    Examples:\n\n    In order to update the topic representation, you will need to first fit the topic\n    model and extract topics from them. Based on these, you can update the representation:\n\n    ```python\n    topic_model.update_topics(docs, n_gram_range=(2, 3))\n    ```\n\n    You can also use a custom vectorizer to update the representation:\n\n    ```python\n    from sklearn.feature_extraction.text import CountVectorizer\n    vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\")\n    topic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n    ```\n\n    You can also use this function to change or map the topics to something else.\n    You can update them as follows:\n\n    ```python\n    topic_model.update_topics(docs, my_updated_topics)\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    check_is_fitted(self)\n    if not n_gram_range:\n        n_gram_range = self.n_gram_range\n\n    if top_n_words &gt; 100:\n        logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                       \"can slow down computation quite a bit.\")\n    self.top_n_words = top_n_words\n    self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=n_gram_range)\n    self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n    self.representation_model = representation_model\n\n    if topics is None:\n        topics = self.topics_\n    else:\n        logger.warning(\"Using a custom list of topic assignments may lead to errors if \"\n                       \"topic reduction techniques are used afterwards. Make sure that \"\n                       \"manually assigning topics is the last step in the pipeline.\"\n                       \"Note that topic embeddings will also be created through weighted\"\n                       \"c-TF-IDF embeddings instead of centroid embeddings.\")\n\n    self._outliers = 1 if -1 in set(topics) else 0\n    # Extract words\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": topics, \"ID\": range(len(docs)), \"Image\": images})\n    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n    self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)\n    self.topic_representations_ = self._extract_words_per_topic(words, documents)\n    if set(topics) != self.topics_:\n        self._create_topic_vectors()\n    self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                          for key, values in\n                          self.topic_representations_.items()}\n    self._update_topic_size(documents)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_approximate_distribution","title":"<code>visualize_approximate_distribution(self, document, topic_token_distribution, normalize=False)</code>","text":"<p>Visualize the topic distribution calculated by <code>.approximate_topic_distribution</code> on a token level. Thereby indicating the extend to which a certain word or phrases belong to a specific topic. The assumption here is that a single word can belong to multiple similar topics and as such give information about the broader set of topics within a single document.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>document</code> <code>str</code> <p>The document for which you want to visualize     the approximated topic distribution.</p> required <code>topic_token_distribution</code> <code>ndarray</code> <p>The topic-token distribution of the document as                     extracted by <code>.approximate_topic_distribution</code></p> required <code>normalize</code> <code>bool</code> <p>Whether to normalize, between 0 and 1 (summing to 1), the     topic distribution values.</p> <code>False</code> <p>Returns:</p> Type Description <code>df</code> <p>A stylized dataframe indicating the best fitting topics     for each token.</p> <p>Examples:</p> <pre><code># Calculate the topic distributions on a token level\n# Note that we need to have `calculate_token_level=True`\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(\n        docs, calculate_token_level=True\n)\n\n# Visualize the approximated topic distributions\ndf = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])\ndf\n</code></pre> <p>To revert this stylized dataframe back to a regular dataframe, you can run the following:</p> <pre><code>df.data.columns = [column.strip() for column in df.data.columns]\ndf = df.data\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_approximate_distribution(self,\n                                       document: str,\n                                       topic_token_distribution: np.ndarray,\n                                       normalize: bool = False):\n    \"\"\" Visualize the topic distribution calculated by `.approximate_topic_distribution`\n    on a token level. Thereby indicating the extend to which a certain word or phrases belong\n    to a specific topic. The assumption here is that a single word can belong to multiple\n    similar topics and as such give information about the broader set of topics within\n    a single document.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        document: The document for which you want to visualize\n                the approximated topic distribution.\n        topic_token_distribution: The topic-token distribution of the document as\n                                extracted by `.approximate_topic_distribution`\n        normalize: Whether to normalize, between 0 and 1 (summing to 1), the\n                topic distribution values.\n\n    Returns:\n        df: A stylized dataframe indicating the best fitting topics\n            for each token.\n\n    Examples:\n\n    ```python\n    # Calculate the topic distributions on a token level\n    # Note that we need to have `calculate_token_level=True`\n    topic_distr, topic_token_distr = topic_model.approximate_distribution(\n            docs, calculate_token_level=True\n    )\n\n    # Visualize the approximated topic distributions\n    df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])\n    df\n    ```\n\n    To revert this stylized dataframe back to a regular dataframe,\n    you can run the following:\n\n    ```python\n    df.data.columns = [column.strip() for column in df.data.columns]\n    df = df.data\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_approximate_distribution(self,\n                                                       document=document,\n                                                       topic_token_distribution=topic_token_distribution,\n                                                       normalize=normalize)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_barchart","title":"<code>visualize_barchart(self, topics=None, top_n_topics=8, n_words=5, custom_labels=False, title='Topic Word Scores', width=250, height=250)</code>","text":"<p>Visualize a barchart of selected topics</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>8</code> <code>n_words</code> <code>int</code> <p>Number of words to show in a topic</p> <code>5</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'Topic Word Scores'</code> <code>width</code> <code>int</code> <p>The width of each figure.</p> <code>250</code> <code>height</code> <code>int</code> <p>The height of each figure.</p> <code>250</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the barchart of selected topics simply run:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_barchart()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_barchart(self,\n                       topics: List[int] = None,\n                       top_n_topics: int = 8,\n                       n_words: int = 5,\n                       custom_labels: bool = False,\n                       title: str = \"Topic Word Scores\",\n                       width: int = 250,\n                       height: int = 250) -&gt; go.Figure:\n    \"\"\" Visualize a barchart of selected topics\n\n    Arguments:\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_words: Number of words to show in a topic\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of each figure.\n        height: The height of each figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the barchart of selected topics\n    simply run:\n\n    ```python\n    topic_model.visualize_barchart()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_barchart()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_barchart(self,\n                                       topics=topics,\n                                       top_n_topics=top_n_topics,\n                                       n_words=n_words,\n                                       custom_labels=custom_labels,\n                                       title=title,\n                                       width=width,\n                                       height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_distribution","title":"<code>visualize_distribution(self, probabilities, min_probability=0.015, custom_labels=False, title='&lt;b&gt;Topic Probability Distribution&lt;/b&gt;', width=800, height=600)</code>","text":"<p>Visualize the distribution of topic probabilities</p> <p>Parameters:</p> Name Type Description Default <code>probabilities</code> <code>ndarray</code> <p>An array of probability scores</p> required <code>min_probability</code> <code>float</code> <p>The minimum probability score to visualize.              All others are ignored.</p> <code>0.015</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using            <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topic Probability Distribution&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>600</code> <p>Examples:</p> <p>Make sure to fit the model before and only input the probabilities of a single document:</p> <pre><code>topic_model.visualize_distribution(topic_model.probabilities_[0])\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_distribution(topic_model.probabilities_[0])\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_distribution(self,\n                           probabilities: np.ndarray,\n                           min_probability: float = 0.015,\n                           custom_labels: bool = False,\n                           title: str = \"&lt;b&gt;Topic Probability Distribution&lt;/b&gt;\",\n                           width: int = 800,\n                           height: int = 600) -&gt; go.Figure:\n    \"\"\" Visualize the distribution of topic probabilities\n\n    Arguments:\n        probabilities: An array of probability scores\n        min_probability: The minimum probability score to visualize.\n                         All others are ignored.\n        custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    Make sure to fit the model before and only input the\n    probabilities of a single document:\n\n    ```python\n    topic_model.visualize_distribution(topic_model.probabilities_[0])\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_distribution(topic_model.probabilities_[0])\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_distribution(self,\n                                           probabilities=probabilities,\n                                           min_probability=min_probability,\n                                           custom_labels=custom_labels,\n                                           title=title,\n                                           width=width,\n                                           height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_documents","title":"<code>visualize_documents(self, docs, topics=None, embeddings=None, reduced_embeddings=None, sample=None, hide_annotations=False, hide_document_hover=False, custom_labels=False, title='&lt;b&gt;Documents and Topics&lt;/b&gt;', width=1200, height=750)</code>","text":"<p>Visualize documents and their topics in 2D</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>float</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                 specific points. Helps to speed up generation of visualization.</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_documents(docs)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_documents(self,\n                        docs: List[str],\n                        topics: List[int] = None,\n                        embeddings: np.ndarray = None,\n                        reduced_embeddings: np.ndarray = None,\n                        sample: float = None,\n                        hide_annotations: bool = False,\n                        hide_document_hover: bool = False,\n                        custom_labels: bool = False,\n                        title: str = \"&lt;b&gt;Documents and Topics&lt;/b&gt;\",\n                        width: int = 1200,\n                        height: int = 750) -&gt; go.Figure:\n    \"\"\" Visualize documents and their topics in 2D\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                            specific points. Helps to speed up generation of visualization.\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_documents(docs)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic\n    topic_model = BERTopic().fit(docs, embeddings)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_documents(docs, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../getting_started/visualization/documents.html\"\n    style=\"width:1000px; height: 800px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    return plotting.visualize_documents(self,\n                                        docs=docs,\n                                        topics=topics,\n                                        embeddings=embeddings,\n                                        reduced_embeddings=reduced_embeddings,\n                                        sample=sample,\n                                        hide_annotations=hide_annotations,\n                                        hide_document_hover=hide_document_hover,\n                                        custom_labels=custom_labels,\n                                        title=title,\n                                        width=width,\n                                        height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_heatmap","title":"<code>visualize_heatmap(self, topics=None, top_n_topics=None, n_clusters=None, custom_labels=False, title='&lt;b&gt;Similarity Matrix&lt;/b&gt;', width=800, height=800)</code>","text":"<p>Visualize a heatmap of the topic's similarity matrix</p> <p>Based on the cosine similarity matrix between topic embeddings, a heatmap is created showing the similarity between topics.</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>None</code> <code>n_clusters</code> <code>int</code> <p>Create n clusters and order the similarity         matrix by those clusters.</p> <code>None</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Similarity Matrix&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>800</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the similarity matrix of topics simply run:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_heatmap()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_heatmap(self,\n                      topics: List[int] = None,\n                      top_n_topics: int = None,\n                      n_clusters: int = None,\n                      custom_labels: bool = False,\n                      title: str = \"&lt;b&gt;Similarity Matrix&lt;/b&gt;\",\n                      width: int = 800,\n                      height: int = 800) -&gt; go.Figure:\n    \"\"\" Visualize a heatmap of the topic's similarity matrix\n\n    Based on the cosine similarity matrix between topic embeddings,\n    a heatmap is created showing the similarity between topics.\n\n    Arguments:\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_clusters: Create n clusters and order the similarity\n                    matrix by those clusters.\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the similarity matrix of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_heatmap()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_heatmap()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_heatmap(self,\n                                      topics=topics,\n                                      top_n_topics=top_n_topics,\n                                      n_clusters=n_clusters,\n                                      custom_labels=custom_labels,\n                                      title=title,\n                                      width=width,\n                                      height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_hierarchical_documents","title":"<code>visualize_hierarchical_documents(self, docs, hierarchical_topics, topics=None, embeddings=None, reduced_embeddings=None, sample=None, hide_annotations=False, hide_document_hover=True, nr_levels=10, level_scale='linear', custom_labels=False, title='&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;', width=1200, height=750)</code>","text":"<p>Visualize documents and their topics in 2D at different levels of hierarchy</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                 represented by their parents and their children</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>Union[float, int]</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                 specific points. Helps to speed up generation of visualizations.</p> <code>True</code> <code>nr_levels</code> <code>int</code> <p>The number of levels to be visualized in the hierarchy. First, the distances     in <code>hierarchical_topics.Distance</code> are split in <code>nr_levels</code> lists of distances with     equal length. Then, for each list of distances, the merged topics are selected that     have a distance less or equal to the maximum distance of the selected list of distances.     NOTE: To get all possible merged steps, make sure that <code>nr_levels</code> is equal to     the length of <code>hierarchical_topics</code>.</p> <code>10</code> <code>level_scale</code> <code>str</code> <p>Whether to apply a linear or logarithmic ('log') scale levels of the distance          vector. Linear scaling will perform an equal number of merges at each level          while logarithmic scaling will perform more mergers in earlier levels to          provide more resolution at higher levels (this can be used for when the number          of topics is large).</p> <code>'linear'</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using            <code>topic_model.set_topic_labels</code>.            NOTE: Custom labels are only generated for the original            un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_hierarchical_documents(self,\n                                     docs: List[str],\n                                     hierarchical_topics: pd.DataFrame,\n                                     topics: List[int] = None,\n                                     embeddings: np.ndarray = None,\n                                     reduced_embeddings: np.ndarray = None,\n                                     sample: Union[float, int] = None,\n                                     hide_annotations: bool = False,\n                                     hide_document_hover: bool = True,\n                                     nr_levels: int = 10,\n                                     level_scale: str = 'linear',\n                                     custom_labels: bool = False,\n                                     title: str = \"&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;\",\n                                     width: int = 1200,\n                                     height: int = 750) -&gt; go.Figure:\n    \"\"\" Visualize documents and their topics in 2D at different levels of hierarchy\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                            represented by their parents and their children\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                            specific points. Helps to speed up generation of visualizations.\n        nr_levels: The number of levels to be visualized in the hierarchy. First, the distances\n                in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances with\n                equal length. Then, for each list of distances, the merged topics are selected that\n                have a distance less or equal to the maximum distance of the selected list of distances.\n                NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to\n                the length of `hierarchical_topics`.\n        level_scale: Whether to apply a linear or logarithmic ('log') scale levels of the distance\n                     vector. Linear scaling will perform an equal number of merges at each level\n                     while logarithmic scaling will perform more mergers in earlier levels to\n                     provide more resolution at higher levels (this can be used for when the number\n                     of topics is large).\n        custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n                       NOTE: Custom labels are only generated for the original\n                       un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic and extract hierarchical topics\n    topic_model = BERTopic().fit(docs, embeddings)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../getting_started/visualization/hierarchical_documents.html\"\n    style=\"width:1000px; height: 770px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    return plotting.visualize_hierarchical_documents(self,\n                                                     docs=docs,\n                                                     hierarchical_topics=hierarchical_topics,\n                                                     topics=topics,\n                                                     embeddings=embeddings,\n                                                     reduced_embeddings=reduced_embeddings,\n                                                     sample=sample,\n                                                     hide_annotations=hide_annotations,\n                                                     hide_document_hover=hide_document_hover,\n                                                     nr_levels=nr_levels,\n                                                     level_scale=level_scale,\n                                                     custom_labels=custom_labels,\n                                                     title=title,\n                                                     width=width,\n                                                     height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_hierarchy","title":"<code>visualize_hierarchy(self, orientation='left', topics=None, top_n_topics=None, custom_labels=False, title='&lt;b&gt;Hierarchical Clustering&lt;/b&gt;', width=1000, height=600, hierarchical_topics=None, linkage_function=None, distance_function=None, color_threshold=1)</code>","text":"<p>Visualize a hierarchical structure of the topics</p> <p>A ward linkage function is used to perform the hierarchical clustering based on the cosine distance matrix between topic embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>orientation</code> <code>str</code> <p>The orientation of the figure.         Either 'left' or 'bottom'</p> <code>'left'</code> <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.        NOTE: Custom labels are only generated for the original        un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Clustering&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure. Only works if orientation is set to 'left'</p> <code>1000</code> <code>height</code> <code>int</code> <p>The height of the figure. Only works if orientation is set to 'bottom'</p> <code>600</code> <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                 represented by their parents and their children.                 NOTE: The hierarchical topic names are only visualized                 if both <code>topics</code> and <code>top_n_topics</code> are not set.</p> <code>None</code> <code>linkage_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], numpy.ndarray]</code> <p>The linkage function to use. Default is:             <code>lambda x: sch.linkage(x, 'ward', optimal_ordering=True)</code>             NOTE: Make sure to use the same <code>linkage_function</code> as used             in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>distance_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], scipy.sparse._csr.csr_matrix]</code> <p>The distance function to use on the c-TF-IDF matrix. Default is:             <code>lambda x: 1 - cosine_similarity(x)</code>             NOTE: Make sure to use the same <code>distance_function</code> as used             in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>color_threshold</code> <code>int</code> <p>Value at which the separation of clusters will be made which          will result in different colors for different clusters.          A higher value will typically lead in less colored clusters.</p> <code>1</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the hierarchical structure of topics simply run:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>If you also want the labels visualized of hierarchical topics, run the following:</p> <pre><code># Extract hierarchical topics and their representations\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Visualize these representations\ntopic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchy()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_hierarchy(self,\n                        orientation: str = \"left\",\n                        topics: List[int] = None,\n                        top_n_topics: int = None,\n                        custom_labels: bool = False,\n                        title: str = \"&lt;b&gt;Hierarchical Clustering&lt;/b&gt;\",\n                        width: int = 1000,\n                        height: int = 600,\n                        hierarchical_topics: pd.DataFrame = None,\n                        linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                        distance_function: Callable[[csr_matrix], csr_matrix] = None,\n                        color_threshold: int = 1) -&gt; go.Figure:\n    \"\"\" Visualize a hierarchical structure of the topics\n\n    A ward linkage function is used to perform the\n    hierarchical clustering based on the cosine distance\n    matrix between topic embeddings.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        orientation: The orientation of the figure.\n                    Either 'left' or 'bottom'\n        topics: A selection of topics to visualize\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n                   NOTE: Custom labels are only generated for the original\n                   un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure. Only works if orientation is set to 'left'\n        height: The height of the figure. Only works if orientation is set to 'bottom'\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                            represented by their parents and their children.\n                            NOTE: The hierarchical topic names are only visualized\n                            if both `topics` and `top_n_topics` are not set.\n        linkage_function: The linkage function to use. Default is:\n                        `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n                        NOTE: Make sure to use the same `linkage_function` as used\n                        in `topic_model.hierarchical_topics`.\n        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                        `lambda x: 1 - cosine_similarity(x)`\n                        NOTE: Make sure to use the same `distance_function` as used\n                        in `topic_model.hierarchical_topics`.\n        color_threshold: Value at which the separation of clusters will be made which\n                     will result in different colors for different clusters.\n                     A higher value will typically lead in less colored clusters.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the hierarchical structure of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchy()\n    ```\n\n    If you also want the labels visualized of hierarchical topics,\n    run the following:\n\n    ```python\n    # Extract hierarchical topics and their representations\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Visualize these representations\n    topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n    ```\n\n    If you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchy()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../getting_started/visualization/hierarchy.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_hierarchy(self,\n                                        orientation=orientation,\n                                        topics=topics,\n                                        top_n_topics=top_n_topics,\n                                        custom_labels=custom_labels,\n                                        title=title,\n                                        width=width,\n                                        height=height,\n                                        hierarchical_topics=hierarchical_topics,\n                                        linkage_function=linkage_function,\n                                        distance_function=distance_function,\n                                        color_threshold=color_threshold\n                                        )\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_term_rank","title":"<code>visualize_term_rank(self, topics=None, log_scale=False, custom_labels=False, title='&lt;b&gt;Term score decline per Topic&lt;/b&gt;', width=800, height=500)</code>","text":"<p>Visualize the ranks of all terms across all topics</p> <p>Each topic is represented by a set of words. These words, however, do not all equally represent the topic. This visualization shows how many words are needed to represent a topic and at which point the beneficial effect of adding words starts to decline.</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize. These will be colored     red where all others will be colored black.</p> <code>None</code> <code>log_scale</code> <code>bool</code> <p>Whether to represent the ranking on a log scale</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Term score decline per Topic&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>500</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the ranks of all words across all topics simply run:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_term_rank()\nfig.write_html(\"path/to/file.html\")\n</code></pre> <p>Reference:</p> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an analysis by the amazing tmtoolkit. Reference to that specific analysis can be found here.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_term_rank(self,\n                        topics: List[int] = None,\n                        log_scale: bool = False,\n                        custom_labels: bool = False,\n                        title: str = \"&lt;b&gt;Term score decline per Topic&lt;/b&gt;\",\n                        width: int = 800,\n                        height: int = 500) -&gt; go.Figure:\n    \"\"\" Visualize the ranks of all terms across all topics\n\n    Each topic is represented by a set of words. These words, however,\n    do not all equally represent the topic. This visualization shows\n    how many words are needed to represent a topic and at which point\n    the beneficial effect of adding words starts to decline.\n\n    Arguments:\n        topics: A selection of topics to visualize. These will be colored\n                red where all others will be colored black.\n        log_scale: Whether to represent the ranking on a log scale\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the ranks of all words across\n    all topics simply run:\n\n    ```python\n    topic_model.visualize_term_rank()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_term_rank()\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    Reference:\n\n    This visualization was heavily inspired by the\n    \"Term Probability Decline\" visualization found in an\n    analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).\n    Reference to that specific analysis can be found\n    [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_term_rank(self,\n                                        topics=topics,\n                                        log_scale=log_scale,\n                                        custom_labels=custom_labels,\n                                        title=title,\n                                        width=width,\n                                        height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics","title":"<code>visualize_topics(self, topics=None, top_n_topics=None, custom_labels=False, title='&lt;b&gt;Intertopic Distance Map&lt;/b&gt;', width=650, height=650)</code>","text":"<p>Visualize topics, their sizes, and their corresponding words</p> <p>This visualization is highly inspired by LDAvis, a great visualization technique typically reserved for LDA.</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Intertopic Distance Map&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>650</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>650</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_topics(self,\n                     topics: List[int] = None,\n                     top_n_topics: int = None,\n                     custom_labels: bool = False,\n                     title: str = \"&lt;b&gt;Intertopic Distance Map&lt;/b&gt;\",\n                     width: int = 650,\n                     height: int = 650) -&gt; go.Figure:\n    \"\"\" Visualize topics, their sizes, and their corresponding words\n\n    This visualization is highly inspired by LDAvis, a great visualization\n    technique typically reserved for LDA.\n\n    Arguments:\n        topics: A selection of topics to visualize\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_topics()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_topics(self,\n                                     topics=topics,\n                                     top_n_topics=top_n_topics,\n                                     custom_labels=custom_labels,\n                                     title=title,\n                                     width=width,\n                                     height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics_over_time","title":"<code>visualize_topics_over_time(self, topics_over_time, top_n_topics=None, topics=None, normalize_frequency=False, custom_labels=False, title='&lt;b&gt;Topics over Time&lt;/b&gt;', width=1250, height=450)</code>","text":"<p>Visualize topics over time</p> <p>Parameters:</p> Name Type Description Default <code>topics_over_time</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>None</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics over Time&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>450</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics over time, simply run:</p> <pre><code>topics_over_time = topic_model.topics_over_time(docs, timestamps)\ntopic_model.visualize_topics_over_time(topics_over_time)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_over_time(topics_over_time)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_topics_over_time(self,\n                               topics_over_time: pd.DataFrame,\n                               top_n_topics: int = None,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: bool = False,\n                               title: str = \"&lt;b&gt;Topics over Time&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 450) -&gt; go.Figure:\n    \"\"\" Visualize topics over time\n\n    Arguments:\n        topics_over_time: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics over time, simply run:\n\n    ```python\n    topics_over_time = topic_model.topics_over_time(docs, timestamps)\n    topic_model.visualize_topics_over_time(topics_over_time)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_over_time(topics_over_time)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_topics_over_time(self,\n                                               topics_over_time=topics_over_time,\n                                               top_n_topics=top_n_topics,\n                                               topics=topics,\n                                               normalize_frequency=normalize_frequency,\n                                               custom_labels=custom_labels,\n                                               title=title,\n                                               width=width,\n                                               height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics_per_class","title":"<code>visualize_topics_per_class(self, topics_per_class, top_n_topics=10, topics=None, normalize_frequency=False, custom_labels=False, title='&lt;b&gt;Topics per Class&lt;/b&gt;', width=1250, height=900)</code>","text":"<p>Visualize topics per class</p> <p>Parameters:</p> Name Type Description Default <code>topics_per_class</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>10</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics per Class&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>900</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics per class, simply run:</p> <pre><code>topics_per_class = topic_model.topics_per_class(docs, classes)\ntopic_model.visualize_topics_per_class(topics_per_class)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_per_class(topics_per_class)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_topics_per_class(self,\n                               topics_per_class: pd.DataFrame,\n                               top_n_topics: int = 10,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: bool = False,\n                               title: str = \"&lt;b&gt;Topics per Class&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 900) -&gt; go.Figure:\n    \"\"\" Visualize topics per class\n\n    Arguments:\n        topics_per_class: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics per class, simply run:\n\n    ```python\n    topics_per_class = topic_model.topics_per_class(docs, classes)\n    topic_model.visualize_topics_per_class(topics_per_class)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_per_class(topics_per_class)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_topics_per_class(self,\n                                               topics_per_class=topics_per_class,\n                                               top_n_topics=top_n_topics,\n                                               topics=topics,\n                                               normalize_frequency=normalize_frequency,\n                                               custom_labels=custom_labels,\n                                               title=title,\n                                               width=width,\n                                               height=height)\n</code></pre>"},{"location":"api/ctfidf.html","title":"<code>c-TF-IDF</code>","text":"<p>A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.</p> <p></p> <p>c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents. The frequency of each word x is extracted for each class c and is l1 normalized. This constitutes the term frequency.</p> <p>Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus the average number of words per class A divided by the frequency of word x across all classes.</p> <p>Parameters:</p> Name Type Description Default <code>bm25_weighting</code> <code>bool</code> <p>Uses BM25-inspired idf-weighting procedure instead of the procedure             as defined in the c-TF-IDF formula. It uses the following weighting scheme:             <code>log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))</code></p> <code>False</code> <code>reduce_frequent_words</code> <code>bool</code> <p>Takes the square root of the bag-of-words after normalizing the matrix.                    Helps to reduce the impact of words that appear too frequently.</p> <code>False</code> <code>seed_words</code> <code>List[str]</code> <p>Specific words that will have their idf value increased by          the value of <code>seed_multiplier</code>.          NOTE: This will only increase the value of words that have an exact match.</p> <code>None</code> <code>seed_multiplier</code> <code>bool</code> <p>The value with which the idf values of the words in <code>seed_words</code>              are multiplied.</p> <code>2</code> <p>Examples:</p> <pre><code>transformer = ClassTfidfTransformer()\n</code></pre> Source code in <code>bertopic\\vectorizers\\_ctfidf.py</code> <pre><code>class ClassTfidfTransformer(TfidfTransformer):\n    \"\"\"\n    A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.\n\n    ![](../algorithm/c-TF-IDF.svg)\n\n    c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes\n    by joining all documents per class. Thus, each class is converted to a single document\n    instead of set of documents. The frequency of each word **x** is extracted\n    for each class **c** and is **l1** normalized. This constitutes the term frequency.\n\n    Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus\n    the average number of words per class **A** divided by the frequency of word **x**\n    across all classes.\n\n    Arguments:\n        bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure\n                        as defined in the c-TF-IDF formula. It uses the following weighting scheme:\n                        `log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))`\n        reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix.\n                               Helps to reduce the impact of words that appear too frequently.\n        seed_words: Specific words that will have their idf value increased by \n                    the value of `seed_multiplier`. \n                    NOTE: This will only increase the value of words that have an exact match.\n        seed_multiplier: The value with which the idf values of the words in `seed_words`\n                         are multiplied.\n\n    Examples:\n\n    ```python\n    transformer = ClassTfidfTransformer()\n    ```\n    \"\"\"\n    def __init__(self, \n                 bm25_weighting: bool = False, \n                 reduce_frequent_words: bool = False,\n                 seed_words: List[str] = None,\n                 seed_multiplier: bool = 2\n                 ):\n        self.bm25_weighting = bm25_weighting\n        self.reduce_frequent_words = reduce_frequent_words\n        self.seed_words = seed_words\n        self.seed_multiplier = seed_multiplier\n        super(ClassTfidfTransformer, self).__init__()\n\n    def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):\n        \"\"\"Learn the idf vector (global term weights).\n\n        Arguments:\n            X: A matrix of term/token counts.\n            multiplier: A multiplier for increasing/decreasing certain IDF scores\n        \"\"\"\n        X = check_array(X, accept_sparse=('csr', 'csc'))\n        if not sp.issparse(X):\n            X = sp.csr_matrix(X)\n        dtype = np.float64\n\n        if self.use_idf:\n            _, n_features = X.shape\n\n            # Calculate the frequency of words across all classes\n            df = np.squeeze(np.asarray(X.sum(axis=0)))\n\n            # Calculate the average number of samples as regularization\n            avg_nr_samples = int(X.sum(axis=1).mean())\n\n            # BM25-inspired weighting procedure\n            if self.bm25_weighting:\n                idf = np.log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))\n\n            # Divide the average number of samples by the word frequency\n            # +1 is added to force values to be positive\n            else:\n                idf = np.log((avg_nr_samples / df)+1)\n\n            # Multiplier to increase/decrease certain idf scores\n            if multiplier is not None:\n                idf = idf * multiplier\n\n            self._idf_diag = sp.diags(idf, offsets=0,\n                                      shape=(n_features, n_features),\n                                      format='csr',\n                                      dtype=dtype)\n\n        return self\n\n    def transform(self, X: sp.csr_matrix):\n        \"\"\"Transform a count-based matrix to c-TF-IDF\n\n        Arguments:\n            X (sparse matrix): A matrix of term/token counts.\n\n        Returns:\n            X (sparse matrix): A c-TF-IDF matrix\n        \"\"\"\n        if self.use_idf:\n            X = normalize(X, axis=1, norm='l1', copy=False)\n\n            if self.reduce_frequent_words:\n                X.data = np.sqrt(X.data)\n\n            X = X * self._idf_diag\n\n        return X\n</code></pre>"},{"location":"api/ctfidf.html#bertopic.vectorizers._ctfidf.ClassTfidfTransformer.fit","title":"<code>fit(self, X, multiplier=None)</code>","text":"<p>Learn the idf vector (global term weights).</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>csr_matrix</code> <p>A matrix of term/token counts.</p> required <code>multiplier</code> <code>ndarray</code> <p>A multiplier for increasing/decreasing certain IDF scores</p> <code>None</code> Source code in <code>bertopic\\vectorizers\\_ctfidf.py</code> <pre><code>def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):\n    \"\"\"Learn the idf vector (global term weights).\n\n    Arguments:\n        X: A matrix of term/token counts.\n        multiplier: A multiplier for increasing/decreasing certain IDF scores\n    \"\"\"\n    X = check_array(X, accept_sparse=('csr', 'csc'))\n    if not sp.issparse(X):\n        X = sp.csr_matrix(X)\n    dtype = np.float64\n\n    if self.use_idf:\n        _, n_features = X.shape\n\n        # Calculate the frequency of words across all classes\n        df = np.squeeze(np.asarray(X.sum(axis=0)))\n\n        # Calculate the average number of samples as regularization\n        avg_nr_samples = int(X.sum(axis=1).mean())\n\n        # BM25-inspired weighting procedure\n        if self.bm25_weighting:\n            idf = np.log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))\n\n        # Divide the average number of samples by the word frequency\n        # +1 is added to force values to be positive\n        else:\n            idf = np.log((avg_nr_samples / df)+1)\n\n        # Multiplier to increase/decrease certain idf scores\n        if multiplier is not None:\n            idf = idf * multiplier\n\n        self._idf_diag = sp.diags(idf, offsets=0,\n                                  shape=(n_features, n_features),\n                                  format='csr',\n                                  dtype=dtype)\n\n    return self\n</code></pre>"},{"location":"api/ctfidf.html#bertopic.vectorizers._ctfidf.ClassTfidfTransformer.transform","title":"<code>transform(self, X)</code>","text":"<p>Transform a count-based matrix to c-TF-IDF</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>sparse matrix</code> <p>A matrix of term/token counts.</p> required <p>Returns:</p> Type Description <code>X (sparse matrix)</code> <p>A c-TF-IDF matrix</p> Source code in <code>bertopic\\vectorizers\\_ctfidf.py</code> <pre><code>def transform(self, X: sp.csr_matrix):\n    \"\"\"Transform a count-based matrix to c-TF-IDF\n\n    Arguments:\n        X (sparse matrix): A matrix of term/token counts.\n\n    Returns:\n        X (sparse matrix): A c-TF-IDF matrix\n    \"\"\"\n    if self.use_idf:\n        X = normalize(X, axis=1, norm='l1', copy=False)\n\n        if self.reduce_frequent_words:\n            X.data = np.sqrt(X.data)\n\n        X = X * self._idf_diag\n\n    return X\n</code></pre>"},{"location":"api/onlinecv.html","title":"<code>OnlineCountVectorizer</code>","text":"<p>An online variant of the CountVectorizer with updating vocabulary.</p> <p>At each <code>.partial_fit</code>, its vocabulary is updated based on any OOV words it might find. Then, <code>.update_bow</code> can be used to track and update the Bag-of-Words representation. These functions are seperated such that the vectorizer can be used in iteration without updating the Bag-of-Words representation can might speed up the fitting process. However, the <code>.update_bow</code> function is used in BERTopic to track changes in the topic representations and allow for decay.</p> <p>This class inherits its parameters and attributes from:     <code>sklearn.feature_extraction.text.CountVectorizer</code></p> <p>Parameters:</p> Name Type Description Default <code>decay</code> <code>float</code> <p>A value between [0, 1] to weight the percentage of frequencies    the previous bag-of-words should be decreased. For example,    a value of <code>.1</code> will decrease the frequencies in the bag-of-words    matrix with 10% at each iteration.</p> <code>None</code> <code>delete_min_df</code> <code>float</code> <p>Delete words eat each iteration from its vocabulary            that do not exceed a minimum frequency.            This will keep the resulting bag-of-words matrix small            such that it does not explode in size with increasing            vocabulary. If <code>decay</code> is None then this equals <code>min_df</code>.</p> <code>None</code> <code>**kwargs</code> <p>Set of parameters inherited from:       <code>sklearn.feature_extraction.text.CountVectorizer</code>       In practice, this means that you can still use parameters       from the original CountVectorizer, like <code>stop_words</code> and       <code>ngram_range</code>.</p> <code>{}</code> <p>Attributes:</p> Name Type Description <code>X_</code> <code>scipy.sparse.csr_matrix) </code> <p>The Bag-of-Words representation</p> <p>Examples:</p> <pre><code>from bertopic.vectorizers import OnlineCountVectorizer\nvectorizer = OnlineCountVectorizer(stop_words=\"english\")\n\nfor index, doc in enumerate(my_docs):\n    vectorizer.partial_fit(doc)\n\n    # Update and clean the bow every 100 iterations:\n    if index % 100 == 0:\n        X = vectorizer.update_bow()\n</code></pre> <p>To use the model in BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import OnlineCountVectorizer\n\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>References</p> <p>Adapted from: https://github.com/idoshlomo/online_vectorizers</p> Source code in <code>bertopic\\vectorizers\\_online_cv.py</code> <pre><code>class OnlineCountVectorizer(CountVectorizer):\n    \"\"\" An online variant of the CountVectorizer with updating vocabulary.\n\n    At each `.partial_fit`, its vocabulary is updated based on any OOV words\n    it might find. Then, `.update_bow` can be used to track and update\n    the Bag-of-Words representation. These functions are seperated such that\n    the vectorizer can be used in iteration without updating the Bag-of-Words\n    representation can might speed up the fitting process. However, the\n    `.update_bow` function is used in BERTopic to track changes in the\n    topic representations and allow for decay.\n\n    This class inherits its parameters and attributes from:\n        `sklearn.feature_extraction.text.CountVectorizer`\n\n    Arguments:\n        decay: A value between [0, 1] to weight the percentage of frequencies\n               the previous bag-of-words should be decreased. For example,\n               a value of `.1` will decrease the frequencies in the bag-of-words\n               matrix with 10% at each iteration.\n        delete_min_df: Delete words eat each iteration from its vocabulary\n                       that do not exceed a minimum frequency.\n                       This will keep the resulting bag-of-words matrix small\n                       such that it does not explode in size with increasing\n                       vocabulary. If `decay` is None then this equals `min_df`.\n        **kwargs: Set of parameters inherited from:\n                  `sklearn.feature_extraction.text.CountVectorizer`\n                  In practice, this means that you can still use parameters\n                  from the original CountVectorizer, like `stop_words` and\n                  `ngram_range`.\n\n    Attributes:\n        X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation\n\n    Examples:\n\n    ```python\n    from bertopic.vectorizers import OnlineCountVectorizer\n    vectorizer = OnlineCountVectorizer(stop_words=\"english\")\n\n    for index, doc in enumerate(my_docs):\n        vectorizer.partial_fit(doc)\n\n        # Update and clean the bow every 100 iterations:\n        if index % 100 == 0:\n            X = vectorizer.update_bow()\n    ```\n\n    To use the model in BERTopic:\n\n    ```python\n    from bertopic import BERTopic\n    from bertopic.vectorizers import OnlineCountVectorizer\n\n    vectorizer_model = OnlineCountVectorizer(stop_words=\"english\")\n    topic_model = BERTopic(vectorizer_model=vectorizer_model)\n    ```\n\n    References:\n        Adapted from: https://github.com/idoshlomo/online_vectorizers\n    \"\"\"\n    def __init__(self,\n                 decay: float = None,\n                 delete_min_df: float = None,\n                 **kwargs):\n        self.decay = decay\n        self.delete_min_df = delete_min_df\n        super(OnlineCountVectorizer, self).__init__(**kwargs)\n\n    def partial_fit(self, raw_documents: List[str]) -&gt; None:\n        \"\"\" Perform a partial fit and update vocabulary with OOV tokens\n\n        Arguments:\n            raw_documents: A list of documents\n        \"\"\"\n        if not hasattr(self, 'vocabulary_'):\n            return self.fit(raw_documents)\n\n        analyzer = self.build_analyzer()\n        analyzed_documents = [analyzer(doc) for doc in raw_documents]\n        new_tokens = set(chain.from_iterable(analyzed_documents))\n        oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))\n\n        if oov_tokens:\n            max_index = max(self.vocabulary_.values())\n            oov_vocabulary = dict(zip(oov_tokens, list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1))))\n            self.vocabulary_.update(oov_vocabulary)\n\n        return self\n\n    def update_bow(self, raw_documents: List[str]) -&gt; csr_matrix:\n        \"\"\" Create or update the bag-of-words matrix\n\n        Update the bag-of-words matrix by adding the newly transformed\n        documents. This may add empty columns if new words are found and/or\n        add empty rows if new topics are found.\n\n        During this process, the previous bag-of-words matrix might be\n        decayed if `self.decay` has been set during init. Similarly, words\n        that do not exceed `self.delete_min_df` are removed from its\n        vocabulary and bag-of-words matrix.\n\n        Arguments:\n            raw_documents: A list of documents\n\n        Returns:\n            X_: Bag-of-words matrix\n        \"\"\"\n        if hasattr(self, \"X_\"):\n            X = self.transform(raw_documents)\n\n            # Add empty columns if new words are found\n            columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)\n            self.X_ = sparse.hstack([self.X_, columns])\n\n            # Add empty rows if new topics are found\n            rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)\n            self.X_ = sparse.vstack([self.X_, rows])\n\n            # Decay of BoW matrix\n            if self.decay is not None:\n                self.X_ = self.X_ * (1 - self.decay)\n\n            self.X_ += X\n        else:\n            self.X_ = self.transform(raw_documents)\n\n        if self.delete_min_df is not None:\n            self._clean_bow()\n\n        return self.X_\n\n    def _clean_bow(self) -&gt; None:\n        \"\"\" Remove words that do not exceed `self.delete_min_df` \"\"\"\n        # Only keep words with a minimum frequency\n        indices = np.where(self.X_.sum(0) &gt;= self.delete_min_df)[1]\n        indices_dict = {index: index for index in indices}\n        self.X_ = self.X_[:, indices]\n\n        # Update vocabulary with new words\n        new_vocab = {}\n        vocabulary_dict = {v: k for k, v in self.vocabulary_.items()}\n        for i, index in enumerate(indices):\n            if indices_dict.get(index) is not None:\n                new_vocab[vocabulary_dict[index]] = i\n\n        self.vocabulary_ = new_vocab\n</code></pre>"},{"location":"api/onlinecv.html#bertopic.vectorizers._online_cv.OnlineCountVectorizer.partial_fit","title":"<code>partial_fit(self, raw_documents)</code>","text":"<p>Perform a partial fit and update vocabulary with OOV tokens</p> <p>Parameters:</p> Name Type Description Default <code>raw_documents</code> <code>List[str]</code> <p>A list of documents</p> required Source code in <code>bertopic\\vectorizers\\_online_cv.py</code> <pre><code>def partial_fit(self, raw_documents: List[str]) -&gt; None:\n    \"\"\" Perform a partial fit and update vocabulary with OOV tokens\n\n    Arguments:\n        raw_documents: A list of documents\n    \"\"\"\n    if not hasattr(self, 'vocabulary_'):\n        return self.fit(raw_documents)\n\n    analyzer = self.build_analyzer()\n    analyzed_documents = [analyzer(doc) for doc in raw_documents]\n    new_tokens = set(chain.from_iterable(analyzed_documents))\n    oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))\n\n    if oov_tokens:\n        max_index = max(self.vocabulary_.values())\n        oov_vocabulary = dict(zip(oov_tokens, list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1))))\n        self.vocabulary_.update(oov_vocabulary)\n\n    return self\n</code></pre>"},{"location":"api/onlinecv.html#bertopic.vectorizers._online_cv.OnlineCountVectorizer.update_bow","title":"<code>update_bow(self, raw_documents)</code>","text":"<p>Create or update the bag-of-words matrix</p> <p>Update the bag-of-words matrix by adding the newly transformed documents. This may add empty columns if new words are found and/or add empty rows if new topics are found.</p> <p>During this process, the previous bag-of-words matrix might be decayed if <code>self.decay</code> has been set during init. Similarly, words that do not exceed <code>self.delete_min_df</code> are removed from its vocabulary and bag-of-words matrix.</p> <p>Parameters:</p> Name Type Description Default <code>raw_documents</code> <code>List[str]</code> <p>A list of documents</p> required <p>Returns:</p> Type Description <code>X_</code> <p>Bag-of-words matrix</p> Source code in <code>bertopic\\vectorizers\\_online_cv.py</code> <pre><code>def update_bow(self, raw_documents: List[str]) -&gt; csr_matrix:\n    \"\"\" Create or update the bag-of-words matrix\n\n    Update the bag-of-words matrix by adding the newly transformed\n    documents. This may add empty columns if new words are found and/or\n    add empty rows if new topics are found.\n\n    During this process, the previous bag-of-words matrix might be\n    decayed if `self.decay` has been set during init. Similarly, words\n    that do not exceed `self.delete_min_df` are removed from its\n    vocabulary and bag-of-words matrix.\n\n    Arguments:\n        raw_documents: A list of documents\n\n    Returns:\n        X_: Bag-of-words matrix\n    \"\"\"\n    if hasattr(self, \"X_\"):\n        X = self.transform(raw_documents)\n\n        # Add empty columns if new words are found\n        columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)\n        self.X_ = sparse.hstack([self.X_, columns])\n\n        # Add empty rows if new topics are found\n        rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)\n        self.X_ = sparse.vstack([self.X_, rows])\n\n        # Decay of BoW matrix\n        if self.decay is not None:\n            self.X_ = self.X_ * (1 - self.decay)\n\n        self.X_ += X\n    else:\n        self.X_ = self.transform(raw_documents)\n\n    if self.delete_min_df is not None:\n        self._clean_bow()\n\n    return self.X_\n</code></pre>"},{"location":"api/backends/base.html","title":"<code>BaseEmbedder</code>","text":"<p>The Base Embedder used for creating embedding models</p> <p>Parameters:</p> Name Type Description Default <code>embedding_model</code> <p>The main embedding model to be used for extracting              document and word embedding</p> <code>None</code> <code>word_embedding_model</code> <p>The embedding model used for extracting word                   embeddings only. If this model is selected,                   then the <code>embedding_model</code> is purely used for                   creating document embeddings.</p> <code>None</code> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>class BaseEmbedder:\n    \"\"\" The Base Embedder used for creating embedding models\n\n    Arguments:\n        embedding_model: The main embedding model to be used for extracting\n                         document and word embedding\n        word_embedding_model: The embedding model used for extracting word\n                              embeddings only. If this model is selected,\n                              then the `embedding_model` is purely used for\n                              creating document embeddings.\n    \"\"\"\n    def __init__(self,\n                 embedding_model=None,\n                 word_embedding_model=None):\n        self.embedding_model = embedding_model\n        self.word_embedding_model = word_embedding_model\n\n    def embed(self,\n              documents: List[str],\n              verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n documents/words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            documents: A list of documents or words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document/words embeddings with shape (n, m) with `n` documents/words\n            that each have an embeddings size of `m`\n        \"\"\"\n        pass\n\n    def embed_words(self,\n                    words: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            words: A list of words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Word embeddings with shape (n, m) with `n` words\n            that each have an embeddings size of `m`\n\n        \"\"\"\n        return self.embed(words, verbose)\n\n    def embed_documents(self,\n                        document: List[str],\n                        verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            document: A list of documents to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document embeddings with shape (n, m) with `n` documents\n            that each have an embeddings size of `m`\n        \"\"\"\n        return self.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/base.html#bertopic.backend._base.BaseEmbedder.embed","title":"<code>embed(self, documents, verbose=False)</code>","text":"<p>Embed a list of n documents/words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents or words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document/words embeddings with shape (n, m) with <code>n</code> documents/words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>def embed(self,\n          documents: List[str],\n          verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n documents/words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        documents: A list of documents or words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document/words embeddings with shape (n, m) with `n` documents/words\n        that each have an embeddings size of `m`\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/backends/base.html#bertopic.backend._base.BaseEmbedder.embed_documents","title":"<code>embed_documents(self, document, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>document</code> <code>List[str]</code> <p>A list of documents to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document embeddings with shape (n, m) with <code>n</code> documents that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>def embed_documents(self,\n                    document: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        document: A list of documents to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document embeddings with shape (n, m) with `n` documents\n        that each have an embeddings size of `m`\n    \"\"\"\n    return self.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/base.html#bertopic.backend._base.BaseEmbedder.embed_words","title":"<code>embed_words(self, words, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>words</code> <code>List[str]</code> <p>A list of words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Word embeddings with shape (n, m) with <code>n</code> words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>def embed_words(self,\n                words: List[str],\n                verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        words: A list of words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Word embeddings with shape (n, m) with `n` words\n        that each have an embeddings size of `m`\n\n    \"\"\"\n    return self.embed(words, verbose)\n</code></pre>"},{"location":"api/backends/cohere.html","title":"<code>CohereBackend</code>","text":"<p>Cohere Embedding Model</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <p>A <code>cohere</code> client.</p> required <code>embedding_model</code> <code>str</code> <p>A Cohere model. Default is \"large\".              For an overview of models see:              https://docs.cohere.ai/docs/generation-card</p> <code>'large'</code> <code>delay_in_seconds</code> <code>float</code> <p>If a <code>batch_size</code> is given, use this set               the delay in seconds between batches.</p> <code>None</code> <code>batch_size</code> <code>int</code> <p>The size of each batch.</p> <code>None</code> <code>embed_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs passed to <code>cohere.Client.embed</code>.                 Can be used to define additional parameters                 such as <code>input_type</code></p> <code>{}</code> <p>Examples:</p> <pre><code>import cohere\nfrom bertopic.backend import CohereBackend\n\nclient = cohere.Client(\"APIKEY\")\ncohere_model = CohereBackend(client)\n</code></pre> <p>If you want to specify <code>input_type</code>:</p> <pre><code>cohere_model = CohereBackend(\n    client,\n    embedding_model=\"embed-english-v3.0\",\n    embed_kwargs={\"input_type\": \"clustering\"}\n)\n</code></pre> Source code in <code>bertopic\\backend\\_cohere.py</code> <pre><code>class CohereBackend(BaseEmbedder):\n    \"\"\" Cohere Embedding Model\n\n    Arguments:\n        client: A `cohere` client.\n        embedding_model: A Cohere model. Default is \"large\".\n                         For an overview of models see:\n                         https://docs.cohere.ai/docs/generation-card\n        delay_in_seconds: If a `batch_size` is given, use this set\n                          the delay in seconds between batches.\n        batch_size: The size of each batch.\n        embed_kwargs: Kwargs passed to `cohere.Client.embed`.\n                            Can be used to define additional parameters\n                            such as `input_type`\n\n    Examples:\n\n    ```python\n    import cohere\n    from bertopic.backend import CohereBackend\n\n    client = cohere.Client(\"APIKEY\")\n    cohere_model = CohereBackend(client)\n    ```\n\n    If you want to specify `input_type`:\n\n    ```python\n    cohere_model = CohereBackend(\n        client,\n        embedding_model=\"embed-english-v3.0\",\n        embed_kwargs={\"input_type\": \"clustering\"}\n    )\n    ```\n    \"\"\"\n    def __init__(self,\n                 client,\n                 embedding_model: str = \"large\",\n                 delay_in_seconds: float = None,\n                 batch_size: int = None,\n                 embed_kwargs: Mapping[str, Any] = {}):\n        super().__init__()\n        self.client = client\n        self.embedding_model = embedding_model\n        self.delay_in_seconds = delay_in_seconds\n        self.batch_size = batch_size\n        self.embed_kwargs = embed_kwargs\n\n        if self.embed_kwargs.get(\"model\"):\n            self.embedding_model = embed_kwargs.get(\"model\")\n        else:\n            self.embed_kwargs[\"model\"] = self.embedding_model\n\n    def embed(self,\n              documents: List[str],\n              verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n documents/words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            documents: A list of documents or words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document/words embeddings with shape (n, m) with `n` documents/words\n            that each have an embeddings size of `m`\n        \"\"\"\n        # Batch-wise embedding extraction\n        if self.batch_size is not None:\n            embeddings = []\n            for batch in tqdm(self._chunks(documents), disable=not verbose):\n                response = self.client.embed(batch, **self.embed_kwargs)\n                embeddings.extend(response.embeddings)\n\n                # Delay subsequent calls\n                if self.delay_in_seconds:\n                    time.sleep(self.delay_in_seconds)\n\n        # Extract embeddings all at once\n        else:\n            response = self.client.embed(documents, **self.embed_kwargs)\n            embeddings = response.embeddings\n        return np.array(embeddings)\n\n    def _chunks(self, documents):\n        for i in range(0, len(documents), self.batch_size):\n            yield documents[i:i + self.batch_size]\n</code></pre>"},{"location":"api/backends/cohere.html#bertopic.backend._cohere.CohereBackend.embed","title":"<code>embed(self, documents, verbose=False)</code>","text":"<p>Embed a list of n documents/words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents or words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document/words embeddings with shape (n, m) with <code>n</code> documents/words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_cohere.py</code> <pre><code>def embed(self,\n          documents: List[str],\n          verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n documents/words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        documents: A list of documents or words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document/words embeddings with shape (n, m) with `n` documents/words\n        that each have an embeddings size of `m`\n    \"\"\"\n    # Batch-wise embedding extraction\n    if self.batch_size is not None:\n        embeddings = []\n        for batch in tqdm(self._chunks(documents), disable=not verbose):\n            response = self.client.embed(batch, **self.embed_kwargs)\n            embeddings.extend(response.embeddings)\n\n            # Delay subsequent calls\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n    # Extract embeddings all at once\n    else:\n        response = self.client.embed(documents, **self.embed_kwargs)\n        embeddings = response.embeddings\n    return np.array(embeddings)\n</code></pre>"},{"location":"api/backends/openai.html","title":"<code>OpenAIBackend</code>","text":"<p>OpenAI Embedding Model</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <p>A <code>openai.OpenAI</code> client.</p> required <code>embedding_model</code> <code>str</code> <p>An OpenAI model. Default is              For an overview of models see:              https://platform.openai.com/docs/models/embeddings</p> <code>'text-embedding-ada-002'</code> <code>delay_in_seconds</code> <code>float</code> <p>If a <code>batch_size</code> is given, use this set               the delay in seconds between batches.</p> <code>None</code> <code>batch_size</code> <code>int</code> <p>The size of each batch.</p> <code>None</code> <code>generator_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs passed to <code>openai.Embedding.create</code>.               Can be used to define custom engines or               deployment_ids.</p> <code>{}</code> <p>Examples:</p> <pre><code>import openai\nfrom bertopic.backend import OpenAIBackend\n\nclient = openai.OpenAI(api_key=\"sk-...\")\nopenai_embedder = OpenAIBackend(client, \"text-embedding-ada-002\")\n</code></pre> Source code in <code>bertopic\\backend\\_openai.py</code> <pre><code>class OpenAIBackend(BaseEmbedder):\n    \"\"\" OpenAI Embedding Model\n\n    Arguments:\n        client: A `openai.OpenAI` client.\n        embedding_model: An OpenAI model. Default is\n                         For an overview of models see:\n                         https://platform.openai.com/docs/models/embeddings\n        delay_in_seconds: If a `batch_size` is given, use this set\n                          the delay in seconds between batches.\n        batch_size: The size of each batch.\n        generator_kwargs: Kwargs passed to `openai.Embedding.create`.\n                          Can be used to define custom engines or\n                          deployment_ids.\n\n    Examples:\n\n    ```python\n    import openai\n    from bertopic.backend import OpenAIBackend\n\n    client = openai.OpenAI(api_key=\"sk-...\")\n    openai_embedder = OpenAIBackend(client, \"text-embedding-ada-002\")\n    ```\n    \"\"\"\n    def __init__(self,\n                 embedding_model: str = \"text-embedding-ada-002\",\n                 delay_in_seconds: float = None,\n                 batch_size: int = None,\n                 generator_kwargs: Mapping[str, Any] = {}):\n        super().__init__()\n        self.embedding_model = embedding_model\n        self.delay_in_seconds = delay_in_seconds\n        self.batch_size = batch_size\n        self.generator_kwargs = generator_kwargs\n\n        if self.generator_kwargs.get(\"model\"):\n            self.embedding_model = generator_kwargs.get(\"model\")\n        elif not self.generator_kwargs.get(\"engine\"):\n            self.generator_kwargs[\"model\"] = self.embedding_model\n\n    def embed(self,\n              documents: List[str],\n              verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n documents/words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            documents: A list of documents or words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document/words embeddings with shape (n, m) with `n` documents/words\n            that each have an embeddings size of `m`\n        \"\"\"\n        # Batch-wise embedding extraction\n        if self.batch_size is not None:\n            embeddings = []\n            for batch in tqdm(self._chunks(documents), disable=not verbose):\n                response = openai.Embedding.create(input=batch, **self.generator_kwargs)\n                embeddings.extend([r[\"embedding\"] for r in response[\"data\"]])\n\n                # Delay subsequent calls\n                if self.delay_in_seconds:\n                    time.sleep(self.delay_in_seconds)\n\n        # Extract embeddings all at once\n        else:\n            response = openai.Embedding.create(input=documents, **self.generator_kwargs)\n            embeddings = [r[\"embedding\"] for r in response[\"data\"]]\n        return np.array(embeddings)\n\n    def _chunks(self, documents):\n        for i in range(0, len(documents), self.batch_size):\n            yield documents[i:i + self.batch_size]\n</code></pre>"},{"location":"api/backends/openai.html#bertopic.backend._openai.OpenAIBackend.embed","title":"<code>embed(self, documents, verbose=False)</code>","text":"<p>Embed a list of n documents/words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents or words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document/words embeddings with shape (n, m) with <code>n</code> documents/words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_openai.py</code> <pre><code>def embed(self,\n          documents: List[str],\n          verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n documents/words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        documents: A list of documents or words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document/words embeddings with shape (n, m) with `n` documents/words\n        that each have an embeddings size of `m`\n    \"\"\"\n    # Batch-wise embedding extraction\n    if self.batch_size is not None:\n        embeddings = []\n        for batch in tqdm(self._chunks(documents), disable=not verbose):\n            response = openai.Embedding.create(input=batch, **self.generator_kwargs)\n            embeddings.extend([r[\"embedding\"] for r in response[\"data\"]])\n\n            # Delay subsequent calls\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n    # Extract embeddings all at once\n    else:\n        response = openai.Embedding.create(input=documents, **self.generator_kwargs)\n        embeddings = [r[\"embedding\"] for r in response[\"data\"]]\n    return np.array(embeddings)\n</code></pre>"},{"location":"api/backends/word_doc.html","title":"<code>WordDocEmbedder</code>","text":"<p>Combine a document- and word-level embedder</p> Source code in <code>bertopic\\backend\\_word_doc.py</code> <pre><code>class WordDocEmbedder(BaseEmbedder):\n    \"\"\" Combine a document- and word-level embedder\n    \"\"\"\n    def __init__(self,\n                 embedding_model,\n                 word_embedding_model):\n        super().__init__()\n\n        self.embedding_model = select_backend(embedding_model)\n        self.word_embedding_model = select_backend(word_embedding_model)\n\n    def embed_words(self,\n                    words: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            words: A list of words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Word embeddings with shape (n, m) with `n` words\n            that each have an embeddings size of `m`\n\n        \"\"\"\n        return self.word_embedding_model.embed(words, verbose)\n\n    def embed_documents(self,\n                        document: List[str],\n                        verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            document: A list of documents to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document embeddings with shape (n, m) with `n` documents\n            that each have an embeddings size of `m`\n        \"\"\"\n        return self.embedding_model.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/word_doc.html#bertopic.backend._word_doc.WordDocEmbedder.embed_documents","title":"<code>embed_documents(self, document, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>document</code> <code>List[str]</code> <p>A list of documents to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document embeddings with shape (n, m) with <code>n</code> documents that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_word_doc.py</code> <pre><code>def embed_documents(self,\n                    document: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        document: A list of documents to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document embeddings with shape (n, m) with `n` documents\n        that each have an embeddings size of `m`\n    \"\"\"\n    return self.embedding_model.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/word_doc.html#bertopic.backend._word_doc.WordDocEmbedder.embed_words","title":"<code>embed_words(self, words, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>words</code> <code>List[str]</code> <p>A list of words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Word embeddings with shape (n, m) with <code>n</code> words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_word_doc.py</code> <pre><code>def embed_words(self,\n                words: List[str],\n                verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        words: A list of words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Word embeddings with shape (n, m) with `n` words\n        that each have an embeddings size of `m`\n\n    \"\"\"\n    return self.word_embedding_model.embed(words, verbose)\n</code></pre>"},{"location":"api/cluster/base.html","title":"<code>BaseCluster</code>","text":"<p>The Base Cluster class</p> <p>Using this class directly in BERTopic will make it skip over the cluster step. As a result, topics need to be passed  to BERTopic in the form of its <code>y</code> parameter in order to create  topic representations. </p> <p>Examples:    </p> <p>This will skip over the cluster step in BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseCluster\n\nempty_cluster_model = BaseCluster()\n\ntopic_model = BERTopic(hdbscan_model=empty_cluster_model)\n</code></pre> <p>Then, this class can be used to perform manual topic modeling.  That is, topic modeling on a topics that were already generated before  without the need to learn them:</p> <pre><code>topic_model.fit(docs, y=y)\n</code></pre> Source code in <code>bertopic\\cluster\\_base.py</code> <pre><code>class BaseCluster:\n    \"\"\" The Base Cluster class\n\n    Using this class directly in BERTopic will make it skip\n    over the cluster step. As a result, topics need to be passed \n    to BERTopic in the form of its `y` parameter in order to create \n    topic representations. \n\n    Examples:    \n\n    This will skip over the cluster step in BERTopic:\n\n    ```python\n    from bertopic import BERTopic\n    from bertopic.dimensionality import BaseCluster\n\n    empty_cluster_model = BaseCluster()\n\n    topic_model = BERTopic(hdbscan_model=empty_cluster_model)\n    ```\n\n    Then, this class can be used to perform manual topic modeling. \n    That is, topic modeling on a topics that were already generated before \n    without the need to learn them:\n\n    ```python\n    topic_model.fit(docs, y=y)\n    ```\n    \"\"\"\n    def fit(self, X, y=None):\n        if y is not None:\n            self.labels_ = y\n        else:\n            self.labels_ = None\n        return self\n\n    def transform(self, X: np.ndarray) -&gt; np.ndarray:\n        return X\n</code></pre>"},{"location":"api/dimensionality/base.html","title":"<code>BaseDimensionalityReduction</code>","text":"<p>The Base Dimensionality Reduction class</p> <p>You can use this to skip over the dimensionality reduction step in BERTopic.</p> <p>Examples:</p> <p>This will skip over the reduction step in BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\nempty_reduction_model = BaseDimensionalityReduction()\n\ntopic_model = BERTopic(umap_model=empty_reduction_model)\n</code></pre> Source code in <code>bertopic\\dimensionality\\_base.py</code> <pre><code>class BaseDimensionalityReduction:\n    \"\"\" The Base Dimensionality Reduction class\n\n    You can use this to skip over the dimensionality reduction step in BERTopic.\n\n    Examples:\n\n    This will skip over the reduction step in BERTopic:\n\n    ```python\n    from bertopic import BERTopic\n    from bertopic.dimensionality import BaseDimensionalityReduction\n\n    empty_reduction_model = BaseDimensionalityReduction()\n\n    topic_model = BERTopic(umap_model=empty_reduction_model)\n    ```\n    \"\"\"\n    def fit(self, X: np.ndarray = None):\n        return self\n\n    def transform(self, X: np.ndarray) -&gt; np.ndarray:\n        return X\n</code></pre>"},{"location":"api/plotting/barchart.html","title":"<code>Barchart</code>","text":"<p>Visualize a barchart of selected topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>8</code> <code>n_words</code> <code>int</code> <p>Number of words to show in a topic</p> <code>5</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topic Word Scores&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of each figure.</p> <code>250</code> <code>height</code> <code>int</code> <p>The height of each figure.</p> <code>250</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the barchart of selected topics simply run:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_barchart()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_barchart.py</code> <pre><code>def visualize_barchart(topic_model,\n                       topics: List[int] = None,\n                       top_n_topics: int = 8,\n                       n_words: int = 5,\n                       custom_labels: Union[bool, str] = False,\n                       title: str = \"&lt;b&gt;Topic Word Scores&lt;/b&gt;\",\n                       width: int = 250,\n                       height: int = 250) -&gt; go.Figure:\n    \"\"\" Visualize a barchart of selected topics\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_words: Number of words to show in a topic\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of each figure.\n        height: The height of each figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the barchart of selected topics\n    simply run:\n\n    ```python\n    topic_model.visualize_barchart()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_barchart()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/bar_chart.html\"\n    style=\"width:1100px; height: 660px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    colors = itertools.cycle([\"#D55E00\", \"#0072B2\", \"#CC79A7\", \"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\"])\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list()[0:6])\n\n    # Initialize figure\n    if isinstance(custom_labels, str):\n        subplot_titles = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]\n        subplot_titles = [\"_\".join([label[0] for label in labels[:4]]) for labels in subplot_titles]\n        subplot_titles = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in subplot_titles]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        subplot_titles = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topics]\n    else:\n        subplot_titles = [f\"Topic {topic}\" for topic in topics]\n    columns = 4\n    rows = int(np.ceil(len(topics) / columns))\n    fig = make_subplots(rows=rows,\n                        cols=columns,\n                        shared_xaxes=False,\n                        horizontal_spacing=.1,\n                        vertical_spacing=.4 / rows if rows &gt; 1 else 0,\n                        subplot_titles=subplot_titles)\n\n    # Add barchart for each topic\n    row = 1\n    column = 1\n    for topic in topics:\n        words = [word + \"  \" for word, _ in topic_model.get_topic(topic)][:n_words][::-1]\n        scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]\n\n        fig.add_trace(\n            go.Bar(x=scores,\n                   y=words,\n                   orientation='h',\n                   marker_color=next(colors)),\n            row=row, col=column)\n\n        if column == columns:\n            column = 1\n            row += 1\n        else:\n            column += 1\n\n    # Stylize graph\n    fig.update_layout(\n        template=\"plotly_white\",\n        showlegend=False,\n        title={\n            'text': f\"{title}\",\n            'x': .5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width*4,\n        height=height*rows if rows &gt; 1 else height * 1.3,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    fig.update_xaxes(showgrid=True)\n    fig.update_yaxes(showgrid=True)\n\n    return fig\n</code></pre>"},{"location":"api/plotting/distribution.html","title":"<code>Distribution</code>","text":"<p>Visualize the distribution of topic probabilities</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>probabilities</code> <code>ndarray</code> <p>An array of probability scores</p> required <code>min_probability</code> <code>float</code> <p>The minimum probability score to visualize.              All others are ignored.</p> <code>0.015</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topic Probability Distribution&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>600</code> <p>Examples:</p> <p>Make sure to fit the model before and only input the probabilities of a single document:</p> <pre><code>topic_model.visualize_distribution(probabilities[0])\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_distribution(probabilities[0])\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_distribution.py</code> <pre><code>def visualize_distribution(topic_model,\n                           probabilities: np.ndarray,\n                           min_probability: float = 0.015,\n                           custom_labels: Union[bool, str] = False,\n                           title: str = \"&lt;b&gt;Topic Probability Distribution&lt;/b&gt;\",\n                           width: int = 800,\n                           height: int = 600) -&gt; go.Figure:\n    \"\"\" Visualize the distribution of topic probabilities\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        probabilities: An array of probability scores\n        min_probability: The minimum probability score to visualize.\n                         All others are ignored.\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    Make sure to fit the model before and only input the\n    probabilities of a single document:\n\n    ```python\n    topic_model.visualize_distribution(probabilities[0])\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_distribution(probabilities[0])\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/probabilities.html\"\n    style=\"width:1000px; height: 500px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    if len(probabilities.shape) != 1:\n        raise ValueError(\"This visualization cannot be used if you have set `calculate_probabilities` to False \"\n                         \"as it uses the topic probabilities of all topics. \")\n    if len(probabilities[probabilities &gt; min_probability]) == 0:\n        raise ValueError(\"There are no values where `min_probability` is higher than the \"\n                         \"probabilities that were supplied. Lower `min_probability` to prevent this error.\")\n\n    # Get values and indices equal or exceed the minimum probability\n    labels_idx = np.argwhere(probabilities &gt;= min_probability).flatten()\n    vals = probabilities[labels_idx].tolist()\n\n    # Create labels\n    if isinstance(custom_labels, str):\n        labels = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in labels_idx]\n        labels = [\"_\".join([label[0] for label in l[:4]]) for l in labels]\n        labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in labels]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        labels = [topic_model.custom_labels_[idx + topic_model._outliers] for idx in labels_idx]\n    else:\n        labels = []\n        for idx in labels_idx:\n            words = topic_model.get_topic(idx)\n            if words:\n                label = [word[0] for word in words[:5]]\n                label = f\"&lt;b&gt;Topic {idx}&lt;/b&gt;: {'_'.join(label)}\"\n                label = label[:40] + \"...\" if len(label) &gt; 40 else label\n                labels.append(label)\n            else:\n                vals.remove(probabilities[idx])\n\n    # Create Figure\n    fig = go.Figure(go.Bar(\n        x=vals,\n        y=labels,\n        marker=dict(\n            color='#C8D2D7',\n            line=dict(\n                color='#6E8484',\n                width=1),\n        ),\n        orientation='h')\n    )\n\n    fig.update_layout(\n        xaxis_title=\"Probability\",\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        template=\"simple_white\",\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    return fig\n</code></pre>"},{"location":"api/plotting/documents.html","title":"<code>Documents</code>","text":"<p>Visualize documents and their topics in 2D</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>float</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                  specific points. Helps to speed up generation of visualization.</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_documents(docs)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_documents.py</code> <pre><code>def visualize_documents(topic_model,\n                        docs: List[str],\n                        topics: List[int] = None,\n                        embeddings: np.ndarray = None,\n                        reduced_embeddings: np.ndarray = None,\n                        sample: float = None,\n                        hide_annotations: bool = False,\n                        hide_document_hover: bool = False,\n                        custom_labels: Union[bool, str] = False,\n                        title: str = \"&lt;b&gt;Documents and Topics&lt;/b&gt;\",\n                        width: int = 1200,\n                        height: int = 750):\n    \"\"\" Visualize documents and their topics in 2D\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                             specific points. Helps to speed up generation of visualization.\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_documents(docs)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic\n    topic_model = BERTopic().fit(docs, embeddings)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_documents(docs, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../../getting_started/visualization/documents.html\"\n    style=\"width:1000px; height: 800px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    topic_per_doc = topic_model.topics_\n\n    # Sample the data to optimize for visualization and dimensionality reduction\n    if sample is None or sample &gt; 1:\n        sample = 1\n\n    indices = []\n    for topic in set(topic_per_doc):\n        s = np.where(np.array(topic_per_doc) == topic)[0]\n        size = len(s) if len(s) &lt; 100 else int(len(s) * sample)\n        indices.extend(np.random.choice(s, size=size, replace=False))\n    indices = np.array(indices)\n\n    df = pd.DataFrame({\"topic\": np.array(topic_per_doc)[indices]})\n    df[\"doc\"] = [docs[index] for index in indices]\n    df[\"topic\"] = [topic_per_doc[index] for index in indices]\n\n    # Extract embeddings if not already done\n    if sample is None:\n        if embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n        else:\n            embeddings_to_reduce = embeddings\n    else:\n        if embeddings is not None:\n            embeddings_to_reduce = embeddings[indices]\n        elif embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n\n    # Reduce input embeddings\n    if reduced_embeddings is None:\n        umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)\n        embeddings_2d = umap_model.embedding_\n    elif sample is not None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings[indices]\n    elif sample is None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings\n\n    unique_topics = set(topic_per_doc)\n    if topics is None:\n        topics = unique_topics\n\n    # Combine data\n    df[\"x\"] = embeddings_2d[:, 0]\n    df[\"y\"] = embeddings_2d[:, 1]\n\n    # Prepare text and names\n    if isinstance(custom_labels, str):\n        names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]\n        names = [\"_\".join([label[0] for label in labels[:4]]) for labels in names]\n        names = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in names]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]\n    else:\n        names = [f\"{topic}_\" + \"_\".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]\n\n    # Visualize\n    fig = go.Figure()\n\n    # Outliers and non-selected topics\n    non_selected_topics = set(unique_topics).difference(topics)\n    if len(non_selected_topics) == 0:\n        non_selected_topics = [-1]\n\n    selection = df.loc[df.topic.isin(non_selected_topics), :]\n    selection[\"text\"] = \"\"\n    selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), \"Other documents\"]\n\n    fig.add_trace(\n        go.Scattergl(\n            x=selection.x,\n            y=selection.y,\n            hovertext=selection.doc if not hide_document_hover else None,\n            hoverinfo=\"text\",\n            mode='markers+text',\n            name=\"other\",\n            showlegend=False,\n            marker=dict(color='#CFD8DC', size=5, opacity=0.5)\n        )\n    )\n\n    # Selected topics\n    for name, topic in zip(names, unique_topics):\n        if topic in topics and topic != -1:\n            selection = df.loc[df.topic == topic, :]\n            selection[\"text\"] = \"\"\n\n            if not hide_annotations:\n                selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name]\n\n            fig.add_trace(\n                go.Scattergl(\n                    x=selection.x,\n                    y=selection.y,\n                    hovertext=selection.doc if not hide_document_hover else None,\n                    hoverinfo=\"text\",\n                    text=selection.text,\n                    mode='markers+text',\n                    name=name,\n                    textfont=dict(\n                        size=12,\n                    ),\n                    marker=dict(size=5, opacity=0.5)\n                )\n            )\n\n    # Add grid in a 'plus' shape\n    x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))\n    y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))\n    fig.add_shape(type=\"line\",\n                  x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],\n                  line=dict(color=\"#CFD8DC\", width=2))\n    fig.add_shape(type=\"line\",\n                  x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,\n                  line=dict(color=\"#9E9E9E\", width=2))\n    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text=\"D1\", showarrow=False, yshift=10)\n    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text=\"D2\", showarrow=False, xshift=10)\n\n    # Stylize layout\n    fig.update_layout(\n        template=\"simple_white\",\n        title={\n            'text': f\"{title}\",\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height\n    )\n\n    fig.update_xaxes(visible=False)\n    fig.update_yaxes(visible=False)\n    return fig\n</code></pre>"},{"location":"api/plotting/dtm.html","title":"<code>DTM</code>","text":"<p>Visualize topics over time</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics_over_time</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>None</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics over Time&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>450</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics over time, simply run:</p> <pre><code>topics_over_time = topic_model.topics_over_time(docs, timestamps)\ntopic_model.visualize_topics_over_time(topics_over_time)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_over_time(topics_over_time)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_topics_over_time.py</code> <pre><code>def visualize_topics_over_time(topic_model,\n                               topics_over_time: pd.DataFrame,\n                               top_n_topics: int = None,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: Union[bool, str] = False,\n                               title: str = \"&lt;b&gt;Topics over Time&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 450) -&gt; go.Figure:\n    \"\"\" Visualize topics over time\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics_over_time: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics over time, simply run:\n\n    ```python\n    topics_over_time = topic_model.topics_over_time(docs, timestamps)\n    topic_model.visualize_topics_over_time(topics_over_time)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_over_time(topics_over_time)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/trump.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    colors = [\"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#D55E00\", \"#0072B2\", \"#CC79A7\"]\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        selected_topics = list(topics)\n    elif top_n_topics is not None:\n        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        selected_topics = sorted(freq_df.Topic.to_list())\n\n    # Prepare data\n    if isinstance(custom_labels, str):\n        topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]\n        topic_names = [\"_\".join([label[0] for label in labels[:4]]) for labels in topic_names]\n        topic_names = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in topic_names]\n        topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        topic_names = {key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()}\n    else:\n        topic_names = {key: value[:40] + \"...\" if len(value) &gt; 40 else value\n                       for key, value in topic_model.topic_labels_.items()}\n    topics_over_time[\"Name\"] = topics_over_time.Topic.map(topic_names)\n    data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :].sort_values([\"Topic\", \"Timestamp\"])\n\n    # Add traces\n    fig = go.Figure()\n    for index, topic in enumerate(data.Topic.unique()):\n        trace_data = data.loc[data.Topic == topic, :]\n        topic_name = trace_data.Name.values[0]\n        words = trace_data.Words.values\n        if normalize_frequency:\n            y = normalize(trace_data.Frequency.values.reshape(1, -1))[0]\n        else:\n            y = trace_data.Frequency\n        fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,\n                                 mode='lines',\n                                 marker_color=colors[index % 7],\n                                 hoverinfo=\"text\",\n                                 name=topic_name,\n                                 hovertext=[f'&lt;b&gt;Topic {topic}&lt;/b&gt;&lt;br&gt;Words: {word}' for word in words]))\n\n    # Styling of the visualization\n    fig.update_xaxes(showgrid=True)\n    fig.update_yaxes(showgrid=True)\n    fig.update_layout(\n        yaxis_title=\"Normalized Frequency\" if normalize_frequency else \"Frequency\",\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.40,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        template=\"simple_white\",\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n        legend=dict(\n            title=\"&lt;b&gt;Global Topic Representation\",\n        )\n    )\n    return fig\n</code></pre>"},{"location":"api/plotting/heatmap.html","title":"<code>Heatmap</code>","text":"<p>Visualize a heatmap of the topic's similarity matrix</p> <p>Based on the cosine similarity matrix between topic embeddings, a heatmap is created showing the similarity between topics.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>None</code> <code>n_clusters</code> <code>int</code> <p>Create n clusters and order the similarity         matrix by those clusters.</p> <code>None</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Similarity Matrix&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>800</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the similarity matrix of topics simply run:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_heatmap()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_heatmap.py</code> <pre><code>def visualize_heatmap(topic_model,\n                      topics: List[int] = None,\n                      top_n_topics: int = None,\n                      n_clusters: int = None,\n                      custom_labels: Union[bool, str] = False,\n                      title: str = \"&lt;b&gt;Similarity Matrix&lt;/b&gt;\",\n                      width: int = 800,\n                      height: int = 800) -&gt; go.Figure:\n    \"\"\" Visualize a heatmap of the topic's similarity matrix\n\n    Based on the cosine similarity matrix between topic embeddings,\n    a heatmap is created showing the similarity between topics.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_clusters: Create n clusters and order the similarity\n                    matrix by those clusters.\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the similarity matrix of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_heatmap()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_heatmap()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/heatmap.html\"\n    style=\"width:1000px; height: 720px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n\n    # Select topic embeddings\n    if topic_model.topic_embeddings_ is not None:\n        embeddings = np.array(topic_model.topic_embeddings_)[topic_model._outliers:]\n    else:\n        embeddings = topic_model.c_tf_idf_[topic_model._outliers:]\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list())\n\n    # Order heatmap by similar clusters of topics\n    sorted_topics = topics\n    if n_clusters:\n        if n_clusters &gt;= len(set(topics)):\n            raise ValueError(\"Make sure to set `n_clusters` lower than \"\n                             \"the total number of unique topics.\")\n\n        distance_matrix = cosine_similarity(embeddings[topics])\n        Z = linkage(distance_matrix, 'ward')\n        clusters = fcluster(Z, t=n_clusters, criterion='maxclust')\n\n        # Extract new order of topics\n        mapping = {cluster: [] for cluster in clusters}\n        for topic, cluster in zip(topics, clusters):\n            mapping[cluster].append(topic)\n        mapping = [cluster for cluster in mapping.values()]\n        sorted_topics = [topic for cluster in mapping for topic in cluster]\n\n    # Select embeddings\n    indices = np.array([topics.index(topic) for topic in sorted_topics])\n    embeddings = embeddings[indices]\n    distance_matrix = cosine_similarity(embeddings)\n\n    # Create labels\n    if isinstance(custom_labels, str):\n        new_labels = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in sorted_topics]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        new_labels = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in sorted_topics]\n    else:\n        new_labels = [[[str(topic), None]] + topic_model.get_topic(topic) for topic in sorted_topics]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n\n    fig = px.imshow(distance_matrix,\n                    labels=dict(color=\"Similarity Score\"),\n                    x=new_labels,\n                    y=new_labels,\n                    color_continuous_scale='GnBu'\n                    )\n\n    fig.update_layout(\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.55,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n    fig.update_layout(showlegend=True)\n    fig.update_layout(legend_title_text='Trend')\n\n    return fig\n</code></pre>"},{"location":"api/plotting/hierarchical_documents.html","title":"<code>Hierarchical Documents</code>","text":"<p>Visualize documents and their topics in 2D at different levels of hierarchy</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                  represented by their parents and their children</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>Union[float, int]</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                  specific points. Helps to speed up generation of visualizations.</p> <code>True</code> <code>nr_levels</code> <code>int</code> <p>The number of levels to be visualized in the hierarchy. First, the distances        in <code>hierarchical_topics.Distance</code> are split in <code>nr_levels</code> lists of distances.         Then, for each list of distances, the merged topics are selected that have a         distance less or equal to the maximum distance of the selected list of distances.        NOTE: To get all possible merged steps, make sure that <code>nr_levels</code> is equal to        the length of <code>hierarchical_topics</code>.</p> <code>10</code> <code>level_scale</code> <code>str</code> <p>Whether to apply a linear or logarithmic (log) scale levels of the distance           vector. Linear scaling will perform an equal number of merges at each level           while logarithmic scaling will perform more mergers in earlier levels to           provide more resolution at higher levels (this can be used for when the number           of topics is large). </p> <code>'linear'</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".            NOTE: Custom labels are only generated for the original             un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> <p>Note</p> <p>This visualization was inspired by the scatter plot representation of Doc2Map: https://github.com/louisgeisler/Doc2Map</p> Source code in <code>bertopic\\plotting\\_hierarchical_documents.py</code> <pre><code>def visualize_hierarchical_documents(topic_model,\n                                     docs: List[str],\n                                     hierarchical_topics: pd.DataFrame,\n                                     topics: List[int] = None,\n                                     embeddings: np.ndarray = None,\n                                     reduced_embeddings: np.ndarray = None,\n                                     sample: Union[float, int] = None,\n                                     hide_annotations: bool = False,\n                                     hide_document_hover: bool = True,\n                                     nr_levels: int = 10,\n                                     level_scale: str = 'linear', \n                                     custom_labels: Union[bool, str] = False,\n                                     title: str = \"&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;\",\n                                     width: int = 1200,\n                                     height: int = 750) -&gt; go.Figure:\n    \"\"\" Visualize documents and their topics in 2D at different levels of hierarchy\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                             represented by their parents and their children\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                             specific points. Helps to speed up generation of visualizations.\n        nr_levels: The number of levels to be visualized in the hierarchy. First, the distances\n                   in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances. \n                   Then, for each list of distances, the merged topics are selected that have a \n                   distance less or equal to the maximum distance of the selected list of distances.\n                   NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to\n                   the length of `hierarchical_topics`.\n        level_scale: Whether to apply a linear or logarithmic (log) scale levels of the distance \n                     vector. Linear scaling will perform an equal number of merges at each level \n                     while logarithmic scaling will perform more mergers in earlier levels to \n                     provide more resolution at higher levels (this can be used for when the number \n                     of topics is large). \n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n                       NOTE: Custom labels are only generated for the original \n                       un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic and extract hierarchical topics\n    topic_model = BERTopic().fit(docs, embeddings)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    NOTE:\n        This visualization was inspired by the scatter plot representation of Doc2Map:\n        https://github.com/louisgeisler/Doc2Map\n\n    &lt;iframe src=\"../../getting_started/visualization/hierarchical_documents.html\"\n    style=\"width:1000px; height: 770px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    topic_per_doc = topic_model.topics_\n\n    # Sample the data to optimize for visualization and dimensionality reduction\n    if sample is None or sample &gt; 1:\n        sample = 1\n\n    indices = []\n    for topic in set(topic_per_doc):\n        s = np.where(np.array(topic_per_doc) == topic)[0]\n        size = len(s) if len(s) &lt; 100 else int(len(s)*sample)\n        indices.extend(np.random.choice(s, size=size, replace=False))\n    indices = np.array(indices)\n\n    df = pd.DataFrame({\"topic\": np.array(topic_per_doc)[indices]})\n    df[\"doc\"] = [docs[index] for index in indices]\n    df[\"topic\"] = [topic_per_doc[index] for index in indices]\n\n    # Extract embeddings if not already done\n    if sample is None:\n        if embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n        else:\n            embeddings_to_reduce = embeddings\n    else:\n        if embeddings is not None:\n            embeddings_to_reduce = embeddings[indices]\n        elif embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n\n    # Reduce input embeddings\n    if reduced_embeddings is None:\n        umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)\n        embeddings_2d = umap_model.embedding_\n    elif sample is not None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings[indices]\n    elif sample is None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings\n\n    # Combine data\n    df[\"x\"] = embeddings_2d[:, 0]\n    df[\"y\"] = embeddings_2d[:, 1]\n\n    # Create topic list for each level, levels are created by calculating the distance\n    distances = hierarchical_topics.Distance.to_list()\n    if level_scale == 'log' or level_scale == 'logarithmic':\n        log_indices = np.round(np.logspace(start=math.log(1,10), stop=math.log(len(distances)-1,10), num=nr_levels)).astype(int).tolist()\n        log_indices.reverse()\n        max_distances = [distances[i] for i in log_indices]\n    elif level_scale == 'lin' or level_scale == 'linear':\n        max_distances = [distances[indices[-1]] for indices in np.array_split(range(len(hierarchical_topics)), nr_levels)][::-1]\n    else:\n        raise ValueError(\"level_scale needs to be one of 'log' or 'linear'\")\n\n    for index, max_distance in enumerate(max_distances):\n\n        # Get topics below `max_distance`\n        mapping = {topic: topic for topic in df.topic.unique()}\n        selection = hierarchical_topics.loc[hierarchical_topics.Distance &lt;= max_distance, :]\n        selection.Parent_ID = selection.Parent_ID.astype(int)\n        selection = selection.sort_values(\"Parent_ID\")\n\n        for row in selection.iterrows():\n            for topic in row[1].Topics:\n                mapping[topic] = row[1].Parent_ID\n\n        # Make sure the mappings are mapped 1:1\n        mappings = [True for _ in mapping]\n        while any(mappings):\n            for i, (key, value) in enumerate(mapping.items()):\n                if value in mapping.keys() and key != value:\n                    mapping[key] = mapping[value]\n                else:\n                    mappings[i] = False\n\n        # Create new column\n        df[f\"level_{index+1}\"] = df.topic.map(mapping)\n        df[f\"level_{index+1}\"] = df[f\"level_{index+1}\"].astype(int)\n\n    # Prepare topic names of original and merged topics\n    trace_names = []\n    topic_names = {}\n    for topic in range(hierarchical_topics.Parent_ID.astype(int).max()):\n        if topic &lt; hierarchical_topics.Parent_ID.astype(int).min():\n            if topic_model.get_topic(topic):\n                if isinstance(custom_labels, str):\n                    trace_name = f\"{topic}_\" + \"_\".join(list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3])\n                elif topic_model.custom_labels_ is not None and custom_labels:\n                    trace_name = topic_model.custom_labels_[topic + topic_model._outliers]\n                else:\n                    trace_name = f\"{topic}_\" + \"_\".join([word[:20] for word, _ in topic_model.get_topic(topic)][:3])\n                topic_names[topic] = {\"trace_name\": trace_name[:40], \"plot_text\": trace_name[:40]}\n                trace_names.append(trace_name)\n        else:\n            trace_name = f\"{topic}_\" + hierarchical_topics.loc[hierarchical_topics.Parent_ID == str(topic), \"Parent_Name\"].values[0]\n            plot_text = \"_\".join([name[:20] for name in trace_name.split(\"_\")[:3]])\n            topic_names[topic] = {\"trace_name\": trace_name[:40], \"plot_text\": plot_text[:40]}\n            trace_names.append(trace_name)\n\n    # Prepare traces\n    all_traces = []\n    for level in range(len(max_distances)):\n        traces = []\n\n        # Outliers\n        if topic_model._outliers:\n            traces.append(\n                    go.Scattergl(\n                        x=df.loc[(df[f\"level_{level+1}\"] == -1), \"x\"],\n                        y=df.loc[df[f\"level_{level+1}\"] == -1, \"y\"],\n                        mode='markers+text',\n                        name=\"other\",\n                        hoverinfo=\"text\",\n                        hovertext=df.loc[(df[f\"level_{level+1}\"] == -1), \"doc\"] if not hide_document_hover else None,\n                        showlegend=False,\n                        marker=dict(color='#CFD8DC', size=5, opacity=0.5)\n                    )\n                )\n\n        # Selected topics\n        if topics:\n            selection = df.loc[(df.topic.isin(topics)), :]\n            unique_topics = sorted([int(topic) for topic in selection[f\"level_{level+1}\"].unique()])\n        else:\n            unique_topics = sorted([int(topic) for topic in df[f\"level_{level+1}\"].unique()])\n\n        for topic in unique_topics:\n            if topic != -1:\n                if topics:\n                    selection = df.loc[(df[f\"level_{level+1}\"] == topic) &amp;\n                                       (df.topic.isin(topics)), :]\n                else:\n                    selection = df.loc[df[f\"level_{level+1}\"] == topic, :]\n\n                if not hide_annotations:\n                    selection.loc[len(selection), :] = None\n                    selection[\"text\"] = \"\"\n                    selection.loc[len(selection) - 1, \"x\"] = selection.x.mean()\n                    selection.loc[len(selection) - 1, \"y\"] = selection.y.mean()\n                    selection.loc[len(selection) - 1, \"text\"] = topic_names[int(topic)][\"plot_text\"]\n\n                traces.append(\n                    go.Scattergl(\n                        x=selection.x,\n                        y=selection.y,\n                        text=selection.text if not hide_annotations else None,\n                        hovertext=selection.doc if not hide_document_hover else None,\n                        hoverinfo=\"text\",\n                        name=topic_names[int(topic)][\"trace_name\"],\n                        mode='markers+text',\n                        marker=dict(size=5, opacity=0.5)\n                    )\n                )\n\n        all_traces.append(traces)\n\n    # Track and count traces\n    nr_traces_per_set = [len(traces) for traces in all_traces]\n    trace_indices = [(0, nr_traces_per_set[0])]\n    for index, nr_traces in enumerate(nr_traces_per_set[1:]):\n        start = trace_indices[index][1]\n        end = nr_traces + start\n        trace_indices.append((start, end))\n\n    # Visualization\n    fig = go.Figure()\n    for traces in all_traces:\n        for trace in traces:\n            fig.add_trace(trace)\n\n    for index in range(len(fig.data)):\n        if index &gt;= nr_traces_per_set[0]:\n            fig.data[index].visible = False\n\n    # Create and add slider\n    steps = []\n    for index, indices in enumerate(trace_indices):\n        step = dict(\n            method=\"update\",\n            label=str(index),\n            args=[{\"visible\": [False] * len(fig.data)}]\n        )\n        for index in range(indices[1]-indices[0]):\n            step[\"args\"][0][\"visible\"][index+indices[0]] = True\n        steps.append(step)\n\n    sliders = [dict(\n        currentvalue={\"prefix\": \"Level: \"},\n        pad={\"t\": 20},\n        steps=steps\n    )]\n\n    # Add grid in a 'plus' shape\n    x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))\n    y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))\n    fig.add_shape(type=\"line\",\n                  x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],\n                  line=dict(color=\"#CFD8DC\", width=2))\n    fig.add_shape(type=\"line\",\n                  x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,\n                  line=dict(color=\"#9E9E9E\", width=2))\n    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text=\"D1\", showarrow=False, yshift=10)\n    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text=\"D2\", showarrow=False, xshift=10)\n\n    # Stylize layout\n    fig.update_layout(\n        sliders=sliders,\n        template=\"simple_white\",\n        title={\n            'text': f\"{title}\",\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height,\n    )\n\n    fig.update_xaxes(visible=False)\n    fig.update_yaxes(visible=False)\n    return fig\n</code></pre>"},{"location":"api/plotting/hierarchy.html","title":"<code>Hierarchy</code>","text":"<p>Visualize a hierarchical structure of the topics</p> <p>A ward linkage function is used to perform the hierarchical clustering based on the cosine distance matrix between topic embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>orientation</code> <code>str</code> <p>The orientation of the figure.          Either 'left' or 'bottom'</p> <code>'left'</code> <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".            NOTE: Custom labels are only generated for the original             un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Clustering&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure. Only works if orientation is set to 'left'</p> <code>1000</code> <code>height</code> <code>int</code> <p>The height of the figure. Only works if orientation is set to 'bottom'</p> <code>600</code> <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                  represented by their parents and their children.                  NOTE: The hierarchical topic names are only visualized                  if both <code>topics</code> and <code>top_n_topics</code> are not set.</p> <code>None</code> <code>linkage_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], numpy.ndarray]</code> <p>The linkage function to use. Default is:               <code>lambda x: sch.linkage(x, 'ward', optimal_ordering=True)</code>               NOTE: Make sure to use the same <code>linkage_function</code> as used               in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>distance_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], scipy.sparse._csr.csr_matrix]</code> <p>The distance function to use on the c-TF-IDF matrix. Default is:                <code>lambda x: 1 - cosine_similarity(x)</code>.                 You can pass any function that returns either a square matrix of                  shape (n_samples, n_samples) with zeros on the diagonal and                  non-negative values or condensed distance matrix of shape                  (n_samples * (n_samples - 1) / 2,) containing the upper                  triangular of the distance matrix.                NOTE: Make sure to use the same <code>distance_function</code> as used                in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>color_threshold</code> <code>int</code> <p>Value at which the separation of clusters will be made which              will result in different colors for different clusters.              A higher value will typically lead in less colored clusters.</p> <code>1</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the hierarchical structure of topics simply run:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>If you also want the labels visualized of hierarchical topics, run the following:</p> <pre><code># Extract hierarchical topics and their representations\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Visualize these representations\ntopic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchy()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_hierarchy.py</code> <pre><code>def visualize_hierarchy(topic_model,\n                        orientation: str = \"left\",\n                        topics: List[int] = None,\n                        top_n_topics: int = None,\n                        custom_labels: Union[bool, str] = False,\n                        title: str = \"&lt;b&gt;Hierarchical Clustering&lt;/b&gt;\",\n                        width: int = 1000,\n                        height: int = 600,\n                        hierarchical_topics: pd.DataFrame = None,\n                        linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                        distance_function: Callable[[csr_matrix], csr_matrix] = None,\n                        color_threshold: int = 1) -&gt; go.Figure:\n    \"\"\" Visualize a hierarchical structure of the topics\n\n    A ward linkage function is used to perform the\n    hierarchical clustering based on the cosine distance\n    matrix between topic embeddings.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        orientation: The orientation of the figure.\n                     Either 'left' or 'bottom'\n        topics: A selection of topics to visualize\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n                       NOTE: Custom labels are only generated for the original \n                       un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure. Only works if orientation is set to 'left'\n        height: The height of the figure. Only works if orientation is set to 'bottom'\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                             represented by their parents and their children.\n                             NOTE: The hierarchical topic names are only visualized\n                             if both `topics` and `top_n_topics` are not set.\n        linkage_function: The linkage function to use. Default is:\n                          `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n                          NOTE: Make sure to use the same `linkage_function` as used\n                          in `topic_model.hierarchical_topics`.\n        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                           `lambda x: 1 - cosine_similarity(x)`.\n                            You can pass any function that returns either a square matrix of \n                            shape (n_samples, n_samples) with zeros on the diagonal and \n                            non-negative values or condensed distance matrix of shape \n                            (n_samples * (n_samples - 1) / 2,) containing the upper \n                            triangular of the distance matrix.\n                           NOTE: Make sure to use the same `distance_function` as used\n                           in `topic_model.hierarchical_topics`.\n        color_threshold: Value at which the separation of clusters will be made which\n                         will result in different colors for different clusters.\n                         A higher value will typically lead in less colored clusters.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the hierarchical structure of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchy()\n    ```\n\n    If you also want the labels visualized of hierarchical topics,\n    run the following:\n\n    ```python\n    # Extract hierarchical topics and their representations\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Visualize these representations\n    topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n    ```\n\n    If you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchy()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/hierarchy.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    if distance_function is None:\n        distance_function = lambda x: 1 - cosine_similarity(x)\n\n    if linkage_function is None:\n        linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list())\n\n    # Select embeddings\n    all_topics = sorted(list(topic_model.get_topics().keys()))\n    indices = np.array([all_topics.index(topic) for topic in topics])\n\n    # Select topic embeddings\n    if topic_model.c_tf_idf_ is not None:\n        embeddings = topic_model.c_tf_idf_[indices]\n    else:\n        embeddings = np.array(topic_model.topic_embeddings_)[indices]\n\n    # Annotations\n    if hierarchical_topics is not None and len(topics) == len(freq_df.Topic.to_list()):\n        annotations = _get_annotations(topic_model=topic_model,\n                                       hierarchical_topics=hierarchical_topics,\n                                       embeddings=embeddings,\n                                       distance_function=distance_function,\n                                       linkage_function=linkage_function,\n                                       orientation=orientation,\n                                       custom_labels=custom_labels)\n    else:\n        annotations = None\n\n    # wrap distance function to validate input and return a condensed distance matrix\n    distance_function_viz = lambda x: validate_distance_matrix(\n        distance_function(x), embeddings.shape[0])\n    # Create dendogram\n    fig = ff.create_dendrogram(embeddings,\n                               orientation=orientation,\n                               distfun=distance_function_viz,\n                               linkagefun=linkage_function,\n                               hovertext=annotations,\n                               color_threshold=color_threshold)\n\n    # Create nicer labels\n    axis = \"yaxis\" if orientation == \"left\" else \"xaxis\"\n    if isinstance(custom_labels, str):\n        new_labels = [[[str(x), None]] + topic_model.topic_aspects_[custom_labels][x] for x in fig.layout[axis][\"ticktext\"]]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        new_labels = [topic_model.custom_labels_[topics[int(x)] + topic_model._outliers] for x in fig.layout[axis][\"ticktext\"]]\n    else:\n        new_labels = [[[str(topics[int(x)]), None]] + topic_model.get_topic(topics[int(x)])\n                      for x in fig.layout[axis][\"ticktext\"]]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n\n    # Stylize layout\n    fig.update_layout(\n        plot_bgcolor='#ECEFF1',\n        template=\"plotly_white\",\n        title={\n            'text': f\"{title}\",\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    # Stylize orientation\n    if orientation == \"left\":\n        fig.update_layout(height=200 + (15 * len(topics)),\n                          width=width,\n                          yaxis=dict(tickmode=\"array\",\n                                     ticktext=new_labels))\n\n        # Fix empty space on the bottom of the graph\n        y_max = max([trace['y'].max() + 5 for trace in fig['data']])\n        y_min = min([trace['y'].min() - 5 for trace in fig['data']])\n        fig.update_layout(yaxis=dict(range=[y_min, y_max]))\n\n    else:\n        fig.update_layout(width=200 + (15 * len(topics)),\n                          height=height,\n                          xaxis=dict(tickmode=\"array\",\n                                     ticktext=new_labels))\n\n    if hierarchical_topics is not None:\n        for index in [0, 3]:\n            axis = \"x\" if orientation == \"left\" else \"y\"\n            xs = [data[\"x\"][index] for data in fig.data if (data[\"text\"] and data[axis][index] &gt; 0)]\n            ys = [data[\"y\"][index] for data in fig.data if (data[\"text\"] and data[axis][index] &gt; 0)]\n            hovertext = [data[\"text\"][index] for data in fig.data if (data[\"text\"] and data[axis][index] &gt; 0)]\n\n            fig.add_trace(go.Scatter(x=xs, y=ys, marker_color='black',\n                                     hovertext=hovertext, hoverinfo=\"text\",\n                                     mode='markers', showlegend=False))\n    return fig\n</code></pre>"},{"location":"api/plotting/term.html","title":"<code>Term Score Decline</code>","text":"<p>Visualize the ranks of all terms across all topics</p> <p>Each topic is represented by a set of words. These words, however, do not all equally represent the topic. This visualization shows how many words are needed to represent a topic and at which point the beneficial effect of adding words starts to decline.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize. These will be colored     red where all others will be colored black.</p> <code>None</code> <code>log_scale</code> <code>bool</code> <p>Whether to represent the ranking on a log scale</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Term score decline per Topic&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>500</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the ranks of all words across all topics simply run:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_term_rank()\nfig.write_html(\"path/to/file.html\")\n</code></pre> <p>Reference:</p> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an analysis by the amazing tmtoolkit. Reference to that specific analysis can be found here.</p> Source code in <code>bertopic\\plotting\\_term_rank.py</code> <pre><code>def visualize_term_rank(topic_model,\n                        topics: List[int] = None,\n                        log_scale: bool = False,\n                        custom_labels: Union[bool, str] = False,\n                        title: str = \"&lt;b&gt;Term score decline per Topic&lt;/b&gt;\",\n                        width: int = 800,\n                        height: int = 500) -&gt; go.Figure:\n    \"\"\" Visualize the ranks of all terms across all topics\n\n    Each topic is represented by a set of words. These words, however,\n    do not all equally represent the topic. This visualization shows\n    how many words are needed to represent a topic and at which point\n    the beneficial effect of adding words starts to decline.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize. These will be colored\n                red where all others will be colored black.\n        log_scale: Whether to represent the ranking on a log scale\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the ranks of all words across\n    all topics simply run:\n\n    ```python\n    topic_model.visualize_term_rank()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_term_rank()\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../../getting_started/visualization/term_rank.html\"\n    style=\"width:1000px; height: 530px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n\n    &lt;iframe src=\"../../getting_started/visualization/term_rank_log.html\"\n    style=\"width:1000px; height: 530px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n\n    Reference:\n\n    This visualization was heavily inspired by the\n    \"Term Probability Decline\" visualization found in an\n    analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).\n    Reference to that specific analysis can be found\n    [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).\n    \"\"\"\n\n    topics = [] if topics is None else topics\n\n    topic_ids = topic_model.get_topic_info().Topic.unique().tolist()\n    topic_words = [topic_model.get_topic(topic) for topic in topic_ids]\n\n    values = np.array([[value[1] for value in values] for values in topic_words])\n    indices = np.array([[value + 1 for value in range(len(values))] for values in topic_words])\n\n    # Create figure\n    lines = []\n    for topic, x, y in zip(topic_ids, indices, values):\n        if not any(y &gt; 1.5):\n\n            # labels\n            if isinstance(custom_labels, str):\n                label = f\"{topic}_\" + \"_\".join(list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3])\n            elif topic_model.custom_labels_ is not None and custom_labels:\n                label = topic_model.custom_labels_[topic + topic_model._outliers]\n            else:\n                label = f\"&lt;b&gt;Topic {topic}&lt;/b&gt;:\" + \"_\".join([word[0] for word in topic_model.get_topic(topic)])\n                label = label[:50]\n\n            # line parameters\n            color = \"red\" if topic in topics else \"black\"\n            opacity = 1 if topic in topics else .1\n            if any(y == 0):\n                y[y == 0] = min(values[values &gt; 0])\n            y = np.log10(y, out=y, where=y &gt; 0) if log_scale else y\n\n            line = go.Scatter(x=x, y=y,\n                              name=\"\",\n                              hovertext=label,\n                              mode=\"lines+lines\",\n                              opacity=opacity,\n                              line=dict(color=color, width=1.5))\n            lines.append(line)\n\n    fig = go.Figure(data=lines)\n\n    # Stylize layout\n    fig.update_xaxes(range=[0, len(indices[0])], tick0=1, dtick=2)\n    fig.update_layout(\n        showlegend=False,\n        template=\"plotly_white\",\n        title={\n            'text': f\"{title}\",\n            'y': .9,\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    fig.update_xaxes(title_text='Term Rank')\n    if log_scale:\n        fig.update_yaxes(title_text='c-TF-IDF score (log scale)')\n    else:\n        fig.update_yaxes(title_text='c-TF-IDF score')\n\n    return fig\n</code></pre>"},{"location":"api/plotting/topics.html","title":"<code>Topics</code>","text":"<p>Visualize topics, their sizes, and their corresponding words</p> <p>This visualization is highly inspired by LDAvis, a great visualization technique typically reserved for LDA.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Intertopic Distance Map&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>650</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>650</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_topics.py</code> <pre><code>def visualize_topics(topic_model,\n                     topics: List[int] = None,\n                     top_n_topics: int = None,\n                     custom_labels: Union[bool, str] = False,\n                     title: str = \"&lt;b&gt;Intertopic Distance Map&lt;/b&gt;\",\n                     width: int = 650,\n                     height: int = 650) -&gt; go.Figure:\n    \"\"\" Visualize topics, their sizes, and their corresponding words\n\n    This visualization is highly inspired by LDAvis, a great visualization\n    technique typically reserved for LDA.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_topics()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/viz.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list())\n\n    # Extract topic words and their frequencies\n    topic_list = sorted(topics)\n    frequencies = [topic_model.topic_sizes_[topic] for topic in topic_list]\n    if isinstance(custom_labels, str):\n        words = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topic_list]\n        words = [\"_\".join([label[0] for label in labels[:4]]) for labels in words]\n        words = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in words]\n    elif custom_labels and topic_model.custom_labels_ is not None:\n        words = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topic_list]\n    else:\n        words = [\" | \".join([word[0] for word in topic_model.get_topic(topic)[:5]]) for topic in topic_list]\n\n    # Embed c-TF-IDF into 2D\n    all_topics = sorted(list(topic_model.get_topics().keys()))\n    indices = np.array([all_topics.index(topic) for topic in topics])\n\n    if topic_model.topic_embeddings_ is not None:\n        embeddings = topic_model.topic_embeddings_[indices]\n        embeddings = UMAP(n_neighbors=2, n_components=2, metric='cosine', random_state=42).fit_transform(embeddings)\n    else:\n        embeddings = topic_model.c_tf_idf_.toarray()[indices]\n        embeddings = MinMaxScaler().fit_transform(embeddings)\n        embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger', random_state=42).fit_transform(embeddings)\n\n    # Visualize with plotly\n    df = pd.DataFrame({\"x\": embeddings[:, 0], \"y\": embeddings[:, 1],\n                       \"Topic\": topic_list, \"Words\": words, \"Size\": frequencies})\n    return _plotly_topic_visualization(df, topic_list, title, width, height)\n</code></pre>"},{"location":"api/plotting/topics_per_class.html","title":"<code>Topics per Class</code>","text":"<p>Visualize topics per class</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics_per_class</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>10</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics per Class&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>900</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics per class, simply run:</p> <pre><code>topics_per_class = topic_model.topics_per_class(docs, classes)\ntopic_model.visualize_topics_per_class(topics_per_class)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_per_class(topics_per_class)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_topics_per_class.py</code> <pre><code>def visualize_topics_per_class(topic_model,\n                               topics_per_class: pd.DataFrame,\n                               top_n_topics: int = 10,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: Union[bool, str] = False,\n                               title: str = \"&lt;b&gt;Topics per Class&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 900) -&gt; go.Figure:\n    \"\"\" Visualize topics per class\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics_per_class: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics per class, simply run:\n\n    ```python\n    topics_per_class = topic_model.topics_per_class(docs, classes)\n    topic_model.visualize_topics_per_class(topics_per_class)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_per_class(topics_per_class)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/topics_per_class.html\"\n    style=\"width:1400px; height: 1000px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    colors = [\"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#D55E00\", \"#0072B2\", \"#CC79A7\"]\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        selected_topics = list(topics)\n    elif top_n_topics is not None:\n        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        selected_topics = sorted(freq_df.Topic.to_list())\n\n    # Prepare data\n    if isinstance(custom_labels, str):\n        topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]\n        topic_names = [\"_\".join([label[0] for label in labels[:4]]) for labels in topic_names]\n        topic_names = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in topic_names]\n        topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        topic_names = {key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()}\n    else:\n        topic_names = {key: value[:40] + \"...\" if len(value) &gt; 40 else value\n                       for key, value in topic_model.topic_labels_.items()}\n    topics_per_class[\"Name\"] = topics_per_class.Topic.map(topic_names)\n    data = topics_per_class.loc[topics_per_class.Topic.isin(selected_topics), :]\n\n    # Add traces\n    fig = go.Figure()\n    for index, topic in enumerate(selected_topics):\n        if index == 0:\n            visible = True\n        else:\n            visible = \"legendonly\"\n        trace_data = data.loc[data.Topic == topic, :]\n        topic_name = trace_data.Name.values[0]\n        words = trace_data.Words.values\n        if normalize_frequency:\n            x = normalize(trace_data.Frequency.values.reshape(1, -1))[0]\n        else:\n            x = trace_data.Frequency\n        fig.add_trace(go.Bar(y=trace_data.Class,\n                             x=x,\n                             visible=visible,\n                             marker_color=colors[index % 7],\n                             hoverinfo=\"text\",\n                             name=topic_name,\n                             orientation=\"h\",\n                             hovertext=[f'&lt;b&gt;Topic {topic}&lt;/b&gt;&lt;br&gt;Words: {word}' for word in words]))\n\n    # Styling of the visualization\n    fig.update_xaxes(showgrid=True)\n    fig.update_yaxes(showgrid=True)\n    fig.update_layout(\n        xaxis_title=\"Normalized Frequency\" if normalize_frequency else \"Frequency\",\n        yaxis_title=\"Class\",\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.40,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        template=\"simple_white\",\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n        legend=dict(\n            title=\"&lt;b&gt;Global Topic Representation\",\n        )\n    )\n    return fig\n</code></pre>"},{"location":"api/representation/base.html","title":"<code>BaseRepresentation</code>","text":"<p>The base representation model for fine-tuning topic representations </p> Source code in <code>bertopic\\representation\\_base.py</code> <pre><code>class BaseRepresentation(BaseEstimator):\n    \"\"\" The base representation model for fine-tuning topic representations \"\"\"\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Each representation model that inherits this class will have\n        its arguments (topic_model, documents, c_tf_idf, topics)\n        automatically passed. Therefore, the representation model\n        will only have access to the information about topics related\n        to those arguments.\n\n        Arguments:\n            topic_model: The BERTopic model that is fitted until topic\n                         representations are calculated.\n            documents: A dataframe with columns \"Document\" and \"Topic\"\n                       that contains all documents with each corresponding\n                       topic.\n            c_tf_idf: A c-TF-IDF representation that is typically\n                      identical to `topic_model.c_tf_idf_` except for\n                      dynamic, class-based, and hierarchical topic modeling\n                      where it is calculated on a subset of the documents.\n            topics: A dictionary with topic (key) and tuple of word and\n                    weight (value) as calculated by c-TF-IDF. This is the\n                    default topics that are returned if no representation\n                    model is used.\n        \"\"\"\n        return topic_model.topic_representations_\n</code></pre>"},{"location":"api/representation/base.html#bertopic.representation._base.BaseRepresentation.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Each representation model that inherits this class will have its arguments (topic_model, documents, c_tf_idf, topics) automatically passed. Therefore, the representation model will only have access to the information about topics related to those arguments.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>The BERTopic model that is fitted until topic          representations are calculated.</p> required <code>documents</code> <code>DataFrame</code> <p>A dataframe with columns \"Document\" and \"Topic\"        that contains all documents with each corresponding        topic.</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>A c-TF-IDF representation that is typically       identical to <code>topic_model.c_tf_idf_</code> except for       dynamic, class-based, and hierarchical topic modeling       where it is calculated on a subset of the documents.</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>A dictionary with topic (key) and tuple of word and     weight (value) as calculated by c-TF-IDF. This is the     default topics that are returned if no representation     model is used.</p> required Source code in <code>bertopic\\representation\\_base.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Each representation model that inherits this class will have\n    its arguments (topic_model, documents, c_tf_idf, topics)\n    automatically passed. Therefore, the representation model\n    will only have access to the information about topics related\n    to those arguments.\n\n    Arguments:\n        topic_model: The BERTopic model that is fitted until topic\n                     representations are calculated.\n        documents: A dataframe with columns \"Document\" and \"Topic\"\n                   that contains all documents with each corresponding\n                   topic.\n        c_tf_idf: A c-TF-IDF representation that is typically\n                  identical to `topic_model.c_tf_idf_` except for\n                  dynamic, class-based, and hierarchical topic modeling\n                  where it is calculated on a subset of the documents.\n        topics: A dictionary with topic (key) and tuple of word and\n                weight (value) as calculated by c-TF-IDF. This is the\n                default topics that are returned if no representation\n                model is used.\n    \"\"\"\n    return topic_model.topic_representations_\n</code></pre>"},{"location":"api/representation/cohere.html","title":"<code>Cohere</code>","text":"<p>Use the Cohere API to generate topic labels based on their generative model.</p> <p>Find more about their models here: https://docs.cohere.ai/docs</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <p>A <code>cohere.Client</code></p> required <code>model</code> <code>str</code> <p>Model to use within Cohere, defaults to <code>\"xlarge\"</code>.</p> <code>'xlarge'</code> <code>prompt</code> <code>str</code> <p>The prompt to be used in the model. If no prompt is given,     <code>self.default_prompt_</code> is used instead.     NOTE: Use <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code> in the prompt     to decide where the keywords and documents need to be     inserted.</p> <code>None</code> <code>delay_in_seconds</code> <code>float</code> <p>The delay in seconds between consecutive prompts                      in order to prevent RateLimitErrors. </p> <code>None</code> <code>nr_docs</code> <code>int</code> <p>The number of documents to pass to OpenAI if a prompt      with the <code>[\"DOCUMENTS\"]</code> tag is used.</p> <code>4</code> <code>diversity</code> <code>float</code> <p>The diversity of documents to pass to OpenAI.        Accepts values between 0 and 1. A higher         values results in passing more diverse documents        whereas lower values passes more similar documents.</p> <code>None</code> <code>doc_length</code> <code>int</code> <p>The maximum length of each document. If a document is longer,         it will be truncated. If None, the entire document is passed.</p> <code>None</code> <code>tokenizer</code> <code>Union[str, Callable]</code> <p>The tokenizer used to calculate to split the document into segments        used to count the length of a document.             * If tokenizer is 'char', then the document is split up               into characters which are counted to adhere to <code>doc_length</code>            * If tokenizer is 'whitespace', the document is split up              into words separated by whitespaces. These words are counted              and truncated depending on <code>doc_length</code>            * If tokenizer is 'vectorizer', then the internal CountVectorizer              is used to tokenize the document. These tokens are counted              and trunctated depending on <code>doc_length</code>            * If tokenizer is a callable, then that callable is used to tokenize              the document. These tokens are counted and truncated depending              on <code>doc_length</code></p> <code>None</code> <p>Usage:</p> <p>To use this, you will need to install cohere first:</p> <p><code>pip install cohere</code></p> <p>Then, get yourself an API key and use Cohere's API as follows:</p> <pre><code>import cohere\nfrom bertopic.representation import Cohere\nfrom bertopic import BERTopic\n\n# Create your representation model\nco = cohere.Client(my_api_key)\nrepresentation_model = Cohere(co)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"I have the following documents: [DOCUMENTS]. What topic do they contain?\"\nrepresentation_model = Cohere(co, prompt=prompt)\n</code></pre> Source code in <code>bertopic\\representation\\_cohere.py</code> <pre><code>class Cohere(BaseRepresentation):\n    \"\"\" Use the Cohere API to generate topic labels based on their\n    generative model.\n\n    Find more about their models here:\n    https://docs.cohere.ai/docs\n\n    Arguments:\n        client: A `cohere.Client`\n        model: Model to use within Cohere, defaults to `\"xlarge\"`.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n                NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n                to decide where the keywords and documents need to be\n                inserted.\n        delay_in_seconds: The delay in seconds between consecutive prompts \n                                in order to prevent RateLimitErrors. \n        nr_docs: The number of documents to pass to OpenAI if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to OpenAI.\n                   Accepts values between 0 and 1. A higher \n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document. \n                       * If tokenizer is 'char', then the document is split up \n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n\n    Usage:\n\n    To use this, you will need to install cohere first:\n\n    `pip install cohere`\n\n    Then, get yourself an API key and use Cohere's API as follows:\n\n    ```python\n    import cohere\n    from bertopic.representation import Cohere\n    from bertopic import BERTopic\n\n    # Create your representation model\n    co = cohere.Client(my_api_key)\n    representation_model = Cohere(co)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can also use a custom prompt:\n\n    ```python\n    prompt = \"I have the following documents: [DOCUMENTS]. What topic do they contain?\"\n    representation_model = Cohere(co, prompt=prompt)\n    ```\n    \"\"\"\n    def __init__(self,\n                 client,\n                 model: str = \"xlarge\",\n                 prompt: str = None,\n                 delay_in_seconds: float = None,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None\n                 ):\n        self.client = client\n        self.model = model\n        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT\n        self.default_prompt_ = DEFAULT_PROMPT\n        self.delay_in_seconds = delay_in_seconds\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n        self.prompts_ = []\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: Not used\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top 4 representative documents per topic\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n        # Generate using Cohere's Language Model\n        updated_topics = {}\n        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n            prompt = self._create_prompt(truncated_docs, topic, topics)\n            self.prompts_.append(prompt)\n\n            # Delay\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n            request = self.client.generate(model=self.model,\n                                           prompt=prompt,\n                                           max_tokens=50,\n                                           num_generations=1,\n                                           stop_sequences=[\"\\n\"])\n            label = request.generations[0].text.strip()\n            updated_topics[topic] = [(label, 1)] + [(\"\", 0) for _ in range(9)]\n\n        return updated_topics\n\n    def _create_prompt(self, docs, topic, topics):\n        keywords = list(zip(*topics[topic]))[0]\n\n        # Use the Default Chat Prompt\n        if self.prompt == DEFAULT_PROMPT:\n            prompt = self.prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            prompt = self._replace_documents(prompt, docs)\n\n        # Use a custom prompt that leverages keywords, documents or both using\n        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively\n        else:\n            prompt = self.prompt\n            if \"[KEYWORDS]\" in prompt:\n                prompt = prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            if \"[DOCUMENTS]\" in prompt:\n                prompt = self._replace_documents(prompt, docs)\n\n        return prompt\n\n    @staticmethod\n    def _replace_documents(prompt, docs):\n        to_replace = \"\"\n        for doc in docs:\n            to_replace += f\"- {doc}\\n\"\n        prompt = prompt.replace(\"[DOCUMENTS]\", to_replace)\n        return prompt\n</code></pre>"},{"location":"api/representation/cohere.html#bertopic.representation._cohere.Cohere.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>Not used</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_cohere.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: Not used\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top 4 representative documents per topic\n    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n    # Generate using Cohere's Language Model\n    updated_topics = {}\n    for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n        prompt = self._create_prompt(truncated_docs, topic, topics)\n        self.prompts_.append(prompt)\n\n        # Delay\n        if self.delay_in_seconds:\n            time.sleep(self.delay_in_seconds)\n\n        request = self.client.generate(model=self.model,\n                                       prompt=prompt,\n                                       max_tokens=50,\n                                       num_generations=1,\n                                       stop_sequences=[\"\\n\"])\n        label = request.generations[0].text.strip()\n        updated_topics[topic] = [(label, 1)] + [(\"\", 0) for _ in range(9)]\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/generation.html","title":"<code>TextGeneration</code>","text":"<p>Text2Text or text generation with transformers</p> <p>Parameters:</p> Name Type Description Default <code>model</code> <code>Union[str, pipeline]</code> <p>A transformers pipeline that should be initialized as \"text-generation\"    for gpt-like models or \"text2text-generation\" for T5-like models.    For example, <code>pipeline('text-generation', model='gpt2')</code>. If a string    is passed, \"text-generation\" will be selected by default.</p> required <code>prompt</code> <code>str</code> <p>The prompt to be used in the model. If no prompt is given,     <code>self.default_prompt_</code> is used instead.     NOTE: Use <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code> in the prompt     to decide where the keywords and documents need to be     inserted.</p> <code>None</code> <code>pipeline_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs that you can pass to the transformers.pipeline              when it is called.</p> <code>{}</code> <code>random_state</code> <code>int</code> <p>A random state to be passed to <code>transformers.set_seed</code></p> <code>42</code> <code>nr_docs</code> <code>int</code> <p>The number of documents to pass to OpenAI if a prompt      with the <code>[\"DOCUMENTS\"]</code> tag is used.</p> <code>4</code> <code>diversity</code> <code>float</code> <p>The diversity of documents to pass to OpenAI.        Accepts values between 0 and 1. A higher        values results in passing more diverse documents        whereas lower values passes more similar documents.</p> <code>None</code> <code>doc_length</code> <code>int</code> <p>The maximum length of each document. If a document is longer,         it will be truncated. If None, the entire document is passed.</p> <code>None</code> <code>tokenizer</code> <code>Union[str, Callable]</code> <p>The tokenizer used to calculate to split the document into segments        used to count the length of a document.            * If tokenizer is 'char', then the document is split up              into characters which are counted to adhere to <code>doc_length</code>            * If tokenizer is 'whitespace', the document is split up              into words separated by whitespaces. These words are counted              and truncated depending on <code>doc_length</code>            * If tokenizer is 'vectorizer', then the internal CountVectorizer              is used to tokenize the document. These tokens are counted              and trunctated depending on <code>doc_length</code>            * If tokenizer is a callable, then that callable is used to tokenize              the document. These tokens are counted and truncated depending              on <code>doc_length</code></p> <code>None</code> <p>Usage:</p> <p>To use a gpt-like model:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Create your representation model\ngenerator = pipeline('text-generation', model='gpt2')\nrepresentation_model = TextGeneration(generator)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTo pic(representation_model=representation_model)\n</code></pre> <p>You can use a custom prompt and decide where the keywords should be inserted by using the <code>[KEYWORDS]</code> or documents with thte <code>[DOCUMENTS]</code> tag:</p> <pre><code>from bertopic.representation import TextGeneration\n\nprompt = \"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?\"\"\n\n# Create your representation model\ngenerator = pipeline('text2text-generation', model='google/flan-t5-base')\nrepresentation_model = TextGeneration(generator)\n</code></pre> Source code in <code>bertopic\\representation\\_textgeneration.py</code> <pre><code>class TextGeneration(BaseRepresentation):\n    \"\"\" Text2Text or text generation with transformers\n\n    Arguments:\n        model: A transformers pipeline that should be initialized as \"text-generation\"\n               for gpt-like models or \"text2text-generation\" for T5-like models.\n               For example, `pipeline('text-generation', model='gpt2')`. If a string\n               is passed, \"text-generation\" will be selected by default.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n                NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n                to decide where the keywords and documents need to be\n                inserted.\n        pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline\n                         when it is called.\n        random_state: A random state to be passed to `transformers.set_seed`\n        nr_docs: The number of documents to pass to OpenAI if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to OpenAI.\n                   Accepts values between 0 and 1. A higher\n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document.\n                       * If tokenizer is 'char', then the document is split up\n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n\n    Usage:\n\n    To use a gpt-like model:\n\n    ```python\n    from bertopic.representation import TextGeneration\n    from bertopic import BERTopic\n\n    # Create your representation model\n    generator = pipeline('text-generation', model='gpt2')\n    representation_model = TextGeneration(generator)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTo pic(representation_model=representation_model)\n    ```\n\n    You can use a custom prompt and decide where the keywords should\n    be inserted by using the `[KEYWORDS]` or documents with thte `[DOCUMENTS]` tag:\n\n    ```python\n    from bertopic.representation import TextGeneration\n\n    prompt = \"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?\"\"\n\n    # Create your representation model\n    generator = pipeline('text2text-generation', model='google/flan-t5-base')\n    representation_model = TextGeneration(generator)\n    ```\n    \"\"\"\n    def __init__(self,\n                 model: Union[str, pipeline],\n                 prompt: str = None,\n                 pipeline_kwargs: Mapping[str, Any] = {},\n                 random_state: int = 42,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None\n                 ):\n        set_seed(random_state)\n        if isinstance(model, str):\n            self.model = pipeline(\"text-generation\", model=model)\n        elif isinstance(model, Pipeline):\n            self.model = model\n        else:\n            raise ValueError(\"Make sure that the HF model that you\"\n                             \"pass is either a string referring to a\"\n                             \"HF model or a `transformers.pipeline` object.\")\n        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT\n        self.default_prompt_ = DEFAULT_PROMPT\n        self.pipeline_kwargs = pipeline_kwargs\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n\n        self.prompts_ = []\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topic representations and return a single label\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top 4 representative documents per topic\n        if self.prompt != DEFAULT_PROMPT and \"[DOCUMENTS]\" in self.prompt:\n            repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n                c_tf_idf,\n                documents,\n                topics,\n                500,\n                self.nr_docs,\n                self.diversity\n            )\n        else:\n            repr_docs_mappings = {topic: None for topic in topics.keys()}\n\n        updated_topics = {}\n        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n\n            # Prepare prompt\n            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n            prompt = self._create_prompt(truncated_docs, topic, topics)\n            self.prompts_.append(prompt)\n\n            # Extract result from generator and use that as label\n            topic_description = self.model(prompt, **self.pipeline_kwargs)\n            topic_description = [(description[\"generated_text\"].replace(prompt, \"\"), 1) for description in topic_description]\n\n            if len(topic_description) &lt; 10:\n                topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n\n            updated_topics[topic] = topic_description\n\n        return updated_topics\n\n    def _create_prompt(self, docs, topic, topics):\n        keywords = \", \".join(list(zip(*topics[topic]))[0])\n\n        # Use the default prompt and replace keywords\n        if self.prompt == DEFAULT_PROMPT:\n            prompt = self.prompt.replace(\"[KEYWORDS]\", keywords)\n\n        # Use a prompt that leverages either keywords or documents in\n        # a custom location\n        else:\n            prompt = self.prompt\n            if \"[KEYWORDS]\" in prompt:\n                prompt = prompt.replace(\"[KEYWORDS]\", keywords)\n            if \"[DOCUMENTS]\" in prompt:\n                to_replace = \"\"\n                for doc in docs:\n                    to_replace += f\"- {doc}\\n\"\n                prompt = prompt.replace(\"[DOCUMENTS]\", to_replace)\n\n        return prompt\n</code></pre>"},{"location":"api/representation/generation.html#bertopic.representation._textgeneration.TextGeneration.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topic representations and return a single label</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_textgeneration.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topic representations and return a single label\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top 4 representative documents per topic\n    if self.prompt != DEFAULT_PROMPT and \"[DOCUMENTS]\" in self.prompt:\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n            c_tf_idf,\n            documents,\n            topics,\n            500,\n            self.nr_docs,\n            self.diversity\n        )\n    else:\n        repr_docs_mappings = {topic: None for topic in topics.keys()}\n\n    updated_topics = {}\n    for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n\n        # Prepare prompt\n        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n        prompt = self._create_prompt(truncated_docs, topic, topics)\n        self.prompts_.append(prompt)\n\n        # Extract result from generator and use that as label\n        topic_description = self.model(prompt, **self.pipeline_kwargs)\n        topic_description = [(description[\"generated_text\"].replace(prompt, \"\"), 1) for description in topic_description]\n\n        if len(topic_description) &lt; 10:\n            topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n\n        updated_topics[topic] = topic_description\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/keybert.html","title":"<code>KeyBERTInspired</code>","text":"Source code in <code>bertopic\\representation\\_keybert.py</code> <pre><code>class KeyBERTInspired(BaseRepresentation):\n    def __init__(self,\n                 top_n_words: int = 10,\n                 nr_repr_docs: int = 5,\n                 nr_samples: int = 500,\n                 nr_candidate_words: int = 100,\n                 random_state: int = 42):\n        \"\"\" Use a KeyBERT-like model to fine-tune the topic representations\n\n        The algorithm follows KeyBERT but does some optimization in\n        order to speed up inference.\n\n        The steps are as follows. First, we extract the top n representative\n        documents per topic. To extract the representative documents, we\n        randomly sample a number of candidate documents per cluster\n        which is controlled by the `nr_samples` parameter. Then,\n        the top n representative documents  are extracted by calculating\n        the c-TF-IDF representation for the  candidate documents and finding,\n        through cosine similarity, which are closest to the topic c-TF-IDF representation.\n        Next, the top n words per topic are extracted based on their\n        c-TF-IDF representation, which is controlled by the `nr_repr_docs`\n        parameter.\n\n        Then, we extract the embeddings for words and representative documents\n        and create topic embeddings by averaging the representative documents.\n        Finally, the most similar words to each topic are extracted by\n        calculating the cosine similarity between word and topic embeddings.\n\n        Arguments:\n            top_n_words: The top n words to extract per topic.\n            nr_repr_docs: The number of representative documents to extract per cluster.\n            nr_samples: The number of candidate documents to extract per cluster.\n            nr_candidate_words: The number of candidate words per cluster.\n            random_state: The random state for randomly sampling candidate documents.\n\n        Usage:\n\n        ```python\n        from bertopic.representation import KeyBERTInspired\n        from bertopic import BERTopic\n\n        # Create your representation model\n        representation_model = KeyBERTInspired()\n\n        # Use the representation model in BERTopic on top of the default pipeline\n        topic_model = BERTopic(representation_model=representation_model)\n        ```\n        \"\"\"\n        self.top_n_words = top_n_words\n        self.nr_repr_docs = nr_repr_docs\n        self.nr_samples = nr_samples\n        self.nr_candidate_words = nr_candidate_words\n        self.random_state = random_state\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # We extract the top n representative documents per class\n        _, representative_docs, repr_doc_indices, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, self.nr_samples, self.nr_repr_docs)\n\n        # We extract the top n words per class\n        topics = self._extract_candidate_words(topic_model, c_tf_idf, topics)\n\n        # We calculate the similarity between word and document embeddings and create\n        # topic embeddings from the representative document embeddings\n        sim_matrix, words = self._extract_embeddings(topic_model, topics, representative_docs, repr_doc_indices)\n\n        # Find the best matching words based on the similarity matrix for each topic\n        updated_topics = self._extract_top_words(words, topics, sim_matrix)\n\n        return updated_topics\n\n    def _extract_candidate_words(self,\n                                 topic_model,\n                                 c_tf_idf: csr_matrix,\n                                 topics: Mapping[str, List[Tuple[str, float]]]\n                                 ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" For each topic, extract candidate words based on the c-TF-IDF\n        representation.\n\n        Arguments:\n            topic_model: A BERTopic model\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The top words per topic\n\n        Returns:\n            topics: The `self.top_n_words` per topic\n        \"\"\"\n        labels = [int(label) for label in sorted(list(topics.keys()))]\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = topic_model.vectorizer_model.get_feature_names_out()\n        else:\n            words = topic_model.vectorizer_model.get_feature_names()\n\n        indices = topic_model._top_n_idx_sparse(c_tf_idf, self.nr_candidate_words)\n        scores = topic_model._top_n_values_sparse(c_tf_idf, indices)\n        sorted_indices = np.argsort(scores, 1)\n        indices = np.take_along_axis(indices, sorted_indices, axis=1)\n        scores = np.take_along_axis(scores, sorted_indices, axis=1)\n\n        # Get top 30 words per topic based on c-TF-IDF score\n        topics = {label: [(words[word_index], score)\n                          if word_index is not None and score &gt; 0\n                          else (\"\", 0.00001)\n                          for word_index, score in zip(indices[index][::-1], scores[index][::-1])\n                          ]\n                  for index, label in enumerate(labels)}\n        topics = {label: list(zip(*values[:self.nr_candidate_words]))[0] for label, values in topics.items()}\n\n        return topics\n\n    def _extract_embeddings(self,\n                            topic_model,\n                            topics: Mapping[str, List[Tuple[str, float]]],\n                            representative_docs: List[str],\n                            repr_doc_indices: List[List[int]]\n                            ) -&gt; Union[np.ndarray, List[str]]:\n        \"\"\" Extract the representative document embeddings and create topic embeddings.\n        Then extract word embeddings and calculate the cosine similarity between topic\n        embeddings and the word embeddings. Topic embeddings are the average of\n        representative document embeddings.\n\n        Arguments:\n            topic_model: A BERTopic model\n            topics: The top words per topic\n            representative_docs: A flat list of representative documents\n            repr_doc_indices: The indices of representative documents\n                              that belong to each topic\n\n        Returns:\n            sim: The similarity matrix between word and topic embeddings\n            vocab: The complete vocabulary of input documents\n        \"\"\"\n        # Calculate representative docs embeddings and create topic embeddings\n        repr_embeddings = topic_model._extract_embeddings(representative_docs, method=\"document\", verbose=False)\n        topic_embeddings = [np.mean(repr_embeddings[i[0]:i[-1]+1], axis=0) for i in repr_doc_indices]\n\n        # Calculate word embeddings and extract best matching with updated topic_embeddings\n        vocab = list(set([word for words in topics.values() for word in words]))\n        word_embeddings = topic_model._extract_embeddings(vocab, method=\"document\", verbose=False)\n        sim = cosine_similarity(topic_embeddings, word_embeddings)\n\n        return sim, vocab\n\n    def _extract_top_words(self,\n                           vocab: List[str],\n                           topics: Mapping[str, List[Tuple[str, float]]],\n                           sim: np.ndarray\n                           ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract the top n words per topic based on the\n        similarity matrix between topics and words.\n\n        Arguments:\n            vocab: The complete vocabulary of input documents\n            labels: All topic labels\n            topics: The top words per topic\n            sim: The similarity matrix between word and topic embeddings\n\n        Returns:\n            updated_topics: The updated topic representations\n        \"\"\"\n        labels = [int(label) for label in sorted(list(topics.keys()))]\n        updated_topics = {}\n        for i, topic in enumerate(labels):\n            indices = [vocab.index(word) for word in topics[topic]]\n            values = sim[:, indices][i]\n            word_indices = [indices[index] for index in np.argsort(values)[-self.top_n_words:]]\n            updated_topics[topic] = [(vocab[index], val) for val, index in zip(np.sort(values)[-self.top_n_words:], word_indices)][::-1]\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/keybert.html#bertopic.representation._keybert.KeyBERTInspired.__init__","title":"<code>__init__(self, top_n_words=10, nr_repr_docs=5, nr_samples=500, nr_candidate_words=100, random_state=42)</code>  <code>special</code>","text":"<p>Use a KeyBERT-like model to fine-tune the topic representations</p> <p>The algorithm follows KeyBERT but does some optimization in order to speed up inference.</p> <p>The steps are as follows. First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the <code>nr_samples</code> parameter. Then, the top n representative documents  are extracted by calculating the c-TF-IDF representation for the  candidate documents and finding, through cosine similarity, which are closest to the topic c-TF-IDF representation. Next, the top n words per topic are extracted based on their c-TF-IDF representation, which is controlled by the <code>nr_repr_docs</code> parameter.</p> <p>Then, we extract the embeddings for words and representative documents and create topic embeddings by averaging the representative documents. Finally, the most similar words to each topic are extracted by calculating the cosine similarity between word and topic embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>top_n_words</code> <code>int</code> <p>The top n words to extract per topic.</p> <code>10</code> <code>nr_repr_docs</code> <code>int</code> <p>The number of representative documents to extract per cluster.</p> <code>5</code> <code>nr_samples</code> <code>int</code> <p>The number of candidate documents to extract per cluster.</p> <code>500</code> <code>nr_candidate_words</code> <code>int</code> <p>The number of candidate words per cluster.</p> <code>100</code> <code>random_state</code> <code>int</code> <p>The random state for randomly sampling candidate documents.</p> <code>42</code> <p>Usage:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> Source code in <code>bertopic\\representation\\_keybert.py</code> <pre><code>def __init__(self,\n             top_n_words: int = 10,\n             nr_repr_docs: int = 5,\n             nr_samples: int = 500,\n             nr_candidate_words: int = 100,\n             random_state: int = 42):\n    \"\"\" Use a KeyBERT-like model to fine-tune the topic representations\n\n    The algorithm follows KeyBERT but does some optimization in\n    order to speed up inference.\n\n    The steps are as follows. First, we extract the top n representative\n    documents per topic. To extract the representative documents, we\n    randomly sample a number of candidate documents per cluster\n    which is controlled by the `nr_samples` parameter. Then,\n    the top n representative documents  are extracted by calculating\n    the c-TF-IDF representation for the  candidate documents and finding,\n    through cosine similarity, which are closest to the topic c-TF-IDF representation.\n    Next, the top n words per topic are extracted based on their\n    c-TF-IDF representation, which is controlled by the `nr_repr_docs`\n    parameter.\n\n    Then, we extract the embeddings for words and representative documents\n    and create topic embeddings by averaging the representative documents.\n    Finally, the most similar words to each topic are extracted by\n    calculating the cosine similarity between word and topic embeddings.\n\n    Arguments:\n        top_n_words: The top n words to extract per topic.\n        nr_repr_docs: The number of representative documents to extract per cluster.\n        nr_samples: The number of candidate documents to extract per cluster.\n        nr_candidate_words: The number of candidate words per cluster.\n        random_state: The random state for randomly sampling candidate documents.\n\n    Usage:\n\n    ```python\n    from bertopic.representation import KeyBERTInspired\n    from bertopic import BERTopic\n\n    # Create your representation model\n    representation_model = KeyBERTInspired()\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n    \"\"\"\n    self.top_n_words = top_n_words\n    self.nr_repr_docs = nr_repr_docs\n    self.nr_samples = nr_samples\n    self.nr_candidate_words = nr_candidate_words\n    self.random_state = random_state\n</code></pre>"},{"location":"api/representation/keybert.html#bertopic.representation._keybert.KeyBERTInspired.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>The topic c-TF-IDF representation</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_keybert.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: The topic c-TF-IDF representation\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # We extract the top n representative documents per class\n    _, representative_docs, repr_doc_indices, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, self.nr_samples, self.nr_repr_docs)\n\n    # We extract the top n words per class\n    topics = self._extract_candidate_words(topic_model, c_tf_idf, topics)\n\n    # We calculate the similarity between word and document embeddings and create\n    # topic embeddings from the representative document embeddings\n    sim_matrix, words = self._extract_embeddings(topic_model, topics, representative_docs, repr_doc_indices)\n\n    # Find the best matching words based on the similarity matrix for each topic\n    updated_topics = self._extract_top_words(words, topics, sim_matrix)\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/langchain.html","title":"<code>LangChain</code>","text":"<p>Using chains in langchain to generate topic labels.</p> <p>The classic example uses <code>langchain.chains.question_answering.load_qa_chain</code>. This returns a chain that takes a list of documents and a question as input.</p> <p>You can also use Runnables such as those composed using the LangChain Expression Language.</p> <p>Parameters:</p> Name Type Description Default <code>chain</code> <p>The langchain chain or Runnable with a <code>batch</code> method.    Input keys must be <code>input_documents</code> and <code>question</code>.    Output key must be <code>output_text</code>.</p> required <code>prompt</code> <code>str</code> <p>The prompt to be used in the model. If no prompt is given,     <code>self.default_prompt_</code> is used instead.</p> <code>None</code> <code>nr_docs</code> <code>int</code> <p>The number of documents to pass to LangChain if a prompt      with the <code>[\"DOCUMENTS\"]</code> tag is used.</p> <code>4</code> <code>diversity</code> <code>float</code> <p>The diversity of documents to pass to LangChain.        Accepts values between 0 and 1. A higher         values results in passing more diverse documents        whereas lower values passes more similar documents.</p> <code>None</code> <code>doc_length</code> <code>int</code> <p>The maximum length of each document. If a document is longer,         it will be truncated. If None, the entire document is passed.</p> <code>None</code> <code>tokenizer</code> <code>Union[str, Callable]</code> <p>The tokenizer used to calculate to split the document into segments        used to count the length of a document.             * If tokenizer is 'char', then the document is split up               into characters which are counted to adhere to <code>doc_length</code>            * If tokenizer is 'whitespace', the document is split up              into words separated by whitespaces. These words are counted              and truncated depending on <code>doc_length</code>            * If tokenizer is 'vectorizer', then the internal CountVectorizer              is used to tokenize the document. These tokens are counted              and trunctated depending on <code>doc_length</code>. They are decoded with               whitespaces.            * If tokenizer is a callable, then that callable is used to tokenize              the document. These tokens are counted and truncated depending              on <code>doc_length</code></p> <code>None</code> <code>chain_config</code> <p>The configuration for the langchain chain. Can be used to set options           like max_concurrency to avoid rate limiting errors.</p> <code>None</code> <p>Usage:</p> <p>To use this, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:</p> <p><code>pip install langchain</code> <code>pip install openai</code></p> <p>Then, you can create your chain as follows:</p> <pre><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\nchain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type=\"stuff\")\n</code></pre> <p>Finally, you can pass the chain to BERTopic as follows:</p> <pre><code>from bertopic.representation import LangChain\n\n# Create your representation model\nrepresentation_model = LangChain(chain)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"What are these documents about? Please give a single label.\"\nrepresentation_model = LangChain(chain, prompt=prompt)\n</code></pre> <p>You can also use a Runnable instead of a chain. The example below uses the LangChain Expression Language:</p> <pre><code>from bertopic.representation import LangChain\nfrom langchain.chains.question_answering import load_qa_chain\nfrom langchain.chat_models import ChatAnthropic\nfrom langchain.schema.document import Document\nfrom langchain.schema.runnable import RunnablePassthrough\nfrom langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer\n\nprompt = ...\nllm = ...\n\n# We will construct a special privacy-preserving chain using Microsoft Presidio\n\npii_handler = PresidioReversibleAnonymizer(analyzed_fields=[\"PERSON\"])\n\nchain = (\n    {\n        \"input_documents\": (\n            lambda inp: [\n                Document(\n                    page_content=pii_handler.anonymize(\n                        d.page_content,\n                        language=\"en\",\n                    ),\n                )\n                for d in inp[\"input_documents\"]\n            ]\n        ),\n        \"question\": RunnablePassthrough(),\n    }\n    | load_qa_chain(representation_llm, chain_type=\"stuff\")\n    | (lambda output: {\"output_text\": pii_handler.deanonymize(output[\"output_text\"])})\n)\n\nrepresentation_model = LangChain(chain, prompt=representation_prompt)\n</code></pre> Source code in <code>bertopic\\representation\\_langchain.py</code> <pre><code>class LangChain(BaseRepresentation):\n    \"\"\" Using chains in langchain to generate topic labels.\n\n    The classic example uses `langchain.chains.question_answering.load_qa_chain`.\n    This returns a chain that takes a list of documents and a question as input.\n\n    You can also use Runnables such as those composed using the LangChain Expression Language.\n\n    Arguments:\n        chain: The langchain chain or Runnable with a `batch` method.\n               Input keys must be `input_documents` and `question`.\n               Output key must be `output_text`.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n        nr_docs: The number of documents to pass to LangChain if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to LangChain.\n                   Accepts values between 0 and 1. A higher \n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document. \n                       * If tokenizer is 'char', then the document is split up \n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`. They are decoded with \n                         whitespaces.\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n        chain_config: The configuration for the langchain chain. Can be used to set options\n                      like max_concurrency to avoid rate limiting errors.\n    Usage:\n\n    To use this, you will need to install the langchain package first.\n    Additionally, you will need an underlying LLM to support langchain,\n    like openai:\n\n    `pip install langchain`\n    `pip install openai`\n\n    Then, you can create your chain as follows:\n\n    ```python\n    from langchain.chains.question_answering import load_qa_chain\n    from langchain.llms import OpenAI\n    chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type=\"stuff\")\n    ```\n\n    Finally, you can pass the chain to BERTopic as follows:\n\n    ```python\n    from bertopic.representation import LangChain\n\n    # Create your representation model\n    representation_model = LangChain(chain)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can also use a custom prompt:\n\n    ```python\n    prompt = \"What are these documents about? Please give a single label.\"\n    representation_model = LangChain(chain, prompt=prompt)\n    ```\n\n    You can also use a Runnable instead of a chain.\n    The example below uses the LangChain Expression Language:\n\n    ```python\n    from bertopic.representation import LangChain\n    from langchain.chains.question_answering import load_qa_chain\n    from langchain.chat_models import ChatAnthropic\n    from langchain.schema.document import Document\n    from langchain.schema.runnable import RunnablePassthrough\n    from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer\n\n    prompt = ...\n    llm = ...\n\n    # We will construct a special privacy-preserving chain using Microsoft Presidio\n\n    pii_handler = PresidioReversibleAnonymizer(analyzed_fields=[\"PERSON\"])\n\n    chain = (\n        {\n            \"input_documents\": (\n                lambda inp: [\n                    Document(\n                        page_content=pii_handler.anonymize(\n                            d.page_content,\n                            language=\"en\",\n                        ),\n                    )\n                    for d in inp[\"input_documents\"]\n                ]\n            ),\n            \"question\": RunnablePassthrough(),\n        }\n        | load_qa_chain(representation_llm, chain_type=\"stuff\")\n        | (lambda output: {\"output_text\": pii_handler.deanonymize(output[\"output_text\"])})\n    )\n\n    representation_model = LangChain(chain, prompt=representation_prompt)\n    ```\n    \"\"\"\n    def __init__(self,\n                 chain,\n                 prompt: str = None,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None,\n                 chain_config = None,\n                 ):\n        self.chain = chain\n        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT\n        self.default_prompt_ = DEFAULT_PROMPT\n        self.chain_config = chain_config\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, int]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top 4 representative documents per topic\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n            c_tf_idf=c_tf_idf,\n            documents=documents,\n            topics=topics,\n            nr_samples=500,\n            nr_repr_docs=self.nr_docs,\n            diversity=self.diversity\n        )\n\n        # Generate label using langchain's batch functionality\n        chain_docs: List[List[Document]] = [\n            [\n                Document(\n                    page_content=truncate_document(\n                        topic_model,\n                        self.doc_length,\n                        self.tokenizer,\n                        doc\n                    )\n                )\n                for doc in docs\n            ]\n            for docs in repr_docs_mappings.values()\n        ]\n\n        # `self.chain` must take `input_documents` and `question` as input keys\n        inputs = [\n            {\"input_documents\": docs, \"question\": self.prompt}\n            for docs in chain_docs\n        ]\n\n        # `self.chain` must return a dict with an `output_text` key\n        # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`\n        outputs = self.chain.batch(inputs=inputs, config=self.chain_config)\n        labels = [output[\"output_text\"].strip() for output in outputs]\n\n        updated_topics = {\n            topic: [(label, 1)] + [(\"\", 0) for _ in range(9)]\n            for topic, label in zip(repr_docs_mappings.keys(), labels)\n        }\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/langchain.html#bertopic.representation._langchain.LangChain.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>The topic c-TF-IDF representation</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_langchain.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, int]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: The topic c-TF-IDF representation\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top 4 representative documents per topic\n    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n        c_tf_idf=c_tf_idf,\n        documents=documents,\n        topics=topics,\n        nr_samples=500,\n        nr_repr_docs=self.nr_docs,\n        diversity=self.diversity\n    )\n\n    # Generate label using langchain's batch functionality\n    chain_docs: List[List[Document]] = [\n        [\n            Document(\n                page_content=truncate_document(\n                    topic_model,\n                    self.doc_length,\n                    self.tokenizer,\n                    doc\n                )\n            )\n            for doc in docs\n        ]\n        for docs in repr_docs_mappings.values()\n    ]\n\n    # `self.chain` must take `input_documents` and `question` as input keys\n    inputs = [\n        {\"input_documents\": docs, \"question\": self.prompt}\n        for docs in chain_docs\n    ]\n\n    # `self.chain` must return a dict with an `output_text` key\n    # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`\n    outputs = self.chain.batch(inputs=inputs, config=self.chain_config)\n    labels = [output[\"output_text\"].strip() for output in outputs]\n\n    updated_topics = {\n        topic: [(label, 1)] + [(\"\", 0) for _ in range(9)]\n        for topic, label in zip(repr_docs_mappings.keys(), labels)\n    }\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/mmr.html","title":"<code>MaximalMarginalRelevance</code>","text":"<p>Calculate Maximal Marginal Relevance (MMR) between candidate keywords and the document.</p> <p>MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.</p> <p>Parameters:</p> Name Type Description Default <code>diversity</code> <code>float</code> <p>How diverse the select keywords/keyphrases are.         Values range between 0 and 1 with 0 being not diverse at all         and 1 being most diverse.</p> <code>0.1</code> <code>top_n_words</code> <code>int</code> <p>The number of keywords/keyhprases to return</p> <code>10</code> <p>Usage:</p> <pre><code>from bertopic.representation import MaximalMarginalRelevance\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = MaximalMarginalRelevance(diversity=0.3)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> Source code in <code>bertopic\\representation\\_mmr.py</code> <pre><code>class MaximalMarginalRelevance(BaseRepresentation):\n    \"\"\" Calculate Maximal Marginal Relevance (MMR)\n    between candidate keywords and the document.\n\n    MMR considers the similarity of keywords/keyphrases with the\n    document, along with the similarity of already selected\n    keywords and keyphrases. This results in a selection of keywords\n    that maximize their within diversity with respect to the document.\n\n    Arguments:\n        diversity: How diverse the select keywords/keyphrases are.\n                    Values range between 0 and 1 with 0 being not diverse at all\n                    and 1 being most diverse.\n        top_n_words: The number of keywords/keyhprases to return\n\n    Usage:\n\n    ```python\n    from bertopic.representation import MaximalMarginalRelevance\n    from bertopic import BERTopic\n\n    # Create your representation model\n    representation_model = MaximalMarginalRelevance(diversity=0.3)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n    \"\"\"\n    def __init__(self, diversity: float = 0.1, top_n_words: int = 10):\n        self.diversity = diversity\n        self.top_n_words = top_n_words\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topic representations\n\n        Arguments:\n            topic_model: The BERTopic model\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n\n        if topic_model.embedding_model is None:\n            warnings.warn(\"MaximalMarginalRelevance can only be used BERTopic was instantiated\"\n                          \"with the `embedding_model` parameter.\")\n            return topics\n\n        updated_topics = {}\n        for topic, topic_words in topics.items():\n            words = [word[0] for word in topic_words]\n            word_embeddings = topic_model._extract_embeddings(words, method=\"word\", verbose=False)\n            topic_embedding = topic_model._extract_embeddings(\" \".join(words), method=\"word\", verbose=False).reshape(1, -1)\n            topic_words = mmr(topic_embedding, word_embeddings, words, self.diversity, self.top_n_words)\n            updated_topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]\n        return updated_topics\n</code></pre>"},{"location":"api/representation/mmr.html#bertopic.representation._mmr.MaximalMarginalRelevance.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topic representations</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>The BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_mmr.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topic representations\n\n    Arguments:\n        topic_model: The BERTopic model\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n\n    if topic_model.embedding_model is None:\n        warnings.warn(\"MaximalMarginalRelevance can only be used BERTopic was instantiated\"\n                      \"with the `embedding_model` parameter.\")\n        return topics\n\n    updated_topics = {}\n    for topic, topic_words in topics.items():\n        words = [word[0] for word in topic_words]\n        word_embeddings = topic_model._extract_embeddings(words, method=\"word\", verbose=False)\n        topic_embedding = topic_model._extract_embeddings(\" \".join(words), method=\"word\", verbose=False).reshape(1, -1)\n        topic_words = mmr(topic_embedding, word_embeddings, words, self.diversity, self.top_n_words)\n        updated_topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]\n    return updated_topics\n</code></pre>"},{"location":"api/representation/openai.html","title":"<code>OpenAI</code>","text":"<p>Using the OpenAI API to generate topic labels based     on one of their Completion of ChatCompletion models.</p> <pre><code>The default method is `openai.Completion` if `chat=False`.\nThe prompts will also need to follow a completion task. If you\nare looking for a more interactive chats, use `chat=True`\nwith `model=gpt-3.5-turbo`.\n\nFor an overview see:\nhttps://platform.openai.com/docs/models\n\n!!! arguments\n    client: A `openai.OpenAI` client\n    !!! model \"Model to use within OpenAI, defaults to `\"text-ada-001\"`.\"\n           NOTE: If a `gpt-3.5-turbo` model is used, make sure to set\n           `chat` to True.\n    !!! generator_kwargs \"Kwargs passed to `openai.Completion.create`\"\n                      for fine-tuning the output.\n    !!! prompt \"The prompt to be used in the model. If no prompt is given,\"\n            `self.default_prompt_` is used instead.\n            NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n            to decide where the keywords and documents need to be\n            inserted.\n    !!! delay_in_seconds \"The delay in seconds between consecutive prompts\"\n                      in order to prevent RateLimitErrors.\n    !!! exponential_backoff \"Retry requests with a random exponential backoff.\"\n                         A short sleep is used when a rate limit error is hit,\n                         then the requests is retried. Increase the sleep length\n                         if errors are hit until 10 unsuccesfull requests.\n                         If True, overrides `delay_in_seconds`.\n    !!! chat \"Set this to True if a GPT-3.5 model is used.\"\n          See: https://platform.openai.com/docs/models/gpt-3-5\n    !!! nr_docs \"The number of documents to pass to OpenAI if a prompt\"\n             with the `[\"DOCUMENTS\"]` tag is used.\n    !!! diversity \"The diversity of documents to pass to OpenAI.\"\n               Accepts values between 0 and 1. A higher\n               values results in passing more diverse documents\n               whereas lower values passes more similar documents.\n    !!! doc_length \"The maximum length of each document. If a document is longer,\"\n                it will be truncated. If None, the entire document is passed.\n    !!! tokenizer \"The tokenizer used to calculate to split the document into segments\"\n               used to count the length of a document.\n                   * If tokenizer is 'char', then the document is split up\n                     into characters which are counted to adhere to `doc_length`\n                   * If tokenizer is 'whitespace', the document is split up\n                     into words separated by whitespaces. These words are counted\n                     and truncated depending on `doc_length`\n                   * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                     is used to tokenize the document. These tokens are counted\n                     and trunctated depending on `doc_length`\n                   * If tokenizer is a callable, then that callable is used to tokenize\n                     the document. These tokens are counted and truncated depending\n                     on `doc_length`\n\nUsage:\n\nTo use this, you will need to install the openai package first:\n\n`pip install openai`\n\nThen, get yourself an API key and use OpenAI's API as follows:\n\n```python\nimport openai\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create your representation model\nclient = openai.OpenAI(api_key=MY_API_KEY)\nrepresentation_model = OpenAI(client, delay_in_seconds=5)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n```\n\nYou can also use a custom prompt:\n\n```python\nprompt = \"I have the following documents: [DOCUMENTS]\n</code></pre> <p>These documents are about the following topic: '\"     representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)     <code>If you want to use OpenAI's ChatGPT model:</code>python     representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)     ```</p> Source code in <code>bertopic\\representation\\_openai.py</code> <pre><code>class OpenAI(BaseRepresentation):\n    \"\"\" Using the OpenAI API to generate topic labels based\n    on one of their Completion of ChatCompletion models.\n\n    The default method is `openai.Completion` if `chat=False`.\n    The prompts will also need to follow a completion task. If you\n    are looking for a more interactive chats, use `chat=True`\n    with `model=gpt-3.5-turbo`.\n\n    For an overview see:\n    https://platform.openai.com/docs/models\n\n    Arguments:\n        client: A `openai.OpenAI` client\n        model: Model to use within OpenAI, defaults to `\"text-ada-001\"`.\n               NOTE: If a `gpt-3.5-turbo` model is used, make sure to set\n               `chat` to True.\n        generator_kwargs: Kwargs passed to `openai.Completion.create`\n                          for fine-tuning the output.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n                NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n                to decide where the keywords and documents need to be\n                inserted.\n        delay_in_seconds: The delay in seconds between consecutive prompts\n                          in order to prevent RateLimitErrors.\n        exponential_backoff: Retry requests with a random exponential backoff.\n                             A short sleep is used when a rate limit error is hit,\n                             then the requests is retried. Increase the sleep length\n                             if errors are hit until 10 unsuccesfull requests.\n                             If True, overrides `delay_in_seconds`.\n        chat: Set this to True if a GPT-3.5 model is used.\n              See: https://platform.openai.com/docs/models/gpt-3-5\n        nr_docs: The number of documents to pass to OpenAI if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to OpenAI.\n                   Accepts values between 0 and 1. A higher\n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document.\n                       * If tokenizer is 'char', then the document is split up\n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n\n    Usage:\n\n    To use this, you will need to install the openai package first:\n\n    `pip install openai`\n\n    Then, get yourself an API key and use OpenAI's API as follows:\n\n    ```python\n    import openai\n    from bertopic.representation import OpenAI\n    from bertopic import BERTopic\n\n    # Create your representation model\n    client = openai.OpenAI(api_key=MY_API_KEY)\n    representation_model = OpenAI(client, delay_in_seconds=5)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can also use a custom prompt:\n\n    ```python\n    prompt = \"I have the following documents: [DOCUMENTS] \\nThese documents are about the following topic: '\"\n    representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)\n    ```\n\n    If you want to use OpenAI's ChatGPT model:\n\n    ```python\n    representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\n    ```\n    \"\"\"\n    def __init__(self,\n                 client,\n                 model: str = \"text-ada-001\",\n                 prompt: str = None,\n                 generator_kwargs: Mapping[str, Any] = {},\n                 delay_in_seconds: float = None,\n                 exponential_backoff: bool = False,\n                 chat: bool = False,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None\n                 ):\n        self.client = client\n        self.model = model\n\n        if prompt is None:\n            self.prompt = DEFAULT_CHAT_PROMPT if chat else DEFAULT_PROMPT\n        else:\n            self.prompt = prompt\n\n        self.default_prompt_ = DEFAULT_CHAT_PROMPT if chat else DEFAULT_PROMPT\n        self.delay_in_seconds = delay_in_seconds\n        self.exponential_backoff = exponential_backoff\n        self.chat = chat\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n        self.prompts_ = []\n\n        self.generator_kwargs = generator_kwargs\n        if self.generator_kwargs.get(\"model\"):\n            self.model = generator_kwargs.get(\"model\")\n        if self.generator_kwargs.get(\"prompt\"):\n            del self.generator_kwargs[\"prompt\"]\n        if not self.generator_kwargs.get(\"stop\") and not chat:\n            self.generator_kwargs[\"stop\"] = \"\\n\"\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top n representative documents per topic\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n        # Generate using OpenAI's Language Model\n        updated_topics = {}\n        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n            prompt = self._create_prompt(truncated_docs, topic, topics)\n            self.prompts_.append(prompt)\n\n            # Delay\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n            if self.chat:\n                messages = [\n                    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n                kwargs = {\"model\": self.model, \"messages\": messages, **self.generator_kwargs}\n                if self.exponential_backoff:\n                    response = chat_completions_with_backoff(self.client, **kwargs)\n                else:\n                    response = self.client.chat.completions.create(**kwargs)\n\n                # Check whether content was actually generated\n                # Adresses #1570 for potential issues with OpenAI's content filter\n                if hasattr(response.choices[0].message, \"content\"):\n                    label = response.choices[0].message.content.strip().replace(\"topic: \", \"\")\n                else:\n                    label = \"No label returned\"\n            else:\n                if self.exponential_backoff:\n                    response = completions_with_backoff(self.client, model=self.model, prompt=prompt, **self.generator_kwargs)\n                else:\n                    response = self.client.completions.create(model=self.model, prompt=prompt, **self.generator_kwargs)\n                label = response.choices[0].message.content.strip()\n\n            updated_topics[topic] = [(label, 1)]\n\n        return updated_topics\n\n    def _create_prompt(self, docs, topic, topics):\n        keywords = list(zip(*topics[topic]))[0]\n\n        # Use the Default Chat Prompt\n        if self.prompt == DEFAULT_CHAT_PROMPT or self.prompt == DEFAULT_PROMPT:\n            prompt = self.prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            prompt = self._replace_documents(prompt, docs)\n\n        # Use a custom prompt that leverages keywords, documents or both using\n        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively\n        else:\n            prompt = self.prompt\n            if \"[KEYWORDS]\" in prompt:\n                prompt = prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            if \"[DOCUMENTS]\" in prompt:\n                prompt = self._replace_documents(prompt, docs)\n\n        return prompt\n\n    @staticmethod\n    def _replace_documents(prompt, docs):\n        to_replace = \"\"\n        for doc in docs:\n            to_replace += f\"- {doc}\\n\"\n        prompt = prompt.replace(\"[DOCUMENTS]\", to_replace)\n        return prompt\n</code></pre>"},{"location":"api/representation/openai.html#bertopic.representation._openai.OpenAI.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>The topic c-TF-IDF representation</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_openai.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: The topic c-TF-IDF representation\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top n representative documents per topic\n    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n    # Generate using OpenAI's Language Model\n    updated_topics = {}\n    for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n        prompt = self._create_prompt(truncated_docs, topic, topics)\n        self.prompts_.append(prompt)\n\n        # Delay\n        if self.delay_in_seconds:\n            time.sleep(self.delay_in_seconds)\n\n        if self.chat:\n            messages = [\n                {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                {\"role\": \"user\", \"content\": prompt}\n            ]\n            kwargs = {\"model\": self.model, \"messages\": messages, **self.generator_kwargs}\n            if self.exponential_backoff:\n                response = chat_completions_with_backoff(self.client, **kwargs)\n            else:\n                response = self.client.chat.completions.create(**kwargs)\n\n            # Check whether content was actually generated\n            # Adresses #1570 for potential issues with OpenAI's content filter\n            if hasattr(response.choices[0].message, \"content\"):\n                label = response.choices[0].message.content.strip().replace(\"topic: \", \"\")\n            else:\n                label = \"No label returned\"\n        else:\n            if self.exponential_backoff:\n                response = completions_with_backoff(self.client, model=self.model, prompt=prompt, **self.generator_kwargs)\n            else:\n                response = self.client.completions.create(model=self.model, prompt=prompt, **self.generator_kwargs)\n            label = response.choices[0].message.content.strip()\n\n        updated_topics[topic] = [(label, 1)]\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/pos.html","title":"<code>PartOfSpeech</code>","text":"<p>Extract Topic Keywords based on their Part-of-Speech</p> <p>DEFAULT_PATTERNS = [             [{'POS': 'ADJ'}, {'POS': 'NOUN'}],             [{'POS': 'NOUN'}],             [{'POS': 'ADJ'}] ]</p> <p>From candidate topics, as extracted with c-TF-IDF, find documents that contain keywords found in the candidate topics. These candidate documents then serve as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic.</p> <p>These candidate keywords are first judged by whether they fall within the DEFAULT_PATTERNS or the user-defined pattern. Then, the resulting keywords are sorted by their respective c-TF-IDF values.</p> <p>Parameters:</p> Name Type Description Default <code>model</code> <code>Union[str, spacy.language.Language]</code> <p>The Spacy model to use</p> <code>'en_core_web_sm'</code> <code>top_n_words</code> <code>int</code> <p>The top n words to extract</p> <code>10</code> <code>pos_patterns</code> <code>List[str]</code> <p>Patterns for Spacy to use.           See https://spacy.io/usage/rule-based-matching</p> <code>None</code> <p>Usage:</p> <pre><code>from bertopic.representation import PartOfSpeech\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can define custom POS patterns to be extracted:</p> <pre><code>pos_patterns = [\n            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n            [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n]\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\", pos_patterns=pos_patterns)\n</code></pre> Source code in <code>bertopic\\representation\\_pos.py</code> <pre><code>class PartOfSpeech(BaseRepresentation):\n    \"\"\" Extract Topic Keywords based on their Part-of-Speech\n\n    DEFAULT_PATTERNS = [\n                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n                [{'POS': 'NOUN'}],\n                [{'POS': 'ADJ'}]\n    ]\n\n    From candidate topics, as extracted with c-TF-IDF,\n    find documents that contain keywords found in the\n    candidate topics. These candidate documents then\n    serve as the representative set of documents from\n    which the Spacy model can extract a set of candidate\n    keywords for each topic.\n\n    These candidate keywords are first judged by whether\n    they fall within the DEFAULT_PATTERNS or the user-defined\n    pattern. Then, the resulting keywords are sorted by\n    their respective c-TF-IDF values.\n\n    Arguments:\n        model: The Spacy model to use\n        top_n_words: The top n words to extract\n        pos_patterns: Patterns for Spacy to use.\n                      See https://spacy.io/usage/rule-based-matching\n\n    Usage:\n\n    ```python\n    from bertopic.representation import PartOfSpeech\n    from bertopic import BERTopic\n\n    # Create your representation model\n    representation_model = PartOfSpeech(\"en_core_web_sm\")\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can define custom POS patterns to be extracted:\n\n    ```python\n    pos_patterns = [\n                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n                [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n    ]\n    representation_model = PartOfSpeech(\"en_core_web_sm\", pos_patterns=pos_patterns)\n    ```\n    \"\"\"\n    def __init__(self,\n                 model: Union[str, Language] = \"en_core_web_sm\",\n                 top_n_words: int = 10,\n                 pos_patterns: List[str] = None):\n        if isinstance(model, str):\n            self.model = spacy.load(model)\n        elif isinstance(model, Language):\n            self.model = model\n        else:\n            raise ValueError(\"Make sure that the Spacy model that you\"\n                             \"pass is either a string referring to a\"\n                             \"Spacy model or a Spacy nlp object.\")\n\n        self.top_n_words = top_n_words\n\n        if pos_patterns is None:\n            self.pos_patterns = [\n                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n                [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n            ]\n        else:\n            self.pos_patterns = pos_patterns\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        matcher = Matcher(self.model.vocab)\n        matcher.add(\"Pattern\", self.pos_patterns)\n\n        candidate_topics = {}\n        for topic, values in topics.items():\n            keywords = list(zip(*values))[0]\n\n            # Extract candidate documents\n            candidate_documents = []\n            for keyword in keywords:\n                selection = documents.loc[documents.Topic == topic, :]\n                selection = selection.loc[selection.Document.str.contains(keyword), \"Document\"]\n                if len(selection) &gt; 0:\n                    for document in selection[:2]:\n                        candidate_documents.append(document)\n            candidate_documents = list(set(candidate_documents))\n\n            # Extract keywords\n            docs_pipeline = self.model.pipe(candidate_documents)\n            updated_keywords = []\n            for doc in docs_pipeline:\n                matches = matcher(doc)\n                for _, start, end in matches:\n                    updated_keywords.append(doc[start:end].text)\n            candidate_topics[topic] = list(set(updated_keywords))\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = list(topic_model.vectorizer_model.get_feature_names_out())\n        else:\n            words = list(topic_model.vectorizer_model.get_feature_names())\n\n        # Match updated keywords with c-TF-IDF values\n        words_lookup = dict(zip(words, range(len(words))))\n        updated_topics = {topic: [] for topic in topics.keys()}\n\n        for topic, candidate_keywords in candidate_topics.items():\n            word_indices = [words_lookup.get(keyword) for keyword in candidate_keywords if words_lookup.get(keyword)]\n            vals = topic_model.c_tf_idf_[:, np.array(word_indices)][topic + topic_model._outliers]\n            indices = np.argsort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n            vals = np.sort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n            topic_words = [(words[word_indices[index]], val) for index, val in zip(indices, vals)]\n            updated_topics[topic] = topic_words\n            if len(updated_topics[topic]) &lt; self.top_n_words:\n                updated_topics[topic] += [(\"\", 0) for _ in range(self.top_n_words-len(updated_topics[topic]))]\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/pos.html#bertopic.representation._pos.PartOfSpeech.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_pos.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    matcher = Matcher(self.model.vocab)\n    matcher.add(\"Pattern\", self.pos_patterns)\n\n    candidate_topics = {}\n    for topic, values in topics.items():\n        keywords = list(zip(*values))[0]\n\n        # Extract candidate documents\n        candidate_documents = []\n        for keyword in keywords:\n            selection = documents.loc[documents.Topic == topic, :]\n            selection = selection.loc[selection.Document.str.contains(keyword), \"Document\"]\n            if len(selection) &gt; 0:\n                for document in selection[:2]:\n                    candidate_documents.append(document)\n        candidate_documents = list(set(candidate_documents))\n\n        # Extract keywords\n        docs_pipeline = self.model.pipe(candidate_documents)\n        updated_keywords = []\n        for doc in docs_pipeline:\n            matches = matcher(doc)\n            for _, start, end in matches:\n                updated_keywords.append(doc[start:end].text)\n        candidate_topics[topic] = list(set(updated_keywords))\n\n    # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n    # and will be removed in 1.2. Please use get_feature_names_out instead.\n    if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n        words = list(topic_model.vectorizer_model.get_feature_names_out())\n    else:\n        words = list(topic_model.vectorizer_model.get_feature_names())\n\n    # Match updated keywords with c-TF-IDF values\n    words_lookup = dict(zip(words, range(len(words))))\n    updated_topics = {topic: [] for topic in topics.keys()}\n\n    for topic, candidate_keywords in candidate_topics.items():\n        word_indices = [words_lookup.get(keyword) for keyword in candidate_keywords if words_lookup.get(keyword)]\n        vals = topic_model.c_tf_idf_[:, np.array(word_indices)][topic + topic_model._outliers]\n        indices = np.argsort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n        vals = np.sort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n        topic_words = [(words[word_indices[index]], val) for index, val in zip(indices, vals)]\n        updated_topics[topic] = topic_words\n        if len(updated_topics[topic]) &lt; self.top_n_words:\n            updated_topics[topic] += [(\"\", 0) for _ in range(self.top_n_words-len(updated_topics[topic]))]\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/zeroshot.html","title":"<code>ZeroShotClassification</code>","text":"<p>Zero-shot Classification on topic keywords with candidate labels</p> <p>Parameters:</p> Name Type Description Default <code>candidate_topics</code> <code>List[str]</code> <p>A list of labels to assign to the topics if they               exceed <code>min_prob</code></p> required <code>model</code> <code>str</code> <p>A transformers pipeline that should be initialized as    \"zero-shot-classification\". For example,    <code>pipeline(\"zero-shot-classification\", model=\"facebook/bart-large-mnli\")</code></p> <code>'facebook/bart-large-mnli'</code> <code>pipeline_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs that you can pass to the transformers.pipeline              when it is called. NOTE: Use <code>{\"multi_label\": True}</code>              to extract multiple labels for each topic.</p> <code>{}</code> <code>min_prob</code> <code>float</code> <p>The minimum probability to assign a candidate label to a topic</p> <code>0.8</code> <p>Usage:</p> <pre><code>from bertopic.representation import ZeroShotClassification\nfrom bertopic import BERTopic\n\n# Create your representation model\ncandidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\nrepresentation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> Source code in <code>bertopic\\representation\\_zeroshot.py</code> <pre><code>class ZeroShotClassification(BaseRepresentation):\n    \"\"\" Zero-shot Classification on topic keywords with candidate labels\n\n    Arguments:\n        candidate_topics: A list of labels to assign to the topics if they\n                          exceed `min_prob`\n        model: A transformers pipeline that should be initialized as\n               \"zero-shot-classification\". For example,\n               `pipeline(\"zero-shot-classification\", model=\"facebook/bart-large-mnli\")`\n        pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline\n                         when it is called. NOTE: Use `{\"multi_label\": True}`\n                         to extract multiple labels for each topic.\n        min_prob: The minimum probability to assign a candidate label to a topic\n\n    Usage:\n\n    ```python\n    from bertopic.representation import ZeroShotClassification\n    from bertopic import BERTopic\n\n    # Create your representation model\n    candidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\n    representation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n    \"\"\"\n    def __init__(self,\n                 candidate_topics: List[str],\n                 model: str = \"facebook/bart-large-mnli\",\n                 pipeline_kwargs: Mapping[str, Any] = {},\n                 min_prob: float = 0.8\n                 ):\n        self.candidate_topics = candidate_topics\n        if isinstance(model, str):\n            self.model = pipeline(\"zero-shot-classification\", model=model)\n        elif isinstance(model, Pipeline):\n            self.model = model\n        else:\n            raise ValueError(\"Make sure that the HF model that you\"\n                             \"pass is either a string referring to a\"\n                             \"HF model or a `transformers.pipeline` object.\")\n        self.pipeline_kwargs = pipeline_kwargs\n        self.min_prob = min_prob\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: Not used\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Classify topics\n        topic_descriptions = [\" \".join(list(zip(*topics[topic]))[0]) for topic in topics.keys()]\n        classifications = self.model(topic_descriptions, self.candidate_topics, **self.pipeline_kwargs)\n\n        # Extract labels\n        updated_topics = {}\n        for topic, classification in zip(topics.keys(), classifications):\n            topic_description = topics[topic]\n\n            # Multi-label assignment\n            if self.pipeline_kwargs.get(\"multi_label\"):\n                topic_description = []\n                for label, score in zip(classification[\"labels\"], classification[\"scores\"]):\n                    if score &gt; self.min_prob:\n                        topic_description.append((label, score))\n\n            # Single label assignment\n            elif classification[\"scores\"][0] &gt; self.min_prob:\n                topic_description = [(classification[\"labels\"][0], classification[\"scores\"][0])]\n\n            # Make sure that 10 items are returned\n            if len(topic_description) == 0:\n                topic_description = topics[topic]\n            elif len(topic_description) &lt; 10:\n                topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n            updated_topics[topic] = topic_description\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/zeroshot.html#bertopic.representation._zeroshot.ZeroShotClassification.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>Not used</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_zeroshot.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: Not used\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Classify topics\n    topic_descriptions = [\" \".join(list(zip(*topics[topic]))[0]) for topic in topics.keys()]\n    classifications = self.model(topic_descriptions, self.candidate_topics, **self.pipeline_kwargs)\n\n    # Extract labels\n    updated_topics = {}\n    for topic, classification in zip(topics.keys(), classifications):\n        topic_description = topics[topic]\n\n        # Multi-label assignment\n        if self.pipeline_kwargs.get(\"multi_label\"):\n            topic_description = []\n            for label, score in zip(classification[\"labels\"], classification[\"scores\"]):\n                if score &gt; self.min_prob:\n                    topic_description.append((label, score))\n\n        # Single label assignment\n        elif classification[\"scores\"][0] &gt; self.min_prob:\n            topic_description = [(classification[\"labels\"][0], classification[\"scores\"][0])]\n\n        # Make sure that 10 items are returned\n        if len(topic_description) == 0:\n            topic_description = topics[topic]\n        elif len(topic_description) &lt; 10:\n            topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n        updated_topics[topic] = topic_description\n\n    return updated_topics\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html","title":"Best Practices","text":"<p> - Overview of Best Practices</p> <p>Through the nature of BERTopic, its modularity, many variations of the topic modeling technique is possible. However, during the development and through the usage of the package, a set of best practices have been developed that generally lead to great results.</p> <p>The following are a number of steps, parameters, and settings that you can use that will generally improve the quality of the resulting topics. In other words, after going through the quick start and getting a feeling for the API these steps should get you to the next level of performance.</p> <p>Note</p> <p>Although these are called best practices, it does not necessarily mean that they work across all use cases perfectly. The underlying modular nature of BERTopic is meant to take different use cases into account. After going through these practices it is advised to fine-tune wherever necessary. </p> <p>To showcase how these \"best practices\" work, we will go through an example dataset and apply all practices to it.</p>"},{"location":"getting_started/best_practices/best_practices.html#data","title":"Data","text":"<p>For this example, we will use a dataset containing abstracts and metadata from ArXiv articles.</p> <pre><code>from datasets import load_dataset\n\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\n\n# Extract abstracts to train on and corresponding titles\nabstracts = dataset[\"abstract\"]\ntitles = dataset[\"title\"]\n</code></pre> <p>Sentence Splitter</p> <p>Whenever you have large documents, you typically want to split them up into either paragraphs or sentences. A nice way to do so is by using NLTK's sentence splitter which is nothing more than:</p> <pre><code>from nltk.tokenize import sent_tokenize, word_tokenize\nsentences = [sent_tokenize(abstract) for abstract in abstracts]\nsentences = [sentence for doc in sentences for sentence in doc]\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#pre-calculate-embeddings","title":"Pre-calculate Embeddings","text":"<p>After having created our data, namely <code>abstracts</code>, we can dive into the very first best practice, pre-calculating embeddings.</p> <p>BERTopic works by converting documents into numerical values, called embeddings. This process can be very costly, especially if we want to iterate over parameters. Instead, we can calculate those embeddings once and feed them to BERTopic to skip calculating embeddings each time.</p> <pre><code>from sentence_transformers import SentenceTransformer\n\n# Pre-calculate embeddings\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = embedding_model.encode(abstracts, show_progress_bar=True)\n</code></pre> <p>Tip</p> <p>New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the MTEB leaderboard. It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look. </p>"},{"location":"getting_started/best_practices/best_practices.html#preventing-stochastic-behavior","title":"Preventing Stochastic Behavior","text":"<p>In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the curse of dimensionality to a certain degree.</p> <p>As a default, this is done with UMAP which is an incredible algorithm for reducing dimensional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a <code>random_state</code> of the model before passing it to BERTopic.</p> <p>As a result, we can now fully reproduce the results each time we run the model.</p> <pre><code>from umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#controlling-number-of-topics","title":"Controlling Number of Topics","text":"<p>There is a parameter to control the number of topics, namely <code>nr_topics</code>. This parameter, however, merges topics after they have been created. It is a parameter that supports creating a fixed number of topics.</p> <p>However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely <code>min_cluster_size</code> that indirectly controls the number of topics that will be created.</p> <p>A higher <code>min_cluster_size</code> will generate fewer topics and a lower <code>min_cluster_size</code> will generate more topics.</p> <p>Here, we will go with <code>min_cluster_size=150</code> to prevent too many micro-clusters from being created:</p> <pre><code>from hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#improving-default-representation","title":"Improving Default Representation","text":"<p>The default representation of topics is calculated through c-TF-IDF. However, c-TF-IDF is powered by the CountVectorizer which converts text into tokens. Using the CountVectorizer, we can do a number of things:</p> <ul> <li>Remove stopwords</li> <li>Ignore infrequent words</li> <li>Increase the n-gram range</li> </ul> <p>In other words, we can preprocess the topic representations after documents are assigned to topics. This will not influence the clustering process in any way.</p> <p>Here, we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(stop_words=\"english\", min_df=2, ngram_range=(1, 2))\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#additional-representations","title":"Additional Representations","text":"<p>Previously, we have tuned the default representation but there are quite a number of other topic representations in BERTopic that we can choose from. From KeyBERTInspired and PartOfSpeech, to OpenAI's ChatGPT and open-source alternatives, many representations are possible.</p> <p>In BERTopic, you can model many different topic representations simultaneously to test them out and get different perspectives of topic descriptions. This is called multi-aspect topic modeling.</p> <p>Here, we will demonstrate a number of interesting and useful representations in BERTopic:</p> <ul> <li>KeyBERTInspired</li> <li>A method that derives inspiration from how KeyBERT works</li> <li>PartOfSpeech</li> <li>Using SpaCy's POS tagging to extract words</li> <li>MaximalMarginalRelevance</li> <li>Diversify the topic words</li> <li>OpenAI</li> <li>Use ChatGPT to label our topics</li> </ul> <pre><code>import openai\nfrom bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech\n\n# KeyBERT\nkeybert_model = KeyBERTInspired()\n\n# Part-of-Speech\npos_model = PartOfSpeech(\"en_core_web_sm\")\n\n# MMR\nmmr_model = MaximalMarginalRelevance(diversity=0.3)\n\n# GPT-3.5\nclient = openai.OpenAI(api_key=\"sk-...\")\nprompt = \"\"\"\nI have a topic that contains the following documents: \n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\nopenai_model = OpenAI(client, model=\"gpt-3.5-turbo\", exponential_backoff=True, chat=True, prompt=prompt)\n\n# All representation models\nrepresentation_model = {\n    \"KeyBERT\": keybert_model,\n    # \"OpenAI\": openai_model,  # Uncomment if you will use OpenAI\n    \"MMR\": mmr_model,\n    \"POS\": pos_model\n}\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#training","title":"Training","text":"<p>Now that we have a set of best practices, we can use them in our training loop. Here, several different representations, keywords and labels for our topics will be created. If you want to iterate over the topic model it is advised to use the pre-calculated embeddings as that significantly speeds up training.</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic(\n\n  # Pipeline models\n  embedding_model=embedding_model,\n  umap_model=umap_model,\n  hdbscan_model=hdbscan_model,\n  vectorizer_model=vectorizer_model,\n  representation_model=representation_model,\n\n  # Hyperparameters\n  top_n_words=10,\n  verbose=True\n)\n\n# Train model\ntopics, probs = topic_model.fit_transform(abstracts, embeddings)\n\n# Show topics\ntopic_model.get_topic_info()\n</code></pre> <p>To get all representations for a single topic, we simply run the following:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(1, full=True)\n{'Main': [('adversarial', 0.028838938990764302),\n  ('attacks', 0.021726302042463556),\n  ('attack', 0.016803574415028524),\n  ('robustness', 0.013046135743326167),\n  ('adversarial examples', 0.01151254557995679),\n  ('examples', 0.009920962487998853),\n  ('perturbations', 0.009053305826870773),\n  ('adversarial attacks', 0.008747627064844006),\n  ('malware', 0.007675131707700338),\n  ('defense', 0.007365955840313783)],\n 'KeyBERT': [('adversarial training', 0.76427937),\n  ('adversarial attack', 0.74271905),\n  ('vulnerable adversarial', 0.73302543),\n  ('adversarial', 0.7311052),\n  ('adversarial examples', 0.7179245),\n  ('adversarial attacks', 0.7082),\n  ('adversarially', 0.7005141),\n  ('adversarial robustness', 0.69911957),\n  ('adversarial perturbations', 0.6588783),\n  ('adversary', 0.4467769)],\n 'OpenAI': [('Adversarial attacks and defense', 1)],\n 'MMR': [('adversarial', 0.028838938990764302),\n  ('attacks', 0.021726302042463556),\n  ('attack', 0.016803574415028524),\n  ('robustness', 0.013046135743326167),\n  ('adversarial examples', 0.01151254557995679),\n  ('examples', 0.009920962487998853),\n  ('perturbations', 0.009053305826870773),\n  ('adversarial attacks', 0.008747627064844006),\n  ('malware', 0.007675131707700338),\n  ('defense', 0.007365955840313783)],\n 'POS': [('adversarial', 0.028838938990764302),\n  ('attacks', 0.021726302042463556),\n  ('attack', 0.016803574415028524),\n  ('robustness', 0.013046135743326167),\n  ('adversarial examples', 0.01151254557995679),\n  ('examples', 0.009920962487998853),\n  ('perturbations', 0.009053305826870773),\n  ('adversarial attacks', 0.008747627064844006),\n  ('malware', 0.007675131707700338),\n  ('defense', 0.007365955840313783)]}\n</code></pre> <p>NOTE: The labels generated by OpenAI's ChatGPT are especially interesting to use throughout your model. Below, we will go into more detail how to set that as a custom label.</p> <p>Parameters</p> <p>If you would like to return the topic-document probability matrix, then it is advised to use <code>calculate_probabilities=True</code>. Do note that this can significantly slow down training. To speed it up, use cuML's HDBSCAN instead. You could also approximate the topic-document probability matrix with <code>.approximate_distribution</code> which will be discussed later.</p>"},{"location":"getting_started/best_practices/best_practices.html#custom-labels","title":"(Custom) Labels","text":"<p>The default label of each topic are the top 3 words in each topic combined with an underscore between them.</p> <p>This, of course, might not be the best label that you can think of for a certain topic. Instead, we can use <code>.set_topic_labels</code> to manually label all or certain topics.</p> <p>We can also use <code>.set_topic_labels</code> to use one of the other topic representations that we had before, like <code>KeyBERTInspired</code> or even <code>OpenAI</code>.</p> <pre><code># Label the topics yourself\ntopic_model.set_topic_labels({1: \"Space Travel\", 7: \"Religion\"})\n\n# or use one of the other topic representations, like KeyBERTInspired\nkeybert_topic_labels = {topic: \" | \".join(list(zip(*values))[0][:3]) for topic, values in topic_model.topic_aspects_[\"KeyBERT\"].items()}\ntopic_model.set_topic_labels(keybert_topic_labels)\n\n# or ChatGPT's labels\nchatgpt_topic_labels = {topic: \" | \".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_[\"OpenAI\"].items()}\nchatgpt_topic_labels[-1] = \"Outlier Topic\"\ntopic_model.set_topic_labels(chatgpt_topic_labels)\n</code></pre> <p>Now that we have set the updated topic labels, we can access them with the many functions used throughout BERTopic. Most notably, you can show the updated labels in visualizations with the <code>custom_labels=True</code> parameters.</p> <p>If we were to run <code>topic_model.get_topic_info()</code> it will now include the column <code>CustomName</code>. That is the custom label that we just created for each topic.</p>"},{"location":"getting_started/best_practices/best_practices.html#topic-document-distribution","title":"Topic-Document Distribution","text":"<p>If using <code>calculate_probabilities=True</code> is not possible, then you can approximate the topic-document distributions using <code>.approximate_distribution</code>. It is a fast and flexible method for creating different topic-document distributions.</p> <pre><code># `topic_distr` contains the distribution of topics in each document\ntopic_distr, _ = topic_model.approximate_distribution(abstracts, window=8, stride=4)\n</code></pre> <p>Next, lets take a look at a specific abstract and see how the topic distribution was extracted:</p> <pre><code># Visualize the topic-document distribution for a single document\ntopic_model.visualize_distribution(topic_distr[abstract_id], custom_labels=True)\n</code></pre> <p>It seems to have extracted a number of topics that are relevant and shows the distributions of these topics across the abstract. We can go one step further and visualize them on a token-level:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(abstracts[abstract_id], calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(abstracts[abstract_id], topic_token_distr[0])\ndf\n</code></pre> <p>use_embedding_model</p> <p>As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected embedding_model instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#outlier-reduction","title":"Outlier Reduction","text":"<p>By default, HDBSCAN generates outliers which is a helpful mechanic in creating accurate topic representations. However, you might want to assign every single document to a topic. We can use <code>.reduce_outliers</code> to map some or all outliers to a topic:</p> <pre><code># Reduce outliers\nnew_topics = topic_model.reduce_outliers(abstracts, topics)\n\n# Reduce outliers with pre-calculate embeddings instead\nnew_topics = topic_model.reduce_outliers(abstracts, topics, strategy=\"embeddings\", embeddings=embeddings)\n</code></pre> <p>Update Topics with Outlier Reduction</p> <p>After having generated updated topic assignments, we can pass them to BERTopic in order to update the topic representations:</p> <pre><code>topic_model.update_topics(docs, topics=new_topics)\n</code></pre> <p>It is important to realize that updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.</p>"},{"location":"getting_started/best_practices/best_practices.html#visualize-topics","title":"Visualize Topics","text":"<p>With visualizations, we are closing into the realm of subjective \"best practices\". These are things that I generally do because I like the representations but your experience might differ.</p> <p>Having said that, there are two visualizations that are my go-to when visualizing the topics themselves:</p> <ul> <li><code>topic_model.visualize_topics()</code></li> <li><code>topic_model.visualize_hierarchy()</code></li> </ul> <pre><code># Visualize topics with custom labels\ntopic_model.visualize_topics(custom_labels=True)\n\n# Visualize hierarchy with custom labels\ntopic_model.visualize_hierarchy(custom_labels=True)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#visualize-documents","title":"Visualize Documents","text":"<p>When visualizing documents, it helps to have embedded the documents beforehand to speed up computation. Fortunately, we have already done that as a \"best practice\".</p> <p>Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.</p> <pre><code># Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n</code></pre> <p>The following plot is interactive which means that you can zoom in, double click on a label to only see that one and generally interact with the plot:</p> <pre><code># Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts\n# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset\n# NOTE: You can also hide the annotations with `hide_annotations=True` which is helpful to see the larger structure\ntopic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, custom_labels=True)\n</code></pre> <p>2-dimensional space</p> <p>Although visualizing the documents in 2-dimensional gives an idea of their underlying structure, there is a risk involved.</p> <p>Visualizing the documents in 2-dimensional space means that we have lost significant information since the original embeddings were more than 384 dimensions. Condensing all that information in 2 dimensions is simply not possible. In other words, it is merely an approximation, albeit quite an accurate one.</p>"},{"location":"getting_started/best_practices/best_practices.html#serialization","title":"Serialization","text":"<p>When saving a BERTopic model, there are several ways in doing so. You can either save the entire model with <code>pickle</code>, <code>pytorch</code>, or <code>safetensors</code>.</p> <p>Personally, I would advise going with <code>safetensors</code> whenever possible. The reason for this is that the format allows for a very small topic model to be saved and shared.</p> <p>When saving a model with <code>safetensors</code>, it skips over saving the dimensionality reduction and clustering models. The <code>.transform</code> function will still work without these models but instead assign topics based on the similarity between document embeddings and the topic embeddings.</p> <p>As a result, the <code>.transform</code> step might give different results but it is generally worth it considering the smaller and significantly faster model.</p> <pre><code>embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"my_model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n</code></pre> <p>Embedding Model</p> <p>Using <code>safetensors</code>, we are not saving the underlying embedding model but merely a pointer to the model. For example, in the above example we are saving the string <code>\"sentence-transformers/all-MiniLM-L6-v2\"</code> so that we can load in the embedding model alongside the topic model.</p> <p>This currently only works if you are using a sentence transformer model. If you are using a different model, you can load it in when loading the topic model like this:</p> <pre><code>from sentence_transformers import SentenceTransformer\n\n# Define embedding model\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n# Load model and add embedding model\nloaded_model = BERTopic.load(\"my_model_dir\", embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#inference","title":"Inference","text":"<p>To speed up the inference, we can leverage a \"best practice\" that we used before, namely serialization. When you save a model as <code>safetensors</code> and then load it in, we are removing the dimensionality reduction and clustering steps from the pipeline. </p> <p>Instead, the assignment of topics is done through cosine similarity of document embeddings and topic embeddings. This speeds up inferences significantly. </p> <p>To show its effect, let's start by disabling the logger:</p> <pre><code>from bertopic._utils import MyLogger\nlogger = MyLogger(\"ERROR\")\nloaded_model.verbose = False\ntopic_model.verbose = False\n</code></pre> <p>Then, we run inference on both the loaded model and the non-loaded model:</p> <pre><code>&gt;&gt;&gt; %timeit loaded_model.transform(abstracts[:100])\n343 ms \u00b1 31.1 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n</code></pre> <pre><code>&gt;&gt;&gt; %timeit topic_model.transform(abstracts[:100])\n1.37 s \u00b1 166 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n</code></pre> <p>Based on the above, the <code>loaded_model</code> seems to be quite a bit faster for inference than the original <code>topic_model</code>.</p>"},{"location":"getting_started/clustering/clustering.html","title":"3. Clustering","text":"<p>After reducing the dimensionality of our input embeddings, we need to cluster them into groups of similar embeddings to extract our topics.  This process of clustering is quite important because the more performant our clustering technique the more accurate our topic representations are.</p> <p>In BERTopic, we typically use HDBSCAN as it is quite capable of capturing structures with different densities. However, there is not one perfect  clustering model and you might want to be using something entirely different for your use case. Moreover, what if a new state-of-the-art model  is released tomorrow? We would like to be able to use that in BERTopic, right? Since BERTopic assumes some independence among steps, we can allow for this modularity:</p> <p> </p> <p>As a result, the <code>hdbscan_model</code> parameter in BERTopic now allows for a variety of clustering models. To do so, the class should have  the following attributes:</p> <ul> <li><code>.fit(X)</code> <ul> <li>A function that can be used to fit the model</li> </ul> </li> <li><code>.predict(X)</code> <ul> <li>A predict function that transforms the input to cluster labels</li> </ul> </li> <li><code>.labels_</code><ul> <li>The labels after fitting the model</li> </ul> </li> </ul> <p>In other words, it should have the following structure:</p> <pre><code>class ClusterModel:\n    def fit(self, X):\n        self.labels_ = None\n        return self\n\n    def predict(self, X):\n        return X\n</code></pre> <p>In this section, we will go through several examples of clustering algorithms and how they can be implemented.  </p>"},{"location":"getting_started/clustering/clustering.html#hdbscan","title":"HDBSCAN","text":"<p>As a default, BERTopic uses HDBSCAN to perform its clustering. To use a HDBSCAN model with custom parameters,  we simply define it and pass it to BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model)\n</code></pre> <p>Here, we can define any parameters in HDBSCAN to optimize for the best performance based on whatever validation metrics you are using. </p>"},{"location":"getting_started/clustering/clustering.html#k-means","title":"k-Means","text":"<p>Although HDBSCAN works quite well in BERTopic and is typically advised, you might want to be using k-Means instead.  It allows you to select how many clusters you would like and forces every single point to be in a cluster. Therefore, no  outliers will be created. This also has disadvantages. When you force every single point in a cluster, it will mean  that the cluster is highly likely to contain noise which can hurt the topic representations. As a small tip, using  the <code>vectorizer_model=CountVectorizer(stop_words=\"english\")</code> helps quite a bit to then improve the topic representation. </p> <p>Having said that, using k-Means is quite straightforward:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\n\ncluster_model = KMeans(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre> <p>Note</p> <p>As you might have noticed, the <code>cluster_model</code> is passed to <code>hdbscan_model</code> which might be a bit confusing considering  you are not passing an HDBSCAN model. For now, the name of the parameter is kept the same to adhere to the current  state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible. </p>"},{"location":"getting_started/clustering/clustering.html#agglomerative-clustering","title":"Agglomerative Clustering","text":"<p>Like k-Means, there are a bunch more clustering algorithms in <code>sklearn</code> that you can be using. Some of these models do  not have a <code>.predict()</code> method but still can be used in BERTopic. However, using BERTopic's <code>.transform()</code> function  will then give errors. </p> <p>Here, we will demonstrate Agglomerative Clustering:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import AgglomerativeClustering\n\ncluster_model = AgglomerativeClustering(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre>"},{"location":"getting_started/clustering/clustering.html#cuml-hdbscan","title":"cuML HDBSCAN","text":"<p>Although the original HDBSCAN implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,  we can use cuML to speed up HDBSCAN through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model)\n</code></pre> <p>The great thing about using cuML's HDBSCAN implementation is that it supports many features of the original implementation. In other words,  <code>calculate_probabilities=True</code> also works!</p> <p>Note</p> <p>As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., <code>.transform</code>) using cuML's HDBSCAN.  However, it is still possible to calculate the topic-document probability matrix for the data on which the model was trained (i.e., <code>.fit</code> and <code>.fit_transform</code>).</p> <p>Note</p> <p>If you want to install cuML together with BERTopic using Google Colab, you can run the following code:</p> <pre><code>!pip install bertopic\n!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64\n</code></pre>"},{"location":"getting_started/ctfidf/ctfidf.html","title":"c-TF-IDF","text":"<p>In BERTopic, in order to get an accurate representation of the topics from our bag-of-words matrix, TF-IDF was adjusted to work on a cluster/categorical/topic level instead of a document level. This adjusted TF-IDF representation is called c-TF-IDF and takes into account what makes the documents in one cluster different from documents in another cluster:</p> <p></p> <p> Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word <code>x</code> in class <code>c</code>, where <code>c</code> refers to the cluster we created before. This results in our class-based <code>tf</code> representation. This representation is L1-normalized to account for the differences in topic sizes.     Then, we take the logarithm of one plus the average number of words per class <code>A</code> divided by the frequency of word <code>x</code> across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based <code>idf</code> representation. Like with the classic TF-IDF, we then multiply <code>tf</code> with <code>idf</code> to get the importance score per word in each class. In other words, the classical TF-IDF procedure is not used here but a modified version of the algorithm that allows for a much better representation. </p> <p>Since the topic representation is somewhat independent of the clustering step, we can change how the c-TF-IDF representation will look like. This can be in the form of parameter tuning, different weighting schemes, or using a diversity metric on top of it. This allows for some modularity concerning the weighting scheme:</p> <p> </p> <p>This class-based TF-IDF representation is enabled by default in BERTopic. However, we can explicitly pass it to BERTopic through the <code>ctfidf_model</code> allowing for parameter tuning and the customization of the topic extraction technique:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer()\ntopic_model = BERTopic(ctfidf_model=ctfidf_model )\n</code></pre>"},{"location":"getting_started/ctfidf/ctfidf.html#parameters","title":"Parameters","text":"<p>There are two parameters worth exploring in the <code>ClassTfidfTransformer</code>, namely <code>bm25_weighting</code> and <code>reduce_frequent_words</code>.</p>"},{"location":"getting_started/ctfidf/ctfidf.html#bm25_weighting","title":"bm25_weighting","text":"<p>The <code>bm25_weighting</code> is a boolean parameter that indicates whether a class-based BM-25 weighting measure is used instead of the default method as defined in the formula at the beginning of this page. </p> <p>Instead of using the following weighting scheme:</p> <p></p> <p>the class-based BM-25 weighting is used instead:</p> <p></p> <p>At smaller datasets, this variant can be more robust to stop words that appear in your data. It can be enabled as follows:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(bm25_weighting=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model )\n</code></pre>"},{"location":"getting_started/ctfidf/ctfidf.html#reduce_frequent_words","title":"reduce_frequent_words","text":"<p>Some words appear quite often in every topic but are generally not considered stop words as found in the <code>CountVectorizer(stop_words=\"english\")</code> list. To further reduce these frequent words, we can use <code>reduce_frequent_words</code> to take the square root of the term frequency after applying the weighting scheme. </p> <p>Instead of the default term frequency:</p> <p></p> <p>we take the square root of the term frequency after normalizing the frequency matrix:</p> <p></p> <p>Although seemingly a small change, it can have quite a large effect on the number of stop words in the resulting topic representations. It can be enabled as follows:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model )\n</code></pre> <p>Tip</p> <p>Both parameters can be used simultaneously: <code>ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)</code></p>"},{"location":"getting_started/dim_reduction/dim_reduction.html","title":"2. Dimensionality Reduction","text":"<p>An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality. </p> <p>A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.  UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions.  However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can  use any other dimensionality reduction algorithm. The image below illustrates this modularity:</p> <p> </p> <p>As a result, the <code>umap_model</code> parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have  the following attributes:</p> <ul> <li><code>.fit(X)</code> <ul> <li>A function that can be used to fit the model</li> </ul> </li> <li><code>.transform(X)</code> <ul> <li>A transform function that transforms the input to a lower dimensional size</li> </ul> </li> </ul> <p>In other words, it should have the following structure:</p> <pre><code>class DimensionalityReduction:\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n</code></pre> <p>In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.  </p>"},{"location":"getting_started/dim_reduction/dim_reduction.html#umap","title":"UMAP","text":"<p>As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters,  we simply define it and pass it to BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre> <p>Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using. </p>"},{"location":"getting_started/dim_reduction/dim_reduction.html#pca","title":"PCA","text":"<p>Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and perform inference. To use PCA, we can simply import it from <code>sklearn</code> and pass it to the <code>umap_model</code> parameter:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.decomposition import PCA\n\ndim_model = PCA(n_components=5)\ntopic_model = BERTopic(umap_model=dim_model)\n</code></pre> <p>As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN. </p> <p>Note</p> <p>As you might have noticed, the <code>dim_model</code> is passed to <code>umap_model</code> which might be a bit confusing considering  you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current  state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible. </p>"},{"location":"getting_started/dim_reduction/dim_reduction.html#truncated-svd","title":"Truncated SVD","text":"<p>Like PCA, there are a bunch more dimensionality reduction techniques in <code>sklearn</code> that you can be using. Here, we will demonstrate Truncated SVD  but any model can be used as long as it has both a <code>.fit()</code> and <code>.transform()</code> method:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.decomposition import TruncatedSVD\n\ndim_model = TruncatedSVD(n_components=5)\ntopic_model = BERTopic(umap_model=dim_model)\n</code></pre>"},{"location":"getting_started/dim_reduction/dim_reduction.html#cuml-umap","title":"cuML UMAP","text":"<p>Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,  we can use cuML to speed up UMAP through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.manifold import UMAP\n\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre> <p>Note</p> <p>If you want to install cuML together with BERTopic using Google Colab, you can run the following code:</p> <pre><code>!pip install bertopic\n!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64\n</code></pre>"},{"location":"getting_started/dim_reduction/dim_reduction.html#skip-dimensionality-reduction","title":"Skip dimensionality reduction","text":"<p>Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an \"empty\" model that simply returns the data pass it to: </p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\n# Fit BERTopic without actually performing any dimensionality reduction\nempty_dimensionality_model = BaseDimensionalityReduction()\ntopic_model = BERTopic(umap_model=empty_dimensionality_model)\n</code></pre> <p>In other words, we go from this pipeline:</p> <p></p> SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality  reduction Clustering Topic  representation <p></p> <p>To the following pipeline:</p> <p></p> SBERT HDBSCAN c-TF-IDF Embeddings Clustering Topic  representation <p></p>"},{"location":"getting_started/distribution/distribution.html","title":"Topic Distributions","text":"<p>BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic. </p> <p>Another option is to use a cluster model that can perform soft clustering, like HDBSCAN. As BERTopic focuses on modularity, we may still want to model that mixture of topics even when we are using a hard-clustering model, like k-Means without the need to split up our documents. This is where <code>.approximate_distribution</code> comes in!</p> <p></p> the right problem is difficult Solving right  the  Solving  the  right problem right  problem is problem  is difficult create token sets topic-token set similarity document-topic distribution multi-topic assignment on a token level solving topic 2 topic 1 topic 3 topic 4 the right problem is difficult 0.75 0.32 0.16 0.21 0.29 0.81 0.47 0.26 0.12 0.33 <p></p> <p>To perform this approximation, each document is split into tokens according to the provided tokenizer in the <code>CountVectorizer</code>. Then, a sliding window is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document: </p> <p>Solving the right problem is difficult.</p> <p>can be split up into <code>solving the right</code>, <code>the right problem</code>, <code>right problem is</code>, and <code>problem is difficult</code>. These are called token sets.  For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.  Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document. </p> <p>Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.  With this approach, we take into account that there may be multiple topics for a single word. </p> <p>We can make this multiple-topic word assignment a bit more accurate by then splitting these token sets up into individual tokens and assigning the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes  to a document's topic distribution.</p>"},{"location":"getting_started/distribution/distribution.html#example","title":"Example","text":"<p>To calculate our topic distributions, we first need to fit a basic topic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic().fit(docs)\n</code></pre> <p>After doing so, we can approximate the topic distributions for your documents:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs)\n</code></pre> <p>The resulting <code>topic_distr</code> is a n x m matrix where n are the topics and m the documents. We can then visualize the distribution  of topics in a document:</p> <pre><code>topic_model.visualize_distribution(topic_distr[1])\n</code></pre> <p>Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first  calculate topic distributions on a token level and then visualize the results:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])\ndf\n</code></pre> <p> </p> <p>Tip</p> <p>You can also approximate the topic distributions for unseen documents. It will not be as accurate as <code>.transform</code> but it is quite fast and can serve you well in a production setting. </p> <p>Note</p> <p>To get the stylized dataframe for <code>.visualize_approximate_distribution</code> you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via <code>pip install jinja2</code></p>"},{"location":"getting_started/distribution/distribution.html#parameters","title":"Parameters","text":"<p>There are a few parameters that are of interest which will be discussed below.</p>"},{"location":"getting_started/distribution/distribution.html#batch_size","title":"batch_size","text":"<p>Creating token sets for each document can result in quite a large list of token sets. The similarity of these token sets with the topics can result a large matrix that might not fit into memory anymore. To circumvent this, we can process batches of documents instead to minimize the memory overload. The value for <code>batch_size</code> indicates the number of documents that will be processed at once:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500)\n</code></pre>"},{"location":"getting_started/distribution/distribution.html#window","title":"window","text":"<p>The number of tokens that are combined into token sets are defined by the <code>window</code> parameter. Seeing as we are performing a sliding window, we can change the size of the window. A larger window takes more tokens into account but setting it too large can result in considering too much information. Personally, I like to have this window between 4 and 8:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, window=4)\n</code></pre>"},{"location":"getting_started/distribution/distribution.html#stride","title":"stride","text":"<p>The sliding window that is performed on a document shifts, as a default, 1 token to the right each time to create its token sets. As a result, especially with large windows, a single token gets judged several times. We can use the <code>stride</code> parameter to increase the number of tokens the window shifts to the right. By increasing this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with <code>window</code> is preferred. For example, if we have a very large dataset, we can set <code>stride=4</code> and <code>window=8</code> to judge token sets that contain 8 tokens but that are shifted with 4 steps  each time. As a result, this increases the computational speed quite a bit:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, window=4)\n</code></pre>"},{"location":"getting_started/distribution/distribution.html#use_embedding_model","title":"use_embedding_model","text":"<p>As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected <code>embedding_model</code> instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html","title":"Embedding Models","text":"<p>BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers (<code>\"all-MiniLM-L6-v2\"</code>) as it is quite capable of capturing the semantic similarity between documents. </p> <p>However, there is not one perfect  embedding model and you might want to be using something entirely different for your use case. Since BERTopic assumes some independence among steps, we can allow for this modularity:</p> <p> </p> <p>This modularity allows us not only to choose any embedding model to convert our documents into numerical representations, we can use essentially any data to perform our clustering.  When new state-of-the-art pre-trained embedding models are released, BERTopic will be able to use them. As a result, BERTopic grows with any new models being released. Out of the box, BERTopic supports several embedding techniques. In this section, we will go through several of them and how they can be implemented. </p>"},{"location":"getting_started/embeddings/embeddings.html#sentence-transformers","title":"Sentence Transformers","text":"<p>You can select any model from sentence-transformers here  and pass it through BERTopic with <code>embedding_model</code>:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(embedding_model=\"all-MiniLM-L6-v2\")\n</code></pre> <p>Or select a SentenceTransformer model with your parameters:</p> <pre><code>from sentence_transformers import SentenceTransformer\n\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\ntopic_model = BERTopic(embedding_model=sentence_model)\n</code></pre> <p>Tip 1!</p> <p>This embedding back-end was put here first for a reason, sentence-transformers works amazing out of the box! Playing around with different models can give you great results. Also, make sure to frequently visit this page as new models are often released. </p> <p>Tip 2!</p> <p>New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the MTEB leaderboard. It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look. </p> <p>Many of these models can be used with <code>SentenceTransformers</code> in BERTopic, like so:</p> <pre><code>from sentence_transformers import SentenceTransformer\n\nembedding_model = SentenceTransformer(\"BAAI/bge-base-en-v1.5\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#hugging-face-transformers","title":"\ud83e\udd17 Hugging Face Transformers","text":"<p>To use a Hugging Face transformers model, load in a pipeline and point  to any model found on their model hub (https://huggingface.co/models):</p> <pre><code>from transformers.pipelines import pipeline\n\nembedding_model = pipeline(\"feature-extraction\", model=\"distilbert-base-cased\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre> <p>Tip!</p> <p>These transformers also work quite well using <code>sentence-transformers</code> which has great optimizations tricks that make using it a bit faster. </p>"},{"location":"getting_started/embeddings/embeddings.html#flair","title":"Flair","text":"<p>Flair allows you to choose almost any embedding model that  is publicly available. Flair can be used as follows:</p> <pre><code>from flair.embeddings import TransformerDocumentEmbeddings\n\nroberta = TransformerDocumentEmbeddings('roberta-base')\ntopic_model = BERTopic(embedding_model=roberta)\n</code></pre> <p>You can select any \ud83e\udd17 transformers model here.</p> <p>Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.  Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily  pass it to BERTopic to use those word embeddings as document embeddings: </p> <pre><code>from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings\n\nglove_embedding = WordEmbeddings('crawl')\ndocument_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])\n\ntopic_model = BERTopic(embedding_model=document_glove_embeddings)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#spacy","title":"Spacy","text":"<p>Spacy is an amazing framework for processing text. There are  many models available across many languages for modeling text. </p> <p>To use Spacy's non-transformer models in BERTopic:</p> <pre><code>import spacy\n\nnlp = spacy.load(\"en_core_web_md\", exclude=['tagger', 'parser', 'ner', \n                                            'attribute_ruler', 'lemmatizer'])\n\ntopic_model = BERTopic(embedding_model=nlp)\n</code></pre> <p>Using spacy-transformer models:</p> <pre><code>import spacy\n\nspacy.prefer_gpu()\nnlp = spacy.load(\"en_core_web_trf\", exclude=['tagger', 'parser', 'ner', \n                                             'attribute_ruler', 'lemmatizer'])\n\ntopic_model = BERTopic(embedding_model=nlp)\n</code></pre> <p>If you run into memory issues with spacy-transformer models, try:</p> <pre><code>import spacy\nfrom thinc.api import set_gpu_allocator, require_gpu\n\nnlp = spacy.load(\"en_core_web_trf\", exclude=['tagger', 'parser', 'ner', \n                                             'attribute_ruler', 'lemmatizer'])\nset_gpu_allocator(\"pytorch\")\nrequire_gpu(0)\n\ntopic_model = BERTopic(embedding_model=nlp)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#universal-sentence-encoder-use","title":"Universal Sentence Encoder (USE)","text":"<p>The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here  for embedding the documents. The model is trained and optimized for greater-than-word length text,  such as sentences, phrases, or short paragraphs.</p> <p>Using USE in BERTopic is rather straightforward:</p> <pre><code>import tensorflow_hub\nembedding_model = tensorflow_hub.load(\"https://tfhub.dev/google/universal-sentence-encoder/4\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#gensim","title":"Gensim","text":"<p>BERTopic supports the <code>gensim.downloader</code> module, which allows it to download any word embedding model supported by Gensim.  Typically, these are Glove, Word2Vec, or FastText embeddings:</p> <pre><code>import gensim.downloader as api\nft = api.load('fasttext-wiki-news-subwords-300')\ntopic_model = BERTopic(embedding_model=ft)\n</code></pre> <p>Tip!</p> <p>Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.</p>"},{"location":"getting_started/embeddings/embeddings.html#scikit-learn-embeddings","title":"Scikit-Learn Embeddings","text":"<p>Scikit-Learn is a framework for more than just machine learning.  It offers many preprocessing tools, some of which can be used to create representations  for text. Many of these tools are relatively lightweight and do not require a GPU.  While the representations may be less expressive than many BERT models, the fact that  it runs much faster can make it a relevant candidate to consider. </p> <p>If you have a scikit-learn compatible pipeline that you'd like to use to embed text then you can also pass this to BERTopic. </p> <pre><code>from sklearn.pipeline import make_pipeline\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\npipe = make_pipeline(\n    TfidfVectorizer(),\n    TruncatedSVD(100)\n)\n\ntopic_model = BERTopic(embedding_model=pipe)\n</code></pre> <p>Warning</p> <p>One caveat to be aware of is that scikit-learns base <code>Pipeline</code> class does not support the <code>.partial_fit()</code>-API. If you have a pipeline that theoretically should be able to support online learning then you might want to explore the scikit-partial project.  Moreover, since this backend does not generate representations on a word level,  it does not support the <code>bertopic.representation</code> models.</p>"},{"location":"getting_started/embeddings/embeddings.html#openai","title":"OpenAI","text":"<p>To use OpenAI's external API, we need to define our key and explicitly call <code>bertopic.backend.OpenAIBackend</code> to be used in our topic model:</p> <pre><code>import openai\nfrom bertopic.backend import OpenAIBackend\n\nclient = openai.OpenAI(api_key=\"sk-...\")\nembedding_model = OpenAIBackend(client, \"text-embedding-ada-002\")\n\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#cohere","title":"Cohere","text":"<p>To use Cohere's external API, we need to define our key and explicitly call <code>bertopic.backend.CohereBackend</code> to be used in our topic model:</p> <pre><code>import cohere\nfrom bertopic.backend import CohereBackend\n\nclient = cohere.Client(\"MY_API_KEY\")\nembedding_model = CohereBackend(client)\n\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#multimodal","title":"Multimodal","text":"<p>To create embeddings for both text and images in the same vector space, we can use the <code>MultiModalBackend</code>.  This model uses a clip-vit based model that is capable of embedding text, images, or both:</p> <pre><code>from bertopic.backend import MultiModalBackend\nmodel = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n\n# Embed documents only\ndoc_embeddings = model.embed_documents(docs)\n\n# Embeding images only\nimage_embeddings = model.embed_images(images)\n\n# Embed both images and documents, then average them\ndoc_image_embeddings = model.embed(docs, images)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#custom-backend","title":"Custom Backend","text":"<p>If your backend or model cannot be found in the ones currently available, you can use the <code>bertopic.backend.BaseEmbedder</code> class to  create your backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:</p> <pre><code>from bertopic.backend import BaseEmbedder\nfrom sentence_transformers import SentenceTransformer\n\nclass CustomEmbedder(BaseEmbedder):\n    def __init__(self, embedding_model):\n        super().__init__()\n        self.embedding_model = embedding_model\n\n    def embed(self, documents, verbose=False):\n        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)\n        return embeddings \n\n# Create custom backend\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\ncustom_embedder = CustomEmbedder(embedding_model=embedding_model)\n\n# Pass custom backend to bertopic\ntopic_model = BERTopic(embedding_model=custom_embedder)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#custom-embeddings","title":"Custom Embeddings","text":"<p>The base models in BERTopic are BERT-based models that work well with document similarity tasks. Your documents,  however, might be too specific for a general pre-trained model to be used. Fortunately, you can use the embedding  model in BERTopic to create document features.   </p> <p>You only need to prepare the document embeddings yourself and pass them through <code>fit_transform</code> of BERTopic: <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train our topic model using our pre-trained sentence-transformers embeddings\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre></p> <p>As you can see above, we used a SentenceTransformer model to create the embedding. You could also have used  <code>\ud83e\udd17 transformers</code>, <code>Doc2Vec</code>, or any other embedding method. </p>"},{"location":"getting_started/embeddings/embeddings.html#tf-idf","title":"TF-IDF","text":"<p>As mentioned above, any embedding technique can be used. However, when running UMAP, the typical distance metric is  <code>cosine</code> which does not work quite well for a TF-IDF matrix. Instead, BERTopic will recognize that a sparse matrix  is passed and use <code>hellinger</code> instead which works quite well for the similarity between probability distributions. </p> <p>We simply create a TF-IDF matrix and use them as embeddings in our <code>fit_transform</code> method:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\n# Create TF-IDF sparse matrix\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nvectorizer = TfidfVectorizer(min_df=5)\nembeddings = vectorizer.fit_transform(docs)\n\n# Train our topic model using TF-IDF vectors\ntopic_model = BERTopic(stop_words=\"english\")\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre> <p>Here, you will probably notice that creating the embeddings is quite fast whereas <code>fit_transform</code> is quite slow.  This is to be expected as reducing the dimensionality of a large sparse matrix takes some time. The inverse of using  transformer embeddings is true: creating the embeddings is slow whereas <code>fit_transform</code> is quite fast. </p>"},{"location":"getting_started/guided/guided.html","title":"Guided Topic Modeling","text":"<p>Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of.  </p> <p>To model that bug, we can create a seed topic representation containing the words <code>bug</code>, <code>login</code>, <code>password</code>,  and <code>username</code>. By defining those words, a Guided Topic Modeling approach will try to converge at least one topic to those words.</p> <p></p> \"drug cancer drugs doctor\"  \"windows drive dos file\"  \"space launch orbit lunar\"  Concatenate and embed the keywords/keyphrases using the embedding model. For each document, generate labels by finding which seeded topic fits best based on cosine similarity between embeddings. Average the embedding of each document with the selected seeded topic.  Define seed topics through keywords or keyphrases.  \"drug\", \"cancer\", \"drugs\", \"doctor\"  Seed topic 1 Seed topic 2 Seed topic 3 \"windows\", \"drive\", \"dos\", \"file\"  \"space\", \"launch\", \"orbit\", \"lunar\"  Seed  topic 3  Seed  topic 2  No seed topic match found Seed topic 2  seed topic  embedding document  embedding + 2 Mutiply the IDF values of the seeded keywords across all topics with 1.2. Word IDF Multiplier Adjusted IDF drug 1.2 .55 .66 1.2 doctor .78 .94 cat 1  .22 .22 1 dog .11 .11 space 1.2 .35 .42 1.2  launch .89 1.07 <p></p> <p>Guided BERTopic has two main steps:</p> <p>First, we create embeddings for each seeded topic by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label.  If it is most similar to the average document embedding, it will get the -1 label.  These labels are then passed through UMAP to create a semi-supervised approach that should nudge  the topic creation to the seeded topics.</p> <p>Second, we take all words in seed_topic_list and assign them a multiplier larger than 1.  Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing  the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.</p>"},{"location":"getting_started/guided/guided.html#example","title":"Example","text":"<p>To demonstrate Guided BERTopic, we use the 20 Newsgroups dataset as our example. We have frequently used this dataset in BERTopic examples and we sometimes see a topic generated about health with words such as <code>drug</code> and <code>cancer</code>  being important. However, due to the stochastic nature of UMAP, this topic is not always found. </p> <p>In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,  there may be several other topics that we know should be in the documents. Let's also initialize those:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\nseed_topic_list = [[\"drug\", \"cancer\", \"drugs\", \"doctor\"],\n                   [\"windows\", \"drive\", \"dos\", \"file\"],\n                   [\"space\", \"launch\", \"orbit\", \"lunar\"]]\n\ntopic_model = BERTopic(seed_topic_list=seed_topic_list)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>As you can see above, the <code>seed_topic_list</code> contains a list of topic representations. By defining the above topics  BERTopic is more likely to model the defined seeded topics. However, BERTopic is merely nudged towards creating those  topics. In practice, if the seeded topics do not exist or might be divided into smaller topics, then they will  not be modeled. Thus, seed topics need to be accurate to accurately converge towards them. </p>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html","title":"Hierarchical Topic Modeling","text":"<p>When tweaking your topic model, the number of topics that are generated has a large effect on the quality of the topic representations. Some topics could be merged and having an understanding of the effect will help you understand which topics should and which should not be merged. </p> <p>That is where hierarchical topic modeling comes in. It tries to model the possible hierarchical nature of the topics you have created to understand which topics are similar to each other. Moreover, you will have more insight into sub-topics that might exist in your data. </p> <p></p> Create a distance matrix by calculating the cosine similarity between c-TF-IDF representations of each topic.  Apply a linkage function of choice on the distance matrix to model the hierarchical structure of topics.  Topic 26 Topic 1 Topic 38 Topic 42 re-calculate c-TF-IDF Update the c-TF-IDF representation based on the collection of documents across the merged topics.   Topic 1 .12 .12 .53 .53 .74 .74 .89 .89 .24 .24 .01 .01 1 1 1 1 ... ... ... ... ... ... ... ... 1 2 3 1 2 3 n ... . . . n <p></p> <p>In BERTopic, we can approximate this potential hierarchy by making use of our topic-term matrix (c-TF-IDF matrix). This matrix contains information about the importance of every word in every topic and makes for a nice numerical representation of our topics. The smaller the distance between two c-TF-IDF representations, the more similar we assume they are. In practice, this process of merging topics is done through the hierarchical clustering capabilities of <code>scipy</code> (see here). It allows for several linkage methods through which we can approximate our topic hierarchy. As a default, we are using the ward but many others are available. </p> <p>Whenever we merge two topics, we can calculate the c-TF-IDF representation of these two merged by summing their bag-of-words representation. We assume that two sets of topics are merged and that all others are kept the same, regardless of their location in the hierarchy. This helps us isolate the potential effect of merging sets of topics. As a result, we can see the topic representation at each level in the tree. </p>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#example","title":"Example","text":"<p>To demonstrate hierarchical topic modeling with BERTopic, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents. </p> <p>First, we train a basic BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Next, we can use our fitted BERTopic model to extract possible hierarchies from our c-TF-IDF matrix:</p> <pre><code>hierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>The resulting <code>hierarchical_topics</code> is a dataframe in which merged topics are described. For example, if you would  merge two topics, what would the topic representation of the new topic be? </p>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#linkage-functions","title":"Linkage functions","text":"<p>When creating the potential hierarchical nature of topics, we use Scipy's ward <code>linkage</code> function as a default  to generate the hierarchy. However, you might want to use a different linkage function  for your use case, such as <code>single</code>, <code>complete</code>, <code>average</code>, <code>centroid</code>, or <code>median</code>. In BERTopic, you can define the  linkage function yourself, including the distance function that you would like to use:</p> <pre><code>from scipy.cluster import hierarchy as sch\nfrom bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Hierarchical topics\nlinkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)\nhierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n</code></pre>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#visualizations","title":"Visualizations","text":"<p>To visualize these results, we can start by running a familiar function, namely <code>topic_model.visualize_hierarchy</code>:</p> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations  help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,  we can now see which sub-topics can be found within certain larger themes. </p> <p>Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can  use <code>topic_model.get_topic_tree</code> to create a text-based representation of this hierarchy. Although the general structure is more difficult  to view, we can see better which topics could be logically merged:</p> <pre><code>&gt;&gt;&gt; tree = topic_model.get_topic_tree(hierarchical_topics)\n&gt;&gt;&gt; print(tree)\n.\n\u2514\u2500atheists_atheism_god_moral_atheist\n     \u251c\u2500atheists_atheism_god_atheist_argument\n     \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n     \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n     \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n</code></pre> Click here to view the full tree. <pre><code>  .\n  \u251c\u2500people_armenian_said_god_armenians\n  \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u2502    \u251c\u2500jehovah_lord_mormon_mcconkie_god\n  \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500ra_satan_thou_god_lucifer \u2500\u2500 Topic: 94\n  \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500jehovah_lord_mormon_mcconkie_unto \u2500\u2500 Topic: 78\n  \u2502    \u2502    \u2502    \u2514\u2500jesus_mary_god_hell_sin\n  \u2502    \u2502    \u2502         \u251c\u2500jesus_hell_god_eternal_heaven\n  \u2502    \u2502    \u2502         \u2502    \u251c\u2500hell_jesus_eternal_god_heaven\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500jesus_tomb_disciples_resurrection_john \u2500\u2500 Topic: 69\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500hell_eternal_god_jesus_heaven \u2500\u2500 Topic: 53\n  \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500aaron_baptism_sin_law_god \u2500\u2500 Topic: 89\n  \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500mary_sin_maria_priest_conception \u2500\u2500 Topic: 56\n  \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500marriage_married_marry_ceremony_marriages \u2500\u2500 Topic: 110\n  \u2502    \u2514\u2500people_armenian_armenians_said_mr\n  \u2502         \u251c\u2500people_armenian_armenians_said_israel\n  \u2502         \u2502    \u251c\u2500god_homosexual_homosexuality_atheists_sex\n  \u2502         \u2502    \u2502    \u251c\u2500homosexual_homosexuality_sex_gay_homosexuals\n  \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500kinsey_sex_gay_men_sexual \u2500\u2500 Topic: 44\n  \u2502         \u2502    \u2502    \u2502    \u2514\u2500homosexuality_homosexual_sin_homosexuals_gay\n  \u2502         \u2502    \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500gay_homosexual_homosexuals_sexual_cramer \u2500\u2500 Topic: 50\n  \u2502         \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500homosexuality_homosexual_sin_paul_sex \u2500\u2500 Topic: 27\n  \u2502         \u2502    \u2502    \u2514\u2500god_atheists_atheism_moral_atheist\n  \u2502         \u2502    \u2502         \u251c\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500jim_context_challenges_articles_quote \u2500\u2500 Topic: 36\n  \u2502         \u2502    \u2502         \u2502    \u2514\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500islam_quran_islamic_rushdie_muslims \u2500\u2500 Topic: 31\n  \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500judas_scripture_bible_books_greek \u2500\u2500 Topic: 33\n  \u2502         \u2502    \u2502         \u2514\u2500atheists_atheism_god_moral_atheist\n  \u2502         \u2502    \u2502              \u251c\u2500atheists_atheism_god_atheist_argument\n  \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n  \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n  \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n  \u2502         \u2502    \u2514\u2500armenian_armenians_people_israel_said\n  \u2502         \u2502         \u251c\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502    \u251c\u2500tax_rights_government_income_taxes\n  \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500rights_right_slavery_slaves_residence \u2500\u2500 Topic: 106\n  \u2502         \u2502         \u2502    \u2502    \u2514\u2500tax_government_taxes_income_libertarians\n  \u2502         \u2502         \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500government_libertarians_libertarian_regulation_party \u2500\u2500 Topic: 58\n  \u2502         \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500tax_taxes_income_billion_deficit \u2500\u2500 Topic: 41\n  \u2502         \u2502         \u2502    \u2514\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502         \u251c\u2500gun_guns_militia_firearms_amendment\n  \u2502         \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500blacks_penalty_death_cruel_punishment \u2500\u2500 Topic: 55\n  \u2502         \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500gun_guns_militia_firearms_amendment \u2500\u2500 Topic: 7\n  \u2502         \u2502         \u2502         \u2514\u2500armenian_armenians_israel_jews_turkish\n  \u2502         \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500israel_israeli_jews_arab_jewish \u2500\u2500 Topic: 4\n  \u2502         \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500armenian_armenians_turkish_armenia_azerbaijan \u2500\u2500 Topic: 15\n  \u2502         \u2502         \u2514\u2500stephanopoulos_president_mr_myers_ms\n  \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500serbs_muslims_stephanopoulos_mr_bosnia \u2500\u2500 Topic: 35\n  \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500myers_stephanopoulos_president_ms_mr \u2500\u2500 Topic: 87\n  \u2502         \u2514\u2500batf_fbi_koresh_compound_gas\n  \u2502              \u251c\u2500\u25a0\u2500\u2500reno_workers_janet_clinton_waco \u2500\u2500 Topic: 77\n  \u2502              \u2514\u2500batf_fbi_koresh_gas_compound\n  \u2502                   \u251c\u2500batf_koresh_fbi_warrant_compound\n  \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500batf_warrant_raid_compound_fbi \u2500\u2500 Topic: 42\n  \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500koresh_batf_fbi_children_compound \u2500\u2500 Topic: 61\n  \u2502                   \u2514\u2500\u25a0\u2500\u2500fbi_gas_tear_bds_building \u2500\u2500 Topic: 23\n  \u2514\u2500use_like_just_dont_new\n      \u251c\u2500game_team_year_games_like\n      \u2502    \u251c\u2500game_team_games_25_year\n      \u2502    \u2502    \u251c\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502    \u251c\u2500window_printer_use_problem_mhz\n      \u2502    \u2502    \u2502    \u2502    \u251c\u2500mhz_wire_simms_wiring_battery\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_mhz_battery_cpu_heat\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_pds_simm_vram_lc\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500pds_nubus_lc_slot_card \u2500\u2500 Topic: 119\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500simms_simm_vram_meg_dram \u2500\u2500 Topic: 32\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500mhz_battery_cpu_heat_speed\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500fan_cpu_heat_sink_fans \u2500\u2500 Topic: 92\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500mhz_speed_cpu_fpu_clock \u2500\u2500 Topic: 22\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500monitor_turn_power_computer_electricity \u2500\u2500 Topic: 91\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500battery_batteries_concrete_duo_discharge\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500duo_battery_apple_230_problem \u2500\u2500 Topic: 121\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500battery_batteries_concrete_discharge_temperature \u2500\u2500 Topic: 75\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500leds_uv_blue_light_boards \u2500\u2500 Topic: 66\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500wire_wiring_ground_neutral_outlets \u2500\u2500 Topic: 120\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500scope_scopes_phone_dial_number\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500dial_number_phone_line_output \u2500\u2500 Topic: 93\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500scope_scopes_motorola_generator_oscilloscope \u2500\u2500 Topic: 113\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500celp_dsp_sampling_antenna_digital\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500antenna_antennas_receiver_cable_transmitter \u2500\u2500 Topic: 70\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500celp_dsp_sampling_speech_voice \u2500\u2500 Topic: 52\n      \u2502    \u2502    \u2502    \u2502    \u2514\u2500window_printer_xv_mouse_windows\n      \u2502    \u2502    \u2502    \u2502         \u251c\u2500window_xv_error_widget_problem\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500error_symbol_undefined_xterm_rx\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500symbol_error_undefined_doug_parse \u2500\u2500 Topic: 63\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500rx_remote_server_xdm_xterm \u2500\u2500 Topic: 45\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500window_xv_widget_application_expose\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500window_widget_expose_application_event\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500gc_mydisplay_draw_gxxor_drawing \u2500\u2500 Topic: 103\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500window_widget_application_expose_event \u2500\u2500 Topic: 25\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500xv_den_polygon_points_algorithm\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500den_polygon_points_algorithm_polygons \u2500\u2500 Topic: 28\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500xv_24bit_image_bit_images \u2500\u2500 Topic: 57\n      \u2502    \u2502    \u2502    \u2502         \u2514\u2500printer_fonts_print_mouse_postscript\n      \u2502    \u2502    \u2502    \u2502              \u251c\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500scanner_logitech_grayscale_ocr_scanman \u2500\u2500 Topic: 108\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u2514\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u251c\u2500\u25a0\u2500\u2500printer_print_deskjet_hp_ink \u2500\u2500 Topic: 18\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u2514\u2500\u25a0\u2500\u2500fonts_font_truetype_tt_atm \u2500\u2500 Topic: 49\n      \u2502    \u2502    \u2502    \u2502              \u2514\u2500mouse_ghostscript_midi_driver_postscript\n      \u2502    \u2502    \u2502    \u2502                   \u251c\u2500ghostscript_midi_postscript_files_file\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500ghostscript_postscript_pageview_ghostview_dsc \u2500\u2500 Topic: 104\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u2514\u2500midi_sound_file_windows_driver\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u251c\u2500\u25a0\u2500\u2500location_mar_file_host_rwrr \u2500\u2500 Topic: 83\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u2514\u2500\u25a0\u2500\u2500midi_sound_driver_blaster_soundblaster \u2500\u2500 Topic: 98\n      \u2502    \u2502    \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500mouse_driver_mice_ball_problem \u2500\u2500 Topic: 68\n      \u2502    \u2502    \u2502    \u2514\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502         \u251c\u25001st_sale_condition_comics_hulk\n      \u2502    \u2502    \u2502         \u2502    \u251c\u2500sale_condition_offer_asking_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500condition_stereo_amp_speakers_asking\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500miles_car_amfm_toyota_cassette \u2500\u2500 Topic: 62\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500amp_speakers_condition_stereo_audio \u2500\u2500 Topic: 24\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500games_sale_pom_cds_shipping\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u251c\u2500pom_cds_sale_shipping_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500size_shipping_sale_condition_mattress \u2500\u2500 Topic: 100\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500pom_cds_cd_sale_picture \u2500\u2500 Topic: 37\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500games_game_snes_sega_genesis \u2500\u2500 Topic: 40\n      \u2502    \u2502    \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u251c\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500lens_tape_camera_backup_lenses\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500tape_backup_tapes_drive_4mm \u2500\u2500 Topic: 107\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500lens_camera_lenses_zoom_pouch \u2500\u2500 Topic: 114\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u25001st_hulk_comics_art_appears \u2500\u2500 Topic: 105\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500books_book_cover_trek_chemistry \u2500\u2500 Topic: 125\n      \u2502    \u2502    \u2502         \u2502         \u2514\u2500tickets_hotel_ticket_voucher_package\n      \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500hotel_voucher_package_vacation_room \u2500\u2500 Topic: 74\n      \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500tickets_ticket_june_airlines_july \u2500\u2500 Topic: 84\n      \u2502    \u2502    \u2502         \u2514\u2500game_team_games_season_hockey\n      \u2502    \u2502    \u2502              \u251c\u2500game_hockey_team_25_550\n      \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500espn_pt_pts_game_la \u2500\u2500 Topic: 17\n      \u2502    \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500team_25_game_hockey_550 \u2500\u2500 Topic: 2\n      \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500year_game_hit_baseball_players \u2500\u2500 Topic: 0\n      \u2502    \u2502    \u2514\u2500bike_car_greek_insurance_msg\n      \u2502    \u2502         \u251c\u2500car_bike_insurance_cars_engine\n      \u2502    \u2502         \u2502    \u251c\u2500car_insurance_cars_radar_engine\n      \u2502    \u2502         \u2502    \u2502    \u251c\u2500insurance_health_private_care_canada\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500insurance_health_private_care_canada \u2500\u2500 Topic: 99\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500insurance_car_accident_rates_sue \u2500\u2500 Topic: 82\n      \u2502    \u2502         \u2502    \u2502    \u2514\u2500car_cars_radar_engine_detector\n      \u2502    \u2502         \u2502    \u2502         \u251c\u2500car_radar_cars_detector_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500radar_detector_detectors_ka_alarm \u2500\u2500 Topic: 39\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500car_cars_mustang_ford_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500clutch_shift_shifting_transmission_gear \u2500\u2500 Topic: 88\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500car_cars_mustang_ford_v8 \u2500\u2500 Topic: 14\n      \u2502    \u2502         \u2502    \u2502         \u2514\u2500oil_diesel_odometer_diesels_car\n      \u2502    \u2502         \u2502    \u2502              \u251c\u2500odometer_oil_sensor_car_drain\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500odometer_sensor_speedo_gauge_mileage \u2500\u2500 Topic: 96\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500oil_drain_car_leaks_taillights \u2500\u2500 Topic: 102\n      \u2502    \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500diesel_diesels_emissions_fuel_oil \u2500\u2500 Topic: 79\n      \u2502    \u2502         \u2502    \u2514\u2500bike_riding_ride_bikes_motorcycle\n      \u2502    \u2502         \u2502         \u251c\u2500bike_ride_riding_bikes_lane\n      \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500bike_ride_riding_lane_car \u2500\u2500 Topic: 11\n      \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500bike_bikes_miles_honda_motorcycle \u2500\u2500 Topic: 19\n      \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500countersteering_bike_motorcycle_rear_shaft \u2500\u2500 Topic: 46\n      \u2502    \u2502         \u2514\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u251c\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u251c\u2500greek_kuwait_greece_turkish_greeks\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500greek_greece_turkish_greeks_cyprus \u2500\u2500 Topic: 71\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kuwait_iraq_iran_gulf_arabia \u2500\u2500 Topic: 76\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2514\u2500msg_dog_drugs_drug_food\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u251c\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500clinton_bush_quayle_reagan_panicking \u2500\u2500 Topic: 101\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u2514\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500cooper_trial_weaver_spence_witnesses \u2500\u2500 Topic: 90\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500dog_dogs_bike_trained_springer \u2500\u2500 Topic: 67\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2514\u2500msg_drugs_drug_food_chinese\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500msg_food_chinese_foods_taste \u2500\u2500 Topic: 30\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500drugs_drug_marijuana_cocaine_alcohol \u2500\u2500 Topic: 72\n      \u2502    \u2502              \u2502    \u2502    \u2514\u2500water_theory_universe_science_larsons\n      \u2502    \u2502              \u2502    \u2502         \u251c\u2500water_nuclear_cooling_steam_dept\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500rocketry_rockets_engines_nuclear_plutonium \u2500\u2500 Topic: 115\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u2514\u2500water_cooling_steam_dept_plants\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500water_dept_phd_environmental_atmospheric \u2500\u2500 Topic: 97\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500cooling_water_steam_towers_plants \u2500\u2500 Topic: 109\n      \u2502    \u2502              \u2502    \u2502         \u2514\u2500theory_universe_larsons_larson_science\n      \u2502    \u2502              \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500theory_universe_larsons_larson_science \u2500\u2500 Topic: 54\n      \u2502    \u2502              \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500oort_cloud_grbs_gamma_burst \u2500\u2500 Topic: 80\n      \u2502    \u2502              \u2502    \u2514\u2500helmet_kirlian_photography_lock_wax\n      \u2502    \u2502              \u2502         \u251c\u2500helmet_kirlian_photography_leaf_mask\n      \u2502    \u2502              \u2502         \u2502    \u251c\u2500kirlian_photography_leaf_pictures_deleted\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u251c\u2500deleted_joke_stuff_maddi_nickname\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500joke_maddi_nickname_nicknames_frank \u2500\u2500 Topic: 43\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500deleted_stuff_bookstore_joke_motto \u2500\u2500 Topic: 81\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kirlian_photography_leaf_pictures_aura \u2500\u2500 Topic: 85\n      \u2502    \u2502              \u2502         \u2502    \u2514\u2500helmet_mask_liner_foam_cb\n      \u2502    \u2502              \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500helmet_liner_foam_cb_helmets \u2500\u2500 Topic: 112\n      \u2502    \u2502              \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500mask_goalies_77_santore_tl \u2500\u2500 Topic: 123\n      \u2502    \u2502              \u2502         \u2514\u2500lock_wax_paint_plastic_ear\n      \u2502    \u2502              \u2502              \u251c\u2500\u25a0\u2500\u2500lock_cable_locks_bike_600 \u2500\u2500 Topic: 117\n      \u2502    \u2502              \u2502              \u2514\u2500wax_paint_ear_plastic_skin\n      \u2502    \u2502              \u2502                   \u251c\u2500\u25a0\u2500\u2500wax_paint_plastic_scratches_solvent \u2500\u2500 Topic: 65\n      \u2502    \u2502              \u2502                   \u2514\u2500\u25a0\u2500\u2500ear_wax_skin_greasy_acne \u2500\u2500 Topic: 116\n      \u2502    \u2502              \u2514\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u251c\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500m4_mp_14_mw_mo \u2500\u2500 Topic: 111\n      \u2502    \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500test_ensign_nameless_deane_deanebinahccbrandeisedu \u2500\u2500 Topic: 118\n      \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500ites_cheek_hello_hi_ken \u2500\u2500 Topic: 3\n      \u2502    \u2514\u2500space_medical_health_disease_cancer\n      \u2502         \u251c\u2500medical_health_disease_cancer_patients\n      \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500cancer_centers_center_medical_research \u2500\u2500 Topic: 122\n      \u2502         \u2502    \u2514\u2500health_medical_disease_patients_hiv\n      \u2502         \u2502         \u251c\u2500patients_medical_disease_candida_health\n      \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n      \u2502         \u2502         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n      \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n      \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n      \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n      \u2502         \u2514\u2500space_launch_nasa_shuttle_orbit\n      \u2502              \u251c\u2500space_moon_station_nasa_launch\n      \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500sky_advertising_billboard_billboards_space \u2500\u2500 Topic: 59\n      \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500space_station_moon_redesign_nasa \u2500\u2500 Topic: 16\n      \u2502              \u2514\u2500space_mission_hst_launch_orbit\n      \u2502                   \u251c\u2500space_launch_nasa_orbit_propulsion\n      \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500space_launch_nasa_propulsion_astronaut \u2500\u2500 Topic: 47\n      \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500orbit_km_jupiter_probe_earth \u2500\u2500 Topic: 86\n      \u2502                   \u2514\u2500\u25a0\u2500\u2500hst_mission_shuttle_orbit_arrays \u2500\u2500 Topic: 60\n      \u2514\u2500drive_file_key_windows_use\n          \u251c\u2500key_file_jpeg_encryption_image\n          \u2502    \u251c\u2500key_encryption_clipper_chip_keys\n          \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500key_clipper_encryption_chip_keys \u2500\u2500 Topic: 1\n          \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500entry_file_ripem_entries_key \u2500\u2500 Topic: 73\n          \u2502    \u2514\u2500jpeg_image_file_gif_images\n          \u2502         \u251c\u2500motif_graphics_ftp_available_3d\n          \u2502         \u2502    \u251c\u2500motif_graphics_openwindows_ftp_available\n          \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500openwindows_motif_xview_windows_mouse \u2500\u2500 Topic: 20\n          \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500graphics_widget_ray_3d_available \u2500\u2500 Topic: 95\n          \u2502         \u2502    \u2514\u2500\u25a0\u2500\u25003d_machines_version_comments_contact \u2500\u2500 Topic: 38\n          \u2502         \u2514\u2500jpeg_image_gif_images_format\n          \u2502              \u251c\u2500\u25a0\u2500\u2500gopher_ftp_files_stuffit_images \u2500\u2500 Topic: 51\n          \u2502              \u2514\u2500\u25a0\u2500\u2500jpeg_image_gif_format_images \u2500\u2500 Topic: 13\n          \u2514\u2500drive_db_card_scsi_windows\n              \u251c\u2500db_windows_dos_mov_os2\n              \u2502    \u251c\u2500\u25a0\u2500\u2500copy_protection_program_software_disk \u2500\u2500 Topic: 64\n              \u2502    \u2514\u2500\u25a0\u2500\u2500db_windows_dos_mov_os2 \u2500\u2500 Topic: 8\n              \u2514\u2500drive_card_scsi_drives_ide\n                      \u251c\u2500drive_scsi_drives_ide_disk\n                      \u2502    \u251c\u2500\u25a0\u2500\u2500drive_scsi_drives_ide_disk \u2500\u2500 Topic: 6\n                      \u2502    \u2514\u2500\u25a0\u2500\u2500meg_sale_ram_drive_shipping \u2500\u2500 Topic: 12\n                      \u2514\u2500card_modem_monitor_video_drivers\n                          \u251c\u2500\u25a0\u2500\u2500card_monitor_video_drivers_vga \u2500\u2500 Topic: 5\n                          \u2514\u2500\u25a0\u2500\u2500modem_port_serial_irq_com \u2500\u2500 Topic: 10\n</code></pre>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#merge-topics","title":"Merge topics","text":"<p>After seeing the potential hierarchy of your topic, you might want to merge specific  topics. For example, if topic 1 is  <code>1_space_launch_moon_nasa</code> and topic 2 is <code>2_spacecraft_solar_space_orbit</code> it might  make sense to merge those two topics as they are quite similar in meaning. In BERTopic,  you can use <code>.merge_topics</code> to manually select and merge those topics. Doing so will  update their topic representation which in turn updates the entire model:</p> <pre><code>topics_to_merge = [1, 2]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> <p>If you have several groups of topics you want to merge, create a list of lists instead:</p> <pre><code>topics_to_merge = [[1, 2],\n                   [3, 4]]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre>"},{"location":"getting_started/manual/manual.html","title":"Manual Topic Modeling","text":"<p>Although topic modeling is typically done by discovering topics in an unsupervised manner, there might be times when you already have a bunch of clusters or classes from which you want to model the topics. For example, the often used 20 NewsGroups dataset is already split up into 20 classes. Here, we might want to see how we can transform those 20 classes into 20 topics. Instead of using BERTopic to discover previously unknown topics, we are now going to manually pass them to BERTopic without actually learning them. </p> <p>We can view this as a manual topic modeling approach. There is no underlying algorithm for detecting these topics since you already have done that before. Whether that is simply because they are already available, like with the 20 NewsGroups dataset, or maybe because you have created clusters of documents before using packages like human-learn, bulk, thisnotthat or something entirely different. </p> <p>In other words, we can pass our labels to BERTopic and it will try to transform those labels into topics by running the c-TF-IDF representations on the set of documents within each label. This process allows us to model the topics themselves and similarly gives us the option to use everything BERTopic has to offer. </p> <p></p> Documents Labels c-TF-IDF <p></p> <p>To do so, we need to skip over the dimensionality reduction and clustering steps since we already know the labels for our documents. We can use the documents and labels from the 20 NewsGroups dataset to create topics from those 20 labels:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# Get labeled data\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n</code></pre> <p>Then, we make sure to create empty instances of the dimensionality reduction and clustering steps. We pass those to BERTopic to simply skip over them and go to the topic representation process:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.backend import BaseEmbedder\nfrom bertopic.cluster import BaseCluster\nfrom bertopic.vectorizers import ClassTfidfTransformer\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\n# Prepare our empty sub-models and reduce frequent words while we are at it.\nempty_embedding_model = BaseEmbedder()\nempty_dimensionality_model = BaseDimensionalityReduction()\nempty_cluster_model = BaseCluster()\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\n\n# Fit BERTopic without actually performing any clustering\ntopic_model= BERTopic(\n        embedding_model=empty_embedding_model,\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=empty_cluster_model,\n        ctfidf_model=ctfidf_model\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> <p>Let's take a look at a few topics that we get out of training this way by running <code>topic_model.get_topic_info()</code>:</p> <p></p> Topic Count  Name 0 0 999  0_game_hockey_team_25  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  <p></p> <p>We can see several interesting topics appearing here. They seem to relate to the 20 classes we had as input. Now, let's map those topics to our original classes to view their relationship:</p> <pre><code># Map input `y` to topics\nmappings = topic_model.topic_mapper_.get_mappings()\nmappings = {value: data[\"target_names\"][key] for key, value in mappings.items()}\n\n# Assign original classes to our topics\ndf = topic_model.get_topic_info()\ndf[\"Class\"] = df.Topic.map(mappings)\ndf\n</code></pre> <p></p> Topic Count  Name Class 0 0 999  0_game_hockey_team_25  rec.sport.hockey  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  sci.space  comp.windows.x  sci.med  rec.autos  sci.crypt  rec.sport.baseball  rec.motorcycles  soc.religion.christian  <p></p> <p>We can see that the c-TF-IDF representations nicely extract the words that give a nice representation of our input classes. This is all done without actually embedding and clustering the data.</p> <p>As a result, the entire \"training\" process only takes a couple of seconds. Moreover, we can still perform BERTopic-specific features like dynamic topic modeling, topics per class, hierarchical topic modeling, modeling topic distributions, etc.</p> <p>Note</p> <p>The resulting <code>topics</code> may be a different mapping from the <code>y</code> labels. To map <code>y</code> to <code>topics</code>, we can run the following:</p> <pre><code>mappings = topic_model.topic_mapper_.get_mappings()\ny_mapped = [mappings[val] for val in y]\n</code></pre>"},{"location":"getting_started/merge/merge.html","title":"Merge Multiple Fitted Models","text":"<p>After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use online BERTopic, you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box. </p> <p>Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model. </p> <p>Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases. </p> <p>Lastly, this methods also allows for a degree of <code>federated learning</code> where each node trains a topic model that are aggregated in a central server.</p>"},{"location":"getting_started/merge/merge.html#example","title":"Example","text":"<p>To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.</p> <p>First, we train three separate models on different parts of the data:</p> <pre><code>from umap import UMAP\nfrom bertopic import BERTopic\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\n\n# Extract abstracts to train on and corresponding titles\nabstracts_1 = dataset[\"abstract\"][:5_000]\nabstracts_2 = dataset[\"abstract\"][5_000:10_000]\nabstracts_3 = dataset[\"abstract\"][10_000:15_000]\n\n# Create topic models\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\ntopic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)\ntopic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)\ntopic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)\n</code></pre> <p>Then, we can combine all three models into one with <code>.merge_models</code>:</p> <pre><code># Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n</code></pre> <p>When we inspect the first model, we can see it has 52 topics:</p> <pre><code>&gt;&gt;&gt; len(topic_model_1.get_topic_info())\n52\n</code></pre> <p>Now, we inspect the merged model, we can see it has 57 topics:</p> <pre><code>&gt;&gt;&gt; len(merged_model.get_topic_info())\n57\n</code></pre> <p>It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model. </p> <p>Note</p> <p>Note that the models are merged sequentially. This means that the comparison starts with <code>topic_model_1</code> and that  each new topic from <code>topic_model_2</code> and <code>topic_model_3</code> will be added to <code>topic_model_1</code>.</p> <p>We can check the newly added topics in the <code>merged_model</code> by simply looking at the 6 latest topics that were added. The order of topics from <code>topic_model_1</code> remains the same. All new topics are simply added on top of them. </p> <p>Let's inspect them:</p> <pre><code>&gt;&gt;&gt; merged_model.get_topic_info().tail(5)\n</code></pre> Topic Count Name Representation Representative_Docs 52 51 47 50_activity_mobile_wearable_sensors ['activity', 'mobile', 'wearable', 'sensors', 'falls', 'human', 'phone', 'recognition', 'activities', 'accelerometer'] nan 53 52 48 25_music_musical_audio_chord ['music', 'musical', 'audio', 'chord', 'and', 'we', 'to', 'that', 'of', 'for'] nan 54 53 32 36_fairness_discrimination_fair_groups ['fairness', 'discrimination', 'fair', 'groups', 'protected', 'decision', 'we', 'of', 'classifier', 'to'] nan 55 54 30 38_traffic_driver_prediction_flow ['traffic', 'driver', 'prediction', 'flow', 'trajectory', 'the', 'and', 'congestion', 'of', 'transportation'] nan 56 55 22 50_spiking_neurons_networks_learning ['spiking', 'neurons', 'networks', 'learning', 'neural', 'snn', 'dynamics', 'plasticity', 'snns', 'of'] nan <p>It seems that topics about activity, music, fairness, traffic, and spiking networks were added to the base topic model! Two things that you might have noticed. First,  the representative documents were not added to the model. This is because of privacy reasons, you might want to combine models that were trained on different stations which would allow for a degree of <code>federated learning</code>. Second, the names of the new topics contain topic ids that refer to one of the old models. They were purposefully left this way so that the user can identify which topics were newly added which you could inspect in the original models.</p>"},{"location":"getting_started/merge/merge.html#min_similarity","title":"min_similarity","text":"<p>The way the models are merged is through comparison of their topic embeddings. If topics between models are similar enough, then they will be regarded as the same topics  and the topic of the first model in the list will be chosen. However, if topics between models are dissimilar enough, then the topic of the latter model will be added to the former.</p> <p>This (dis)similarity is can be tweaked using the <code>min_similarity</code> parameter. Increasing this value will increase the change of adding new topics. In contrast, decreasing this value  will make it more strict and threfore decrease the change of adding new topics. The value is set to <code>0.7</code> by default, so let's see what happens if we were to increase this value to `0.9``:</p> <pre><code># Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3], min_similarity=0.9)\n</code></pre> <p>When we inspect the number of topics in our new model, we can see that they have increased quite a bit:</p> <pre><code>&gt;&gt;&gt; len(merged_model.get_topic_info())\n102\n</code></pre> <p>This demonstrates the influence of <code>min_similarity</code> on the number of new topics that are added to the base model.</p>"},{"location":"getting_started/multiaspect/multiaspect.html","title":"6C. Multiple Representations","text":"<p>During the development of BERTopic, many different types of representations can be created, from keywords and phrases to summaries and custom labels. There is a variety of techniques that one can choose from to represent a topic. As such, there are a number of interesting and creative ways one can summarize topics. A topic is more than just a single representation.</p> <p>Therefore, <code>multi-aspect topic modeling</code> is introduced! During the <code>.fit</code> or <code>.fit_transform</code> stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).</p> <p> </p> <p>The approach is rather straightforward. We might want to represent our topics using a <code>PartOfSpeech</code> representation model but we might also want to try out <code>KeyBERTInspired</code> and compare those representation models. We can do this as follows:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic.representation import PartOfSpeech\nfrom bertopic.representation import MaximalMarginalRelevance\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Documents to train on\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# The main representation of a topic\nmain_representation = KeyBERTInspired()\n\n# Additional ways of representing a topic\naspect_model1 = PartOfSpeech(\"en_core_web_sm\")\naspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]\n\n# Add all models together to be run in a single `fit`\nrepresentation_model = {\n   \"Main\": main_representation,\n   \"Aspect1\":  aspect_model1,\n   \"Aspect2\":  aspect_model2 \n}\ntopic_model = BERTopic(representation_model=representation_model).fit(docs)\n</code></pre> <p>As show above, to perform multi-aspect topic modeling, we make sure that <code>representation_model</code> is a dictionary where each representation model pipeline is defined.  The main pipeline, that is used in most visualization options, is defined with the <code>\"Main\"</code> key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as <code>\"Aspect1\"</code> and <code>\"Aspect2\"</code>. </p> <p>After we have fitted our model, we can access all representations with <code>topic_model.get_topic_info()</code>:</p> <p> </p> <p>As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in <code>topic_model.topic_aspects_</code>. </p>"},{"location":"getting_started/multimodal/multimodal.html","title":"Multimodal Topic Modeling","text":"<p>Documents or text are often accompanied by imagery or the other way around. For example, social media images with captions and products with descriptions. Topic modeling has traditionally focused on creating topics from textual representations. However, as more multimodal representations are created, the need for multimodal topics increases.</p> <p>BERTopic can perform multimodal topic modeling in a number of ways during <code>.fit</code> and <code>.fit_transform</code> stages. </p>"},{"location":"getting_started/multimodal/multimodal.html#text-images","title":"Text + Images","text":"<p>The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them. </p> <p> </p> <p>In this example, we are going to use images from <code>flickr</code> that each have a caption accociated to it: </p> <pre><code># NOTE: This requires the `datasets` package which you can \n# install with `pip install datasets`\nfrom datasets import load_dataset\n\nds = load_dataset(\"maderix/flickr_bw_rgb\")\nimages = ds[\"train\"][\"image\"]\ndocs = ds[\"train\"][\"caption\"]\n</code></pre> <p>The <code>docs</code> variable contains the captions for each image in <code>images</code>. We can now use these variables to run our multimodal example:</p> <p>Tip</p> <p>Do note that it is better to pass the paths of the images instead of the images themselves as there is no need to keep all images in memory. When passing the paths of the images, they are only opened temporarily when they are needed.</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import VisualRepresentation\n\n# Additional ways of representing a topic\nvisual_model = VisualRepresentation()\n\n# Make sure to add the `visual_model` to a dictionary\nrepresentation_model = {\n   \"Visual_Aspect\":  visual_model,\n}\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\n</code></pre> <p>In this example, we are clustering the documents and are then looking for the best matching images to the resulting clusters. </p> <p>We can now access our image representations for each topic with <code>topic_model.topic_aspects_[\"Visual_Aspect\"]</code>. If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:</p> <pre><code>import base64\nfrom io import BytesIO\nfrom IPython.display import HTML\n\ndef image_base64(im):\n    if isinstance(im, str):\n        im = get_thumbnail(im)\n    with BytesIO() as buffer:\n        im.save(buffer, 'jpeg')\n        return base64.b64encode(buffer.getvalue()).decode()\n\n\ndef image_formatter(im):\n    return f'&lt;img src=\"data:image/jpeg;base64,{image_base64(im)}\"&gt;'\n\n# Extract dataframe\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\n\n# Visualize the images\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\n</code></pre> <p> </p> <p>Tip</p> <p>In the example above, we are clustering the documents but since you have  images, you might want to cluster those or cluster an aggregation of both  images and documents. For that, you can use the new <code>MultiModalBackend</code>  to generate embeddings: </p> <pre><code>from bertopic.backend import MultiModalBackend\nmodel = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n\n# Embed documents only\ndoc_embeddings = model.embed_documents(docs)\n\n# Embeding images only\nimage_embeddings = model.embed_images(images)\n\n# Embed both images and documents, then average them\ndoc_image_embeddings = model.embed(docs, images)\n</code></pre>"},{"location":"getting_started/multimodal/multimodal.html#images-only","title":"Images Only","text":"<p>Traditional topic modeling techniques can only be run on textual data, as is shown in the example above. However, there are plenty of cases where textual data is not available but images are. BERTopic allows topic modeling to be performed using only images as your input data.</p> <p> </p> <p>To run BERTopic on images only, we first need to embed our images and then define a model that convert images to text. To do so, we are going to need some images. We will take the same images as the above but instead save them locally and pass the paths to the images instead. As mentioned before, this will make sure that we do not hold too many images in memory whilst only a small subset is needed:</p> <pre><code>import os\nimport glob\nimport zipfile\nimport numpy as np\nimport pandas as pd\nfrom tqdm import tqdm\nfrom sentence_transformers import util\n\n# Flickr 8k images\nimg_folder = 'photos/'\ncaps_folder = 'captions/'\nif not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:\n    os.makedirs(img_folder, exist_ok=True)\n\n    if not os.path.exists('Flickr8k_Dataset.zip'):   #Download dataset if does not exist\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')\n\n    for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:\n        with zipfile.ZipFile(file, 'r') as zf:\n            for member in tqdm(zf.infolist(), desc='Extracting'):\n                zf.extract(member, folder)\nimages = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))\n</code></pre> <p>Next, we can run our pipeline:</p> <pre><code>from bertopic.representation import KeyBERTInspired, VisualRepresentation\nfrom bertopic.backend import MultiModalBackend\n\n# Image embedding model\nembedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n\n# Image to text representation model\nrepresentation_model = {\n    \"Visual_Aspect\": VisualRepresentation(image_to_text_model=\"nlpconnect/vit-gpt2-image-captioning\")\n}\n</code></pre> <p>Using these models, we can run our pipeline:</p> <pre><code>from bertopic import BERTopic\n\n# Train our model with images only\ntopic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, min_topic_size=30)\ntopics, probs = topic_model.fit_transform(documents=None, images=images)\n</code></pre> <p>We can now access our image representations for each topic with <code>topic_model.topic_aspects_[\"Visual_Aspect\"]</code>. If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:</p> <pre><code>import base64\nfrom io import BytesIO\nfrom IPython.display import HTML\n\ndef image_base64(im):\n    if isinstance(im, str):\n        im = get_thumbnail(im)\n    with BytesIO() as buffer:\n        im.save(buffer, 'jpeg')\n        return base64.b64encode(buffer.getvalue()).decode()\n\n\ndef image_formatter(im):\n    return f'&lt;img src=\"data:image/jpeg;base64,{image_base64(im)}\"&gt;'\n\n# Extract dataframe\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\n\n# Visualize the images\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\n</code></pre> <p> </p>"},{"location":"getting_started/online/online.html","title":"Online Topic Modeling","text":"<p>Online topic modeling (sometimes called \"incremental topic modeling\") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a <code>.partial_fit</code> function, which is also used in BERTopic. </p> <p>Tip</p> <p>Another method for online topic modeling can be found with the .merge_models functionality of BERTopic. It allows for merging multiple BERTopic models to create a single new one. This method can be used to discover new topics by training a new model and exploring whether that new model added new topics to the original model when merging. A major benefit, compared to <code>.partial_fit</code> is that you can keep using the original UMAP and HDBSCAN models which tends result in improved performance and gives you significant more flexibility.</p> <p>In BERTopic, there are three main goals for using this technique.</p> <ul> <li>To reduce the memory necessary for training a topic model. </li> <li>To continuously update the topic model as new data comes in. </li> <li>To continuously find new topics as new data comes in. </li> </ul> <p>In BERTopic, online topic modeling can be a bit tricky as there are several steps involved in which online learning needs to be made available. To recap, BERTopic consists of the following 6 steps:</p> <ol> <li>Extract embeddings</li> <li>Reduce dimensionality</li> <li>Cluster reduced embeddings</li> <li>Tokenize topics</li> <li>Extract topic words</li> <li>(Optional) Fine-tune topic words</li> </ol> <p>For some steps, an online variant is more important than others. Typically, in step 1 we use pre-trained language models that are in less need of continuous updates. This means that we can use an embedding model like Sentence-Transformers for extracting the embeddings and still use it in an online setting. Similarly, steps 5 and 6 do not necessarily need online variants since they are built upon step 4, tokenization. If that tokenization is by itself incremental, then so will steps 5 and 6. </p> <p></p> SBERT IncrementalPCA  MiniBatchKMeans  Online CountVectorizer Embeddings Dimensionality  reduction Clustering Incremental  Bag-of-Words c-TF-IDF Topic  representation Online variants of these steps in the main BERTopic pipeline  are needed in order to enable incremental learning.  <p></p> <p>This means that we will need online variants for steps 2 through 4. Steps 2 and 3, dimensionality reduction and clustering, can be modeled through the use of Scikit-Learn's <code>.partial_fit</code> function. In other words, it supports any algorithm that can be trained using <code>.partial_fit</code> since these algorithms can be trained incrementally. For example, incremental dimensionality reduction can be achieved using Scikit-Learn's <code>IncrementalPCA</code> and incremental clustering with <code>MiniBatchKMeans</code>.</p> <p>Lastly, we need to develop an online variant for step 5, tokenization. In this step, a Bag-of-words representation is created through the <code>CountVectorizer</code>. However, as new data comes in, its vocabulary will need to be updated. For that purpose, <code>bertopic.vectorizers.OnlineCountVectorizer</code> was created that not only updates out-of-vocabulary words but also implements decay and cleaning functions to prevent the sparse bag-of-words matrix to become too large. Most notably, the <code>decay</code> parameter is a value between 0 and 1 to weigh the percentage of frequencies that the previous bag-of-words matrix should be reduced to. For example, a value of <code>.1</code> will decrease the frequencies in the bag-of-words matrix by 10% at each iteration. This will make sure that recent data has more weight than previous iterations. Similarly, <code>delete_min_df</code> will remove certain words from its vocabulary if their frequency is lower than a set value. This ties together with the <code>decay</code> parameter as some words will decay over time if not used. For more information regarding the <code>OnlineCountVectorizer</code>, please see the vectorizers documentation.</p>"},{"location":"getting_started/online/online.html#example","title":"Example","text":"<p>Online topic modeling in BERTopic is rather straightforward. We first need to have our documents split into chunks such that we can train and update our topic model incrementally. </p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# Prepare documents\nall_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\ndoc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]\n</code></pre> <p>Here, we created chunks of 1000 documents to be fed in BERTopic. Then, we will need to define several sub-models that support online learning. Specifically, we are going to be using <code>IncrementalPCA</code>, <code>MiniBatchKMeans</code>, and the <code>OnlineCountVectorizer</code>:</p> <pre><code>from sklearn.cluster import MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom bertopic.vectorizers import OnlineCountVectorizer\n\n# Prepare sub-models that support online learning\numap_model = IncrementalPCA(n_components=5)\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n</code></pre> <p>After having defined our sub-models, we can start training our topic model incrementally by looping over our document chunks:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic(umap_model=umap_model,\n                       hdbscan_model=cluster_model,\n                       vectorizer_model=vectorizer_model)\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n</code></pre> <p>And that is it! During each iteration, you can access the predicted topics through the <code>.topics_</code> attribute. </p> <p>Note</p> <p>Do note that in BERTopic it is not possible to use <code>.partial_fit</code> after the <code>.fit</code> as they work quite differently concerning internally updating topics, frequencies, representations, etc. </p> <p>Tip</p> <p>You can use any other dimensionality reduction and clustering algorithm as long as they have a <code>.partial_fit</code> function. Moreover, you can use dimensionality reduction algorithms that do not support <code>.partial_fit</code> functions but do have a <code>.fit</code> function to first train it on a large amount of data and then continuously  add documents. The dimensionality reduction will not be updated but may be trained sufficiently to properly reduce the embeddings without the need to continuously add documents.</p> <p>Warning</p> <p>Only the most recent batch of documents is tracked. If you want to be using online topic modeling for low-memory use cases, then it is advised to also update the <code>.topics_</code> attribute. Otherwise, variations such as hierarchical topic modeling will not work. </p> <pre><code># Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration\ntopics = []\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n    topics.extend(topic_model.topics_)\n\ntopic_model.topics_ = topics\n</code></pre>"},{"location":"getting_started/online/online.html#river","title":"River","text":"<p>To continuously find new topics as they come in, we can use the package river. It contains several clustering models that can create new clusters as new data comes in. To make sure we can use their models, we first need to create a class that has a <code>.partial_fit</code> function and the option to extract labels through <code>.labels_</code>:</p> <pre><code>from river import stream\nfrom river import cluster\n\nclass River:\n    def __init__(self, model):\n        self.model = model\n\n    def partial_fit(self, umap_embeddings):\n        for umap_embedding, _ in stream.iter_array(umap_embeddings):\n            self.model = self.model.learn_one(umap_embedding)\n\n        labels = []\n        for umap_embedding, _ in stream.iter_array(umap_embeddings):\n            label = self.model.predict_one(umap_embedding)\n            labels.append(label)\n\n        self.labels_ = labels\n        return self\n</code></pre> <p>Then, we can choose any <code>river.cluster</code> model that we are interested in and pass it to the <code>River</code> class before using it in BERTopic:</p> <pre><code># Using DBSTREAM to detect new topics as they come in\ncluster_model = River(cluster.DBSTREAM())\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\")\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)\n\n# Prepare model\ntopic_model = BERTopic(\n    hdbscan_model=cluster_model, \n    vectorizer_model=vectorizer_model, \n    ctfidf_model=ctfidf_model,\n)\n\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html","title":"Outlier reduction","text":"<p>When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created  that do not fall within any of the created topics. These are labeled as -1. Depending on your use case, you might want to decrease the number of documents that are labeled as outliers. Fortunately, there are a number of strategies one might  use to reduce the number of outliers after you have trained your BERTopic model. </p> <p>The main way to reduce your outliers in BERTopic is by using the <code>.reduce_outliers</code> function. To make it work without too much tweaking, you will only need to pass the <code>docs</code> and their corresponding <code>topics</code>. You can pass outlier and non-outlier documents together since it will only try to reduce outlier documents and label them to a non-outlier topic. </p> <p>The following is a minimal example:</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers\nnew_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> <p>Note</p> <p>You can use the <code>threshold</code> parameter to select the minimum distance or similarity when matching outlier documents with non-outlier topics. This allows the user to change the amount of outlier documents are assigned to non-outlier topics. </p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#strategies","title":"Strategies","text":"<p>The default method for reducing outliers is by calculating the c-TF-IDF representations of outlier documents and assigning them  to the best matching c-TF-IDF representations of non-outlier topics. </p> <p>However, there are a number of other strategies one can use, either separately or in conjunction that are worthwhile to explore:</p> <ul> <li>Using the topic-document probabilities to assign topics</li> <li>Using the topic-document distributions to assign topics</li> <li>Using c-TF-IDF representations to assign topics</li> <li>Using document and topic embeddings to assign topics</li> </ul>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#probabilities","title":"Probabilities","text":"<p>This strategy uses the soft-clustering as performed by HDBSCAN to find the  best matching topic for each outlier document. To use this, make  sure to calculate the <code>probabilities</code> beforehand by instantiating  BERTopic with <code>calculate_probabilities=True</code>.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model and calculate the document-topic probabilities\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `probabilities` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#topic-distributions","title":"Topic Distributions","text":"<p>Use the topic distributions, as calculated with <code>.approximate_distribution</code> to find the most frequent topic in each outlier document. You can use the  <code>distributions_params</code> variable to tweak the parameters of  <code>.approximate_distribution</code>.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `distributions` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"distributions\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#c-tf-idf","title":"c-TF-IDF","text":"<p>Calculate the c-TF-IDF representation for each outlier document and  find the best matching c-TF-IDF topic representation using  cosine similarity.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `c-tf-idf` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"c-tf-idf\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#embeddings","title":"Embeddings","text":"<p>Using the embeddings of each outlier documents, find the best  matching topic embedding using cosine similarity.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `embeddings` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"embeddings\")\n</code></pre> <p>Note</p> <p>If you have pre-calculated the documents embeddings you can speed up the outlier reduction process for the <code>\"embeddings\"</code> strategy as it will prevent re-calculating  the document embeddings.</p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#chain-strategies","title":"Chain Strategies","text":"<p>Since the <code>.reduce_outliers</code> function does not internally update the topics, we can easily try out different strategies but also chain them together.  You might want to do a first pass with the <code>\"c-tf-idf\"</code> strategy as it is quite fast. Then, we can perform the <code>\"distributions\"</code> strategy on the  outliers that are left since this method is typically much slower:</p> <pre><code># Use the \"c-TF-IDF\" strategy with a threshold\nnew_topics = topic_model.reduce_outliers(docs, new_topics , strategy=\"c-tf-idf\", threshold=0.1)\n\n# Reduce all outliers that are left with the \"distributions\" strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"distributions\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#update-topics","title":"Update Topics","text":"<p>After generating our updated topics, we can feed them back into BERTopic in one of two ways. We can either update the topic representations themselves based on the documents that now belong to new topics or we can only update the topic frequency without updating the topic representations themselves.</p> <p>Warning</p> <p>In both cases, it is important to realize that  updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2. </p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#update-topic-representation","title":"Update Topic Representation","text":"<p>When outlier documents are generated, they are not used when modeling the topic representations. These documents are completely ignored when finding good descriptions of topics. Thus, after having reduced the number of outliers in your topic model, you might want to update the topic representations with the documents that now belong to actual topics. To do so, we can make use of the <code>.update_topics</code> function:</p> <pre><code>topic_model.update_topics(docs, topics=new_topics)\n</code></pre> <p>As seen above, you will only need to pass the documents on which the model was trained including the new topics that were generated using one of the above four strategies. </p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#exploration","title":"Exploration","text":"<p>When you are reducing the number of topics, it might be worthwhile to iteratively visualize the results in order to get an intuitive understanding of the effect of the above four strategies. Making use of <code>.visualize_documents</code>, we can quickly iterate over the different strategies and view their effects. Here, an example will be shown on how to approach such a pipeline. </p> <p>First, we train our model:</p> <pre><code>from umap import UMAP\nfrom bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Prepare data, extract embeddings, and prepare sub-models\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\nvectorizer_model = CountVectorizer(stop_words=\"english\")\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# We reduce our embeddings to 2D as it will allows us to quickly iterate later on\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, \n                          min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Train our topic model\ntopic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, \n                       vectorizer_model=vectorizer_model calculate_probabilities=True, nr_topics=40)\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre> <p>After having trained our model, let us take a look at the 2D representation of the generated topics:</p> <pre><code>topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, \n                                hide_document_hover=True, hide_annotations=True)\n</code></pre> <p>Next, we reduce the number of outliers using the <code>probabilities</code> strategy:</p> <pre><code>new_topics = reduce_outliers(topic_model, docs, topics, probabilities=probs, \n                             threshold=0.05, strategy=\"probabilities\")\ntopic_model.update_topics(docs, topics=new_topics)\n</code></pre> <p>And finally, we visualize the results:</p> <pre><code>topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, \n                                hide_document_hover=True, hide_annotations=True)\n</code></pre>"},{"location":"getting_started/parameter%20tuning/parametertuning.html","title":"Hyperparameter Tuning","text":"<p>Although BERTopic works quite well out of the box, there are a number of hyperparameters to tune according to your use case.  This section will focus on important parameters directly accessible in BERTopic but also hyperparameter optimization in sub-models  such as HDBSCAN and UMAP.</p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#bertopic","title":"BERTopic","text":"<p>When instantiating BERTopic, there are several hyperparameters that you can directly adjust that could significantly improve the performance of your topic model. In this section, we will go through the most impactful parameters in BERTopic and directions on how to optimize them. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#language","title":"language","text":"<p>The <code>language</code> parameter is used to simplify the selection of models for those who are not familiar with sentence-transformers models. </p> <p>In essence, there are two options to choose from:  </p> <ul> <li><code>language = \"english\"</code> or</li> <li><code>language = \"multilingual\"</code></li> </ul> <p>The English model is \"all-MiniLM-L6-v2\" and can be found here. It is the default model that is used in BERTopic and works great for English documents. </p> <p>The multilingual model is \"paraphrase-multilingual-MiniLM-L12-v2\" and supports over 50+ languages which can be found here. The model is very similar to the base model but is trained on many languages and has a slightly different architecture. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#top_n_words","title":"top_n_words","text":"<p><code>top_n_words</code> refers to the number of words per topic that you want to be extracted. In practice, I would advise you to keep this value below 30 and preferably between 10 and 20. The reasoning for this is that the more words you put in a topic the less coherent it can become. The top words are the most representative of the topic and should be focused on. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#n_gram_range","title":"n_gram_range","text":"<p>The <code>n_gram_range</code> parameter refers to the CountVectorizer used when creating the topic representation. It relates to the number of words you want in your topic representation. For example, \"New\" and \"York\" are two separate words but are often used as \"New York\" which represents an n-gram of 2. Thus, the <code>n_gram_range</code> should be set to (1, 2) if you want \"New York\" in your topic representation. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#min_topic_size","title":"min_topic_size","text":"<p><code>min_topic_size</code> is an important parameter! It is used to specify what the minimum size of a topic can be. The lower this value the more topics are created. If you set this value too high, then it is possible that simply no topics will be created! Set this value too low and you will get many microclusters. </p> <p>It is advised to play around with this value depending on the size of your dataset. If it nears a million documents, then it is advised to set it much higher than the default of 10, for example, 100 or even 500. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#nr_topics","title":"nr_topics","text":"<p><code>nr_topics</code> can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced. For example, if your topic model results in 100 topics but you have set <code>nr_topics</code> to 20 then the topic model will try to reduce the number of topics from 100 to 20. </p> <p>This reduction can take a while as each reduction in topics activates a c-TF-IDF calculation. If this is set to None, no reduction is applied. Use \"auto\" to automatically reduce topics using HDBSCAN.</p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#low_memory","title":"low_memory","text":"<p><code>low_memory</code> sets UMAP's <code>low_memory</code> to True to make sure that less memory is used in the computation. This slows down computation but allows UMAP to be run on low-memory machines. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#calculate_probabilities","title":"calculate_probabilities","text":"<p><code>calculate_probabilities</code> lets you calculate the probabilities of each topic in each document. This is computationally quite expensive and is turned off by default. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#umap","title":"UMAP","text":"<p>UMAP is an amazing technique for dimensionality reduction. In BERTopic, it is used to reduce the dimensionality of document embedding into something easier to use with HDBSCAN to create good clusters.</p> <p>However, it does has a significant number of parameters you could take into account. As exposing all parameters in BERTopic would be difficult to manage, we can instantiate our UMAP model and pass it to BERTopic:</p> <pre><code>from umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False)\ntopic_model = BERTopic(umap_model=umap_model).fit(docs)\n</code></pre>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#n_neighbors","title":"n_neighbors","text":"<p><code>n_neighbors</code> is the number of neighboring sample points used when making the manifold approximation. Increasing this value typically results in a  more global view of the embedding structure whilst smaller values result in a more local view. Increasing this value often results in larger clusters  being created. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#n_components","title":"n_components","text":"<p><code>n_components</code> refers to the dimensionality of the embeddings after reducing them. This is set as a default to <code>5</code> to reduce dimensionality  as much as possible whilst trying to maximize the information kept in the resulting embeddings. Although lowering or increasing this value influences the quality of embeddings, its effect is largest on the performance of HDBSCAN. Increasing this value too much and HDBSCAN will have a  hard time clustering the high-dimensional embeddings. Lower this value too much and too little information in the resulting embeddings are available  to create proper clusters. If you want to increase this value, I would advise setting using a metric for HDBSCAN that works well in high dimensional data. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#metric","title":"metric","text":"<p><code>metric</code> refers to the method used to compute the distances in high dimensional space. The default is <code>cosine</code> as we are dealing with high dimensional data. However, BERTopic is also able to use any input, even regular tabular data, to cluster the documents. Thus, you might want to change the metric  to something that fits your use case. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#low_memory_1","title":"low_memory","text":"<p><code>low_memory</code> is used when datasets may consume a lot of memory. Using millions of documents can lead to memory issues and setting this value to <code>True</code>  might alleviate some of the issues. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#hdbscan","title":"HDBSCAN","text":"<p>After reducing the embeddings with UMAP, we use HDBSCAN to cluster our documents into clusters of similar documents. Similar to UMAP, HDBSCAN has many parameters that could be tweaked to improve the cluster's quality.</p> <pre><code>from hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model).fit(docs)\n</code></pre>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#min_cluster_size","title":"min_cluster_size","text":"<p><code>min_cluster_size</code> is arguably the most important parameter in HDBSCAN. It controls the minimum size of a cluster and thereby the number of clusters  that will be generated. It is set to <code>10</code> as a default. Increasing this value results in fewer clusters but of larger size whereas decreasing this value  results in more micro clusters being generated. Typically, I would advise increasing this value rather than decreasing it. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#min_samples","title":"min_samples","text":"<p><code>min_samples</code> is automatically set to <code>min_cluster_size</code> and controls the number of outliers generated. Setting this value significantly lower than  <code>min_cluster_size</code> might help you reduce the amount of noise you will get. Do note that outliers are to be expected and forcing the output  to have no outliers may not properly represent the data. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#metric_1","title":"metric","text":"<p><code>metric</code>, like with HDBSCAN is used to calculate the distances. Here, we went with <code>euclidean</code> as, after reducing the dimensionality, we have  low dimensional data and not much optimization is necessary. However, if you increase <code>n_components</code> in UMAP, then it would be advised to look into  metrics that work with high dimensional data. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#prediction_data","title":"prediction_data","text":"<p>Make sure you always set this value to <code>True</code> as it is needed to predict new points later on. You can set this to False if you do not wish to predict  any unseen data points. </p>"},{"location":"getting_started/quickstart/quickstart.html","title":"Quick Start","text":""},{"location":"getting_started/quickstart/quickstart.html#installation","title":"Installation","text":"<p>Installation, with sentence-transformers, can be done using pypi:</p> <pre><code>pip install bertopic\n</code></pre> <p>You may want to install more depending on the transformers and language backends that you will be using.  The possible installations are: </p> <pre><code># Choose an embedding backend\npip install bertopic[flair, gensim, spacy, use]\n\n# Topic modeling with images\npip install bertopic[vision]\n</code></pre>"},{"location":"getting_started/quickstart/quickstart.html#quick-start","title":"Quick Start","text":"<p>We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of English documents:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>After generating topics, we can access the frequent topics that were generated:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()\n\nTopic   Count   Name\n-1      4630    -1_can_your_will_any\n0       693     49_windows_drive_dos_file\n1       466     32_jesus_bible_christian_faith\n2       441     2_space_launch_orbit_lunar\n3       381     22_key_encryption_keys_encrypted\n</code></pre> <p>-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most  frequent topic that was generated, topic 0:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(0)\n\n[('windows', 0.006152228076250982),\n ('drive', 0.004982897610645755),\n ('dos', 0.004845038866360651),\n ('file', 0.004140142872194834),\n ('disk', 0.004131678774810884),\n ('mac', 0.003624848635985097),\n ('memory', 0.0034840976976789903),\n ('software', 0.0034415334250699077),\n ('email', 0.0034239554442333257),\n ('pc', 0.003047105930670237)]\n</code></pre> <p>Using <code>.get_document_info</code>, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:</p> <pre><code>&gt;&gt;&gt; topic_model.get_document_info(docs)\n\nDocument                               Topic    Name                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...\nFinally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...\nThink! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...\n1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...\n</code></pre> <p>Multilingual</p> <p>Use <code>BERTopic(language=\"multilingual\")</code> to select a model that supports 50+ languages. </p>"},{"location":"getting_started/quickstart/quickstart.html#fine-tune-topic-representations","title":"Fine-tune Topic Representations","text":"<p>In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is <code>KeyBERTInspired</code>, which for many users increases the coherence and reduces stopwords from the resulting topic representations:</p> <pre><code>from bertopic.representation import KeyBERTInspired\n\n# Fine-tune your topic representations\nrepresentation_model = KeyBERTInspired()\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\n\n# Fine-tune topic representations with GPT\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, model=\"gpt-3.5-turbo\", chat=True)\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Multi-aspect Topic Modeling</p> <p>Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic. </p>"},{"location":"getting_started/quickstart/quickstart.html#visualizations","title":"Visualizations","text":"<p>After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good  understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the many visualization options in BERTopic. For example, we can visualize the topics that were generated in a way very similar to  LDAvis:</p> <pre><code>topic_model.visualize_topics()\n</code></pre>"},{"location":"getting_started/quickstart/quickstart.html#saveload-bertopic-model","title":"Save/Load BERTopic model","text":"<p>There are three methods for saving BERTopic:</p> <ol> <li>A light model with <code>.safetensors</code> and config files</li> <li>A light model with pytorch <code>.bin</code> and config files</li> <li>A full model with <code>.pickle</code></li> </ol> <p>Method 3 allows for saving the entire topic model but has several drawbacks:</p> <ul> <li>Arbitrary code can be run from <code>.pickle</code> files</li> <li>The resulting model is rather large (often &gt; 500MB) since all sub-models need to be saved</li> <li>Explicit and specific version control is needed as they typically only run if the environment is exactly the same</li> </ul> <p>It is advised to use methods 1 or 2 for saving.</p> <p>These methods have a number of advantages:</p> <ul> <li><code>.safetensors</code> is a relatively safe format</li> <li>The resulting model can be very small (often &lt; 20MB) since no sub-models need to be saved</li> <li>Although version control is important, there is a bit more flexibility with respect to specific versions of packages</li> <li>More easily used in production</li> <li>Share models with the HuggingFace Hub</li> </ul> <p>Tip</p> <p>For more detail about how to load in a custom vectorizer, representation model, and more, it is highly advised to checkout the serialization page. It contains more examples, details, and some tips and tricks for loading and saving your environment. </p> <p>The methods are as used as follows:</p> <pre><code>topic_model = BERTopic().fit(my_docs)\n\n# Method 1 - safetensors\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 2 - pytorch\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"pytorch\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 3 - pickle\ntopic_model.save(\"my_model\", serialization=\"pickle\")\n</code></pre> <p>To load a model:</p> <pre><code># Load from directory\nloaded_model = BERTopic.load(\"path/to/my/model_dir\")\n\n# Load from file\nloaded_model = BERTopic.load(\"my_model\")\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_Wikipedia\")\n</code></pre> <p>Warning</p> <p>When saving the model, make sure to also keep track of the versions of dependencies and Python used.  Loading and saving the model should be done using the same dependencies and Python. Moreover, models  saved in one version of BERTopic should not be loaded in other versions. </p>"},{"location":"getting_started/representation/llm.html","title":"6B. LLM & Generative AI","text":"<p>As we have seen in the previous section, the topics that you get from BERTopic can be fine-tuned using a number of approaches. Here, we are going to focus on text generation Large Language Models such as ChatGPT, GPT-4, and open-source solutions. </p> <p>Using these techniques, we can further fine-tune topics to generate labels, summaries, poems of topics, and more. To do so, we first generate a set of keywords and documents that describe a topic best using BERTopic's c-TF-IDF calculate. Then, these candidate keywords and documents are passed to the text generation model and asked to generate output that fits the topic best. </p> <p>A huge benefit of this is that we can describe a topic with only a few documents and we therefore do not need to pass all documents to the text generation model. Not only speeds this the generation of topic labels up significantly, you also do not need a massive amount of credits when using an external API, such as Cohere or OpenAI.</p>"},{"location":"getting_started/representation/llm.html#prompt-engineering","title":"Prompt Engineering","text":"<p>In most of the examples below, we use certain tags to customize our prompts. There are currently two tags, namely <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code>.  These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.  For example, if we have the following prompt:</p> <pre><code>prompt = \"\"\"\nI have topic that contains the following documents: \\n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the above information, can you give a short label of the topic?\n\"\"\"\n</code></pre> <p>then that will be rendered as follows:</p> <pre><code>\"\"\"\nI have a topic that contains the following documents: \n- Our videos are also made possible by your support on patreon.co.\n- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.\n- If you want to help us make more videos, you can do so there.\n- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.\n\nThe topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch \n\nBased on the above information, can you give a short label of the topic?\n\"\"\"\n</code></pre> <p>Tip 1</p> <p>You can access the default prompts of these models with <code>representation_model.default_prompt_</code>. The prompts that were generated after training can be accessed with <code>topic_model.representation_model.prompts_</code>.</p>"},{"location":"getting_started/representation/llm.html#selecting-documents","title":"Selecting Documents","text":"<p>By default, four of the most representative documents will be passed to <code>[DOCUMENTS]</code>. These documents are selected by calculating their similarity (through c-TF-IDF representations) with the main c-TF-IDF representation of the topics. The four best matching documents per topic are selected. </p> <p>To increase the number of documents passed to <code>[DOCUMENTS]</code>, we can use the <code>nr_docs</code> parameter which is accessible in all LLMs on this page. Using this value allows you to select the top n most representative documents instead. If you have a long enough context length, then you could even give the LLM dozens of documents.</p> <p>However, some of these documents might be very similar to one another and might be near duplicates. They will not provide much additional information about the content of the topic. Instead, we can use the <code>diversity</code> parameter in each LLM to only select documents that are sufficiently diverse. It takes values between 0 and 1 but a value of 0.1 already does wonders!</p>"},{"location":"getting_started/representation/llm.html#truncating-documents","title":"Truncating Documents","text":"<p>If you increase the number of documents passed to <code>[DOCUMENTS]</code>, then the token limit can quickly be filled. Instead, we could only pass a part of the document by truncating it to a specific length. For that, we use two parameters that are accessible in all LLMs on this page, namely <code>document_length</code> and <code>tokenizer</code>. </p> <p>Let's start with the <code>tokenizer</code>. It is used to split a document up into tokens/segments. Each segment can then be used to calculate the complete document length.  The methods for tokenization are as follows:</p> <ul> <li>If tokenizer is <code>char</code>, then the document is split up into characters.</li> <li>If tokenizer is <code>whitespace</code>, the the document is split up into words separated by whitespaces. </li> <li>If tokenizer is <code>vectorizer</code>, then the internal CountVectorizer is used to tokenize the document.</li> <li>If tokenizer is a <code>callable</code>, then that callable is used to tokenize the document.</li> </ul> <p>After having tokenized the document according one of the strategies above, the <code>document_length</code> is used to truncate the document to its specified value. </p> <p>For example, if the tokenizer is <code>whitespace</code> then a document is split up into individual words and the length of the document is counted by the total number of words. In contrast, if the tokenizer is <code>callable</code> we can use any callable that has an <code>.encode</code> and <code>.decode</code> function. If we were to use tiktoken, then the document would be split up into tokens and the length of the document is counted by the total number tokens.</p> <p>To give an example, using tiktoken would work as follows:</p> <pre><code>import openai\nimport tiktoken\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create tokenizer\ntokenizer = tiktoken.get_encoding(\"cl100k_base\")\n\n# Create your representation model\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, tokenizer=tokenizer, document_length=50)\n</code></pre> <p>In this example, each document will be at most 50 tokens, anything more will get truncated.</p>"},{"location":"getting_started/representation/llm.html#document-truncation","title":"Document Truncation","text":"<p>We can truncate the input documents in <code>[DOCUMENTS]</code> in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:</p> <ul> <li><code>doc_length</code><ul> <li>The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.</li> </ul> </li> <li><code>tokenizer</code><ul> <li>The tokenizer used to calculate to split the document into segments used to count the length of a document. <ul> <li>If tokenizer is  <code>'char'</code>, then the document is split up into characters which are counted to adhere to <code>doc_length</code></li> <li>If tokenizer is <code>'whitespace'</code>, the document is split up into words separated by whitespaces. These words are counted       and truncated depending on <code>doc_length</code></li> <li>If tokenizer is <code>'vectorizer'</code>, then the internal CountVectorizer is used to tokenize the document. These tokens are counted and trunctated depending on <code>doc_length</code></li> <li>If tokenizer is a callable, then that callable is used to tokenized the document. These tokens are counted and truncated depending on <code>doc_length</code></li> </ul> </li> </ul> </li> </ul> <p>This means that the definition of <code>doc_length</code> changes depending on what constitutes a token in the <code>tokenizer</code> parameter. If a token is a character, then <code>doc_length</code> refers to max length in characters. If a token is a word, then <code>doc_length</code> refers to the max length in words.</p> <p>Let's illustrate this with an example. In the code below, we will use <code>tiktoken</code> to count the number of tokens in each document and limit them to 100 tokens. All documents that have more than 100 tokens will be truncated.</p> <p>We start by installing the relevant packages:</p> <pre><code>pip install tiktoken openai\n</code></pre> <p>Then, we use <code>bertopic.representation.OpenAI</code> to represent our topics with nicely written labels. We specify that documents that we put in the prompt cannot exceed 100 tokens each. Since we will put 4 documents in the prompt, they will total roughly 400 tokens:</p> <pre><code>import openai\nimport tiktoken\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Tokenizer\ntokenizer= tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n\n# Create your representation model\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(\n    client,\n    model=\"gpt-3.5-turbo\", \n    delay_in_seconds=2, \n    chat=True,\n    nr_docs=4,\n    doc_length=100,\n    tokenizer=tokenizer\n)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"getting_started/representation/llm.html#transformers","title":"\ud83e\udd17 Transformers","text":"<p>Nearly every week, there are new and improved models released on the \ud83e\udd17 Model Hub that, with some creativity, allow for  further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these  methods are created as a way to support whatever might be released in the future. </p> <p>Using a GPT-like model from the huggingface hub is rather straightforward:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = TextGeneration('gpt2')\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>GPT2, however, is not the most accurate model out there on HuggingFace models. You can get  much better results with a <code>flan-T5</code> like model:</p> <pre><code>from transformers import pipeline\nfrom bertopic.representation import TextGeneration\n\nprompt = \"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?\"\"\n\n# Create your representation model\ngenerator = pipeline('text2text-generation', model='google/flan-t5-base')\nrepresentation_model = TextGeneration(generator)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation beef  volcanoes  immune system  earth  european union  cotton \ud83e\udd17 Transformers <p></p> <p>As can be seen from the example above, if you would like to use a <code>text2text-generation</code> model, you will to  pass a <code>transformers.pipeline</code> with the <code>\"text2text-generation\"</code> parameter. Moreover, you can use a custom prompt and decide where the keywords should be inserted by using the <code>[KEYWORDS]</code> or documents with the <code>[DOCUMENTS]</code> tag.</p>"},{"location":"getting_started/representation/llm.html#zephyr-mistral-7b","title":"Zephyr (Mistral 7B)","text":"<p>We can go a step further with open-source Large Language Models (LLMs) that have shown to match the performance of closed-source LLMs like ChatGPT.</p> <p>In this example, we will show you how to use Zephyr, a fine-tuning version of Mistral 7B. Mistral 7B outperforms other open-source LLMs at a much smaller scale and is a worthwhile solution for use cases such as topic modeling. We want to keep inference as fast as possible and a relatively small model helps with that. Zephyr is a fine-tuned version of Mistral 7B that was trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization (DPO).</p> <p>To use Zephyr in BERTopic, we will first need to install and update a couple of packages that can handle quantized versions of Zephyr:</p> <pre><code>pip install ctransformers[cuda]\npip install --upgrade git+https://github.com/huggingface/transformers\n</code></pre> <p>Instead of loading in the full model, we can instead load a quantized model which is a compressed version of the original model:</p> <pre><code>from ctransformers import AutoModelForCausalLM\nfrom transformers import AutoTokenizer, pipeline\n\n# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"TheBloke/zephyr-7B-alpha-GGUF\",\n    model_file=\"zephyr-7b-alpha.Q4_K_M.gguf\",\n    model_type=\"mistral\",\n    gpu_layers=50,\n    hf=True\n)\ntokenizer = AutoTokenizer.from_pretrained(\"HuggingFaceH4/zephyr-7b-alpha\")\n\n# Pipeline\ngenerator = pipeline(\n    model=model, tokenizer=tokenizer,\n    task='text-generation',\n    max_new_tokens=50,\n    repetition_penalty=1.1\n)\n</code></pre> <p>This Zephyr model requires a specific prompt template in order to work:</p> <pre><code>prompt = \"\"\"&lt;|system|&gt;You are a helpful, respectful and honest assistant for labeling topics..&lt;/s&gt;\n&lt;|user|&gt;\nI have a topic that contains the following documents:\n[DOCUMENTS]\n\nThe topic is described by the following keywords: '[KEYWORDS]'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.&lt;/s&gt;\n&lt;|assistant|&gt;\"\"\"\n</code></pre> <p>After creating this prompt template, we can create our representation model to be used in BERTopic:</p> <pre><code>from bertopic.representation import TextGeneration\n\n# Text generation with Zephyr\nzephyr = TextGeneration(generator, prompt=prompt)\nrepresentation_model = {\"Zephyr\": zephyr}\n\n# Topic Modeling\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\n</code></pre>"},{"location":"getting_started/representation/llm.html#llama-2","title":"Llama 2","text":"<p>Full Llama 2 Tutorial: </p> <p>Open-source LLMs are starting to become more and more popular. Here, we will go through a minimal example of using Llama 2 together with BERTopic. </p> <p>First, we need to load in our Llama 2 model:</p> <pre><code>from torch import bfloat16\nimport transformers\n\n# set quantization configuration to load large model with less GPU memory\n# this requires the `bitsandbytes` library\nbnb_config = transformers.BitsAndBytesConfig(\n    load_in_4bit=True,  # 4-bit quantization\n    bnb_4bit_quant_type='nf4',  # Normalized float 4\n    bnb_4bit_use_double_quant=True,  # Second quantization after the first\n    bnb_4bit_compute_dtype=bfloat16  # Computation type\n)\n\n# Llama 2 Tokenizer\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\n\n# Llama 2 Model\nmodel = transformers.AutoModelForCausalLM.from_pretrained(\n    model_id,\n    trust_remote_code=True,\n    quantization_config=bnb_config,\n    device_map='auto',\n)\nmodel.eval()\n\n# Our text generator\ngenerator = transformers.pipeline(\n    model=model, tokenizer=tokenizer,\n    task='text-generation',\n    temperature=0.1,\n    max_new_tokens=500,\n    repetition_penalty=1.1\n)\n</code></pre> <p>After doing so, we will need to define a prompt that works with both Llama 2 as well as BERTopic:</p> <pre><code># System prompt describes information given to all conversations\nsystem_prompt = \"\"\"\n&lt;s&gt;[INST] &lt;&lt;SYS&gt;&gt;\nYou are a helpful, respectful and honest assistant for labeling topics.\n&lt;&lt;/SYS&gt;&gt;\n\"\"\"\n\n# Example prompt demonstrating the output we are looking for\nexample_prompt = \"\"\"\nI have a topic that contains the following documents:\n- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.\n- Meat, but especially beef, is the word food in terms of emissions.\n- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.\n\nThe topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.\n\n[/INST] Environmental impacts of eating meat\n\"\"\"\n\n# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags\nmain_prompt = \"\"\"\n[INST]\nI have a topic that contains the following documents:\n[DOCUMENTS]\n\nThe topic is described by the following keywords: '[KEYWORDS]'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.\n[/INST]\n\"\"\"\n\nprompt = system_prompt + example_prompt + main_prompt\n</code></pre> <p>Three pieces of the prompt were created:  </p> <ul> <li><code>system_prompt</code> helps us guide the model during a conversation. For example, we can say that it is a helpful assisant that is specialized in labeling topics.</li> <li><code>example_prompt</code> gives an example of a correctly labeled topic to guide Llama 2</li> <li><code>main_prompt</code> contains the main question we are going to ask it, namely to label a topic. Note that it uses the <code>[DOCUMENTS]</code>  and <code>[KEYWORDS]</code> to provide the most relevant documents and keywords as additional context</li> </ul> <p>After having generated our prompt template, we can start running our topic model:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Text generation with Llama 2\nllama2 = TextGeneration(generator, prompt=prompt)\nrepresentation_model = {\n    \"Llama2\": llama2,\n}\n\n# Create our BERTopic model\ntopic_model = BERTopic(representation_model=representation_model,  verbose=True)\n</code></pre>"},{"location":"getting_started/representation/llm.html#llamacpp","title":"llama.cpp","text":"<p>An amazing framework for using LLMs for inference is <code>llama.cpp</code> which has python bindings that we can use in BERTopic. To start with, we first need to install <code>llama-cpp-python</code>:</p> <pre><code>pip install llama-cpp-python\n</code></pre> <p>or using the following for hardware acceleration:</p> <pre><code>CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install llama-cpp-python\n</code></pre> <p>Note</p> <p>There are a number of installation options depending on your hardware and OS. Make sure that you select the correct one to optimize your performance.</p> <p>After installation, you need to download your LLM locally before we use it in BERTopic, like so:</p> <pre><code>wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf\n</code></pre> <p>Finally, we can now use the model the model with BERTopic in just a couple of lines:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import LlamaCPP\n\n# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha\nrepresentation_model = LlamaCPP(\"zephyr-7b-alpha.Q4_K_M.gguf\")\n\n# Create our BERTopic model\ntopic_model = BERTopic(representation_model=representation_model,  verbose=True)\n</code></pre> <p>If you want to have more control over the LLMs parameters, you can run it like so:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import LlamaCPP\nfrom llama_cpp import Llama\n\n# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha\nllm = Llama(model_path=\"zephyr-7b-alpha.Q4_K_M.gguf\", n_gpu_layers=-1, n_ctx=4096, stop=\"Q:\")\nrepresentation_model = LlamaCPP(llm)\n\n# Create our BERTopic model\ntopic_model = BERTopic(representation_model=representation_model,  verbose=True)\n</code></pre> <p>Note</p> <p>The default template that is being used uses a \"Q: ... A: ... \" type of structure which is why the <code>stop</code> is set at <code>\"Q:\"</code>.  The default template is: <pre><code>\"\"\"\nQ: I have a topic that contains the following documents: \n[DOCUMENTS]\n\nThe topic is described by the following keywords: '[KEYWORDS]'.\n\nBased on the above information, can you give a short label of the topic?\nA: \n\"\"\"\n</code></pre></p>"},{"location":"getting_started/representation/llm.html#openai","title":"OpenAI","text":"<p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords. To use this, you will need to install openai first:</p> <pre><code>pip install openai\n</code></pre> <p>Then, get yourself an API key and use OpenAI's API as follows:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create your representation model\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation Organic vs Conventional Food: Environmental and Health Considerations  Volcanic Eruptions and Impacts  The Immune System: Understanding and Boosting Immunity  The Moon's Tides and Orbit Phenomena  Democracy in the European Union  Plastic Pollution and its environmental impact OpenAI <p></p> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"I have the following documents: [DOCUMENTS] \\nThese documents are about the following topic: '\"\nrepresentation_model = OpenAI(client, prompt=prompt)\n</code></pre>"},{"location":"getting_started/representation/llm.html#chatgpt","title":"ChatGPT","text":"<p>Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.  In order to use ChatGPT with BERTopic, we need to define the model and make sure to enable <code>chat</code>:</p> <pre><code>representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\n</code></pre> <p>Prompting with ChatGPT is very satisfying and is customizable as follows:</p> <pre><code>prompt = \"\"\"\nI have a topic that contains the following documents: \n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\n</code></pre> <p>Note</p> <p>Whenever you create a custom prompt, it is important to add  <pre><code>Based on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n</code></pre> at the end of your prompt as BERTopic extracts everything that comes after <code>topic:</code>. Having  said that, if <code>topic:</code> is not in the output, then it will simply extract the entire response, so  feel free to experiment with the prompts. </p>"},{"location":"getting_started/representation/llm.html#summarization","title":"Summarization","text":"<p>Due to the structure of the prompts in OpenAI's chat models, we can extract different types of topic representations from their GPT models.  Instead of extracting a topic label, we can instead ask it to extract a short description of the topic instead:</p> <pre><code>summarization_prompt = \"\"\"\nI have a topic that is described by the following keywords: [KEYWORDS]\nIn this topic, the following documents are a small but representative subset of all documents in the topic:\n[DOCUMENTS]\n\nBased on the information above, please give a description of this topic in the following format:\ntopic: &lt;description&gt;\n\"\"\"\n\nrepresentation_model = OpenAI(client, model=\"gpt-3.5-turbo\", chat=True, prompt=summarization_prompt, nr_docs=5, delay_in_seconds=3)\n</code></pre> <p>The above is not constrained to just creating a short description or summary of the topic, we can extract labels, keywords, poems, example documents, extensitive descriptions, and more using this method! If you want to have multiple representations of a single topic, it might be worthwhile to also check out multi-aspect topic modeling with BERTopic.</p>"},{"location":"getting_started/representation/llm.html#langchain","title":"LangChain","text":"<p>Langchain is a package that helps users with chaining large language models. In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this  external knowledge are the most representative documents in each topic. </p> <p>To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:</p> <pre><code>pip install langchain, openai\n</code></pre> <p>Then, you can create your chain as follows:</p> <pre><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\nchain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type=\"stuff\")\n</code></pre> <p>Finally, you can pass the chain to BERTopic as follows:</p> <pre><code>from bertopic.representation import LangChain\n\n# Create your representation model\nrepresentation_model = LangChain(chain)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"What are these documents about? Please give a single label.\"\nrepresentation_model = LangChain(chain, prompt=prompt)\n</code></pre> <p>Note</p> <p>The prompt does not make use of <code>[KEYWORDS]</code> and <code>[DOCUMENTS]</code> tags as  the documents are already used within langchain's <code>load_qa_chain</code>. </p>"},{"location":"getting_started/representation/llm.html#cohere","title":"Cohere","text":"<p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords. To use this, you will need to install cohere first:</p> <pre><code>pip install cohere\n</code></pre> <p>Then, get yourself an API key and use Cohere's API as follows:</p> <pre><code>import cohere\nfrom bertopic.representation import Cohere\nfrom bertopic import BERTopic\n\n# Create your representation model\nco = cohere.Client(my_api_key)\nrepresentation_model = Cohere(co)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation Organic food  Exploding planets  How your immune system works  How tides work  How democratic is the European Union?  Plastic pollution  Cohere <p></p> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"\"\"\nI have topic that contains the following documents: [DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS].\nBased on the above information, can you give a short label of the topic?\n\"\"\"\nrepresentation_model = Cohere(co, prompt=prompt)\n</code></pre>"},{"location":"getting_started/representation/representation.html","title":"6A. Representation Models","text":"<p>One of the core components of BERTopic is its Bag-of-Words representation and weighting with c-TF-IDF. This method is fast and can quickly generate a number of keywords for a topic without depending on the clustering task. As a result, topics can easily and quickly be updated after training the model without the need to re-train it.  Although these give good topic representations, we may want to further fine-tune the topic representations. </p> <p>As such, there are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations. These are optional  and are not used by default. You are not restrained by the how the representation can be fine-tuned, from GPT-like models to fast keyword extraction  with KeyBERT-like models:</p> <p>For each model below, an example will be shown on how it may change or improve upon the default topic keywords that are generated. The dataset used in these examples can be found here. </p> <p>If you want to have multiple representations of a single topic, it might be worthwhile to also check out multi-aspect topic modeling with BERTopic.</p>"},{"location":"getting_started/representation/representation.html#keybertinspired","title":"KeyBERTInspired","text":"<p>After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage  c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate  the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents. </p> <p></p>  n Topic Extract representative documents Embed candidate keywords Compare embedded keywords with embedded documents Embed and  average documents  Extract candidate keywords Compare c-TF-IDF sampled documents with the topic c-TF-IDF.  Extract top n words per topic based on their c-TF-IDF scores <p></p> <p>Thus, the algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation organic | meat | foods | crops | beef | produce | food | diet | cows | eating  explosion | explodes | eruptions | eruption | blast | volcanoes | volcanic  immune | immunology | antibodies | disease | cells | infection | cell | system   moon | moons | lunar | tides | tidal | gravity | orbit | satellites | earth | orbits  eu | democracy | european | democratic | parliament | governments | voting  plastics | plastic | pollution | microplastics | environmental | polymers | bpa KeyBERT-Inspired <p></p>"},{"location":"getting_started/representation/representation.html#partofspeech","title":"PartOfSpeech","text":"<p>Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from  all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of  keywords and documents that best represent a topic. </p> <p></p>  n Topic Extract documents that contain at least one keyword Sort keywords by their c-TF-IDF value Use the POS matcher on those documents to generate new candidate keywords Extract candidate keywords <p></p> <p>More specifically, we find documents that contain the keywords from our candidate topics as calculated with c-TF-IDF. These documents serve  as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic. These candidate keywords are first put through Spacy's POS module to see whether they match with the <code>DEFAULT_PATTERNS</code>:</p> <pre><code>DEFAULT_PATTERNS = [\n            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n            [{'POS': 'NOUN'}],\n            [{'POS': 'ADJ'}]\n]\n</code></pre> <p>These patterns follow Spacy's Rule-Based Matching. Then, the resulting keywords are sorted by  their respective c-TF-IDF values.</p> <pre><code>from bertopic.representation import PartOfSpeech\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation meat | organic | food | beef | emissions | most | health | pesticides | production  explosion | atmosphere | eruption | kilometers | eruptions | fireball | super immune | system | cells | immunology | adaptive | body | memory | cell   moon | earth | lunar | tides | water | orbit | base | moons | surface | gravity  democratic | vote | parliament | member | union | states | national | countries   plastic | plastics | tons | pollution | waste | microplastics | polymers | bag PartOfSpeech <p></p> <p>You can define custom POS patterns to be extracted:</p> <pre><code>pos_patterns = [\n            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n            [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n]\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\", pos_patterns=pos_patterns)\n</code></pre>"},{"location":"getting_started/representation/representation.html#maximalmarginalrelevance","title":"MaximalMarginalRelevance","text":"<p>When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like \"car\" and \"cars\"  essentially represent the same information and often redundant. </p> <p></p> <p></p> <p>To decrease this redundancy and improve the diversity of keywords, we can use an algorithm called Maximal Marginal Relevance (MMR). MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.</p> <pre><code>from bertopic.representation import MaximalMarginalRelevance\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = MaximalMarginalRelevance(diversity=0.3)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation meat | organic | beef | emissions | health | pesticides | foods | farming | conventional  explosion | atmosphere | eruption | eruptions | crust | volcanoes | earthquakes  immune | system | cells | immunology | adaptive | body | memory | antibodies  moon | lunar | tides | moons | surface | gravity | tide | meters | oceans | dust  eu | democratic | vote | parliament | citizen | laws | institutions | influence | nations  plastics | tons | pollution | waste | microplastics | polymers | ocean | bpa | cotton MaximalMarginalRelevance <p></p>"},{"location":"getting_started/representation/representation.html#zero-shot-classification","title":"Zero-Shot Classification","text":"<p>For some use cases, you might already have a set of candidate labels that you would like to automatically assign to some of the topics.  Although we can use guided or supervised BERTopic for that, we can also use zero-shot classification to assign labels to our topics.  For that, we can make use of \ud83e\udd17 transformers on their models on the model hub. </p> <p>To perform this classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels.  If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords. </p> <p>We use it in BERTopic as follows:</p> <pre><code>from bertopic.representation import ZeroShotClassification\nfrom bertopic import BERTopic\n\n# Create your representation model\ncandidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\nrepresentation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation Organic food  the | explosion | atmosphere | eruption | kilometers | of  Your immune system  moon | earth | lunar | tides | the | water | orbit | base | moons  eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers  ZeroShotClassification <p></p>"},{"location":"getting_started/representation/representation.html#chain-models","title":"Chain Models","text":"<p>All of the above models can make use of the candidate topics, as generated by c-TF-IDF, to further fine-tune the topic representations. For example, <code>MaximalMarginalRelevance</code> takes the keywords in the candidate topics and re-ranks them. Similarly, the keywords in the candidate topic can be used as the input for GPT-prompts in <code>OpenAI</code>.</p> <p>Although the default candidate topics are generated by c-TF-IDF, what if we were to chain these models? For example, we can use <code>MaximalMarginalRelevance</code> to improve upon the keywords in each topic before passing them to <code>OpenAI</code>. </p> <p>This is supported in BERTopic by simply passing a list of representation models when instantation the topic model:</p> <pre><code>from bertopic.representation import MaximalMarginalRelevance, OpenAI\nfrom bertopic import BERTopic\nimport openai\n\n# Create your representation models\nclient = openai.OpenAI(api_key=\"sk-...\")\nopenai_generator = OpenAI(client)\nmmr = MaximalMarginalRelevance(diversity=0.3)\nrepresentation_models = [mmr, openai_generator]\n\n# Use the chained models\ntopic_model = BERTopic(representation_model=representation_models)\n</code></pre>"},{"location":"getting_started/representation/representation.html#custom-model","title":"Custom Model","text":"<p>Although several representation models have been implemented in BERTopic, new technologies get released often and we should not have to wait until they get implemented in BERTopic. Therefore, you can create your own representation model and use that to fine-tune the topics. </p> <p>The following is the basic structure for creating your custom model. Note that it returns the same topics as the those  calculated with c-TF-IDF:</p> <pre><code>from bertopic.representation._base import BaseRepresentation\n\n\nclass CustomRepresentationModel(BaseRepresentation):\n    def extract_topics(self, topic_model, documents, c_tf_idf, topics\n                      ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: The BERTopic model\n            documents: A dataframe of documents with their related topics\n            c_tf_idf: The c-TF-IDF matrix\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        updated_topics = topics.copy()\n        return updated_topics\n</code></pre> <p>Then, we can use that model as follows:</p> <pre><code>from bertopic import BERTopic\n\n# Create our custom representation model\nrepresentation_model = CustomRepresentationModel()\n\n# Pass our custom representation model to BERTopic\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>There are a few things to take into account when creating your custom model:</p> <ul> <li>It needs to have the exact same parameter input: <code>topic_model</code>, <code>documents</code>, <code>c_tf_idf</code>, <code>topics</code>.</li> <li>Make sure that <code>updated_topics</code> has the exact same structure as <code>topics</code>:</li> </ul> <pre><code>updated_topics = {\n    \"1\", [(\"space\", 0.9), (\"nasa\", 0.7)], \n    \"2\": [(\"science\", 0.66), (\"article\", 0.6)]\n}\n</code></pre> <p>Tip</p> <p>You can change the <code>__init__</code> however you want, it does not influence the underlying structure. This also means that you can save data/embeddings/representations/sentiment in your custom representation  model. </p>"},{"location":"getting_started/search/search.html","title":"Search Topics","text":"<p>After having created a BERTopic model, you might end up with over a hundred topics. Searching through those  can be quite cumbersome especially if you are searching for a specific topic. Fortunately, BERTopic allows you  to search for topics using search terms. First, let's create and train a BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Create topics\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>After having trained our model, we can use <code>find_topics</code> to search for topics that are similar  to an input search_term. Here, we are going to be searching for topics that closely relate the  search term \"motor\". Then, we extract the most similar topic and check the results: </p> <pre><code>&gt;&gt;&gt; similar_topics, similarity = topic_model.find_topics(\"motor\", top_n=5)\n&gt;&gt;&gt; topic_model.get_topic(similar_topics[0])\n[('bike', 0.02275997701645559),\n ('motorcycle', 0.011391202866080292),\n ('bikes', 0.00981187573649205),\n ('dod', 0.009614623748226669),\n ('honda', 0.008247663662558535),\n ('ride', 0.0064683227888861945),\n ('harley', 0.006355502638631013),\n ('riding', 0.005766601561614182),\n ('motorcycles', 0.005596372493714447),\n ('advice', 0.005534544418830091)]\n</code></pre> <p>It definitely seems that a topic was found that closely matches \"motor\". The topic seems to be motorcycle  related and therefore matches our \"motor\" input. You can use the <code>similarity</code> variable to see how similar  the extracted topics are to the search term. </p> <p>Note</p> <p>You can only use this method if an embedding model was supplied to BERTopic using <code>embedding_model</code>. </p>"},{"location":"getting_started/seed_words/seed_words.html","title":"Seed Words","text":"<p>When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the \"TNM\" classification is a method for identifying the stage of most cancers. The word \"TNM\" is an abbreviation and might not be correctly captured in generic embedding models. </p> <p>To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of <code>seed_words</code> in the <code>bertopic.vectorizer.ClassTfidfTransformer</code>. The <code>ClassTfidfTransformer</code> is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as \"TNM\". </p> <p>To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like \"agent\" and \"robot\" should be important in such a topic were it to be found. Using the <code>ClassTfidfTransformer</code>, we can define those <code>seed_words</code> and also choose by how much their values are multiplied. </p> <p>The full example is then as follows:</p> <pre><code>from umap import UMAP\nfrom datasets import load_dataset\nfrom bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\n# Let's take a subset of ArXiv abstracts as the training data\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\nabstracts = dataset[\"abstract\"][:5_000]\n\n# For illustration purposes, we make sure the output is fixed when running this code multiple times\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\n\n# We can choose any number of seed words for which we want their representation\n# to be strengthen. We increase the importance of these words as we want them to be more\n# likely to end up in the topic representations.\nctfidf_model = ClassTfidfTransformer(\n    seed_words=[\"agent\", \"robot\", \"behavior\", \"policies\", \"environment\"], \n    seed_multiplier=2\n)\n\n# We run the topic model with the seeded words\ntopic_model = BERTopic(\n    umap_model=umap_model,\n    min_topic_size=15,\n    ctfidf_model=ctfidf_model,\n).fit(abstracts)\n</code></pre> <p>Then, when we run <code>topic_model.get_topic(0)</code>, we get the following output:</p> <pre><code>[('policy', 0.023413102511982354),\n ('reinforcement', 0.021796126795834238),\n ('agent', 0.021131601305431902),\n ('policies', 0.01888385271486409),\n ('environment', 0.017819874593917057),\n ('learning', 0.015321710504308708),\n ('robot', 0.013881115279230468),\n ('control', 0.013297705894983875),\n ('the', 0.013247933839985382),\n ('to', 0.013058208312484141)]\n</code></pre> <p>As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.</p> <p>A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the <code>ClassTfidfTransformer</code> as candidate words to be further optimized. In many cases, words like \"TNM\" might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models. </p> <p>Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like \"TNM\" are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.). </p> <p>Moreover, these <code>seed_words</code> can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.</p>"},{"location":"getting_started/semisupervised/semisupervised.html","title":"Semi-supervised Topic Modeling","text":"<p>In BERTopic, you have several options to nudge the creation of topics toward certain pre-specified topics. Here, we will be looking at semi-supervised topic modeling with BERTopic. </p> <p>Semi-supervised modeling allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have. </p> <p></p> SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality  reduction Labels Clustering Topic  representation <p></p> <p>In other words, we use a semi-supervised UMAP instance to reduce the dimensionality of embeddings before clustering the documents  with HDBSCAN. </p> <p>First, let us prepare the data needed for our topic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\ncategories = data[\"target\"]\ncategory_names = data[\"target_names\"]\n</code></pre> <p>We are using the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is  assigned to one of 20 categories. Using this dataset we can try to extract its corresponding topic model whilst  taking its underlying categories into account. These categories are here the variable <code>targets</code>.</p> <p>Each document can be put into one of the following categories:</p> <pre><code>&gt;&gt;&gt; category_names\n\n['alt.atheism',\n 'comp.graphics',\n 'comp.os.ms-windows.misc',\n 'comp.sys.ibm.pc.hardware',\n 'comp.sys.mac.hardware',\n 'comp.windows.x',\n 'misc.forsale',\n 'rec.autos',\n 'rec.motorcycles',\n 'rec.sport.baseball',\n 'rec.sport.hockey',\n 'sci.crypt',\n 'sci.electronics',\n 'sci.med',\n 'sci.space',\n 'soc.religion.christian',\n 'talk.politics.guns',\n 'talk.politics.mideast',\n 'talk.politics.misc',\n 'talk.religion.misc'] \n</code></pre> <p>To perform this semi-supervised approach, we can take in some pre-defined topics and simply pass those to the <code>y</code> parameter when fitting BERTopic. These labels can be pre-defined topics or simply documents that you feel belong together regardless of their content. BERTopic will nudge the creation of topics toward these categories  using the pre-defined labels. </p> <p>To perform supervised topic modeling, we simply use all categories:</p> <pre><code>topic_model = BERTopic(verbose=True).fit(docs, y=categories)\n</code></pre> <p>The topic model will be much more attuned to the categories that were defined previously. However, this does not mean that only topics for these categories will be found. BERTopic is likely to find more specific topics in those you have already defined. This allows you to discover previously unknown topics!</p>"},{"location":"getting_started/semisupervised/semisupervised.html#partial-labels","title":"Partial labels","text":"<p>At times, you might only have labels for a subset of documents. Fortunately, we can still use those labels to at least nudge the documents for which those labels exist. The documents for which we do not have labels are assigned a -1. For this example, imagine we only have the labels of categories that are related to computers and we want to create a topic model using semi-supervised modeling:</p> <pre><code>labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',\n              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',\n              'comp.windows.x',]\nindices = [category_names.index(label) for label in labels_to_add]\ny = [label if label in indices else -1 for label in categories]\n</code></pre> <p>The <code>y</code> variable contains many -1 values since we do not know all the categories. </p> <p>Next, we use those newly constructed labels to again BERTopic semi-supervised:</p> <pre><code>topic_model = BERTopic(verbose=True).fit(docs, y=y)\n</code></pre> <p>And that is it! By defining certain classes for our documents, we can steer the topic modeling towards modeling the pre-defined categories. </p>"},{"location":"getting_started/serialization/serialization.html","title":"Serialization","text":"<p>Saving, loading, and sharing a BERTopic model can be done in several ways. It is generally advised to go with <code>.safetensors</code> as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as <code>.pickle</code> and pytorch <code>.bin</code> are also possible.</p>"},{"location":"getting_started/serialization/serialization.html#saving","title":"Saving","text":"<p>There are three methods for saving BERTopic:</p> <ol> <li>A light model with <code>.safetensors</code> and config files</li> <li>A light model with pytorch <code>.bin</code> and config files</li> <li>A full model with <code>.pickle</code></li> </ol> <p>Tip</p> <p>It is advised to use methods 1 or 2 for saving as they generated very small models. Especially method 1 (<code>safetensors</code>)  allows for a relatively safe format compared to the other methods. </p> <p>The methods are used as follows:</p> <pre><code>topic_model = BERTopic().fit(my_docs)\n\n# Method 1 - safetensors\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 2 - pytorch\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"pytorch\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 3 - pickle\ntopic_model.save(\"my_model\", serialization=\"pickle\")\n</code></pre> <p>Warning</p> <p>When saving the model, make sure to also keep track of the versions of dependencies and Python used.  Loading and saving the model should be done using the same dependencies and Python. Moreover, models  saved in one version of BERTopic are not guaranteed to load in other versions. </p>"},{"location":"getting_started/serialization/serialization.html#pickle-drawbacks","title":"Pickle Drawbacks","text":"<p>Saving the model with <code>pickle</code> allows for saving the entire topic model, including dimensionality reduction and clustering algorithms, but has several drawbacks:</p> <ul> <li>Arbitrary code can be run from <code>.pickle</code> files</li> <li>The resulting model is rather large (often &gt; 500MB) since all sub-models need to be saved</li> <li>Explicit and specific version control is needed as they typically only run if the environment is exactly the same</li> </ul>"},{"location":"getting_started/serialization/serialization.html#safetensors-and-pytorch-advantages","title":"Safetensors and Pytorch Advantages","text":"<p>Saving the topic modeling with <code>.safetensors</code> or <code>pytorch</code> has a number of advantages:</p> <ul> <li><code>.safetensors</code> is a relatively safe format</li> <li>The resulting model can be very small (often &lt; 20MB&gt;) since no sub-models need to be saved</li> <li>Although version control is important, there is a bit more flexibility with respect to specific versions of packages</li> <li>More easily used in production</li> <li>Share models with the HuggingFace Hub</li> </ul> <p> </p> <p>The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing <code>safetensors</code>, <code>pytorch</code>, and <code>pickle</code>. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings. </p>"},{"location":"getting_started/serialization/serialization.html#huggingface-hub","title":"HuggingFace Hub","text":"<p>When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account which you can do in a number of ways:</p> <ul> <li>Log in to your Hugging Face account with the command below</li> </ul> <pre><code>huggingface-cli login\n\n# or using an environment variable\nhuggingface-cli login --token $HUGGINGFACE_TOKEN\n</code></pre> <ul> <li>Alternatively, you can programmatically login using login() in a notebook or a script</li> </ul> <pre><code>from huggingface_hub import login\nlogin()\n</code></pre> <ul> <li>Or you can give a token with the <code>token</code> variable</li> </ul> <p>When you have logged in to your HuggingFace account, you can save and upload the model as follows:</p> <pre><code>from bertopic import BERTopic\n\n# Train model\ntopic_model = BERTopic().fit(my_docs)\n\n# Push to HuggingFace Hub\ntopic_model.push_to_hf_hub(\n    repo_id=\"MaartenGr/BERTopic_ArXiv\",\n    save_ctfidf=True\n)\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_ArXiv\")\n</code></pre>"},{"location":"getting_started/serialization/serialization.html#parameters","title":"Parameters","text":"<p>There are number of parameters that may be worthwile to know:</p> <ul> <li><code>private</code><ul> <li>Whether to create a private repository</li> </ul> </li> <li><code>serialization</code><ul> <li>The type of serialization. Either <code>safetensors</code> or <code>pytorch</code>. Make sure to run <code>pip install safetensors</code> for safetensors.</li> </ul> </li> <li><code>save_embedding_model</code><ul> <li>A pointer towards a HuggingFace model to be loaded in with SentenceTransformers. E.g., <code>sentence-transformers/all-MiniLM-L6-v2</code></li> </ul> </li> <li><code>save_ctfidf</code><ul> <li>Whether to save c-TF-IDF information</li> </ul> </li> </ul>"},{"location":"getting_started/serialization/serialization.html#loading","title":"Loading","text":"<p>To load a model:</p> <pre><code># Load from directory\nloaded_model = BERTopic.load(\"path/to/my/model_dir\")\n\n# Load from file\nloaded_model = BERTopic.load(\"my_model\")\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_Wikipedia\")\n</code></pre> <p>The embedding model cannot always be saved using a non-pickle method if, for example, you are using OpenAI embeddings. Instead, you can load them in as follows:</p> <pre><code># Define embedding model\nimport openai\nfrom bertopic.backend import OpenAIBackend\n\nclient = openai.OpenAI(api_key=\"sk-...\")\nembedding_model = OpenAIBackend(client, \"text-embedding-ada-002\")\n\n# Load model and add embedding model\nloaded_model = BERTopic.load(\"path/to/my/model_dir\", embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/supervised/supervised.html","title":"Supervised Topic Modeling","text":"<p>Although topic modeling is typically done by discovering topics in an unsupervised manner, there might be times when you already have a bunch of clusters or classes from which you want to model the topics. For example, the often used 20 NewsGroups dataset is already split up into 20 classes. Similarly, you might already have created some labels yourself through packages like human-learn, bulk, thisnotthat or something entirely different. </p> <p>Instead of using BERTopic to discover previously unknown topics, we are now going to manually pass them to BERTopic and try to learn the relationship between those topics and the input documents. </p> <p>In other words, we are going to be performing classification instead! </p> <p>We can view this as a supervised topic modeling approach. Instead of using a clustering algorithm, we are going to be using a classification algorithm instead. </p> <p>Generally, we have the following pipeline:</p> <p></p> SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality  reduction Clustering Topic  representation <p></p> <p>Instead, we are now going to skip over the dimensionality reduction step and replace the clustering step with a classification model:</p> <p></p> SBERT Logistic Regression c-TF-IDF Embeddings Classifier Topic representation <p></p> <p>In other words, we can pass our labels to BERTopic and it will not only learn how to predict labels for new instances, but it also transforms those labels into topics by running the c-TF-IDF representations on the set of documents within each label. This process allows us to model the topics themselves and similarly gives us the option to use everything BERTopic has to offer. </p> <p>To do so, we need to skip over the dimensionality reduction step and replace the clustering step with a classification algorithm. We can use the documents and labels from the 20 NewsGroups dataset to create topics from those 20 labels:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# Get labeled data\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n</code></pre> <p>Then, we make sure to create empty instances of the dimensionality reduction and clustering steps. We pass those to BERTopic to simply skip over them and go to the topic representation process:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\nfrom bertopic.dimensionality import BaseDimensionalityReduction\nfrom sklearn.linear_model import LogisticRegression\n\n# Get labeled data\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n\n# Skip over dimensionality reduction, replace cluster model with classifier,\n# and reduce frequent words while we are at it.\nempty_dimensionality_model = BaseDimensionalityReduction()\nclf = LogisticRegression()\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\n\n# Create a fully supervised BERTopic instance\ntopic_model= BERTopic(\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=clf,\n        ctfidf_model=ctfidf_model\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> <p>Let's take a look at a few topics that we get out of training this way by running <code>topic_model.get_topic_info()</code>:</p> <p></p> Topic Count  Name 0 0 999  0_game_hockey_team_25  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  <p></p> <p>We can see several interesting topics appearing here. They seem to relate to the 20 classes we had as input. Now, let's map those topics to our original classes to view their relationship:</p> <pre><code># Map input `y` to topics\nmappings = topic_model.topic_mapper_.get_mappings()\nmappings = {value: data[\"target_names\"][key] for key, value in mappings.items()}\n\n# Assign original classes to our topics\ndf = topic_model.get_topic_info()\ndf[\"Class\"] = df.Topic.map(mappings)\ndf\n</code></pre> Topic Count  Name Class 0 0 999  0_game_hockey_team_25  rec.sport.hockey  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  sci.space  comp.windows.x  sci.med  rec.autos  sci.crypt  rec.sport.baseball  rec.motorcycles  soc.religion.christian  <p></p> <p>We can see that the c-TF-IDF representations extract the words that give a good representation of our input classes. This is all done directly from the labeling. A welcome side-effect is that we now have a classification algorithm that allows us to predict the topics of unseen data:</p> <pre><code>&gt;&gt;&gt; topic, _ = topic_model.transform(\"this is a document about cars\")\n&gt;&gt;&gt; topic_model.get_topic(topic)\n[('car', 0.4407600315538472),\n ('cars', 0.32348015696446325),\n ('engine', 0.28032518444946686),\n ('ford', 0.2500224508115155),\n ('oil', 0.2325984913598611),\n ('dealer', 0.2310723968585826),\n ('my', 0.22045777551991935),\n ('it', 0.21327993649430219),\n ('tires', 0.20420842634292657),\n ('brake', 0.20246902481367085)]\n</code></pre> <p>Moreover, we can still perform BERTopic-specific features like dynamic topic modeling, topics per class, hierarchical topic modeling, modeling topic distributions, etc.</p> <p>Note</p> <p>The resulting <code>topics</code> may be a different mapping from the <code>y</code> labels. To map <code>y</code> to <code>topics</code>, we can run the following:</p> <pre><code>mappings = topic_model.topic_mapper_.get_mappings()\ny_mapped = [mappings[val] for val in y]\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html","title":"Tips &amp; Tricks","text":""},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#document-length","title":"Document length","text":"<p>As a default, we are using sentence-transformers to embed our documents. However, as the name implies, the embedding model works best for either sentences or paragraphs. This means that whenever you have a set of documents, where each documents contains several paragraphs, the document is truncated and the topic model is only trained on a small part of the data. </p> <p>One way to solve this issue is by splitting up longer documents into either sentences or paragraphs before embedding them. Another solution is to approximate the topic distributions of topics after having trained your topic model. </p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#removing-stop-words","title":"Removing stop words","text":"<p>At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context in order to create accurate embeddings. </p> <p>Instead, we can use the <code>CountVectorizer</code> to preprocess our documents after having generated embeddings and clustered  our documents. Personally, I have found almost no disadvantages to using the <code>CountVectorizer</code> to remove stopwords and  it is something I would strongly advise to try out:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>We can also use the <code>ClassTfidfTransformer</code> to reduce the impact of frequent words. The end result is very similar to explictly removing stopwords but this process does this automatically:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\n</code></pre> <p>Lastly, we can use a KeyBERT-Inspired model to reduce the appearance of stop words. This also often improves the topic representation:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#diversify-topic-representation","title":"Diversify topic representation","text":"<p>After having calculated our top n words per topic there might be many words that essentially  mean the same thing. As a little bonus, we can use <code>bertopic.representation.MaximalMarginalRelevance</code> in BERTopic to  diversify words in each topic such that we limit the number of duplicate words we find in each topic.  This is done using an algorithm called Maximal Marginal Relevance which compares word embeddings  with the topic embedding. </p> <p>We do this by specifying a value between 0 and 1, with 0 being not at all diverse and 1 being completely diverse:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import MaximalMarginalRelevance\n\nrepresentation_model = MaximalMarginalRelevance(diversity=0.2)\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Since MMR is using word embeddings to diversify the topic representations, it is necessary to pass the embedding model to BERTopic if you are using pre-computed embeddings:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import MaximalMarginalRelevance\nfrom sentence_transformers import SentenceTransformer\n\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\nrepresentation_model = MaximalMarginalRelevance(diversity=0.2)\ntopic_model = BERTopic(embedding_model=sentence_model, representation_model=representation_model)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#topic-term-matrix","title":"Topic-term matrix","text":"<p>Although BERTopic focuses on clustering our documents, the end result does contain a topic-term matrix.  This topic-term matrix is calculated using c-TF-IDF, a TF-IDF procedure optimized for class-based analyses. </p> <p>To extract the topic-term matrix (or c-TF-IDF matrix) with the corresponding words, we can simply do the following:</p> <pre><code>topic_term_matrix = topic_model.c_tf_idf_\nwords = topic_model.vectorizer_model.get_feature_names()\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#pre-compute-embeddings","title":"Pre-compute embeddings","text":"<p>Typically, we want to iterate fast over different versions of our BERTopic model whilst we are trying to optimize it to a specific use case. To speed up this process, we can pre-compute the embeddings, save them,  and pass them to BERTopic so it does not need to calculate the embeddings each time: </p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train our topic model using our pre-trained sentence-transformers embeddings\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#speed-up-umap","title":"Speed up UMAP","text":"<p>At times, UMAP may take a while to fit on the embeddings that you have. This often happens when you have the embeddings millions of documents that you want to reduce in dimensionality. There is a trick that  can speed up this process somewhat: Initializing UMAP with rescaled PCA embeddings. </p> <p>Without going in too much detail (look here for more information), you can reduce the embeddings using PCA  and use that as a starting point. This can speed up the dimensionality reduction a bit: </p> <pre><code>import numpy as np\nfrom umap import UMAP\nfrom bertopic import BERTopic\nfrom sklearn.decomposition import PCA\n\n\ndef rescale(x, inplace=False):\n    \"\"\" Rescale an embedding so optimization will not have convergence issues.\n    \"\"\"\n    if not inplace:\n        x = np.array(x, copy=True)\n\n    x /= np.std(x[:, 0]) * 10000\n\n    return x\n\n\n# Initialize and rescale PCA embeddings\npca_embeddings = rescale(PCA(n_components=5).fit_transform(embeddings))\n\n# Start UMAP from PCA embeddings\numap_model = UMAP(\n    n_neighbors=15,\n    n_components=5,\n    min_dist=0.0,\n    metric=\"cosine\",\n    init=pca_embeddings,\n)\n\n# Pass the model to BERTopic:\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#gpu-acceleration","title":"GPU acceleration","text":"<p>You can use cuML to speed up both  UMAP and HDBSCAN through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\nfrom cuml.manifold import UMAP\n\n# Create instances of GPU-accelerated UMAP and HDBSCAN\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\n\n# Pass the above models to be used in BERTopic\ntopic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Depending on the embeddings you are using, you might want to normalize them first in order to  force a cosine-related distance metric in UMAP:</p> <pre><code>from cuml.preprocessing import normalize\nembeddings = normalize(embeddings)\n</code></pre> <p>Note</p> <p>As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., <code>.transform</code>) using cuML's HDBSCAN.  However, it is still possible to calculate the topic-document probability matrix for the data on which the model was trained (i.e., <code>.fit</code> and <code>.fit_tranform</code>).</p> <p>Note</p> <p>If you want to install cuML together with BERTopic using Google Colab, you can run the following code:</p> <pre><code>!pip install bertopic\n!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#lightweight-installation","title":"Lightweight installation","text":"<p>The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely <code>\"all-MiniLM-L6-v2\"</code>. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires <code>pytorch</code> which often results in a rather large environment, memory-wise. </p> <p>Fortunately, it is possible to install BERTopic without <code>sentence-transformers</code> and use it as a lightweight solution instead. The installation can be done as follows:</p> <pre><code>pip install --no-deps bertopic\npip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml\n</code></pre> <p>Then, we can use BERTopic without <code>sentence-transformers</code> as follows using a CPU-based embedding technique:</p> <pre><code>from sklearn.pipeline import make_pipeline\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\npipe = make_pipeline(\n    TfidfVectorizer(),\n    TruncatedSVD(100)\n)\n\ntopic_model = BERTopic(embedding_model=pipe)\n</code></pre> <p>As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!</p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#wordcloud","title":"WordCloud","text":"<p>To minimize the number of dependencies in BERTopic, it is not possible to generate wordclouds out-of-the-box. However,  there is a minimal script that you can use to generate wordclouds in BERTopic. First, you will need to install  the wordcloud package with <code>pip install wordcloud</code>. Then, run the following code  to generate the wordcloud for a specific topic:</p> <pre><code>from wordcloud import WordCloud\nimport matplotlib.pyplot as plt\n\ndef create_wordcloud(model, topic):\n    text = {word: value for word, value in model.get_topic(topic)}\n    wc = WordCloud(background_color=\"white\", max_words=1000)\n    wc.generate_from_frequencies(text)\n    plt.imshow(wc, interpolation=\"bilinear\")\n    plt.axis(\"off\")\n    plt.show()\n\n# Show wordcloud\ncreate_wordcloud(topic_model, topic=1)\n</code></pre> <p></p> <p>Tip</p> <p>To increase the number of words shown in the wordcloud, you can increase the <code>top_n_words</code>  parameter when instantiating BERTopic. You can also increase the number of words in a topic after training the model using <code>.update_topics()</code>. </p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#finding-similar-topics-between-models","title":"Finding similar topics between models","text":"<p>Whenever you have trained seperate BERTopic models on different datasets, it might  be worthful to find the similarities among these models. Is there overlap between  topics in model A and topic in model B? In other words, can we find topics in model A that are similar to those in model B? </p> <p>We can compare the topic representations of several models in two ways. First, by comparing the topic embeddings that are created when using the same embedding model across both fitted BERTopic instances. Second, we can compare the c-TF-IDF representations instead assuming we have fixed the vocabulary in both instances. </p> <p>This example will go into the former, using the same embedding model across two BERTopic instances. To do this comparison, let's first create an example where I trained two models, one on an English dataset and one on a Dutch dataset:</p> <pre><code>from datasets import load_dataset\nfrom bertopic import BERTopic\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# The same embedding model needs to be used for both topic models\n# and since we are dealing with multiple languages, the model needs to be multi-lingual\nsentence_model = SentenceTransformer(\"paraphrase-multilingual-MiniLM-L12-v2\")\n\n# To make this example reproducible\numap_model = UMAP(n_neighbors=15, n_components=5, \n                  min_dist=0.0, metric='cosine', random_state=42)\n\n# English\nen_dataset = load_dataset(\"stsb_multi_mt\", name=\"en\", split=\"train\").to_pandas().sentence1.tolist()\nen_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model)\nen_model.fit(en_dataset)\n\n# Dutch\nnl_dataset = load_dataset(\"stsb_multi_mt\", name=\"nl\", split=\"train\").to_pandas().sentence1.tolist()\nnl_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model)\nnl_model.fit(nl_dataset)\n</code></pre> <p>In the code above, there is one important thing to note and that is the <code>sentence_model</code>. This model needs to be exactly the same in all BERTopic models, otherwise, it is not possible to compare topic models. </p> <p>Next, we can calculate the similarity between topics in the English topic model <code>en_model</code> and the Dutch model <code>nl_model</code>. To do so, we can simply calculate the cosine similarity between the <code>topic_embedding</code> of both models: </p> <pre><code>from sklearn.metrics.pairwise import cosine_similarity\nsim_matrix = cosine_similarity(en_model.topic_embeddings_, nl_model.topic_embeddings_)\n</code></pre> <p>Now that we know which topics are similar to each other, we can extract the most similar topics. Let's say that we have topic 10 in the <code>en_model</code> which represents a topic related to trains:</p> <pre><code>&gt;&gt;&gt; topic = 10\n&gt;&gt;&gt; en_model.get_topic(topic)\n[('train', 0.2588080580844999),\n ('tracks', 0.1392140438801078),\n ('station', 0.12126454635946024),\n ('passenger', 0.058057876475695866),\n ('engine', 0.05123717127783682),\n ('railroad', 0.048142847325312044),\n ('waiting', 0.04098973702226946),\n ('track', 0.03978248702913929),\n ('subway', 0.03834661195748458),\n ('steam', 0.03834661195748458)]\n</code></pre> <p>To find the matching topic, we extract the most similar topic in the <code>sim_matrix</code>:</p> <pre><code>&gt;&gt;&gt; most_similar_topic = np.argmax(sim_matrix[topic + 1])-1\n&gt;&gt;&gt; nl_model.get_topic(most_similar_topic)\n[('trein', 0.24186603209316418),\n ('spoor', 0.1338118418551581),\n ('sporen', 0.07683661859111401),\n ('station', 0.056990389779394225),\n ('stoommachine', 0.04905829711711234),\n ('zilveren', 0.04083879598477808),\n ('treinen', 0.03534099197032758),\n ('treinsporen', 0.03534099197032758),\n ('staat', 0.03481332997324445),\n ('zwarte', 0.03179591746822408)]\n</code></pre> <p>It seems to be working as, for example, <code>trein</code> is a translation of <code>train</code> and <code>sporen</code> a translation of <code>tracks</code>! You can do this for every single topic to find out which topic in the <code>en_model</code> might belong to a model in the <code>nl_model</code>. </p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#multimodal-data","title":"Multimodal data","text":"<p>Concept is a variation  of BERTopic for multimodal data, such as images with captions. Although we can use that  package for multimodal data, we can perform a small trick with BERTopic to have a similar feature. </p> <p>BERTopic is a relatively modular approach that attempts to isolate steps from one another. This means,  for example, that you can use k-Means instead of HDBSCAN or PCA instead of UMAP as it does not make  any assumptions with respect to the nature of the clustering. </p> <p>Similarly, you can pass pre-calculated embeddings to BERTopic that represent the documents that you have.  However, it does not make any assumption with respect to the relationship between those embeddings and  the documents. This means that we could pass any metadata to BERTopic to cluster on instead of document  embeddings. In this example, we can separate our embeddings from our documents so that the embeddings  are generated from images instead of their corresponding images. Thus, we will cluster image embeddings but  create the topic representation from the related captions. </p> <p>In this example, we first need to fetch our data, namely the Flickr 8k dataset that contains images  with captions:</p> <pre><code>import os\nimport glob\nimport zipfile\nimport numpy as np\nimport pandas as pd\nfrom tqdm import tqdm\nfrom PIL import Image\nfrom sentence_transformers import SentenceTransformer, util\n\n# Flickr 8k images\nimg_folder = 'photos/'\ncaps_folder = 'captions/'\nif not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:\n    os.makedirs(img_folder, exist_ok=True)\n\n    if not os.path.exists('Flickr8k_Dataset.zip'):   #Download dataset if does not exist\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')\n\n    for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:\n        with zipfile.ZipFile(file, 'r') as zf:\n            for member in tqdm(zf.infolist(), desc='Extracting'):\n                zf.extract(member, folder)\nimages = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))\n\n# Prepare dataframe\ncaptions = pd.read_csv(\"captions/Flickr8k.lemma.token.txt\",sep='\\t',names=[\"img_id\",\"img_caption\"])\ncaptions.img_id = captions.apply(lambda row: \"photos/Flicker8k_Dataset/\" + row.img_id.split(\".jpg\")[0] + \".jpg\", 1)\ncaptions = captions.groupby([\"img_id\"])[\"img_caption\"].apply(','.join).reset_index()\ncaptions = pd.merge(captions, pd.Series(images, name=\"img_id\"), on=\"img_id\")\n\n# Extract images together with their documents/captions\nimages = captions.img_id.to_list()\ndocs = captions.img_caption.to_list()\n</code></pre> <p>Now that we have our images and captions, we need to generate our image embeddings:</p> <pre><code>model = SentenceTransformer('clip-ViT-B-32')\n\n# Prepare images\nbatch_size = 32\nnr_iterations = int(np.ceil(len(images) / batch_size))\n\n# Embed images per batch\nembeddings = []\nfor i in tqdm(range(nr_iterations)):\n    start_index = i * batch_size\n    end_index = (i * batch_size) + batch_size\n\n    images_to_embed = [Image.open(filepath) for filepath in images[start_index:end_index]]\n    img_emb = model.encode(images_to_embed, show_progress_bar=False)\n    embeddings.extend(img_emb.tolist())\n\n    # Close images\n    for image in images_to_embed:\n        image.close()\nembeddings = np.array(embeddings)\n</code></pre> <p>Finally, we can fit BERTopic the way we are used to, with documents and embeddings:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\ntopics, probs = topic_model.fit_transform(docs, embeddings)\ncaptions[\"Topic\"] = topics\n</code></pre> <p>After fitting our model, let's inspect a topic about skateboarders:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(2)\n[('skateboard', 0.09592033177340711),\n ('skateboarder', 0.07792520092546491),\n ('trick', 0.07481578896400298),\n ('ramp', 0.056952605147927216),\n ('skate', 0.03745127816149923),\n ('perform', 0.036546213623432654),\n ('bicycle', 0.03453483070441857),\n ('bike', 0.033233021253898994),\n ('jump', 0.026709362981948037),\n ('air', 0.025422798170830936)]\n</code></pre> <p>Based on the above output, we can take an image to see if the representation makes sense:</p> <pre><code>image = captions.loc[captions.Topic == 2, \"img_id\"].values.tolist()[0]\nImage.open(image)\n</code></pre> <p></p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic","title":"KeyBERT &amp; BERTopic","text":"<p>Although BERTopic focuses on topic extraction methods that does not assume specific structures for the generated clusters, it is possible to do this on a more local level. More specifically, we can use KeyBERT to generate a number of keywords for each document and then build a vocabulary on top of that as the input for BERTopic. This way, we can select words that we know have meaning to a topic, without focusing on the centroid of that cluster. This also allows more frequent words to pop-up regardless of the structure and density of a cluster. </p> <p>To do this, we first need to run KeyBERT on our data and create our vocabulary:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom keybert import KeyBERT\n\n# Prepare documents \ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# Extract keywords\nkw_model = KeyBERT()\nkeywords = kw_model.extract_keywords(docs)\n\n# Create our vocabulary\nvocabulary = [k[0] for keyword in keywords for k in keyword]\nvocabulary = list(set(vocabulary))\n</code></pre> <p>Then, we pass our <code>vocabulary</code> to BERTopic and train the model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model= CountVectorizer(vocabulary=vocabulary)\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre>"},{"location":"getting_started/topicreduction/topicreduction.html","title":"Topic Reduction","text":"<p>BERTopic uses HDBSCAN for clustering the data and it cannot specify the number of clusters you would want. To a certain extent,  this is an advantage, as we can trust HDBSCAN to be better in finding the number of clusters than we are. Instead, we can try to reduce the number of topics that have been created. Below, you will find three methods of doing  so. </p>"},{"location":"getting_started/topicreduction/topicreduction.html#manual-topic-reduction","title":"Manual Topic Reduction","text":"<p>Each resulting topic has its feature vector constructed from c-TF-IDF. Using those feature vectors, we can find the most similar  topics and merge them. If we do this iteratively, starting from the least frequent topic, we can reduce the number of topics quite easily. We do this until we reach the value of <code>nr_topics</code>:  </p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(nr_topics=20)\n</code></pre> <p>It is also possible to manually select certain topics that you believe should be merged.  For example, if topic 1 is <code>1_space_launch_moon_nasa</code> and topic 2 is <code>2_spacecraft_solar_space_orbit</code>  it might make sense to merge those two topics: </p> <pre><code>topics_to_merge = [1, 2]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> <p>If you have several groups of topics you want to merge, create a list of lists instead:</p> <pre><code>topics_to_merge = [[1, 2]\n                   [3, 4]]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre>"},{"location":"getting_started/topicreduction/topicreduction.html#automatic-topic-reduction","title":"Automatic Topic Reduction","text":"<p>One issue with the approach above is that it will merge topics regardless of whether they are very similar. They  are simply the most similar out of all options. This can be resolved by reducing the number of topics automatically.  To do this, we can use HDBSCAN to cluster our topics using each c-TF-IDF representation. Then, we merge topics that are clustered together.  Another benefit of HDBSCAN is that it generates outliers. These outliers prevent topics from being merged if no other topics are similar.</p> <p>To use this option, we simply set <code>nr_topics</code> to <code>\"auto\"</code>:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(nr_topics=\"auto\")\n</code></pre>"},{"location":"getting_started/topicreduction/topicreduction.html#topic-reduction-after-training","title":"Topic Reduction after Training","text":"<p>Finally, we can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so is that you can decide the number of topics after knowing how many are created. It is difficult to predict before training your model how many topics that are in your documents and how many will be extracted.  Instead, we can decide afterward how many topics seem realistic:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Create topics -&gt; Typically over 50 topics\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Further reduce topics\ntopic_model.reduce_topics(docs, nr_topics=30)\n\n# Access updated topics\ntopics = topic_model.topics_\n</code></pre> <p>The reasoning for putting <code>docs</code> as a parameter is that the documents are not saved within  BERTopic on purpose. If you were to have a million documents, it is very inefficient to save those in BERTopic instead of a dedicated database.  </p>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html","title":"Update Topic Representations","text":"<p>The topics that are extracted from BERTopic are represented by words. These words are extracted from the documents  occupying their topics using a class-based TF-IDF. This allows us to extract words that are interesting to a topic but  less so to another. </p>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html#update-topic-representation-after-training","title":"Update Topic Representation after Training","text":"<p>When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to remove stop_words or you want to try out a different n_gram_range. We can use the function <code>update_topics</code> to update  the topic representation with new parameters for <code>c-TF-IDF</code>: </p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Create topics\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic(n_gram_range=(2, 3))\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>From the model created above, one of the most frequent topics is the following:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(31)[:10]\n[('clipper chip', 0.007240771542316232),\n ('key escrow', 0.004601603973377443),\n ('law enforcement', 0.004277247929596332),\n ('intercon com', 0.0035961920238955824),\n ('amanda walker', 0.003474856425297157),\n ('serial number', 0.0029876119137150358),\n ('com amanda', 0.002789303096817983),\n ('intercon com amanda', 0.0027386688593327084),\n ('amanda intercon', 0.002585262048515583),\n ('amanda intercon com', 0.002585262048515583)]\n</code></pre> <p>Although there does seems to be some relation between words, it is difficult, at least for me, to intuitively understand  what the topic is about. Instead, let's simplify the topic representation by setting <code>n_gram_range</code> to (1, 3) to  also allow for single words.</p> <pre><code>&gt;&gt;&gt; topic_model.update_topics(docs, n_gram_range=(1, 3))\n&gt;&gt;&gt; topic_model.get_topic(31)[:10]\n[('encryption', 0.008021846079148017),\n ('clipper', 0.00789642647602742),\n ('chip', 0.00637127942464045),\n ('key', 0.006363124787175884),\n ('escrow', 0.005030980365244285),\n ('clipper chip', 0.0048271268437973395),\n ('keys', 0.0043245812747907545),\n ('crypto', 0.004311198708675516),\n ('intercon', 0.0038772934659295076),\n ('amanda', 0.003516026493904586)]\n</code></pre> <p>To me, the combination of the words above seem a bit more intuitive than the words we previously had! You can play  around with <code>n_gram_range</code> or use your own custom <code>sklearn.feature_extraction.text.CountVectorizer</code> and pass that instead: </p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(stop_words=\"english\", ngram_range=(1, 5))\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>Tip!</p> <p>If you want to change the topics to something else, whether that is merging them or removing outliers, you can pass  a custom list of topics to update them: <code>topic_model.update_topics(docs, topics=my_updated_topics)</code></p>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html#custom-labels","title":"Custom labels","text":"<p>The topic labels are currently automatically generated by taking the top 3 words and combining them  using the <code>_</code> separator. Although this is an informative label, in practice, this is definitely not the prettiest nor necessarily the most accurate label. For example, although the topic label  <code>1_space_nasa_orbit</code> is informative, but we would prefer to have a bit more intuitive label, such as  <code>space travel</code>. The difficulty with creating such topic labels is that much of the interpretation is left to the user. Would <code>space travel</code> be more accurate or perhaps <code>space explorations</code>? To truly understand which labels are most suited, going into some of the documents in topics is especially helpful. </p> <p>Although we can go through every single topic ourselves and try to label them, we can start by creating an overview of labels that have the length and number of words that we are looking for. To do so, we can generate our list of topic labels with <code>.generate_topic_labels</code> and define the number of words, the separator, word length, etc:</p> <pre><code>topic_labels = topic_model.generate_topic_labels(nr_words=3,\n                                                 topic_prefix=False,\n                                                 word_length=10,\n                                                 separator=\", \")\n</code></pre> <p>Tip</p> <p>If you created multiple topic representations or aspects, you can choose one of these aspects with <code>aspect=\"Aspect1\"</code> or whatever you named the aspect.</p> <p>In the above example, <code>1_space_nasa_orbit</code> would turn into <code>space, nasa, orbit</code> since we selected 3 words, no topic prefix, and the <code>,</code> separator. We can then either change our <code>topic_labels</code> to whatever we want or directly pass them to <code>.set_topic_labels</code> so that they can be used across most visualization functions:</p> <pre><code>topic_model.set_topic_labels(topic_labels)\n</code></pre> <p>It is also possible to only change a few topic labels at a time by passing a dictionary  where the key represents the topic ID and the value is the topic label:</p> <pre><code>topic_model.set_topic_labels({1: \"Space Travel\", 7: \"Religion\"})\n</code></pre> <p>Then, to make use of those custom topic labels across visualizations, such as <code>.visualize_hierarchy()</code>,  we can use the <code>custom_labels=True</code> parameter that is found in most visualizations. </p> <pre><code>fig = topic_model.visualize_barchart(custom_labels=True)\n</code></pre>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html#optimize-labels","title":"Optimize labels","text":"<p>The great advantage of passing custom labels to BERTopic is that when more accurate zero-shot are released,  we can simply use those on top of BERTopic to further fine-tune the labeling. For example, let's say you  have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could use the bart-large-mnli model to find which user-defined  labels best represent the BERTopic-generated labels:</p> <pre><code>from transformers import pipeline\nclassifier = pipeline(\"zero-shot-classification\", model=\"facebook/bart-large-mnli\")\n\n# A selected topic representation\n# 'god jesus atheists atheism belief atheist believe exist beliefs existence'\nsequence_to_classify =  \" \".join([word for word, _ in topic_model.get_topic(1)])\n\n# Our set of potential topic labels\ncandidate_labels = ['cooking', 'dancing', 'religion']\nclassifier(sequence_to_classify, candidate_labels)\n\n#{'labels': ['cooking', 'dancing', 'religion'],\n# 'scores': [0.086, 0.063, 0.850],\n# 'sequence': 'god jesus atheists atheism belief atheist believe exist beliefs existence'}\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html","title":"Dynamic Topic Modeling","text":"<p>Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics  over time. These methods allow you to understand how a topic is represented across different times.  For example, in 1995 people may talk differently about environmental awareness than those in 2015. Although the  topic itself remains the same, environmental awareness, the exact representation of that topic might differ. </p> <p>BERTopic allows for DTM by calculating the topic representation at each timestep without the need to  run the entire model several times. To do this, we first need to fit BERTopic as if there were no temporal  aspect in the data. Thus, a general topic model will be created. We use the global representation as to the main topics that can be found at, most likely, different timesteps. For each topic and timestep, we calculate the c-TF-IDF representation. This will result in a specific topic representation at each timestep without the need to create clusters from embeddings as they were already created.</p> <p></p>  1 Topic Timestep  1  m Timestep Timestep  1 Timestep  m  n Topic c-TF-IDF c-TF-IDF c-TF-IDF c-TF-IDF topic c-TF-IDF c-TF-IDF at t + 2 c-TF-IDF at t c-TF-IDF at t-1 + 2 Global Tuning Split documents by topic Split documents by topic and timestep Apply pre-fitted c-TF-IDF on each subset of documents.  Tune the c-TF-IDF at each timestep t by either averaging the representations with the global representation or with the representation at t-1.  Evolutionary Tuning Optional tuning of representations <p></p> <p>Next, there are two main ways to further fine-tune these specific topic representations,  namely globally and evolutionary.</p> <p>A topic representation at timestep t can be fine-tuned globally by averaging its c-TF-IDF representation with that of the global representation. This allows each topic representation to move slightly towards the global representation whilst still keeping some of its specific words. </p> <p>A topic representation at timestep t can be fine-tuned evolutionary by averaging its c-TF-IDF representation with that of the c-TF-IDF representation at timestep t-1. This is done for each topic representation allowing for the representations to evolve over time. </p> <p>Both fine-tuning methods are set to <code>True</code> as a default and allow for interesting representations to be created. </p>"},{"location":"getting_started/topicsovertime/topicsovertime.html#example","title":"Example","text":"<p>To demonstrate DTM in BERTopic, we first need to prepare our data. A good example of where DTM is useful is topic  modeling on Twitter data. We can analyze how certain people have talked about certain topics in the years  they have been on Twitter. Due to the controversial nature of his tweets, we are going to be using all  tweets by Donald Trump.  </p> <p>First, we need to load the data and do some very basic cleaning. For example, I am not interested in his  re-tweets for this use-case: </p> <pre><code>import re\nimport pandas as pd\n\n# Prepare data\ntrump = pd.read_csv('https://drive.google.com/uc?export=download&amp;id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')\ntrump.text = trump.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == \"f\") &amp; (trump.text != \"\"), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n</code></pre> <p>Then, we need to extract the global topic representations by simply creating and training a BERTopic model:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(tweets)\n</code></pre> <p>From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this  by simply calling <code>topics_over_time</code> and passing the tweets, the corresponding timestamps, and the related topics:</p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)\n</code></pre> <p>And that is it! Aside from what you always need for BERTopic, you now only need to add <code>timestamps</code>  to quickly calculate the topics over time. </p>"},{"location":"getting_started/topicsovertime/topicsovertime.html#parameters","title":"Parameters","text":"<p>There are a few parameters that are of interest which will be discussed below. </p>"},{"location":"getting_started/topicsovertime/topicsovertime.html#tuning","title":"Tuning","text":"<p>Both <code>global_tuning</code> and <code>evolutionary_tuning</code> are set to True as a default, but can easily be changed. Perhaps  you do not want the representations to be influenced by the global representation and merely see how they  evolved over time:</p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, \n                                                global_tuning=True, evolution_tuning=True, nr_bins=20)\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html#bins","title":"Bins","text":"<p>If you have more than 100 unique timestamps, then there will be topic representations created for each of those  timestamps which can negatively affect the topic representations. It is advised to keep the number of unique  timestamps below 50. To do this, you can simply set the number of bins that are created when calculating the  topic representations. The timestamps will be taken and put into equal-sized bins:</p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html#datetime-format","title":"Datetime format","text":"<p>If you are passing strings (dates) instead of integers, then BERTopic will try to automatically detect  which datetime format your strings have. Unfortunately, this will not always work if they are in an unexpected format.  We can use <code>datetime_format</code> to pass the format the timestamps have: </p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, datetime_format=\"%b%M\", nr_bins=20)\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html#visualization","title":"Visualization","text":"<p>To me, DTM becomes truly interesting when you have a good way of visualizing how topics have changed over time.  A nice way of doing so is by leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency  of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.  Simply call <code>visualize_topics_over_time</code> with the newly created topics over time:</p> <pre><code>topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)\n</code></pre> <p>I used <code>top_n_topics</code> to only show the top 20 most frequent topics. If I were to visualize all topics, which is possible by  leaving <code>top_n_topics</code> empty, there is a chance that hundreds of lines will fill the plot. </p> <p>You can also use <code>topics</code> to show specific topics:</p> <pre><code>topic_model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])\n</code></pre>"},{"location":"getting_started/topicsperclass/topicsperclass.html","title":"Topics per Class","text":"<p>In some cases, you might be interested in how certain topics are represented over certain categories. Perhaps  there are specific groups of users for which you want to see how they talk about certain topics. </p> <p>Instead of running the topic model per class, we can simply create a topic model and then extract, for each topic, its representation per class. This allows you to see how certain topics, calculated over all documents, are represented for certain subgroups. </p> <p></p>  1 Topic  1 Class  m Class  1 Class  m Class  n Topic c-TF-IDF c-TF-IDF c-TF-IDF c-TF-IDF Split documents by topic Split documents by topic and class Apply pre-fitted c-TF-IDF on each subset of documents.  <p></p> <p>To do so, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents. </p> <p>First, let's prepare the data:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\ntargets = data[\"target\"]\ntarget_names = data[\"target_names\"]\nclasses = [data[\"target_names\"][i] for i in data[\"target\"]]\n</code></pre> <p>Next, we want to extract the topics across all documents without taking the categories into account:</p> <pre><code>topic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Now that we have created our global topic model, let us calculate the topic representations across each category:</p> <pre><code>topics_per_class = topic_model.topics_per_class(docs, classes=classes)\n</code></pre> <p>The <code>classes</code> variable contains the class for each document. Then, we simply visualize these topics per class:</p> <pre><code>topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)\n</code></pre> <p>You can hover over the bars to see the topic representation per class.</p> <p>As you can see in the visualization above, the topics <code>93_homosexual_homosexuality_sex</code> and <code>58_bike_bikes_motorcycle</code>  are somewhat distributed over all classes. </p> <p>You can see that the topic representation between rec.motorcycles and rec.autos in <code>58_bike_bikes_motorcycle</code> clearly  differs from one another. It seems that BERTopic has tried to combine those two categories into a single topic. However,  since they do contain two separate topics, the topic representation in those two categories differs. </p> <p>We see something similar for <code>93_homosexual_homosexuality_sex</code>, where the topic is distributed among several categories  and is represented slightly differently. </p> <p>Thus, you can see that although in certain categories the topic is similar, the way the topic is represented can differ.   </p>"},{"location":"getting_started/vectorizers/vectorizers.html","title":"4. Vectorizers","text":"<p>In topic modeling, the quality of the topic representations is key for interpreting the topics, communicating results, and understanding patterns. It is of utmost  importance to make sure that the topic representations fit with your use case. </p> <p>In practice, there is not one correct way of creating topic representations. Some use cases might opt for higher n-grams, whereas others might focus more on single  words without any stop words. The diversity in use cases also means that we need to have some flexibility in BERTopic to make sure it can be used across most use cases.  The image below illustrates this modularity:</p> <p> </p> <p>In this section, we will go through several examples of vectorization algorithms and how they can be implemented.  </p>"},{"location":"getting_started/vectorizers/vectorizers.html#countvectorizer","title":"CountVectorizer","text":"<p>One often underestimated component of BERTopic is the <code>CountVectorizer</code> and <code>c-TF-IDF</code> calculation. Together, they are responsible for creating the topic representations and luckily  can be quite flexible in parameter tuning. Here, we will go through tips and tricks for tuning your <code>CountVectorizer</code> and see how they might affect the topic representations. </p> <p>Before starting, it should be noted that you can pass the <code>CountVectorizer</code> before and after training your topic model. Passing it before training allows you to  minimize the size of the resulting <code>c-TF-IDF</code> matrix:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Train BERTopic with a custom CountVectorizer\nvectorizer_model = CountVectorizer(min_df=10)\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Passing it after training allows you to fine-tune the topic representations by using <code>.update_topics()</code>:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Train a BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Fine-tune topic representations after training BERTopic\nvectorizer_model = CountVectorizer(stop_words=\"english\", ngram_range=(1, 3), min_df=10)\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>The great thing about using <code>.update_topics()</code> is that it allows you to tweak the topic representations without re-training your model! Thus, here we will be focusing  on fine-tuning our topic representations after training our model. </p> <p>Note</p> <p>The great thing about processing our topic representations with the <code>CountVectorizer</code> is that it does not influence the quality of clusters as that is  being performed before generating the topic representations. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#basic-usage","title":"Basic Usage","text":"<p>First, let's start with defining our documents and training our topic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Prepare documents\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# Train a BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Now, let's see the top 10 most frequent topics that have been generated:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\nTopic   Count   Name\n1   0   1822    0_game_team_games_he\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_ites_hello_cheek_hi\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_you_your_post_jim\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_fire_gas\n10  9   174 9_amp_audio_condition_asking\n</code></pre> <p>The topic representations generated already seem quite interpretable! However, I am quite sure we do much better without having  to re-train our model. Next, we will go through common parameters in <code>CountVectorizer</code> and focus on the effects that they might have. As a baseline, we will be comparing  them to the topic representation above. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#parameters","title":"Parameters","text":"<p>There are several basic parameters in the CountVectorizer that we can use to improve upon the quality of the resulting topic representations.</p>"},{"location":"getting_started/vectorizers/vectorizers.html#ngram_range","title":"ngram_range","text":"<p>The <code>ngram_range</code> parameter allows us to decide how many tokens each entity is in a topic representation. For example, we have words like  <code>game</code> and <code>team</code> with a length of 1 in a topic but it would also make sense to have words like <code>hockey league</code> with a length of 2. To allow for these words to be generated,  we can set the <code>ngram_range</code> parameter:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=\"english\")\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>As you might have noticed, I also added <code>stop_words=\"english\"</code>. This is necessary as longer words tend to have many stop words and removing them allows  for nicer topic representations:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\n    Topic   Count   Name\n1   0   1822    0_game_team_games_players\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_hello ites_forget hello_ites 15_huh hi\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_post_jim_context_forged\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_gas_compound\n10  9   174 9_amp_audio_condition_asking\n</code></pre> <p>Although they look very similar, if we zoom in on topic 8, we can see longer words in our representation:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(8)\n[('fbi', 0.019637149205975653),\n ('koresh', 0.019054514637064403),\n ('gas', 0.014156057632897179),\n ('compound', 0.012381224868591681),\n ('batf', 0.010349992314076047),\n ('children', 0.009336408916322387),\n ('tear gas', 0.008941747802855279),\n ('tear', 0.008446786597564537),\n ('davidians', 0.007911119583253022),\n ('started', 0.007398687505638955)]\n</code></pre> <p><code>tear</code> and <code>gas</code> have now been combined into a single representation. This helps us understand what those individual words might have been representing. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#stop_words","title":"stop_words","text":"<p>In some of the topics, we can see stop words appearing like <code>he</code> or <code>the</code>. Stop words are something we typically want to prevent in our topic representations as they do not give additional information to the topic.  To prevent those stop words, we can use the <code>stop_words</code> parameter in the <code>CountVectorizer</code> to remove them from the representations:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>After running the above, we get the following output:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\n    Topic   Count   Name\n1   0   1822    0_game_team_games_players\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_ites_cheek_hello_hi\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_monitor_card_video_vga\n6   5   438 5_post_jim_context_forged\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_tobacco\n9   8   218 8_fbi_koresh_gas_compound\n10  9   174 9_amp_audio_condition_stereo\n</code></pre> <p>As you can see, the topic representations already look much better! Stop words are removed and the representations are more interpretable.  We can also pass in a list of stop words if you have multiple languages to take into account. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#min_df","title":"min_df","text":"<p>One important parameter to keep in mind is the <code>min_df</code>. This is typically an integer representing how frequent a word must be before  being added to our representation. You can imagine that if we have a million documents and a certain word only appears a single time across all of them, then  it would be highly unlikely to be representative of a topic. Typically, the <code>c-TF-IDF</code> calculation removes that word from the topic representation but when  you have millions of documents, that will also lead to a very large topic-term matrix. To prevent a huge vocabulary, we can set the <code>min_df</code> to only accept  words that have a minimum frequency. </p> <p>When you have millions of documents or error issues, I would advise increasing the value of <code>min_df</code> as long as the topic representations might sense:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(min_df=10)\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>With the following topic representation:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\n    Topic   Count   Name\n1   0   1822    0_game_team_games_he\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_hello_hi_yep_huh\n4   3   493 3_israel_jews_jewish_peace\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_you_your_post_jim\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_fire_gas\n10  9   174 9_audio_condition_stereo_asking\n</code></pre> <p>As you can see, the output is nearly the same which is what we would like to achieve. All words that appear less than 10 times are now removed  from our topic-term matrix (i.e., <code>c-TF-IDF</code> matrix) which drastically lowers the matrix in size. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#max_features","title":"max_features","text":"<p>A parameter similar to <code>min_df</code> is <code>max_features</code> which allows you to select the top n most frequent words to be used in the topic representation.  Setting this, for example, to <code>10_000</code> creates a topic-term matrix with <code>10_000</code> terms. This helps you control the size of the topic-term matrix  directly without having to fiddle around with the <code>min_df</code> parameter:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(max_features=10_000)\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>With the following representation:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\nTopic   Count   Name\n1   0   1822    0_game_team_games_he\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_hello_hi_yep_huh\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_you_your_post_jim\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_fire_gas\n10  9   174 9_amp_audio_condition_asking\n</code></pre> <p>As with <code>min_df</code>, we would like the topic representations to be very similar. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#tokenizer","title":"tokenizer","text":"<p>The default tokenizer in the CountVectorizer works well for western languages but fails to tokenize some non-western languages, like Chinese. Fortunately, we can use the <code>tokenizer</code> variable in the CountVectorizer to use <code>jieba</code>, which is a package for Chinese text segmentation. Using it is straightforward:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nimport jieba\n\ndef tokenize_zh(text):\n    words = jieba.lcut(text)\n    return words\n\nvectorizer = CountVectorizer(tokenizer=tokenize_zh)\n</code></pre> <p>Then, we can simply pass the vectorizer to update our topic representations:</p> <pre><code>topic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre>"},{"location":"getting_started/vectorizers/vectorizers.html#onlinecountvectorizer","title":"OnlineCountVectorizer","text":"<p>When using the online/incremental variant of BERTopic, we need a <code>CountVectorizer</code> than can incrementally update its representation. For that purpose, <code>OnlineCountVectorizer</code> was created that not only updates out-of-vocabulary words but also implements decay and cleaning functions to prevent the sparse bag-of-words matrix to become too large. It is a class that can be found in <code>bertopic.vectorizers</code> which extends <code>sklearn.feature_extraction.text.CountVectorizer</code>. In other words, you can use the exact same parameter in <code>OnlineCountVectorizer</code> as found in Scikit-Learn's <code>CountVectorizer</code>. We can use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import OnlineCountVectorizer\n\n# Train BERTopic with a custom OnlineCountVectorizer\nvectorizer_model = OnlineCountVectorizer()\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre>"},{"location":"getting_started/vectorizers/vectorizers.html#parameters_1","title":"Parameters","text":"<p>Other than parameters found in <code>CountVectorizer</code>, such as <code>stop_words</code>  and <code>ngram_range</code>, we can two parameters in <code>OnlineCountVectorizer</code> to adjust the way old data is processed and kept. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#decay","title":"decay","text":"<p>At each iteration, we sum the bag-of-words representation of the new documents with the bag-of-words representation of all documents processed thus far. In other words, the bag-of-words matrix keeps increasing with each iteration. However, especially in a streaming setting, older documents might become less and less relevant as time goes on. Therefore, a <code>decay</code> parameter was implemented that decays the bag-of-words' frequencies at each iteration before adding the document frequencies of new documents. The <code>decay</code> parameter is a value between 0 and 1 and indicates the percentage of frequencies the previous bag-of-words matrix should be reduced to. For example, a value of <code>.1</code> will decrease the frequencies in the bag-of-words matrix by 10% at each iteration before adding the new bag-of-words matrix. This will make sure that recent data has more weight than previous iterations. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#delete_min_df","title":"delete_min_df","text":"<p>In BERTopic, we might want to remove words from the topic representation that appear infrequently. The <code>min_df</code> in the <code>CountVectorizer</code> works quite well for that. However, when we have a streaming setting, the <code>min_df</code> does not work as well since a word's frequency might start below <code>min_df</code> but will end up higher than that over time. Setting that value high might not always be advised. </p> <p>As a result, the vocabulary of the resulting bag-of-words matrix can become quite large. Similarly, if we implement the <code>decay</code> parameter, then some values will decrease over time until they are below <code>min_df</code>. For these reasons, the <code>delete_min_df</code> parameter was implemented. The parameter takes positive integers and indicates, at each iteration, which words will be removed. If the value is set to 5, it will check after each iteration if the total frequency of a word is exceeded by that value. If so, the word will be removed in its entirety from the bag-of-words matrix. This helps to keep the bag-of-words matrix of a manageable size. </p> <p>Note</p> <p>Although the <code>delete_min_df</code> parameter removes words from the bag-of-words matrix, it is not permanent. If new documents come in where those previously deleted words are used frequently, they get added back to the matrix. </p>"},{"location":"getting_started/visualization/visualization.html","title":"Visualization","text":"<p>Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works.  Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing  if they make sense is an important factor in alleviating this issue. </p>"},{"location":"getting_started/visualization/visualization.html#visualize-topics","title":"Visualize Topics","text":"<p>After having trained our <code>BERTopic</code> model, we can iteratively go through hundreds of topics to get a good  understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.  Instead, we can visualize the topics that were generated in a way very similar to  LDAvis. </p> <p>We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using  plotly such that we can create an interactive view.</p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>Then, we can call <code>.visualize_topics</code> to create a 2D representation of your topics. The resulting graph is a  plotly interactive graph which can be converted to HTML:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>You can use the slider to select the topic which then lights up red. If you hover over a topic, then general  information is given about the topic, including the size of the topic and its corresponding words.</p>"},{"location":"getting_started/visualization/visualization.html#visualize-documents","title":"Visualize Documents","text":"<p>Using the previous method, we can visualize the topics and get insight into their relationships. However,  you might want a more fine-grained approach where we can visualize the documents inside the topics to see  if they were assigned correctly or whether they make sense. To do so, we can use the <code>topic_model.visualize_documents()</code>  function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization  purposes. This process can be quite expensive, so it is advised to adhere to the following pipeline:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This was done for demonstration purposes  as saving all those documents in the visualization can be quite expensive and result in large files. However,  it might be interesting to set <code>hide_document_hover=False</code> in order to hover over the points and see the content of the documents.    </p>"},{"location":"getting_started/visualization/visualization.html#custom-hover","title":"Custom Hover","text":"<p>When you visualize the documents, you might not always want to see the complete document over hover. Many documents have shorter information that might be more interesting to visualize, such as its title. To create the hover based on a documents' title instead of its content, you can simply pass a variable (<code>titles</code>) containing the title for each document:</p> <pre><code>topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-topic-hierarchy","title":"Visualize Topic Hierarchy","text":"<p>The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical  structure of the topics, we can use <code>scipy.cluster.hierarchy</code> to create clusters and visualize how  they relate to one another. This might help to select an appropriate <code>nr_topics</code> when reducing the number  of topics that you have created. To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>Note</p> <p>Do note that this is not the actual procedure of <code>.reduce_topics()</code> when <code>nr_topics</code> is set to  auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles  the actual procedure of <code>.reduce_topics()</code> when any number of <code>nr_topics</code> is selected. </p>"},{"location":"getting_started/visualization/visualization.html#hierarchical-labels","title":"Hierarchical labels","text":"<p>Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens  to the topic representations when merging topics. To do so, we first need to calculate the representations of the  hierarchical topics:</p> <p>First, we train a basic BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>To visualize these results, we simply need to pass the resulting <code>hierarchical_topics</code> to our <code>.visualize_hierarchy</code> function:</p> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations  help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,  we can now see which sub-topics can be found within certain larger themes. </p>"},{"location":"getting_started/visualization/visualization.html#text-based-topic-tree","title":"Text-based topic tree","text":"<p>Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can  use <code>topic_model.get_topic_tree</code> to create a text-based representation of this hierarchy. Although the general structure is more difficult  to view, we can see better which topics could be logically merged:</p> <pre><code>&gt;&gt;&gt; tree = topic_model.get_topic_tree(hierarchical_topics)\n&gt;&gt;&gt; print(tree)\n.\n\u2514\u2500atheists_atheism_god_moral_atheist\n     \u251c\u2500atheists_atheism_god_atheist_argument\n     \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n     \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n     \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n</code></pre> Click here to view the full tree. <pre><code>  .\n  \u251c\u2500people_armenian_said_god_armenians\n  \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u2502    \u251c\u2500jehovah_lord_mormon_mcconkie_god\n  \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500ra_satan_thou_god_lucifer \u2500\u2500 Topic: 94\n  \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500jehovah_lord_mormon_mcconkie_unto \u2500\u2500 Topic: 78\n  \u2502    \u2502    \u2502    \u2514\u2500jesus_mary_god_hell_sin\n  \u2502    \u2502    \u2502         \u251c\u2500jesus_hell_god_eternal_heaven\n  \u2502    \u2502    \u2502         \u2502    \u251c\u2500hell_jesus_eternal_god_heaven\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500jesus_tomb_disciples_resurrection_john \u2500\u2500 Topic: 69\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500hell_eternal_god_jesus_heaven \u2500\u2500 Topic: 53\n  \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500aaron_baptism_sin_law_god \u2500\u2500 Topic: 89\n  \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500mary_sin_maria_priest_conception \u2500\u2500 Topic: 56\n  \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500marriage_married_marry_ceremony_marriages \u2500\u2500 Topic: 110\n  \u2502    \u2514\u2500people_armenian_armenians_said_mr\n  \u2502         \u251c\u2500people_armenian_armenians_said_israel\n  \u2502         \u2502    \u251c\u2500god_homosexual_homosexuality_atheists_sex\n  \u2502         \u2502    \u2502    \u251c\u2500homosexual_homosexuality_sex_gay_homosexuals\n  \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500kinsey_sex_gay_men_sexual \u2500\u2500 Topic: 44\n  \u2502         \u2502    \u2502    \u2502    \u2514\u2500homosexuality_homosexual_sin_homosexuals_gay\n  \u2502         \u2502    \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500gay_homosexual_homosexuals_sexual_cramer \u2500\u2500 Topic: 50\n  \u2502         \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500homosexuality_homosexual_sin_paul_sex \u2500\u2500 Topic: 27\n  \u2502         \u2502    \u2502    \u2514\u2500god_atheists_atheism_moral_atheist\n  \u2502         \u2502    \u2502         \u251c\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500jim_context_challenges_articles_quote \u2500\u2500 Topic: 36\n  \u2502         \u2502    \u2502         \u2502    \u2514\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500islam_quran_islamic_rushdie_muslims \u2500\u2500 Topic: 31\n  \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500judas_scripture_bible_books_greek \u2500\u2500 Topic: 33\n  \u2502         \u2502    \u2502         \u2514\u2500atheists_atheism_god_moral_atheist\n  \u2502         \u2502    \u2502              \u251c\u2500atheists_atheism_god_atheist_argument\n  \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n  \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n  \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n  \u2502         \u2502    \u2514\u2500armenian_armenians_people_israel_said\n  \u2502         \u2502         \u251c\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502    \u251c\u2500tax_rights_government_income_taxes\n  \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500rights_right_slavery_slaves_residence \u2500\u2500 Topic: 106\n  \u2502         \u2502         \u2502    \u2502    \u2514\u2500tax_government_taxes_income_libertarians\n  \u2502         \u2502         \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500government_libertarians_libertarian_regulation_party \u2500\u2500 Topic: 58\n  \u2502         \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500tax_taxes_income_billion_deficit \u2500\u2500 Topic: 41\n  \u2502         \u2502         \u2502    \u2514\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502         \u251c\u2500gun_guns_militia_firearms_amendment\n  \u2502         \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500blacks_penalty_death_cruel_punishment \u2500\u2500 Topic: 55\n  \u2502         \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500gun_guns_militia_firearms_amendment \u2500\u2500 Topic: 7\n  \u2502         \u2502         \u2502         \u2514\u2500armenian_armenians_israel_jews_turkish\n  \u2502         \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500israel_israeli_jews_arab_jewish \u2500\u2500 Topic: 4\n  \u2502         \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500armenian_armenians_turkish_armenia_azerbaijan \u2500\u2500 Topic: 15\n  \u2502         \u2502         \u2514\u2500stephanopoulos_president_mr_myers_ms\n  \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500serbs_muslims_stephanopoulos_mr_bosnia \u2500\u2500 Topic: 35\n  \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500myers_stephanopoulos_president_ms_mr \u2500\u2500 Topic: 87\n  \u2502         \u2514\u2500batf_fbi_koresh_compound_gas\n  \u2502              \u251c\u2500\u25a0\u2500\u2500reno_workers_janet_clinton_waco \u2500\u2500 Topic: 77\n  \u2502              \u2514\u2500batf_fbi_koresh_gas_compound\n  \u2502                   \u251c\u2500batf_koresh_fbi_warrant_compound\n  \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500batf_warrant_raid_compound_fbi \u2500\u2500 Topic: 42\n  \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500koresh_batf_fbi_children_compound \u2500\u2500 Topic: 61\n  \u2502                   \u2514\u2500\u25a0\u2500\u2500fbi_gas_tear_bds_building \u2500\u2500 Topic: 23\n  \u2514\u2500use_like_just_dont_new\n      \u251c\u2500game_team_year_games_like\n      \u2502    \u251c\u2500game_team_games_25_year\n      \u2502    \u2502    \u251c\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502    \u251c\u2500window_printer_use_problem_mhz\n      \u2502    \u2502    \u2502    \u2502    \u251c\u2500mhz_wire_simms_wiring_battery\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_mhz_battery_cpu_heat\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_pds_simm_vram_lc\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500pds_nubus_lc_slot_card \u2500\u2500 Topic: 119\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500simms_simm_vram_meg_dram \u2500\u2500 Topic: 32\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500mhz_battery_cpu_heat_speed\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500fan_cpu_heat_sink_fans \u2500\u2500 Topic: 92\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500mhz_speed_cpu_fpu_clock \u2500\u2500 Topic: 22\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500monitor_turn_power_computer_electricity \u2500\u2500 Topic: 91\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500battery_batteries_concrete_duo_discharge\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500duo_battery_apple_230_problem \u2500\u2500 Topic: 121\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500battery_batteries_concrete_discharge_temperature \u2500\u2500 Topic: 75\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500leds_uv_blue_light_boards \u2500\u2500 Topic: 66\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500wire_wiring_ground_neutral_outlets \u2500\u2500 Topic: 120\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500scope_scopes_phone_dial_number\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500dial_number_phone_line_output \u2500\u2500 Topic: 93\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500scope_scopes_motorola_generator_oscilloscope \u2500\u2500 Topic: 113\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500celp_dsp_sampling_antenna_digital\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500antenna_antennas_receiver_cable_transmitter \u2500\u2500 Topic: 70\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500celp_dsp_sampling_speech_voice \u2500\u2500 Topic: 52\n      \u2502    \u2502    \u2502    \u2502    \u2514\u2500window_printer_xv_mouse_windows\n      \u2502    \u2502    \u2502    \u2502         \u251c\u2500window_xv_error_widget_problem\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500error_symbol_undefined_xterm_rx\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500symbol_error_undefined_doug_parse \u2500\u2500 Topic: 63\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500rx_remote_server_xdm_xterm \u2500\u2500 Topic: 45\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500window_xv_widget_application_expose\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500window_widget_expose_application_event\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500gc_mydisplay_draw_gxxor_drawing \u2500\u2500 Topic: 103\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500window_widget_application_expose_event \u2500\u2500 Topic: 25\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500xv_den_polygon_points_algorithm\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500den_polygon_points_algorithm_polygons \u2500\u2500 Topic: 28\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500xv_24bit_image_bit_images \u2500\u2500 Topic: 57\n      \u2502    \u2502    \u2502    \u2502         \u2514\u2500printer_fonts_print_mouse_postscript\n      \u2502    \u2502    \u2502    \u2502              \u251c\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500scanner_logitech_grayscale_ocr_scanman \u2500\u2500 Topic: 108\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u2514\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u251c\u2500\u25a0\u2500\u2500printer_print_deskjet_hp_ink \u2500\u2500 Topic: 18\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u2514\u2500\u25a0\u2500\u2500fonts_font_truetype_tt_atm \u2500\u2500 Topic: 49\n      \u2502    \u2502    \u2502    \u2502              \u2514\u2500mouse_ghostscript_midi_driver_postscript\n      \u2502    \u2502    \u2502    \u2502                   \u251c\u2500ghostscript_midi_postscript_files_file\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500ghostscript_postscript_pageview_ghostview_dsc \u2500\u2500 Topic: 104\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u2514\u2500midi_sound_file_windows_driver\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u251c\u2500\u25a0\u2500\u2500location_mar_file_host_rwrr \u2500\u2500 Topic: 83\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u2514\u2500\u25a0\u2500\u2500midi_sound_driver_blaster_soundblaster \u2500\u2500 Topic: 98\n      \u2502    \u2502    \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500mouse_driver_mice_ball_problem \u2500\u2500 Topic: 68\n      \u2502    \u2502    \u2502    \u2514\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502         \u251c\u25001st_sale_condition_comics_hulk\n      \u2502    \u2502    \u2502         \u2502    \u251c\u2500sale_condition_offer_asking_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500condition_stereo_amp_speakers_asking\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500miles_car_amfm_toyota_cassette \u2500\u2500 Topic: 62\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500amp_speakers_condition_stereo_audio \u2500\u2500 Topic: 24\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500games_sale_pom_cds_shipping\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u251c\u2500pom_cds_sale_shipping_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500size_shipping_sale_condition_mattress \u2500\u2500 Topic: 100\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500pom_cds_cd_sale_picture \u2500\u2500 Topic: 37\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500games_game_snes_sega_genesis \u2500\u2500 Topic: 40\n      \u2502    \u2502    \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u251c\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500lens_tape_camera_backup_lenses\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500tape_backup_tapes_drive_4mm \u2500\u2500 Topic: 107\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500lens_camera_lenses_zoom_pouch \u2500\u2500 Topic: 114\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u25001st_hulk_comics_art_appears \u2500\u2500 Topic: 105\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500books_book_cover_trek_chemistry \u2500\u2500 Topic: 125\n      \u2502    \u2502    \u2502         \u2502         \u2514\u2500tickets_hotel_ticket_voucher_package\n      \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500hotel_voucher_package_vacation_room \u2500\u2500 Topic: 74\n      \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500tickets_ticket_june_airlines_july \u2500\u2500 Topic: 84\n      \u2502    \u2502    \u2502         \u2514\u2500game_team_games_season_hockey\n      \u2502    \u2502    \u2502              \u251c\u2500game_hockey_team_25_550\n      \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500espn_pt_pts_game_la \u2500\u2500 Topic: 17\n      \u2502    \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500team_25_game_hockey_550 \u2500\u2500 Topic: 2\n      \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500year_game_hit_baseball_players \u2500\u2500 Topic: 0\n      \u2502    \u2502    \u2514\u2500bike_car_greek_insurance_msg\n      \u2502    \u2502         \u251c\u2500car_bike_insurance_cars_engine\n      \u2502    \u2502         \u2502    \u251c\u2500car_insurance_cars_radar_engine\n      \u2502    \u2502         \u2502    \u2502    \u251c\u2500insurance_health_private_care_canada\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500insurance_health_private_care_canada \u2500\u2500 Topic: 99\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500insurance_car_accident_rates_sue \u2500\u2500 Topic: 82\n      \u2502    \u2502         \u2502    \u2502    \u2514\u2500car_cars_radar_engine_detector\n      \u2502    \u2502         \u2502    \u2502         \u251c\u2500car_radar_cars_detector_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500radar_detector_detectors_ka_alarm \u2500\u2500 Topic: 39\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500car_cars_mustang_ford_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500clutch_shift_shifting_transmission_gear \u2500\u2500 Topic: 88\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500car_cars_mustang_ford_v8 \u2500\u2500 Topic: 14\n      \u2502    \u2502         \u2502    \u2502         \u2514\u2500oil_diesel_odometer_diesels_car\n      \u2502    \u2502         \u2502    \u2502              \u251c\u2500odometer_oil_sensor_car_drain\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500odometer_sensor_speedo_gauge_mileage \u2500\u2500 Topic: 96\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500oil_drain_car_leaks_taillights \u2500\u2500 Topic: 102\n      \u2502    \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500diesel_diesels_emissions_fuel_oil \u2500\u2500 Topic: 79\n      \u2502    \u2502         \u2502    \u2514\u2500bike_riding_ride_bikes_motorcycle\n      \u2502    \u2502         \u2502         \u251c\u2500bike_ride_riding_bikes_lane\n      \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500bike_ride_riding_lane_car \u2500\u2500 Topic: 11\n      \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500bike_bikes_miles_honda_motorcycle \u2500\u2500 Topic: 19\n      \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500countersteering_bike_motorcycle_rear_shaft \u2500\u2500 Topic: 46\n      \u2502    \u2502         \u2514\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u251c\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u251c\u2500greek_kuwait_greece_turkish_greeks\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500greek_greece_turkish_greeks_cyprus \u2500\u2500 Topic: 71\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kuwait_iraq_iran_gulf_arabia \u2500\u2500 Topic: 76\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2514\u2500msg_dog_drugs_drug_food\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u251c\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500clinton_bush_quayle_reagan_panicking \u2500\u2500 Topic: 101\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u2514\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500cooper_trial_weaver_spence_witnesses \u2500\u2500 Topic: 90\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500dog_dogs_bike_trained_springer \u2500\u2500 Topic: 67\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2514\u2500msg_drugs_drug_food_chinese\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500msg_food_chinese_foods_taste \u2500\u2500 Topic: 30\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500drugs_drug_marijuana_cocaine_alcohol \u2500\u2500 Topic: 72\n      \u2502    \u2502              \u2502    \u2502    \u2514\u2500water_theory_universe_science_larsons\n      \u2502    \u2502              \u2502    \u2502         \u251c\u2500water_nuclear_cooling_steam_dept\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500rocketry_rockets_engines_nuclear_plutonium \u2500\u2500 Topic: 115\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u2514\u2500water_cooling_steam_dept_plants\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500water_dept_phd_environmental_atmospheric \u2500\u2500 Topic: 97\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500cooling_water_steam_towers_plants \u2500\u2500 Topic: 109\n      \u2502    \u2502              \u2502    \u2502         \u2514\u2500theory_universe_larsons_larson_science\n      \u2502    \u2502              \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500theory_universe_larsons_larson_science \u2500\u2500 Topic: 54\n      \u2502    \u2502              \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500oort_cloud_grbs_gamma_burst \u2500\u2500 Topic: 80\n      \u2502    \u2502              \u2502    \u2514\u2500helmet_kirlian_photography_lock_wax\n      \u2502    \u2502              \u2502         \u251c\u2500helmet_kirlian_photography_leaf_mask\n      \u2502    \u2502              \u2502         \u2502    \u251c\u2500kirlian_photography_leaf_pictures_deleted\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u251c\u2500deleted_joke_stuff_maddi_nickname\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500joke_maddi_nickname_nicknames_frank \u2500\u2500 Topic: 43\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500deleted_stuff_bookstore_joke_motto \u2500\u2500 Topic: 81\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kirlian_photography_leaf_pictures_aura \u2500\u2500 Topic: 85\n      \u2502    \u2502              \u2502         \u2502    \u2514\u2500helmet_mask_liner_foam_cb\n      \u2502    \u2502              \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500helmet_liner_foam_cb_helmets \u2500\u2500 Topic: 112\n      \u2502    \u2502              \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500mask_goalies_77_santore_tl \u2500\u2500 Topic: 123\n      \u2502    \u2502              \u2502         \u2514\u2500lock_wax_paint_plastic_ear\n      \u2502    \u2502              \u2502              \u251c\u2500\u25a0\u2500\u2500lock_cable_locks_bike_600 \u2500\u2500 Topic: 117\n      \u2502    \u2502              \u2502              \u2514\u2500wax_paint_ear_plastic_skin\n      \u2502    \u2502              \u2502                   \u251c\u2500\u25a0\u2500\u2500wax_paint_plastic_scratches_solvent \u2500\u2500 Topic: 65\n      \u2502    \u2502              \u2502                   \u2514\u2500\u25a0\u2500\u2500ear_wax_skin_greasy_acne \u2500\u2500 Topic: 116\n      \u2502    \u2502              \u2514\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u251c\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500m4_mp_14_mw_mo \u2500\u2500 Topic: 111\n      \u2502    \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500test_ensign_nameless_deane_deanebinahccbrandeisedu \u2500\u2500 Topic: 118\n      \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500ites_cheek_hello_hi_ken \u2500\u2500 Topic: 3\n      \u2502    \u2514\u2500space_medical_health_disease_cancer\n      \u2502         \u251c\u2500medical_health_disease_cancer_patients\n      \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500cancer_centers_center_medical_research \u2500\u2500 Topic: 122\n      \u2502         \u2502    \u2514\u2500health_medical_disease_patients_hiv\n      \u2502         \u2502         \u251c\u2500patients_medical_disease_candida_health\n      \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n      \u2502         \u2502         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n      \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n      \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n      \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n      \u2502         \u2514\u2500space_launch_nasa_shuttle_orbit\n      \u2502              \u251c\u2500space_moon_station_nasa_launch\n      \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500sky_advertising_billboard_billboards_space \u2500\u2500 Topic: 59\n      \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500space_station_moon_redesign_nasa \u2500\u2500 Topic: 16\n      \u2502              \u2514\u2500space_mission_hst_launch_orbit\n      \u2502                   \u251c\u2500space_launch_nasa_orbit_propulsion\n      \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500space_launch_nasa_propulsion_astronaut \u2500\u2500 Topic: 47\n      \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500orbit_km_jupiter_probe_earth \u2500\u2500 Topic: 86\n      \u2502                   \u2514\u2500\u25a0\u2500\u2500hst_mission_shuttle_orbit_arrays \u2500\u2500 Topic: 60\n      \u2514\u2500drive_file_key_windows_use\n          \u251c\u2500key_file_jpeg_encryption_image\n          \u2502    \u251c\u2500key_encryption_clipper_chip_keys\n          \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500key_clipper_encryption_chip_keys \u2500\u2500 Topic: 1\n          \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500entry_file_ripem_entries_key \u2500\u2500 Topic: 73\n          \u2502    \u2514\u2500jpeg_image_file_gif_images\n          \u2502         \u251c\u2500motif_graphics_ftp_available_3d\n          \u2502         \u2502    \u251c\u2500motif_graphics_openwindows_ftp_available\n          \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500openwindows_motif_xview_windows_mouse \u2500\u2500 Topic: 20\n          \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500graphics_widget_ray_3d_available \u2500\u2500 Topic: 95\n          \u2502         \u2502    \u2514\u2500\u25a0\u2500\u25003d_machines_version_comments_contact \u2500\u2500 Topic: 38\n          \u2502         \u2514\u2500jpeg_image_gif_images_format\n          \u2502              \u251c\u2500\u25a0\u2500\u2500gopher_ftp_files_stuffit_images \u2500\u2500 Topic: 51\n          \u2502              \u2514\u2500\u25a0\u2500\u2500jpeg_image_gif_format_images \u2500\u2500 Topic: 13\n          \u2514\u2500drive_db_card_scsi_windows\n              \u251c\u2500db_windows_dos_mov_os2\n              \u2502    \u251c\u2500\u25a0\u2500\u2500copy_protection_program_software_disk \u2500\u2500 Topic: 64\n              \u2502    \u2514\u2500\u25a0\u2500\u2500db_windows_dos_mov_os2 \u2500\u2500 Topic: 8\n              \u2514\u2500drive_card_scsi_drives_ide\n                      \u251c\u2500drive_scsi_drives_ide_disk\n                      \u2502    \u251c\u2500\u25a0\u2500\u2500drive_scsi_drives_ide_disk \u2500\u2500 Topic: 6\n                      \u2502    \u2514\u2500\u25a0\u2500\u2500meg_sale_ram_drive_shipping \u2500\u2500 Topic: 12\n                      \u2514\u2500card_modem_monitor_video_drivers\n                          \u251c\u2500\u25a0\u2500\u2500card_monitor_video_drivers_vga \u2500\u2500 Topic: 5\n                          \u2514\u2500\u25a0\u2500\u2500modem_port_serial_irq_com \u2500\u2500 Topic: 10\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-hierarchical-documents","title":"Visualize Hierarchical Documents","text":"<p>We can extend the previous method by calculating the topic representation at different levels of the hierarchy and  plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:</p> <p><pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by  reducing their dimensionality ourselves:</p> <pre><code># Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This makes the resulting visualization  smaller and fit into your RAM. However, it might be interesting to set <code>hide_document_hover=False</code> to hover  over the points and see the content of the documents. </p>"},{"location":"getting_started/visualization/visualization.html#visualize-terms","title":"Visualize Terms","text":"<p>We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores  for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within  topics. Moreover, you can easily compare topic representations to each other.  To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-topic-similarity","title":"Visualize Topic Similarity","text":"<p>Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity  matrix by simply applying cosine similarities through those topic embeddings. The result will be a  matrix indicating how similar certain topics are to each other.  To visualize the heatmap, run the following:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Note</p> <p>You can set <code>n_clusters</code> in <code>visualize_heatmap</code> to order the topics by their similarity.  This will result in blocks being formed in the heatmap indicating which clusters of topics are  similar to each other. This step is very much recommended as it will make reading the heatmap easier.      </p>"},{"location":"getting_started/visualization/visualization.html#visualize-term-score-decline","title":"Visualize Term Score Decline","text":"<p>Topics are represented by a number of words starting with the best representative word.  Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word  to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline  with each word that is added. At some point adding words to the topic representation only marginally  increases the total c-TF-IDF score and would not be beneficial for its representation. </p> <p>To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word.  In other words, the position of the words (term rank), where the words with  the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis  will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline  of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method,  the select the best number of words in a topic. </p> <p>To visualize the c-TF-IDF score decline, run the following:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>To enable the log scale on the y-axis for a better view of individual topics, run the following:</p> <pre><code>topic_model.visualize_term_rank(log_scale=True)\n</code></pre> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an  analysis by the amazing tmtoolkit. Reference to that specific analysis can be found  here. </p>"},{"location":"getting_started/visualization/visualization.html#visualize-topics-over-time","title":"Visualize Topics over Time","text":"<p>After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by  leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency  of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.  Simply call <code>.visualize_topics_over_time</code> with the newly created topics over time:</p> <pre><code>import re\nimport pandas as pd\nfrom bertopic import BERTopic\n\n# Prepare data\ntrump = pd.read_csv('https://drive.google.com/uc?export=download&amp;id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')\ntrump.text = trump.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == \"f\") &amp; (trump.text != \"\"), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n\n# Create topics over time\nmodel = BERTopic(verbose=True)\ntopics, probs = model.fit_transform(tweets)\ntopics_over_time = model.topics_over_time(tweets, timestamps)\n</code></pre> <p>Then, we visualize some interesting topics: </p> <pre><code>model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-topics-per-class","title":"Visualize Topics per Class","text":"<p>You might want to extract and visualize the topic representation per class. For example, if you have  specific groups of users that might approach topics differently, then extracting them would help understanding  how these users talk about certain topics. In other words, this is simply creating a topic representation for  certain classes that you might have in your data. </p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Prepare data and classes\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\nclasses = [data[\"target_names\"][i] for i in data[\"target\"]]\n\n# Create topic model and calculate topics per class\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_per_class = topic_model.topics_per_class(docs, classes=classes)\n</code></pre> <p>Then, we visualize the topic representation of major topics per class: </p> <pre><code>topic_model.visualize_topics_per_class(topics_per_class)\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-probablities-or-distribution","title":"Visualize Probablities or Distribution","text":"<p>We can generate the topic-document probability matrix by simply setting <code>calculate_probabilities=True</code> if a HDBSCAN model is used:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>The resulting <code>probs</code> variable contains the soft-clustering as done through HDBSCAN. </p> <p>If a non-HDBSCAN model is used, we can estimate the topic distributions after training our model:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic()\ntopics, _ = topic_model.fit_transform(docs) \ntopic_distr, _ = topic_model.approximate_distribution(docs, min_similarity=0)\n</code></pre> <p>Then, we either pass the <code>probs</code> or <code>topic_distr</code> variable to <code>.visualize_distribution</code> to visualize either the probability distributions or the topic distributions:</p> <pre><code># To visualize the probabilities of topic assignment\ntopic_model.visualize_distribution(probs[0])\n\n# To visualize the topic distributions in a document\ntopic_model.visualize_distribution(topic_distr[0])\n</code></pre> <p>Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first calculate topic distributions on a token level and then visualize the results:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])\ndf\n</code></pre> <p> </p> <p>Note</p> <p>To get the stylized dataframe for <code>.visualize_approximate_distribution</code> you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via <code>pip install jinja2</code></p> <p>Note</p> <p>The distribution of the probabilities does not give an indication to  the distribution of the frequencies of topics across a document. It merely shows how confident BERTopic is that certain topics can be found in a document.</p>"},{"location":"getting_started/visualization/visualize_documents.html","title":"Documents","text":"<p>Using the <code>.visualize_topics</code>, we can visualize the topics and get insight into their relationships. However,  you might want a more fine-grained approach where we can visualize the documents inside the topics to see  if they were assigned correctly or whether they make sense. To do so, we can use the <code>topic_model.visualize_documents()</code>  function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization  purposes. This process can be quite expensive, so it is advised to adhere to the following pipeline:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This was done for demonstration purposes  as saving all those documents in the visualization can be quite expensive and result in large files. However,  it might be interesting to set <code>hide_document_hover=False</code> in order to hover over the points and see the content of the documents.    </p>"},{"location":"getting_started/visualization/visualize_documents.html#custom-hover","title":"Custom Hover","text":"<p>When you visualize the documents, you might not always want to see the complete document over hover. Many documents have shorter information that might be more interesting to visualize, such as its title. To create the hover based on a documents' title instead of its content, you can simply pass a variable (<code>titles</code>) containing the title for each document:</p> <pre><code>topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)\n</code></pre>"},{"location":"getting_started/visualization/visualize_documents.html#visualize-probablities-or-distribution","title":"Visualize Probablities or Distribution","text":"<p>We can generate the topic-document probability matrix by simply setting <code>calculate_probabilities=True</code> if a HDBSCAN model is used:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>The resulting <code>probs</code> variable contains the soft-clustering as done through HDBSCAN. </p> <p>If a non-HDBSCAN model is used, we can estimate the topic distributions after training our model:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic()\ntopics, _ = topic_model.fit_transform(docs) \ntopic_distr, _ = topic_model.approximate_distribution(docs, min_similarity=0)\n</code></pre> <p>Then, we either pass the <code>probs</code> or <code>topic_distr</code> variable to <code>.visualize_distribution</code> to visualize either the probability distributions or the topic distributions:</p> <pre><code># To visualize the probabilities of topic assignment\ntopic_model.visualize_distribution(probs[0])\n\n# To visualize the topic distributions in a document\ntopic_model.visualize_distribution(topic_distr[0])\n</code></pre> <p>Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first calculate topic distributions on a token level and then visualize the results:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])\ndf\n</code></pre> <p> </p> <p>Note</p> <p>To get the stylized dataframe for <code>.visualize_approximate_distribution</code> you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via <code>pip install jinja2</code></p> <p>Note</p> <p>The distribution of the probabilities does not give an indication to  the distribution of the frequencies of topics across a document. It merely shows how confident BERTopic is that certain topics can be found in a document.</p>"},{"location":"getting_started/visualization/visualize_hierarchy.html","title":"Hierarchy","text":"<p>The topics that you create can be hierarchically reduced. In order to understand the potential hierarchical  structure of the topics, we can use <code>scipy.cluster.hierarchy</code> to create clusters and visualize how  they relate to one another. This might help to select an appropriate <code>nr_topics</code> when reducing the number  of topics that you have created. To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>Note</p> <p>Do note that this is not the actual procedure of <code>.reduce_topics()</code> when <code>nr_topics</code> is set to  auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles  the actual procedure of <code>.reduce_topics()</code> when any number of <code>nr_topics</code> is selected. </p>"},{"location":"getting_started/visualization/visualize_hierarchy.html#hierarchical-labels","title":"Hierarchical labels","text":"<p>Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens  to the topic representations when merging topics. To do so, we first need to calculate the representations of the  hierarchical topics:</p> <p>First, we train a basic BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>To visualize these results, we simply need to pass the resulting <code>hierarchical_topics</code> to our <code>.visualize_hierarchy</code> function:</p> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations  help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,  we can now see which sub-topics can be found within certain larger themes. </p>"},{"location":"getting_started/visualization/visualize_hierarchy.html#text-based-topic-tree","title":"Text-based topic tree","text":"<p>Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can  use <code>topic_model.get_topic_tree</code> to create a text-based representation of this hierarchy. Although the general structure is more difficult  to view, we can see better which topics could be logically merged:</p> <pre><code>&gt;&gt;&gt; tree = topic_model.get_topic_tree(hierarchical_topics)\n&gt;&gt;&gt; print(tree)\n.\n\u2514\u2500atheists_atheism_god_moral_atheist\n     \u251c\u2500atheists_atheism_god_atheist_argument\n     \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n     \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n     \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n</code></pre> Click here to view the full tree. <pre><code>  .\n  \u251c\u2500people_armenian_said_god_armenians\n  \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u2502    \u251c\u2500jehovah_lord_mormon_mcconkie_god\n  \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500ra_satan_thou_god_lucifer \u2500\u2500 Topic: 94\n  \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500jehovah_lord_mormon_mcconkie_unto \u2500\u2500 Topic: 78\n  \u2502    \u2502    \u2502    \u2514\u2500jesus_mary_god_hell_sin\n  \u2502    \u2502    \u2502         \u251c\u2500jesus_hell_god_eternal_heaven\n  \u2502    \u2502    \u2502         \u2502    \u251c\u2500hell_jesus_eternal_god_heaven\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500jesus_tomb_disciples_resurrection_john \u2500\u2500 Topic: 69\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500hell_eternal_god_jesus_heaven \u2500\u2500 Topic: 53\n  \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500aaron_baptism_sin_law_god \u2500\u2500 Topic: 89\n  \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500mary_sin_maria_priest_conception \u2500\u2500 Topic: 56\n  \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500marriage_married_marry_ceremony_marriages \u2500\u2500 Topic: 110\n  \u2502    \u2514\u2500people_armenian_armenians_said_mr\n  \u2502         \u251c\u2500people_armenian_armenians_said_israel\n  \u2502         \u2502    \u251c\u2500god_homosexual_homosexuality_atheists_sex\n  \u2502         \u2502    \u2502    \u251c\u2500homosexual_homosexuality_sex_gay_homosexuals\n  \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500kinsey_sex_gay_men_sexual \u2500\u2500 Topic: 44\n  \u2502         \u2502    \u2502    \u2502    \u2514\u2500homosexuality_homosexual_sin_homosexuals_gay\n  \u2502         \u2502    \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500gay_homosexual_homosexuals_sexual_cramer \u2500\u2500 Topic: 50\n  \u2502         \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500homosexuality_homosexual_sin_paul_sex \u2500\u2500 Topic: 27\n  \u2502         \u2502    \u2502    \u2514\u2500god_atheists_atheism_moral_atheist\n  \u2502         \u2502    \u2502         \u251c\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500jim_context_challenges_articles_quote \u2500\u2500 Topic: 36\n  \u2502         \u2502    \u2502         \u2502    \u2514\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500islam_quran_islamic_rushdie_muslims \u2500\u2500 Topic: 31\n  \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500judas_scripture_bible_books_greek \u2500\u2500 Topic: 33\n  \u2502         \u2502    \u2502         \u2514\u2500atheists_atheism_god_moral_atheist\n  \u2502         \u2502    \u2502              \u251c\u2500atheists_atheism_god_atheist_argument\n  \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n  \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n  \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n  \u2502         \u2502    \u2514\u2500armenian_armenians_people_israel_said\n  \u2502         \u2502         \u251c\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502    \u251c\u2500tax_rights_government_income_taxes\n  \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500rights_right_slavery_slaves_residence \u2500\u2500 Topic: 106\n  \u2502         \u2502         \u2502    \u2502    \u2514\u2500tax_government_taxes_income_libertarians\n  \u2502         \u2502         \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500government_libertarians_libertarian_regulation_party \u2500\u2500 Topic: 58\n  \u2502         \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500tax_taxes_income_billion_deficit \u2500\u2500 Topic: 41\n  \u2502         \u2502         \u2502    \u2514\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502         \u251c\u2500gun_guns_militia_firearms_amendment\n  \u2502         \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500blacks_penalty_death_cruel_punishment \u2500\u2500 Topic: 55\n  \u2502         \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500gun_guns_militia_firearms_amendment \u2500\u2500 Topic: 7\n  \u2502         \u2502         \u2502         \u2514\u2500armenian_armenians_israel_jews_turkish\n  \u2502         \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500israel_israeli_jews_arab_jewish \u2500\u2500 Topic: 4\n  \u2502         \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500armenian_armenians_turkish_armenia_azerbaijan \u2500\u2500 Topic: 15\n  \u2502         \u2502         \u2514\u2500stephanopoulos_president_mr_myers_ms\n  \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500serbs_muslims_stephanopoulos_mr_bosnia \u2500\u2500 Topic: 35\n  \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500myers_stephanopoulos_president_ms_mr \u2500\u2500 Topic: 87\n  \u2502         \u2514\u2500batf_fbi_koresh_compound_gas\n  \u2502              \u251c\u2500\u25a0\u2500\u2500reno_workers_janet_clinton_waco \u2500\u2500 Topic: 77\n  \u2502              \u2514\u2500batf_fbi_koresh_gas_compound\n  \u2502                   \u251c\u2500batf_koresh_fbi_warrant_compound\n  \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500batf_warrant_raid_compound_fbi \u2500\u2500 Topic: 42\n  \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500koresh_batf_fbi_children_compound \u2500\u2500 Topic: 61\n  \u2502                   \u2514\u2500\u25a0\u2500\u2500fbi_gas_tear_bds_building \u2500\u2500 Topic: 23\n  \u2514\u2500use_like_just_dont_new\n      \u251c\u2500game_team_year_games_like\n      \u2502    \u251c\u2500game_team_games_25_year\n      \u2502    \u2502    \u251c\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502    \u251c\u2500window_printer_use_problem_mhz\n      \u2502    \u2502    \u2502    \u2502    \u251c\u2500mhz_wire_simms_wiring_battery\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_mhz_battery_cpu_heat\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_pds_simm_vram_lc\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500pds_nubus_lc_slot_card \u2500\u2500 Topic: 119\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500simms_simm_vram_meg_dram \u2500\u2500 Topic: 32\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500mhz_battery_cpu_heat_speed\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500fan_cpu_heat_sink_fans \u2500\u2500 Topic: 92\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500mhz_speed_cpu_fpu_clock \u2500\u2500 Topic: 22\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500monitor_turn_power_computer_electricity \u2500\u2500 Topic: 91\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500battery_batteries_concrete_duo_discharge\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500duo_battery_apple_230_problem \u2500\u2500 Topic: 121\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500battery_batteries_concrete_discharge_temperature \u2500\u2500 Topic: 75\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500leds_uv_blue_light_boards \u2500\u2500 Topic: 66\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500wire_wiring_ground_neutral_outlets \u2500\u2500 Topic: 120\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500scope_scopes_phone_dial_number\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500dial_number_phone_line_output \u2500\u2500 Topic: 93\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500scope_scopes_motorola_generator_oscilloscope \u2500\u2500 Topic: 113\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500celp_dsp_sampling_antenna_digital\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500antenna_antennas_receiver_cable_transmitter \u2500\u2500 Topic: 70\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500celp_dsp_sampling_speech_voice \u2500\u2500 Topic: 52\n      \u2502    \u2502    \u2502    \u2502    \u2514\u2500window_printer_xv_mouse_windows\n      \u2502    \u2502    \u2502    \u2502         \u251c\u2500window_xv_error_widget_problem\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500error_symbol_undefined_xterm_rx\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500symbol_error_undefined_doug_parse \u2500\u2500 Topic: 63\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500rx_remote_server_xdm_xterm \u2500\u2500 Topic: 45\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500window_xv_widget_application_expose\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500window_widget_expose_application_event\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500gc_mydisplay_draw_gxxor_drawing \u2500\u2500 Topic: 103\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500window_widget_application_expose_event \u2500\u2500 Topic: 25\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500xv_den_polygon_points_algorithm\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500den_polygon_points_algorithm_polygons \u2500\u2500 Topic: 28\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500xv_24bit_image_bit_images \u2500\u2500 Topic: 57\n      \u2502    \u2502    \u2502    \u2502         \u2514\u2500printer_fonts_print_mouse_postscript\n      \u2502    \u2502    \u2502    \u2502              \u251c\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500scanner_logitech_grayscale_ocr_scanman \u2500\u2500 Topic: 108\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u2514\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u251c\u2500\u25a0\u2500\u2500printer_print_deskjet_hp_ink \u2500\u2500 Topic: 18\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u2514\u2500\u25a0\u2500\u2500fonts_font_truetype_tt_atm \u2500\u2500 Topic: 49\n      \u2502    \u2502    \u2502    \u2502              \u2514\u2500mouse_ghostscript_midi_driver_postscript\n      \u2502    \u2502    \u2502    \u2502                   \u251c\u2500ghostscript_midi_postscript_files_file\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500ghostscript_postscript_pageview_ghostview_dsc \u2500\u2500 Topic: 104\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u2514\u2500midi_sound_file_windows_driver\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u251c\u2500\u25a0\u2500\u2500location_mar_file_host_rwrr \u2500\u2500 Topic: 83\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u2514\u2500\u25a0\u2500\u2500midi_sound_driver_blaster_soundblaster \u2500\u2500 Topic: 98\n      \u2502    \u2502    \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500mouse_driver_mice_ball_problem \u2500\u2500 Topic: 68\n      \u2502    \u2502    \u2502    \u2514\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502         \u251c\u25001st_sale_condition_comics_hulk\n      \u2502    \u2502    \u2502         \u2502    \u251c\u2500sale_condition_offer_asking_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500condition_stereo_amp_speakers_asking\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500miles_car_amfm_toyota_cassette \u2500\u2500 Topic: 62\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500amp_speakers_condition_stereo_audio \u2500\u2500 Topic: 24\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500games_sale_pom_cds_shipping\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u251c\u2500pom_cds_sale_shipping_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500size_shipping_sale_condition_mattress \u2500\u2500 Topic: 100\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500pom_cds_cd_sale_picture \u2500\u2500 Topic: 37\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500games_game_snes_sega_genesis \u2500\u2500 Topic: 40\n      \u2502    \u2502    \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u251c\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500lens_tape_camera_backup_lenses\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500tape_backup_tapes_drive_4mm \u2500\u2500 Topic: 107\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500lens_camera_lenses_zoom_pouch \u2500\u2500 Topic: 114\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u25001st_hulk_comics_art_appears \u2500\u2500 Topic: 105\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500books_book_cover_trek_chemistry \u2500\u2500 Topic: 125\n      \u2502    \u2502    \u2502         \u2502         \u2514\u2500tickets_hotel_ticket_voucher_package\n      \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500hotel_voucher_package_vacation_room \u2500\u2500 Topic: 74\n      \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500tickets_ticket_june_airlines_july \u2500\u2500 Topic: 84\n      \u2502    \u2502    \u2502         \u2514\u2500game_team_games_season_hockey\n      \u2502    \u2502    \u2502              \u251c\u2500game_hockey_team_25_550\n      \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500espn_pt_pts_game_la \u2500\u2500 Topic: 17\n      \u2502    \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500team_25_game_hockey_550 \u2500\u2500 Topic: 2\n      \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500year_game_hit_baseball_players \u2500\u2500 Topic: 0\n      \u2502    \u2502    \u2514\u2500bike_car_greek_insurance_msg\n      \u2502    \u2502         \u251c\u2500car_bike_insurance_cars_engine\n      \u2502    \u2502         \u2502    \u251c\u2500car_insurance_cars_radar_engine\n      \u2502    \u2502         \u2502    \u2502    \u251c\u2500insurance_health_private_care_canada\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500insurance_health_private_care_canada \u2500\u2500 Topic: 99\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500insurance_car_accident_rates_sue \u2500\u2500 Topic: 82\n      \u2502    \u2502         \u2502    \u2502    \u2514\u2500car_cars_radar_engine_detector\n      \u2502    \u2502         \u2502    \u2502         \u251c\u2500car_radar_cars_detector_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500radar_detector_detectors_ka_alarm \u2500\u2500 Topic: 39\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500car_cars_mustang_ford_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500clutch_shift_shifting_transmission_gear \u2500\u2500 Topic: 88\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500car_cars_mustang_ford_v8 \u2500\u2500 Topic: 14\n      \u2502    \u2502         \u2502    \u2502         \u2514\u2500oil_diesel_odometer_diesels_car\n      \u2502    \u2502         \u2502    \u2502              \u251c\u2500odometer_oil_sensor_car_drain\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500odometer_sensor_speedo_gauge_mileage \u2500\u2500 Topic: 96\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500oil_drain_car_leaks_taillights \u2500\u2500 Topic: 102\n      \u2502    \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500diesel_diesels_emissions_fuel_oil \u2500\u2500 Topic: 79\n      \u2502    \u2502         \u2502    \u2514\u2500bike_riding_ride_bikes_motorcycle\n      \u2502    \u2502         \u2502         \u251c\u2500bike_ride_riding_bikes_lane\n      \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500bike_ride_riding_lane_car \u2500\u2500 Topic: 11\n      \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500bike_bikes_miles_honda_motorcycle \u2500\u2500 Topic: 19\n      \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500countersteering_bike_motorcycle_rear_shaft \u2500\u2500 Topic: 46\n      \u2502    \u2502         \u2514\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u251c\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u251c\u2500greek_kuwait_greece_turkish_greeks\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500greek_greece_turkish_greeks_cyprus \u2500\u2500 Topic: 71\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kuwait_iraq_iran_gulf_arabia \u2500\u2500 Topic: 76\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2514\u2500msg_dog_drugs_drug_food\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u251c\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500clinton_bush_quayle_reagan_panicking \u2500\u2500 Topic: 101\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u2514\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500cooper_trial_weaver_spence_witnesses \u2500\u2500 Topic: 90\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500dog_dogs_bike_trained_springer \u2500\u2500 Topic: 67\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2514\u2500msg_drugs_drug_food_chinese\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500msg_food_chinese_foods_taste \u2500\u2500 Topic: 30\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500drugs_drug_marijuana_cocaine_alcohol \u2500\u2500 Topic: 72\n      \u2502    \u2502              \u2502    \u2502    \u2514\u2500water_theory_universe_science_larsons\n      \u2502    \u2502              \u2502    \u2502         \u251c\u2500water_nuclear_cooling_steam_dept\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500rocketry_rockets_engines_nuclear_plutonium \u2500\u2500 Topic: 115\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u2514\u2500water_cooling_steam_dept_plants\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500water_dept_phd_environmental_atmospheric \u2500\u2500 Topic: 97\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500cooling_water_steam_towers_plants \u2500\u2500 Topic: 109\n      \u2502    \u2502              \u2502    \u2502         \u2514\u2500theory_universe_larsons_larson_science\n      \u2502    \u2502              \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500theory_universe_larsons_larson_science \u2500\u2500 Topic: 54\n      \u2502    \u2502              \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500oort_cloud_grbs_gamma_burst \u2500\u2500 Topic: 80\n      \u2502    \u2502              \u2502    \u2514\u2500helmet_kirlian_photography_lock_wax\n      \u2502    \u2502              \u2502         \u251c\u2500helmet_kirlian_photography_leaf_mask\n      \u2502    \u2502              \u2502         \u2502    \u251c\u2500kirlian_photography_leaf_pictures_deleted\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u251c\u2500deleted_joke_stuff_maddi_nickname\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500joke_maddi_nickname_nicknames_frank \u2500\u2500 Topic: 43\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500deleted_stuff_bookstore_joke_motto \u2500\u2500 Topic: 81\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kirlian_photography_leaf_pictures_aura \u2500\u2500 Topic: 85\n      \u2502    \u2502              \u2502         \u2502    \u2514\u2500helmet_mask_liner_foam_cb\n      \u2502    \u2502              \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500helmet_liner_foam_cb_helmets \u2500\u2500 Topic: 112\n      \u2502    \u2502              \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500mask_goalies_77_santore_tl \u2500\u2500 Topic: 123\n      \u2502    \u2502              \u2502         \u2514\u2500lock_wax_paint_plastic_ear\n      \u2502    \u2502              \u2502              \u251c\u2500\u25a0\u2500\u2500lock_cable_locks_bike_600 \u2500\u2500 Topic: 117\n      \u2502    \u2502              \u2502              \u2514\u2500wax_paint_ear_plastic_skin\n      \u2502    \u2502              \u2502                   \u251c\u2500\u25a0\u2500\u2500wax_paint_plastic_scratches_solvent \u2500\u2500 Topic: 65\n      \u2502    \u2502              \u2502                   \u2514\u2500\u25a0\u2500\u2500ear_wax_skin_greasy_acne \u2500\u2500 Topic: 116\n      \u2502    \u2502              \u2514\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u251c\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500m4_mp_14_mw_mo \u2500\u2500 Topic: 111\n      \u2502    \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500test_ensign_nameless_deane_deanebinahccbrandeisedu \u2500\u2500 Topic: 118\n      \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500ites_cheek_hello_hi_ken \u2500\u2500 Topic: 3\n      \u2502    \u2514\u2500space_medical_health_disease_cancer\n      \u2502         \u251c\u2500medical_health_disease_cancer_patients\n      \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500cancer_centers_center_medical_research \u2500\u2500 Topic: 122\n      \u2502         \u2502    \u2514\u2500health_medical_disease_patients_hiv\n      \u2502         \u2502         \u251c\u2500patients_medical_disease_candida_health\n      \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n      \u2502         \u2502         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n      \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n      \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n      \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n      \u2502         \u2514\u2500space_launch_nasa_shuttle_orbit\n      \u2502              \u251c\u2500space_moon_station_nasa_launch\n      \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500sky_advertising_billboard_billboards_space \u2500\u2500 Topic: 59\n      \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500space_station_moon_redesign_nasa \u2500\u2500 Topic: 16\n      \u2502              \u2514\u2500space_mission_hst_launch_orbit\n      \u2502                   \u251c\u2500space_launch_nasa_orbit_propulsion\n      \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500space_launch_nasa_propulsion_astronaut \u2500\u2500 Topic: 47\n      \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500orbit_km_jupiter_probe_earth \u2500\u2500 Topic: 86\n      \u2502                   \u2514\u2500\u25a0\u2500\u2500hst_mission_shuttle_orbit_arrays \u2500\u2500 Topic: 60\n      \u2514\u2500drive_file_key_windows_use\n          \u251c\u2500key_file_jpeg_encryption_image\n          \u2502    \u251c\u2500key_encryption_clipper_chip_keys\n          \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500key_clipper_encryption_chip_keys \u2500\u2500 Topic: 1\n          \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500entry_file_ripem_entries_key \u2500\u2500 Topic: 73\n          \u2502    \u2514\u2500jpeg_image_file_gif_images\n          \u2502         \u251c\u2500motif_graphics_ftp_available_3d\n          \u2502         \u2502    \u251c\u2500motif_graphics_openwindows_ftp_available\n          \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500openwindows_motif_xview_windows_mouse \u2500\u2500 Topic: 20\n          \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500graphics_widget_ray_3d_available \u2500\u2500 Topic: 95\n          \u2502         \u2502    \u2514\u2500\u25a0\u2500\u25003d_machines_version_comments_contact \u2500\u2500 Topic: 38\n          \u2502         \u2514\u2500jpeg_image_gif_images_format\n          \u2502              \u251c\u2500\u25a0\u2500\u2500gopher_ftp_files_stuffit_images \u2500\u2500 Topic: 51\n          \u2502              \u2514\u2500\u25a0\u2500\u2500jpeg_image_gif_format_images \u2500\u2500 Topic: 13\n          \u2514\u2500drive_db_card_scsi_windows\n              \u251c\u2500db_windows_dos_mov_os2\n              \u2502    \u251c\u2500\u25a0\u2500\u2500copy_protection_program_software_disk \u2500\u2500 Topic: 64\n              \u2502    \u2514\u2500\u25a0\u2500\u2500db_windows_dos_mov_os2 \u2500\u2500 Topic: 8\n              \u2514\u2500drive_card_scsi_drives_ide\n                      \u251c\u2500drive_scsi_drives_ide_disk\n                      \u2502    \u251c\u2500\u25a0\u2500\u2500drive_scsi_drives_ide_disk \u2500\u2500 Topic: 6\n                      \u2502    \u2514\u2500\u25a0\u2500\u2500meg_sale_ram_drive_shipping \u2500\u2500 Topic: 12\n                      \u2514\u2500card_modem_monitor_video_drivers\n                          \u251c\u2500\u25a0\u2500\u2500card_monitor_video_drivers_vga \u2500\u2500 Topic: 5\n                          \u2514\u2500\u25a0\u2500\u2500modem_port_serial_irq_com \u2500\u2500 Topic: 10\n</code></pre>"},{"location":"getting_started/visualization/visualize_hierarchy.html#visualize-hierarchical-documents","title":"Visualize Hierarchical Documents","text":"<p>We can extend the previous method by calculating the topic representation at different levels of the hierarchy and  plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:</p> <p><pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by  reducing their dimensionality ourselves:</p> <pre><code># Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This makes the resulting visualization  smaller and fit into your RAM. However, it might be interesting to set <code>hide_document_hover=False</code> to hover  over the points and see the content of the documents. </p>"},{"location":"getting_started/visualization/visualize_terms.html","title":"Terms","text":"<p>We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores  for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within  topics. Moreover, you can easily compare topic representations to each other.  To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre>"},{"location":"getting_started/visualization/visualize_terms.html#visualize-term-score-decline","title":"Visualize Term Score Decline","text":"<p>Topics are represented by a number of words starting with the best representative word.  Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word  to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline  with each word that is added. At some point adding words to the topic representation only marginally  increases the total c-TF-IDF score and would not be beneficial for its representation. </p> <p>To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word.  In other words, the position of the words (term rank), where the words with  the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis  will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline  of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method,  the select the best number of words in a topic. </p> <p>To visualize the c-TF-IDF score decline, run the following:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>To enable the log scale on the y-axis for a better view of individual topics, run the following:</p> <pre><code>topic_model.visualize_term_rank(log_scale=True)\n</code></pre> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an  analysis by the amazing tmtoolkit. Reference to that specific analysis can be found  here. </p>"},{"location":"getting_started/visualization/visualize_topics.html","title":"Topics","text":"<p>Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works.  Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing  if they make sense is an important factor in alleviating this issue. </p>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topics","title":"Visualize Topics","text":"<p>After having trained our <code>BERTopic</code> model, we can iteratively go through hundreds of topics to get a good  understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.  Instead, we can visualize the topics that were generated in a way very similar to  LDAvis. </p> <p>We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using  plotly such that we can create an interactive view.</p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>Then, we can call <code>.visualize_topics</code> to create a 2D representation of your topics. The resulting graph is a  plotly interactive graph which can be converted to HTML:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>You can use the slider to select the topic which then lights up red. If you hover over a topic, then general  information is given about the topic, including the size of the topic and its corresponding words.</p>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topic-similarity","title":"Visualize Topic Similarity","text":"<p>Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity  matrix by simply applying cosine similarities through those topic embeddings. The result will be a  matrix indicating how similar certain topics are to each other.  To visualize the heatmap, run the following:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Note</p> <p>You can set <code>n_clusters</code> in <code>visualize_heatmap</code> to order the topics by their similarity.  This will result in blocks being formed in the heatmap indicating which clusters of topics are  similar to each other. This step is very much recommended as it will make reading the heatmap easier.      </p>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topics-over-time","title":"Visualize Topics over Time","text":"<p>After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by  leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency  of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.  Simply call <code>.visualize_topics_over_time</code> with the newly created topics over time:</p> <pre><code>import re\nimport pandas as pd\nfrom bertopic import BERTopic\n\n# Prepare data\ntrump = pd.read_csv('https://drive.google.com/uc?export=download&amp;id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')\ntrump.text = trump.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == \"f\") &amp; (trump.text != \"\"), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n\n# Create topics over time\nmodel = BERTopic(verbose=True)\ntopics, probs = model.fit_transform(tweets)\ntopics_over_time = model.topics_over_time(tweets, timestamps)\n</code></pre> <p>Then, we visualize some interesting topics: </p> <pre><code>model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])\n</code></pre>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topics-per-class","title":"Visualize Topics per Class","text":"<p>You might want to extract and visualize the topic representation per class. For example, if you have  specific groups of users that might approach topics differently, then extracting them would help understanding  how these users talk about certain topics. In other words, this is simply creating a topic representation for  certain classes that you might have in your data. </p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Prepare data and classes\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\nclasses = [data[\"target_names\"][i] for i in data[\"target\"]]\n\n# Create topic model and calculate topics per class\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_per_class = topic_model.topics_per_class(docs, classes=classes)\n</code></pre> <p>Then, we visualize the topic representation of major topics per class: </p> <pre><code>topic_model.visualize_topics_per_class(topics_per_class)\n</code></pre>"},{"location":"getting_started/zeroshot/zeroshot.html","title":"Zero-shot Topic Modeling","text":"<p>Zero-shot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics.</p> <p>This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.  This allows for extensive flexibility as there are three scenario's to explore.</p> <p>First, both zero-shot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.</p> <p>Second, only zero-shot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.</p> <p>Third, no zero-shot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run. </p> \"Religion\"   the labels Embed cosine similaritydocumentzeroshot For each document, assign topics  based on  between  and  embeddings ManualBERTopicBERTopic Create two models:   (zeroshot documents)  (non-zeroshot documents)  the models into one Merge zeroshot topicslabels Define  through . \"Clustering\" Zeroshot topic 1 \"Topic Modeling\" Zeroshot topic 2 \"Large Language Models (LLM)\" Zeroshot topic 3 \"Topic Modeling\"  \"Large Language Models\"  Topic  Modeling BERTopic + Topic X Topic Y Topic Z Manual BERTopic Topic Modeling LLM Clustering Merged BERTopic Topic Modeling LLM Clustering Topic X Topic Y Topic Z Large Language Models No match found Clustering <p>This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.</p> <p>This creates two models. One for the zero-shot topics and one for the non-zero-shot topics. We combine these two BERTopic models to create a single model that contains both zero-shot and non-zero-shot topics.</p>"},{"location":"getting_started/zeroshot/zeroshot.html#example","title":"Example","text":"<p>In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,  there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers. We know the data and believe there to be at least the following topics: clustering, topic modeling, and large language models.  However, we are not sure whether other topics exist and want to explore those.</p> <p>Zero-shot BERTopic needs two parameters: * <code>zeroshot_topic_list</code> - The names of the topics to assign documents to. Making sure this is as descriptive as possible helps improve the assignment since they are based on cosine similarities between embeddings. * <code>zeroshot_min_similarity</code> - The minimum cosine similarity needed to match a document to a document. It is a value between 0 and 1.</p> <p>Using this feature is straightforward:</p> <pre><code>from datasets import load_dataset\n\nfrom bertopic import BERTopic\nfrom bertopic.representation import KeyBERTInspired\n\n# We select a subsample of 5000 abstracts from ArXiv\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\ndocs = dataset[\"abstract\"][:5_000]\n\n# We define a number of topics that we know are in the documents\nzeroshot_topic_list = [\"Clustering\", \"Topic Modeling\", \"Large Language Models\"]\n\n# We fit our model using the zero-shot topics\n# and we define a minimum similarity. For each document,\n# if the similarity does not exceed that value, it will be used\n# for clustering instead.\ntopic_model = BERTopic(\n    embedding_model=\"thenlper/gte-small\", \n    min_topic_size=15,\n    zeroshot_topic_list=zeroshot_topic_list,\n    zeroshot_min_similarity=.85,\n    representation_model=KeyBERTInspired()\n)\ntopics, _ = topic_model.fit_transform(docs)\n</code></pre> <p>When we run <code>topic_model.get_topic_info()</code> you will see something like this:</p> <p> </p> <p>The <code>zeroshot_min_similarity</code> parameter controls how many of the documents are assigned to the predefined zero-shot topics. Lower this value and you will have more documents assigned to zero-shot topics and fewer documents will be clustered. Increase this value you will have fewer documents assigned to zero-shot topics and more documents will be clustered.</p> <p>Note</p> <p>Setting the <code>zeroshot_min_similarity</code> parameter requires a bit of experimentation. Some embedding models have different similarity distributions, so trying out the values manually and exploring the results is highly advised.</p> <p>Tip</p> <p>Because zero-shot topic modeling is essentially merging two different topic models, the  <code>probs</code> will be empty initially. If you want to have the probabilities of topics across documents,  you can run <code>topic_model.transform</code> on your documents to extract the updated <code>probs</code>.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"BERTopic","text":"<p>BERTopic is a topic modeling technique that leverages \ud83e\udd17 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.</p> <p>BERTopic supports all kinds of topic modeling techniques:  </p> Guided Supervised Semi-supervised Manual Multi-topic distributions Hierarchical Class-based Dynamic Online/Incremental Multimodal Multi-aspect Text Generation/LLM Zero-shot (new!) Merge Models (new!) Seed Words (new!) <p>Corresponding medium posts can be found here, here and here. For a more detailed overview, you can read the paper or see a brief overview. </p>"},{"location":"index.html#installation","title":"Installation","text":"<p>Installation, with sentence-transformers, can be done using pypi:</p> <pre><code>pip install bertopic\n</code></pre> <p>You may want to install more depending on the transformers and language backends that you will be using.  The possible installations are: </p> <pre><code># Choose an embedding backend\npip install bertopic[flair, gensim, spacy, use]\n\n# Topic modeling with images\npip install bertopic[vision]\n</code></pre>"},{"location":"index.html#quick-start","title":"Quick Start","text":"<p>We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>After generating topics and their probabilities, we can access the frequent topics that were generated:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()\n\nTopic   Count   Name\n-1      4630    -1_can_your_will_any\n0       693     49_windows_drive_dos_file\n1       466     32_jesus_bible_christian_faith\n2       441     2_space_launch_orbit_lunar\n3       381     22_key_encryption_keys_encrypted\n</code></pre> <p>-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most  frequent topic that was generated, topic 0:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(0)\n\n[('windows', 0.006152228076250982),\n ('drive', 0.004982897610645755),\n ('dos', 0.004845038866360651),\n ('file', 0.004140142872194834),\n ('disk', 0.004131678774810884),\n ('mac', 0.003624848635985097),\n ('memory', 0.0034840976976789903),\n ('software', 0.0034415334250699077),\n ('email', 0.0034239554442333257),\n ('pc', 0.003047105930670237)]\n</code></pre> <p>Using <code>.get_document_info</code>, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:</p> <pre><code>&gt;&gt;&gt; topic_model.get_document_info(docs)\n\nDocument                               Topic    Name                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...\nFinally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...\nThink! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...\n1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...\n</code></pre> <p>Multilingual</p> <p>Use <code>BERTopic(language=\"multilingual\")</code> to select a model that supports 50+ languages. </p>"},{"location":"index.html#fine-tune-topic-representations","title":"Fine-tune Topic Representations","text":"<p>In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is <code>KeyBERTInspired</code>, which for many users increases the coherence and reduces stopwords from the resulting topic representations:</p> <pre><code>from bertopic.representation import KeyBERTInspired\n\n# Fine-tune your topic representations\nrepresentation_model = KeyBERTInspired()\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\n\n# Fine-tune topic representations with GPT\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, model=\"gpt-3.5-turbo\", chat=True)\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Multi-aspect Topic Modeling</p> <p>Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic. </p>"},{"location":"index.html#modularity","title":"Modularity","text":"<p>By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:</p> <p>You can swap out any of these models or even remove them entirely. The following steps are completely modular:</p> <ol> <li>Embedding documents</li> <li>Reducing dimensionality of embeddings</li> <li>Clustering reduced embeddings into topics</li> <li>Tokenization of topics</li> <li>Weight tokens</li> <li>Represent topics with one or multiple representations</li> </ol> <p>To find more about the underlying algorithm and assumptions here. </p>"},{"location":"index.html#overview","title":"Overview","text":"<p>BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview  of all methods and a short description of its purpose. </p>"},{"location":"index.html#common","title":"Common","text":"<p>Below, you will find an overview of common functions in BERTopic. </p> Method Code Fit the model <code>.fit(docs)</code> Fit the model and predict documents <code>.fit_transform(docs)</code> Predict new documents <code>.transform([new_doc])</code> Access single topic <code>.get_topic(topic=12)</code> Access all topics <code>.get_topics()</code> Get topic freq <code>.get_topic_freq()</code> Get all topic information <code>.get_topic_info()</code> Get all document information <code>.get_document_info(docs)</code> Get representative docs per topic <code>.get_representative_docs()</code> Update topic representation <code>.update_topics(docs, n_gram_range=(1, 3))</code> Generate topic labels <code>.generate_topic_labels()</code> Set topic labels <code>.set_topic_labels(my_custom_labels)</code> Merge topics <code>.merge_topics(docs, topics_to_merge)</code> Reduce nr of topics <code>.reduce_topics(docs, nr_topics=30)</code> Reduce outliers <code>.reduce_outliers(docs, topics)</code> Find topics <code>.find_topics(\"vehicle\")</code> Save model <code>.save(\"my_model\", serialization=\"safetensors\")</code> Load model <code>BERTopic.load(\"my_model\")</code> Get parameters <code>.get_params()</code>"},{"location":"index.html#attributes","title":"Attributes","text":"<p>After having trained your BERTopic model, several are saved within your model. These attributes, in part,  refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in <code>_</code> and are  public attributes that can be used to access model information. </p> Attribute Description <code>.topics_</code> The topics that are generated for each document after training or updating the topic model. <code>.probabilities_</code> The probabilities that are generated for each document if HDBSCAN is used. <code>.topic_sizes_</code> The size of each topic <code>.topic_mapper_</code> A class for tracking topics and their mappings anytime they are merged/reduced. <code>.topic_representations_</code> The top n terms per topic and their respective c-TF-IDF values. <code>.c_tf_idf_</code> The topic-term matrix as calculated through c-TF-IDF. <code>.topic_aspects_</code> The different aspects, or representations, of each topic. <code>.topic_labels_</code> The default labels for each topic. <code>.custom_labels_</code> Custom labels for each topic as generated through <code>.set_topic_labels</code>. <code>.topic_embeddings_</code> The embeddings for each topic if <code>embedding_model</code> was used. <code>.representative_docs_</code> The representative documents for each topic if HDBSCAN is used."},{"location":"index.html#variations","title":"Variations","text":"<p>There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.</p> Method Code Topic Distribution Approximation <code>.approximate_distribution(docs)</code> Online Topic Modeling <code>.partial_fit(doc)</code> Semi-supervised Topic Modeling <code>.fit(docs, y=y)</code> Supervised Topic Modeling <code>.fit(docs, y=y)</code> Manual Topic Modeling <code>.fit(docs, y=y)</code> Multimodal Topic Modeling <code>.fit(docs, images=images)</code> Topic Modeling per Class <code>.topics_per_class(docs, classes)</code> Dynamic Topic Modeling <code>.topics_over_time(docs, timestamps)</code> Hierarchical Topic Modeling <code>.hierarchical_topics(docs)</code> Guided Topic Modeling <code>BERTopic(seed_topic_list=seed_topic_list)</code> Zero-shot Topic Modeling <code>BERTopic(zeroshot_topic_list=zeroshot_topic_list)</code> Merge Multiple Models <code>BERTopic.merge_models([topic_model_1, topic_model_2])</code>"},{"location":"index.html#visualizations","title":"Visualizations","text":"<p>Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation.  Visualizing different aspects of the topic model helps in understanding the model and makes it easier  to tweak the model to your liking. </p> Method Code Visualize Topics <code>.visualize_topics()</code> Visualize Documents <code>.visualize_documents()</code> Visualize Document Hierarchy <code>.visualize_hierarchical_documents()</code> Visualize Topic Hierarchy <code>.visualize_hierarchy()</code> Visualize Topic Tree <code>.get_topic_tree(hierarchical_topics)</code> Visualize Topic Terms <code>.visualize_barchart()</code> Visualize Topic Similarity <code>.visualize_heatmap()</code> Visualize Term Score Decline <code>.visualize_term_rank()</code> Visualize Topic Probability Distribution <code>.visualize_distribution(probs[0])</code> Visualize Topics over Time <code>.visualize_topics_over_time(topics_over_time)</code> Visualize Topics per Class <code>.visualize_topics_per_class(topics_per_class)</code>"},{"location":"index.html#citation","title":"Citation","text":"<p>To cite the BERTopic paper, please use the following bibtex reference:</p> <pre><code>@article{grootendorst2022bertopic,\n  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},\n  author={Grootendorst, Maarten},\n  journal={arXiv preprint arXiv:2203.05794},\n  year={2022}\n}\n</code></pre>"},{"location":"changelog.html","title":"Changelog","text":""},{"location":"changelog.html#version-0160","title":"Version 0.16.0","text":"<p>Release date: 26 November, 2023</p> Highlights: <ul> <li>Merge pre-trained BERTopic models with <code>.merge_models</code><ul> <li>Combine models with different representations together!</li> <li>Use this for incremental/online topic modeling to detect new incoming topics</li> <li>First step towards federated learning with BERTopic</li> </ul> </li> <li>Zero-shot Topic Modeling<ul> <li>Use a predefined list of topics to assign documents</li> <li>If needed, allows for further exploration of undefined topics</li> </ul> </li> <li>Seed (domain-specific) words with <code>ClassTfidfTransformer</code><ul> <li>Make sure selected words are more likely to end up in the representation without influencing the clustering process</li> </ul> </li> <li>Added params to truncate documents to length when using LLMs</li> <li>Added LlamaCPP as a representation model</li> <li>LangChain: Support for LCEL Runnables by @joshuasundance-swca in #1586</li> <li>Added <code>topics</code> parameter to <code>.topics_over_time</code> to select a subset of documents and topics</li> <li>Documentation:<ul> <li>Best practices Guide</li> <li>Llama 2 Tutorial</li> <li>Zephyr Tutorial</li> <li>Improved embeddings guidance (MTEB)</li> <li>Improved logging throughout the package</li> </ul> </li> <li>Added support for Cohere's Embed v3: <pre><code>cohere_model = CohereBackend(\n    client,\n    embedding_model=\"embed-english-v3.0\",\n    embed_kwargs={\"input_type\": \"clustering\"}\n)\n</code></pre></li> </ul> Fixes: <ul> <li>Fixed n-gram Keywords need delimiting in OpenAI() #1546</li> <li>Fixed OpenAI v1.0 issues #1629</li> <li>Improved documentation/logging to adress #1589, #1591</li> <li>Fixed engine support for Azure OpenAI embeddings #1577</li> <li>Fixed OpenAI Representation: KeyError: 'content' #1570</li> <li>Fixed Loading topic model with multiple topic aspects changes their format #1487</li> <li>Fix expired link in algorithm.md by @burugaria7 in #1396</li> <li>Fix guided topic modeling in cuML's UMAP by @stevetracvc in #1326</li> <li>OpenAI: Allow retrying on Service Unavailable errors by @agamble in #1407</li> <li>Fixed parameter naming for HDBSCAN in best practices by @rnckp in #1408</li> <li>Fixed typo in tips_and_tricks.md by @aronnoordhoek in #1446</li> <li>Fix typos in documentation by @bobchien in #1481</li> <li>Fix IndexError when all outliers are removed by reduce_outliers by @Aratako in #1466</li> <li>Fix TypeError on reduce_outliers \"probabilities\" by @ananaphasia in #1501</li> <li>Add new line to fix markdown bullet point formatting by @saeedesmaili in #1519</li> <li>Update typo in topicrepresentation.md by @oliviercaron in #1537</li> <li>Fix typo in FAQ by @sandijou in #1542</li> <li>Fixed typos in best practices documentation by @poomkusa in #1557</li> <li>Correct TopicMapper doc example by @chrisji in #1637</li> <li>Fix typing in hierarchical_topics by @dschwalm in #1364</li> <li>Fixed typing issue with treshold parameter in reduce_outliers by @dschwalm in #1380</li> <li>Fix several typos by @mertyyanik in #1307 (#1307)</li> <li>Fix inconsistent naming by @rolanderdei in #1073</li> </ul> Merge Pre-trained BERTopic Models <p>The new <code>.merge_models</code> feature allows for any number of fitted BERTopic models to be merged. Doing so allows for a number of use cases:</p> <ul> <li>Incremental topic modeling -- Continuously merge models together to detect whether new topics have appeared</li> <li>Federated Learning - Train BERTopic models on different clients and combine them on a central server</li> <li>Minimal compute - We can essentially batch the training process into multiple instances to reduce compute</li> <li>Different datasets - When you have different datasets that you want to train seperately on, for example with different languages, you can train each model separately and join them after training</li> </ul> <p>To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.</p> <p>First, we train three separate models on different parts of the data:</p> <pre><code>from umap import UMAP\nfrom bertopic import BERTopic\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\n\n# Extract abstracts to train on and corresponding titles\nabstracts_1 = dataset[\"abstract\"][:5_000]\nabstracts_2 = dataset[\"abstract\"][5_000:10_000]\nabstracts_3 = dataset[\"abstract\"][10_000:15_000]\n\n# Create topic models\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\ntopic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)\ntopic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)\ntopic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)\n</code></pre> <p>Then, we can combine all three models into one with <code>.merge_models</code>:</p> <pre><code># Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n</code></pre> Zero-shot Topic Modeling <p>Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.  This allows for extensive flexibility as there are three scenario's to explore.</p> <ul> <li>No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run. </li> <li>Only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.</li> <li>Both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.</li> </ul> <p></p> <p>In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,  there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers. We know the data and believe there to be at least the following topics: clustering, topic modeling, and large language models.  However, we are not sure whether other topics exist and want to explore those.</p> <p>Using this feature is straightforward:</p> <pre><code>from datasets import load_dataset\n\nfrom bertopic import BERTopic\nfrom bertopic.representation import KeyBERTInspired\n\n# We select a subsample of 5000 abstracts from ArXiv\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\ndocs = dataset[\"abstract\"][:5_000]\n\n# We define a number of topics that we know are in the documents\nzeroshot_topic_list = [\"Clustering\", \"Topic Modeling\", \"Large Language Models\"]\n\n# We fit our model using the zero-shot topics\n# and we define a minimum similarity. For each document,\n# if the similarity does not exceed that value, it will be used\n# for clustering instead.\ntopic_model = BERTopic(\n    embedding_model=\"thenlper/gte-small\", \n    min_topic_size=15,\n    zeroshot_topic_list=zeroshot_topic_list,\n    zeroshot_min_similarity=.85,\n    representation_model=KeyBERTInspired()\n)\ntopics, _ = topic_model.fit_transform(docs)\n</code></pre> <p>When we run <code>topic_model.get_topic_info()</code> you will see something like this:</p> <p></p> Seed (Domain-specific) Words <p>When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the \"TNM\" classification is a method for identifying the stage of most cancers. The word \"TNM\" is an abbreviation and might not be correctly captured in generic embedding models. </p> <p>To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of <code>seed_words</code> in the <code>bertopic.vectorizer.ClassTfidfTransformer</code>. To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like \"agent\" and \"robot\" should be important in such a topic were it to be found. Using the <code>ClassTfidfTransformer</code>, we can define those <code>seed_words</code> and also choose by how much their values are multiplied. </p> <p>The full example is then as follows:</p> <pre><code>from umap import UMAP\nfrom datasets import load_dataset\nfrom bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\n# Let's take a subset of ArXiv abstracts as the training data\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\nabstracts = dataset[\"abstract\"][:5_000]\n\n# For illustration purposes, we make sure the output is fixed when running this code multiple times\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\n\n# We can choose any number of seed words for which we want their representation\n# to be strengthen. We increase the importance of these words as we want them to be more\n# likely to end up in the topic representations.\nctfidf_model = ClassTfidfTransformer(\n    seed_words=[\"agent\", \"robot\", \"behavior\", \"policies\", \"environment\"], \n    seed_multiplier=2\n)\n\n# We run the topic model with the seeded words\ntopic_model = BERTopic(\n    umap_model=umap_model,\n    min_topic_size=15,\n    ctfidf_model=ctfidf_model,\n).fit(abstracts)\n</code></pre> Truncate Documents in LLMs <p>When using LLMs with BERTopic, we can truncate the input documents in <code>[DOCUMENTS]</code> in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:</p> <ul> <li><code>doc_length</code> - The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.</li> <li><code>tokenizer</code> - The tokenizer used to calculate to split the document into segments used to count the length of a document. <ul> <li>Options include <code>'char'</code>, <code>'whitespace'</code>, <code>'vectorizer'</code>, and a callable</li> </ul> </li> </ul> <p>This means that the definition of <code>doc_length</code> changes depending on what constitutes a token in the <code>tokenizer</code> parameter. If a token is a character, then <code>doc_length</code> refers to max length in characters. If a token is a word, then <code>doc_length</code> refers to the max length in words.</p> <p>Let's illustrate this with an example. In the code below, we will use <code>tiktoken</code> to count the number of tokens in each document and limit them to 100 tokens. All documents that have more than 100 tokens will be truncated.</p> <p>We use <code>bertopic.representation.OpenAI</code> to represent our topics with nicely written labels. We specify that documents that we put in the prompt cannot exceed 100 tokens each. Since we will put 4 documents in the prompt, they will total roughly 400 tokens:</p> <pre><code>import openai\nimport tiktoken\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Tokenizer\ntokenizer= tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n\n# Create your representation model\nopenai.api_key = MY_API_KEY\nrepresentation_model = OpenAI(\n    model=\"gpt-3.5-turbo\", \n    delay_in_seconds=2, \n    chat=True,\n    nr_docs=4,\n    doc_length=100,\n    tokenizer=tokenizer\n)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"changelog.html#version-0150","title":"Version 0.15.0","text":"<p>Release date: 29 May, 2023</p> Highlights: <ul> <li>Multimodal Topic Modeling<ul> <li>Train your topic modeling on text, images, or images and text!</li> <li>Use the <code>bertopic.backend.MultiModalBackend</code> to embed images, text, both or even caption images!</li> </ul> </li> <li>Multi-Aspect Topic Modeling<ul> <li>Create multiple topic representations simultaneously </li> </ul> </li> <li>Improved Serialization options<ul> <li>Push your model to the HuggingFace Hub with <code>.push_to_hf_hub</code></li> <li>Safer, smaller and more flexible serialization options with <code>safetensors</code></li> <li>Thanks to a great collaboration with HuggingFace and the authors of BERTransfer!</li> </ul> </li> <li>Added new embedding models<ul> <li>OpenAI: <code>bertopic.backend.OpenAIBackend</code></li> <li>Cohere: <code>bertopic.backend.CohereBackend</code></li> </ul> </li> <li>Added example of summarizing topics with OpenAI's GPT-models</li> <li>Added <code>nr_docs</code> and <code>diversity</code> parameters to OpenAI and Cohere representation models</li> <li>Use <code>custom_labels=\"Aspect1\"</code> to use the aspect labels for visualizations instead</li> <li>Added cuML support for probability calculation in <code>.transform</code></li> <li>Updated topic embeddings<ul> <li>Centroids by default and c-TF-IDF weighted embeddings for <code>partial_fit</code> and <code>.update_topics</code></li> </ul> </li> <li>Added <code>exponential_backoff</code> parameter to <code>OpenAI</code> model</li> </ul> Fixes: <ul> <li>Fixed custom prompt not working in <code>TextGeneration</code> </li> <li>Fixed #1142</li> <li>Add additional logic to handle cupy arrays by @metasyn in #1179</li> <li>Fix hierarchy viz and handle any form of distance matrix by @elashrry in #1173</li> <li>Updated languages list by @sam9111 in #1099</li> <li>Added level_scale argument to visualize_hierarchical_documents by @zilch42 in #1106</li> <li>Fix inconsistent naming by @rolanderdei in #1073</li> </ul> Multimodal Topic Modeling <p>With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them. </p> <p> </p> <p>In this example, we are going to use images from <code>flickr</code> that each have a caption accociated to it: </p> <pre><code># NOTE: This requires the `datasets` package which you can \n# install with `pip install datasets`\nfrom datasets import load_dataset\n\nds = load_dataset(\"maderix/flickr_bw_rgb\")\nimages = ds[\"train\"][\"image\"]\ndocs = ds[\"train\"][\"caption\"]\n</code></pre> <p>The <code>docs</code> variable contains the captions for each image in <code>images</code>. We can now use these variables to run our multimodal example:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import VisualRepresentation\n\n# Additional ways of representing a topic\nvisual_model = VisualRepresentation()\n\n# Make sure to add the `visual_model` to a dictionary\nrepresentation_model = {\n   \"Visual_Aspect\":  visual_model,\n}\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\n</code></pre> <p>We can now access our image representations for each topic with <code>topic_model.topic_aspects_[\"Visual_Aspect\"]</code>. If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:</p> <pre><code>import base64\nfrom io import BytesIO\nfrom IPython.display import HTML\n\ndef image_base64(im):\n    if isinstance(im, str):\n        im = get_thumbnail(im)\n    with BytesIO() as buffer:\n        im.save(buffer, 'jpeg')\n        return base64.b64encode(buffer.getvalue()).decode()\n\n\ndef image_formatter(im):\n    return f'&lt;img src=\"data:image/jpeg;base64,{image_base64(im)}\"&gt;'\n\n# Extract dataframe\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\n\n# Visualize the images\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\n</code></pre> <p></p> Multi-aspect Topic Modeling <p>In this new release, we introduce <code>multi-aspect topic modeling</code>! During the <code>.fit</code> or <code>.fit_transform</code> stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).</p> <p> </p> <p>The approach is rather straightforward. We might want to represent our topics using a <code>PartOfSpeech</code> representation model but we might also want to try out <code>KeyBERTInspired</code> and compare those representation models. We can do this as follows:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic.representation import PartOfSpeech\nfrom bertopic.representation import MaximalMarginalRelevance\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Documents to train on\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# The main representation of a topic\nmain_representation = KeyBERTInspired()\n\n# Additional ways of representing a topic\naspect_model1 = PartOfSpeech(\"en_core_web_sm\")\naspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]\n\n# Add all models together to be run in a single `fit`\nrepresentation_model = {\n   \"Main\": main_representation,\n   \"Aspect1\":  aspect_model1,\n   \"Aspect2\":  aspect_model2 \n}\ntopic_model = BERTopic(representation_model=representation_model).fit(docs)\n</code></pre> <p>As show above, to perform multi-aspect topic modeling, we make sure that <code>representation_model</code> is a dictionary where each representation model pipeline is defined.  The main pipeline, that is used in most visualization options, is defined with the <code>\"Main\"</code> key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as <code>\"Aspect1\"</code> and <code>\"Aspect2\"</code>. </p> <p>After we have fitted our model, we can access all representations with <code>topic_model.get_topic_info()</code>:</p> <p> </p> <p>As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in <code>topic_model.topic_aspects_</code>. </p> Serialization <p>Saving, loading, and sharing a BERTopic model can be done in several ways. With this new release, it is now  advised to go with <code>.safetensors</code> as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as <code>.pickle</code> and pytorch <code>.bin</code> are also possible.</p> <p>The methods are used as follows:</p> <pre><code>topic_model = BERTopic().fit(my_docs)\n\n# Method 1 - safetensors\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 2 - pytorch\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"pytorch\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 3 - pickle\ntopic_model.save(\"my_model\", serialization=\"pickle\")\n</code></pre> <p>Saving the topic modeling with <code>.safetensors</code> or <code>pytorch</code> has a number of advantages:</p> <ul> <li><code>.safetensors</code> is a relatively safe format</li> <li>The resulting model can be very small (often &lt; 20MB&gt;) since no sub-models need to be saved</li> <li>Although version control is important, there is a bit more flexibility with respect to specific versions of packages</li> <li>More easily used in production</li> <li>Share models with the HuggingFace Hub</li> </ul> <p> </p> <p>The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing <code>safetensors</code>, <code>pytorch</code>, and <code>pickle</code>. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings. </p> HuggingFace Hub <p>When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account:</p> <pre><code>from huggingface_hub import login\nlogin()\n</code></pre> <p>When you have logged in to your HuggingFace account, you can save and upload the model as follows:</p> <pre><code>from bertopic import BERTopic\n\n# Train model\ntopic_model = BERTopic().fit(my_docs)\n\n# Push to HuggingFace Hub\ntopic_model.push_to_hf_hub(\n    repo_id=\"MaartenGr/BERTopic_ArXiv\",\n    save_ctfidf=True\n)\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_ArXiv\")\n</code></pre>"},{"location":"changelog.html#version-0141","title":"Version 0.14.1","text":"<p>Release date: 2 March, 2023</p> Highlights: <ul> <li>Use ChatGPT to create topic representations!:</li> <li>Added <code>delay_in_seconds</code> parameter to OpenAI and Cohere representation models for throttling the API<ul> <li>Setting this between 5 and 10 allows for trial users now to use more easily without hitting RateLimitErrors</li> </ul> </li> <li>Fixed missing <code>title</code> param to visualization methods</li> <li>Fixed probabilities not correctly aligning (#1024)</li> <li>Fix typo in textgenerator  @dkopljar27 in #1002</li> </ul> ChatGPT <p>Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.  In order to use ChatGPT with BERTopic, we need to define the model and make sure to set <code>chat=True</code>:</p> <pre><code>import openai\nfrom bertopic import BERTopic\nfrom bertopic.representation import OpenAI\n\n# Create your representation model\nopenai.api_key = MY_API_KEY\nrepresentation_model = OpenAI(model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Prompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags.  There are currently two tags, namely <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code>.  These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.  For example, if we have the following prompt:</p> <pre><code>prompt = \"\"\"\nI have topic that contains the following documents: \\n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\n</code></pre> <p>then that will be rendered as follows and passed to OpenAI's API:</p> <pre><code>\"\"\"\nI have a topic that contains the following documents: \n- Our videos are also made possible by your support on patreon.co.\n- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.\n- If you want to help us make more videos, you can do so there.\n- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.\n\nThe topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch \n\nBased on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\n</code></pre> <p>Note</p> <p>Whenever you create a custom prompt, it is important to add  <pre><code>Based on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n</code></pre> at the end of your prompt as BERTopic extracts everything that comes after <code>topic:</code>. Having  said that, if <code>topic:</code> is not in the output, then it will simply extract the entire response, so  feel free to experiment with the prompts. </p>"},{"location":"changelog.html#version-0140","title":"Version 0.14.0","text":"<p>Release date: 14 February, 2023</p> Highlights: <ul> <li>Fine-tune topic representations with <code>bertopic.representation</code><ul> <li>Diverse range of models, including KeyBERT, MMR, POS, Transformers, OpenAI, and more!'</li> <li>Create your own prompts for text generation models, like GPT3:<ul> <li>Use <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code> in the prompt to decide where the keywords and set of representative documents need to be inserted.</li> </ul> </li> <li>Chain models to perform fine-grained fine-tuning</li> <li>Create and customize your represention model</li> </ul> </li> <li>Improved the topic reduction technique when using <code>nr_topics=int</code></li> <li>Added <code>title</code> parameters for all graphs (#800)</li> </ul> Fixes: <ul> <li>Improve documentation (#837, #769, #954, #912, #911)</li> <li>Bump pyyaml (#903)</li> <li>Fix large number of representative docs (#965)</li> <li>Prevent stochastisch behavior in <code>.visualize_topics</code> (#952)</li> <li>Add custom labels parameter to <code>.visualize_topics</code> (#976)</li> <li>Fix cuML HDBSCAN type checks by @FelSiq in #981</li> </ul> API Changes: <ul> <li>The <code>diversity</code> parameter was removed in favor of <code>bertopic.representation.MaximalMarginalRelevance</code></li> <li>The <code>representation_model</code> parameter was added to <code>bertopic.BERTopic</code></li> </ul> <p> </p> Representation Models <p>Fine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you! </p> <p> </p> KeyBERTInspired <p>The algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:</p> <p></p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> PartOfSpeech <p>Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic. </p> <p></p> <pre><code>from bertopic.representation import PartOfSpeech\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> MaximalMarginalRelevance <p>When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like \"car\" and \"cars\"  essentially represent the same information and often redundant. We can use <code>MaximalMarginalRelevance</code> to improve diversity of our candidate topics:</p> <p></p> <pre><code>from bertopic.representation import MaximalMarginalRelevance\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = MaximalMarginalRelevance(diversity=0.3)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Zero-Shot Classification <p>To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords. </p> <p>We use it in BERTopic as follows:</p> <pre><code>from bertopic.representation import ZeroShotClassification\nfrom bertopic import BERTopic\n\n# Create your representation model\ncandidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\nrepresentation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: \ud83e\udd17 Transformers <p>Nearly every week, there are new and improved models released on the \ud83e\udd17 Model Hub that, with some creativity, allow for  further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these  methods are created as a way to support whatever might be released in the future. </p> <p>Using a GPT-like model from the huggingface hub is rather straightforward:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = TextGeneration('gpt2')\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: Cohere <p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords. To use this, you will need to install cohere first:</p> <pre><code>pip install cohere\n</code></pre> <p>Then, get yourself an API key and use Cohere's API as follows:</p> <pre><code>import cohere\nfrom bertopic.representation import Cohere\nfrom bertopic import BERTopic\n\n# Create your representation model\nco = cohere.Client(my_api_key)\nrepresentation_model = Cohere(co)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: OpenAI <p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords. To use this, you will need to install openai first:</p> <pre><code>pip install openai\n</code></pre> <p>Then, get yourself an API key and use OpenAI's API as follows:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create your representation model\nopenai.api_key = MY_API_KEY\nrepresentation_model = OpenAI()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> Text Generation: LangChain <p>Langchain is a package that helps users with chaining large language models. In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this  external knowledge are the most representative documents in each topic. </p> <p>To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:</p> <pre><code>pip install langchain, openai\n</code></pre> <p>Then, you can create your chain as follows:</p> <pre><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\nchain = load_qa_chain(OpenAI(temperature=0, openai_api_key=MY_API_KEY), chain_type=\"stuff\")\n</code></pre> <p>Finally, you can pass the chain to BERTopic as follows:</p> <pre><code>from bertopic.representation import LangChain\n\n# Create your representation model\nrepresentation_model = LangChain(chain)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"changelog.html#version-0130","title":"Version 0.13.0","text":"<p>Release date: 4 January, 2023</p> Highlights: <ul> <li>Calculate topic distributions with <code>.approximate_distribution</code> regardless of the cluster model used<ul> <li>Generates topic distributions on a document- and token-levels</li> <li>Can be used for any document regardless of its size!</li> </ul> </li> <li>Fully supervised BERTopic<ul> <li>You can now use a classification model for the clustering step instead to create a fully supervised topic model</li> </ul> </li> <li>Manual topic modeling<ul> <li>Generate topic representations from labels directly</li> <li>Allows for skipping the embedding and clustering steps in order to go directly to the topic representation step</li> </ul> </li> <li>Reduce outliers with 4 different strategies using <code>.reduce_outliers</code></li> <li>Install BERTopic without <code>SentenceTransformers</code> for a lightweight package:<ul> <li><code>pip install --no-deps bertopic</code></li> <li><code>pip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml</code></li> </ul> </li> <li>Get meta data of trained documents such as topics and probabilities using <code>.get_document_info(docs)</code></li> <li>Added more support for cuML's HDBSCAN<ul> <li>Calculate and predict probabilities during <code>fit_transform</code>  and <code>transform</code> respectively</li> <li>This should give a major speed-up when setting <code>calculate_probabilities=True</code></li> </ul> </li> <li>More images to the documentation and a lot of changes/updates/clarifications</li> <li>Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations </li> <li>Sklearn Pipeline Embedder by @koaning in #791</li> </ul> Fixes: <ul> <li>Improve <code>.partial_fit</code> documentation (#837)</li> <li>Fixed scipy linkage usage (#807)</li> <li>Fixed shifted heatmap (#782)</li> <li>Fixed SpaCy backend (#744)</li> <li>Fixed representative docs with small clusters (&lt;3) (#703)</li> <li>Typo fixed by @timpal0l in #734</li> <li>Typo fixed by @srulikbd in #842</li> <li>Correcting iframe urls by @Mustapha-AJEGHRIR in #798</li> <li>Refactor embedding methods by @zachschillaci27 in #855</li> <li>Added diversity parameter to update_topics() function by @anubhabdaserrr in #887</li> </ul> Documentation <p>Personally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard\ud83d\ude05... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes. </p> Topic Distributions <p>The difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same? </p> <p>To do so, we approximate the distribution of topics in a document by calculating and summing the similarities of tokensets (achieved by applying a sliding window) with the topics:</p> <pre><code># After fitting your model run the following for either your trained documents or even unseen documents\ntopic_distr, _ = topic_model.approximate_distribution(docs)\n</code></pre> <p>To calculate and visualize the topic distributions in a document on a token-level, we can run the following:</p> <pre><code># We need to calculate the topic distributions on a token level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Create a visualization using a styled dataframe if Jinja2 is installed\ndf = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0]); df\n</code></pre> Supervised Topic Modeling <p>BERTopic now supports fully-supervised classification! Instead of using a clustering algorithm, like HDBSCAN, we can replace it with a classifier, like Logistic Regression:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseDimensionalityReduction\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.linear_model import LogisticRegression\n\n# Get labeled data\ndata= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n\n# Allows us to skip over the dimensionality reduction step\nempty_dimensionality_model = BaseDimensionalityReduction()\n\n# Create a classifier to be used instead of the cluster model\nclf= LogisticRegression()\n\n# Create a fully supervised BERTopic instance\ntopic_model= BERTopic(\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=clf\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> Manual Topic Modeling <p>When you already have a bunch of labels and simply want to extract topic representations from them, you might not need to actually learn how those can predicted. We can bypass the <code>embeddings -&gt; dimensionality reduction -&gt; clustering</code> steps and go straight to the c-TF-IDF representation of our labels:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.backend import BaseEmbedder\nfrom bertopic.cluster import BaseCluster\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\n# Prepare our empty sub-models and reduce frequent words while we are at it.\nempty_embedding_model = BaseEmbedder()\nempty_dimensionality_model = BaseDimensionalityReduction()\nempty_cluster_model = BaseCluster()\n\n# Fit BERTopic without actually performing any clustering\ntopic_model= BERTopic(\n        embedding_model=empty_embedding_model,\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=empty_cluster_model,\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> Outlier Reduction <p>Outlier reduction is an frequently-discussed topic in BERTopic as its default cluster model, HDBSCAN, has a tendency to generate many outliers. This often helps in the topic representation steps, as we do not consider documents that are less relevant, but you might want to still assign those outliers to actual topics. In the modular philosophy of BERTopic, keeping training times in mind, it is now possible to perform outlier reduction after having trained your topic model. This allows for ease of iteration and prevents having to train BERTopic many times to find the parameters you are searching for. There are 4 different strategies that you can use, so make sure to check out the documentation!</p> <p>Using it is rather straightforward:</p> <pre><code>new_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> Lightweight BERTopic <p>The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely <code>\"all-MiniLM-L6-v2\"</code>. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires <code>pytorch</code> which often results in a rather large environment, memory-wise. </p> <p>Fortunately, it is possible to install BERTopic without <code>sentence-transformers</code> and use it as a lightweight solution instead. The installation can be done as follows:</p> <pre><code>pip install --no-deps bertopic\npip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml\n</code></pre> <p>Then, we can use BERTopic without <code>sentence-transformers</code> as follows using a CPU-based embedding technique:</p> <pre><code>from sklearn.pipeline import make_pipeline\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\npipe = make_pipeline(\n    TfidfVectorizer(),\n    TruncatedSVD(100)\n)\n\ntopic_model = BERTopic(embedding_model=pipe)\n</code></pre> <p>As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!</p> Document Information <p>Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and the probability of the clustering if the cluster model supports it. There are also options to include other metadata, such as the topic distributions or the x and y coordinates of the reduced embeddings that you can learn more about here.</p> <p>To get the document info, you will only need to pass the documents on which the topic model was trained:</p> <pre><code>&gt;&gt;&gt; topic_model.get_document_info(docs)\n\nDocument                               Topic    Name                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...\nFinally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...\nThink! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...\n1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...\n</code></pre>"},{"location":"changelog.html#version-0120","title":"Version 0.12.0","text":"<p>Release date: 5 September, 2022</p> <p>Highlights:</p> <ul> <li>Perform online/incremental topic modeling with <code>.partial_fit</code></li> <li>Expose c-TF-IDF model for customization with <code>bertopic.vectorizers.ClassTfidfTransformer</code><ul> <li>The parameters <code>bm25_weighting</code> and <code>reduce_frequent_words</code> were added to potentially improve representations:</li> </ul> </li> <li>Expose attributes for easier access to internal data</li> <li>Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:<ul> <li>Visualize Overview</li> <li>Code Overview</li> <li>Detailed Overview</li> </ul> </li> <li>Added an example of combining BERTopic with KeyBERT</li> <li>Added many tests with the intention of making development a bit more stable</li> </ul> <p>Fixes: </p> <ul> <li>Fixed iteratively merging topics (#632 and (#648)</li> <li>Fixed 0th topic not showing up in visualizations (#667)</li> <li>Fixed lowercasing not being optional (#682)</li> <li>Fixed spelling (#664 and (#673)</li> <li>Fixed 0th topic not shown in <code>.get_topic_info</code> by @oxymor0n in #660</li> <li>Fixed spelling by @domenicrosati in #674</li> <li>Add custom labels and title options to barchart @leloykun in #694</li> </ul> <p>Online/incremental topic modeling:</p> <p>Online topic modeling (sometimes called \"incremental topic modeling\") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained on before. In Scikit-Learn, this technique is often modeled through a <code>.partial_fit</code> function, which is also used in BERTopic. </p> <p>At a minimum, the cluster model needs to support a <code>.partial_fit</code> function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating. </p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.cluster import MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom bertopic.vectorizers import OnlineCountVectorizer\nfrom bertopic import BERTopic\n\n# Prepare documents\nall_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\ndoc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]\n\n# Prepare sub-models that support online learning\numap_model = IncrementalPCA(n_components=5)\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\ntopic_model = BERTopic(umap_model=umap_model,\n                       hdbscan_model=cluster_model,\n                       vectorizer_model=vectorizer_model)\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n</code></pre> <p>Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the <code>.topics_</code> attribute as variations such as hierarchical topic modeling will not work afterward:</p> <pre><code># Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration\ntopics = []\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n    topics.extend(topic_model.topics_)\n\ntopic_model.topics_ = topics\n</code></pre> <p>c-TF-IDF:</p> <p>Explicitly define, use, and adjust the <code>ClassTfidfTransformer</code> with new parameters, <code>bm25_weighting</code> and <code>reduce_frequent_words</code>, to potentially improve the topic representation: </p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(bm25_weighting=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\n</code></pre> <p>Attributes:</p> <p>After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in <code>topic_model.topics_</code>. </p> Attribute Type Description topics_ List[int] The topics that are generated for each document after training or updating the topic model. The most recent topics are tracked. probabilities_ List[float] The probability of the assigned topic per document. These are only calculated if a HDBSCAN model is used for the clustering step. When <code>calculate_probabilities=True</code>, then it is the probabilities of all topics per document. topic_sizes_ Mapping[int, int] The size of each topic. topic_mapper_ TopicMapper A class for tracking topics and their mappings anytime they are merged, reduced, added, or removed. topic_representations_ Mapping[int, Tuple[int, float]] The top n terms per topic and their respective c-TF-IDF values. c_tf_idf_ csr_matrix The topic-term matrix as calculated through c-TF-IDF. To access its respective words, run <code>.vectorizer_model.get_feature_names()</code> or <code>.vectorizer_model.get_feature_names_out()</code> topic_labels_ Mapping[int, str] The default labels for each topic. custom_labels_ List[str] Custom labels for each topic as generated through <code>.set_topic_labels</code>. topic_embeddings_ np.ndarray The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values. representative_docs_ Mapping[int, str] The representative documents for each topic if HDBSCAN is used."},{"location":"changelog.html#version-0110","title":"Version 0.11.0","text":"<p>Release date: 11 July, 2022</p> <p>Highlights:</p> <ul> <li>Perform hierarchical topic modeling with <code>.hierarchical_topics</code></li> </ul> <pre><code>hierarchical_topics = topic_model.hierarchical_topics(docs, topics) \n</code></pre> <ul> <li>Visualize hierarchical topic representations with <code>.visualize_hierarchy</code></li> </ul> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <ul> <li>Extract a text-based hierarchical topic representation with <code>.get_topic_tree</code></li> </ul> <pre><code>tree = topic_model.get_topic_tree(hierarchical_topics)\n</code></pre> <ul> <li>Visualize 2D documents with <code>.visualize_documents()</code></li> </ul> <pre><code># Use input embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# or use 2D reduced embeddings through a method of your own (e.g., PCA, t-SNE, UMAP, etc.)\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <ul> <li>Visualize 2D hierarchical documents with <code>.visualize_hierarchical_documents()</code> </li> </ul> <pre><code># Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already which speed things up quite a bit:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <ul> <li>Create custom labels to the topics throughout most visualizations</li> </ul> <pre><code># Generate topic labels\ntopic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=10, separator=\", \")\n\n# Set them internally in BERTopic\ntopic_model.set_topic_labels(topics_labels)\n</code></pre> <ul> <li>Manually merge topics with <code>.merge_topics()</code></li> </ul> <pre><code># Merge topics 1, 2, and 3\ntopics_to_merge = [1, 2, 3]\ntopic_model.merge_topics(docs, topics, topics_to_merge)\n\n# Merge topics 1 and 2, and separately merge topics 3 and 4\ntopics_to_merge = [[1, 2], [3, 4]]\ntopic_model.merge_topics(docs, topics, topics_to_merge)\n</code></pre> <ul> <li>Added example for finding similar topics between two models in the tips &amp; tricks page</li> <li>Add multi-modal example in the tips &amp; tricks page</li> <li>Added native Hugging Face transformers support </li> </ul> <p>Fixes: </p> <ul> <li>Fix support for k-Means in <code>.visualize_heatmap</code> (#532)</li> <li>Fix missing topic 0 in <code>.visualize_topics</code> (#533)</li> <li>Fix inconsistencies in <code>.get_topic_info</code> (#572) and (#581)</li> <li>Add <code>optimal_ordering</code> parameter to <code>.visualize_hierarchy</code> by @rafaelvalero in #390</li> <li>Fix RuntimeError when used as sklearn estimator by @simonfelding in #448</li> <li>Fix typo in visualization documentation by @dwhdai in #475</li> <li>Fix typo in docstrings by @xwwwwww in #549</li> <li>Support higher Flair versions</li> </ul>"},{"location":"changelog.html#version-0100","title":"Version 0.10.0","text":"<p>Release date: 30 April, 2022</p> <p>Highlights: </p> <ul> <li>Use any dimensionality reduction technique instead of UMAP:</li> </ul> <pre><code>from bertopic import BERTopic\nfrom sklearn.decomposition import PCA\n\ndim_model = PCA(n_components=5)\ntopic_model = BERTopic(umap_model=dim_model)\n</code></pre> <ul> <li>Use any clustering technique instead of HDBSCAN:</li> </ul> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\n\ncluster_model = KMeans(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre> <p>Documentation: </p> <ul> <li>Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case</li> <li>Added pages on how to use other dimensionality reduction and clustering algorithms</li> <li>Additional instructions on how to reduce outliers in the FAQ:</li> </ul> <pre><code>import numpy as np\nprobability_threshold = 0.01\nnew_topics = [np.argmax(prob) if max(prob) &gt;= probability_threshold else -1 for prob in probs] \n</code></pre> <p>Fixes: </p> <ul> <li>Fixed <code>None</code> being returned for probabilities when transforming unseen documents</li> <li>Replaced all instances of <code>arg:</code> with <code>Arguments:</code> for consistency</li> <li>Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if <code>min_df</code> is set to a value larger than 1</li> <li>Set <code>\"hdbscan&gt;=0.8.28\"</code> to prevent numpy issues</li> <li>Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues</li> <li>Update gensim dependency to <code>&gt;=4.0.0</code> (#371)</li> <li>Fix topic 0 not appearing in visualizations (#472)</li> <li>Fix (#506)</li> <li>Fix (#429)</li> <li>Fix typo in DTM documentation by @hp0404 in #386</li> </ul>"},{"location":"changelog.html#version-094","title":"Version 0.9.4","text":"<p>Release date: 14 December, 2021</p> <p>A number of fixes, documentation updates, and small features:</p> <ul> <li>Expose diversity parameter<ul> <li>Use <code>BERTopic(diversity=0.1)</code> to change how diverse the words in a topic representation are (ranges from 0 to 1)</li> </ul> </li> <li>Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings</li> <li>Added property to c-TF-IDF that all IDF values should be positive (#351)</li> <li>Improve stability of <code>.visualize_barchart()</code> and <code>.visualize_hierarchy()</code></li> <li>Major documentation overhaul (mkdocs, tutorials, FAQ, images, etc. ) (#330)</li> <li>Drop python 3.6 (#333)</li> <li>Relax plotly dependency (#88)</li> <li>Additional logging for <code>.transform</code> (#356)</li> </ul>"},{"location":"changelog.html#version-093","title":"Version 0.9.3","text":"<p>Release date:  17 October, 2021</p> <ul> <li>Fix #282<ul> <li>As it turns out the old implementation of topic mapping was still found in the <code>transform</code> function</li> </ul> </li> <li>Fix #285<ul> <li>Fix getting all representative docs</li> </ul> </li> <li>Fix #288<ul> <li>A recent issue with the package <code>pyyaml</code> that can be found in Google Colab</li> </ul> </li> </ul>"},{"location":"changelog.html#version-092","title":"Version 0.9.2","text":"<p>Release date:  12 October, 2021</p> <p>A release focused on algorithmic optimization and fixing several issues:</p> <p>Highlights:  </p> <ul> <li>Update the non-multilingual paraphrase- models to the all- models due to improved performance</li> <li>Reduce necessary RAM in c-TF-IDF top 30 word extraction</li> </ul> <p>Fixes:  </p> <ul> <li>Fix topic mapping<ul> <li>When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version</li> <li>A new class was created as a way to track these mappings regardless of how many times they were executed</li> <li>In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model</li> </ul> </li> <li>Fix typo in embeddings page (#200) </li> <li>Fix link in README (#233)</li> <li>Fix documentation <code>.visualize_term_rank()</code> (#253) </li> <li>Fix getting correct representative docs (#258)</li> <li>Update memory FAQ with HDBSCAN pr</li> </ul>"},{"location":"changelog.html#version-091","title":"Version 0.9.1","text":"<p>Release date:  1 September, 2021</p> <p>A release focused on fixing several issues:</p> <p>Fixes:  </p> <ul> <li>Fix TypeError when auto-reducing topics (#210) </li> <li>Fix mapping representative docs when reducing topics (#208)</li> <li>Fix visualization issues with probabilities (#205)</li> <li>Fix missing <code>normalize_frequency</code> param in plots (#213)</li> </ul>"},{"location":"changelog.html#version-090","title":"Version 0.9.0","text":"<p>Release date:  9 August, 2021</p> <p>Highlights:  </p> <ul> <li>Implemented a Guided BERTopic -&gt; Use seeds to steer the Topic Modeling</li> <li>Get the most representative documents per topic: <code>topic_model.get_representative_docs(topic=1)</code><ul> <li>This allows users to see which documents are good representations of a topic and better understand the topics that were created</li> </ul> </li> <li>Added <code>normalize_frequency</code> parameter to <code>visualize_topics_per_class</code> and <code>visualize_topics_over_time</code> in order to better compare the relative topic frequencies between topics</li> <li>Return flat probabilities as default, only calculate the probabilities of all topics per document if <code>calculate_probabilities</code> is True</li> <li>Added several FAQs</li> </ul> <p>Fixes:  </p> <ul> <li>Fix loading pre-trained BERTopic model</li> <li>Fix mapping of probabilities</li> <li>Fix #190</li> </ul> <p>Guided BERTopic:    </p> <p>Guided BERTopic works in two ways: </p> <p>First, we create embeddings for each seeded topics by joining them and passing them through the document embedder.  These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label.  If the document is most similar to a seeded topic, then it will get that topic's label.  If it is most similar to the average document embedding, it will get the -1 label.  These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics. </p> <p>Second, we take all words in <code>seed_topic_list</code> and assign them a multiplier larger than 1.  Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing  the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an  irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to  remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant,  like taking the distribution of IDF values and its position into account when defining the multiplier. </p> <pre><code>seed_topic_list = [[\"company\", \"billion\", \"quarter\", \"shrs\", \"earnings\"],\n                   [\"acquisition\", \"procurement\", \"merge\"],\n                   [\"exchange\", \"currency\", \"trading\", \"rate\", \"euro\"],\n                   [\"grain\", \"wheat\", \"corn\"],\n                   [\"coffee\", \"cocoa\"],\n                   [\"natural\", \"gas\", \"oil\", \"fuel\", \"products\", \"petrol\"]]\n\ntopic_model = BERTopic(seed_topic_list=seed_topic_list)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre>"},{"location":"changelog.html#version-081","title":"Version 0.8.1","text":"<p>Release date:  8 June, 2021</p> <p>Highlights:  </p> <ul> <li>Improved models:<ul> <li>For English documents the default is now: <code>\"paraphrase-MiniLM-L6-v2\"</code> </li> <li>For Non-English or multi-lingual documents the default is now: <code>\"paraphrase-multilingual-MiniLM-L12-v2\"</code> </li> <li>Both models show not only great performance but are much faster!  </li> </ul> </li> <li>Add interactive visualizations to the <code>plotting</code> API documentation</li> </ul> <p>For better performance, please use the following models:  </p> <ul> <li>English: <code>\"paraphrase-mpnet-base-v2\"</code></li> <li>Non-English or multi-lingual: <code>\"paraphrase-multilingual-mpnet-base-v2\"</code></li> </ul> <p>Fixes:   </p> <ul> <li>Improved unit testing for more stability</li> <li>Set transformers version for Flair</li> </ul>"},{"location":"changelog.html#version-080","title":"Version 0.8.0","text":"<p>Release date:  31 May, 2021</p> <p>Highlights:  </p> <ul> <li>Additional visualizations:<ul> <li>Topic Hierarchy: <code>topic_model.visualize_hierarchy()</code> </li> <li>Topic Similarity Heatmap: <code>topic_model.visualize_heatmap()</code> </li> <li>Topic Representation Barchart: <code>topic_model.visualize_barchart()</code> </li> <li>Term Score Decline: <code>topic_model.visualize_term_rank()</code> </li> </ul> </li> <li>Created <code>bertopic.plotting</code> library to easily extend visualizations</li> <li>Improved automatic topic reduction by using HDBSCAN to detect similar topics</li> <li>Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest  topic, 1 the second largest, etc. </li> </ul> <p>Fixes:   </p> <ul> <li>Fix typo #113, #117</li> <li>Fix #121 by removing these two lines</li> <li>Fix mapping of topics after reduction (it now excludes 0) (#103)</li> </ul>"},{"location":"changelog.html#version-070","title":"Version 0.7.0","text":"<p>Release date:  26 April, 2021 </p> <p>The two main features are (semi-)supervised topic modeling  and several backends to use instead of Flair and SentenceTransformers!</p> <p>Highlights:</p> <ul> <li>(semi-)supervised topic modeling by leveraging supervised options in UMAP<ul> <li><code>model.fit(docs, y=target_classes)</code></li> </ul> </li> <li>Backends:<ul> <li>Added Spacy, Gensim, USE (TFHub)</li> <li>Use a different backend for document embeddings and word embeddings</li> <li>Create your own backends with <code>bertopic.backend.BaseEmbedder</code></li> <li>Click here for an overview of all new backends</li> </ul> </li> <li>Calculate and visualize topics per class<ul> <li>Calculate: <code>topics_per_class = topic_model.topics_per_class(docs, topics, classes)</code></li> <li>Visualize: <code>topic_model.visualize_topics_per_class(topics_per_class)</code></li> </ul> </li> <li>Several tutorials were updated and added:</li> </ul> Name Link Topic Modeling with BERTopic (Custom) Embedding Models in BERTopic Advanced Customization in BERTopic (semi-)Supervised Topic Modeling with BERTopic Dynamic Topic Modeling with Trump's Tweets <p>Fixes:  </p> <ul> <li>Fixed issues with Torch req</li> <li>Prevent saving term frequency matrix in CTFIDF class</li> <li>Fixed DTM not working when reducing topics (#96)</li> <li>Moved visualization dependencies to base BERTopic<ul> <li><code>pip install bertopic[visualization]</code> becomes <code>pip install bertopic</code></li> </ul> </li> <li>Allow precomputed embeddings in bertopic.find_topics() (#79):</li> </ul> <pre><code>model = BERTopic(embedding_model=my_embedding_model)\nmodel.fit(docs, my_precomputed_embeddings)\nmodel.find_topics(search_term)\n</code></pre>"},{"location":"changelog.html#version-060","title":"Version 0.6.0","text":"<p>Release date:  1 March, 2021</p> <p>Highlights:</p> <ul> <li>DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation <ul> <li><code>model.topics_over_time(docs, timestamps, global_tuning=True)</code></li> </ul> </li> <li>DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time<ul> <li>Only uses topics at t-1 and skips evolution if there is a gap</li> <li><code>model.topics_over_time(docs, timestamps, evolution_tuning=True)</code></li> </ul> </li> <li>DTM: Function to visualize topics over time <ul> <li><code>model.visualize_topics_over_time(topics_over_time)</code></li> </ul> </li> <li>DTM: Add binning of timestamps  <ul> <li><code>model.topics_over_time(docs, timestamps, nr_bins=10)</code></li> </ul> </li> <li>Add function get general information about topics (id, frequency, name, etc.) <ul> <li><code>get_topic_info()</code></li> </ul> </li> <li>Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents</li> </ul> <p>Fixes:</p> <ul> <li><code>_map_probabilities()</code> does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)</li> </ul>"},{"location":"changelog.html#version-050","title":"Version 0.5.0","text":"<p>Release date:  8 Februari, 2021</p> <p>Highlights:</p> <ul> <li>Add <code>Flair</code> to allow for more (custom) token/document embeddings, including \ud83e\udd17 transformers </li> <li>Option to use custom UMAP, HDBSCAN, and CountVectorizer</li> <li>Added <code>low_memory</code> parameter to reduce memory during computation</li> <li>Improved verbosity (shows progress bar)</li> <li>Return the figure of <code>visualize_topics()</code></li> <li>Expose all parameters with a single function: <code>get_params()</code></li> </ul> <p>Fixes:</p> <ul> <li>To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.</li> <li>Set <code>calculate_probabilities</code> to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.</li> <li>Use the newest version of <code>sentence-transformers</code> as it speeds ups encoding significantly</li> </ul>"},{"location":"changelog.html#version-042","title":"Version 0.4.2","text":"<p>Release date:  10 Januari, 2021</p> <p>Fixes:  </p> <ul> <li>Selecting <code>embedding_model</code> did not work when <code>language</code> was also used. This led to the user needing  to set <code>language</code> to None before being able to use <code>embedding_model</code>. Fixed by using <code>embedding_model</code> when  <code>language</code> is used (as a default parameter).</li> </ul>"},{"location":"changelog.html#version-041","title":"Version 0.4.1","text":"<p>Release date:  07 Januari, 2021</p> <p>Fixes:  </p> <ul> <li>Simple fix by lowering the languages variable to match the lowered input language.</li> </ul>"},{"location":"changelog.html#version-040","title":"Version 0.4.0","text":"<p>Release date:  21 December, 2020</p> <p>Highlights:  </p> <ul> <li>Visualize Topics similar to LDAvis</li> <li>Added option to reduce topics after training</li> <li>Added option to update topic representation after training</li> <li>Added option to search topics using a search term</li> <li>Significantly improved the stability of generating clusters</li> <li>Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values </li> <li>More extensive tutorials in the documentation</li> </ul> <p>Notable Changes:  </p> <ul> <li>Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic</li> <li>Improved logging (remove duplicates) </li> <li>Check if BERTopic is fitted </li> <li>Added TF-IDF as an embedder instead of transformer models (see tutorial)</li> <li>Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.</li> <li>Preprocess text before passing it through c-TF-IDF</li> <li>Merged <code>get_topics_freq()</code> with <code>get_topic_freq()</code> </li> </ul> <p>Fixes:  </p> <ul> <li>Fix error handling topic probabilities</li> </ul>"},{"location":"changelog.html#version-032","title":"Version 0.3.2","text":"<p>Release date:  16 November, 2020</p> <p>Highlights:</p> <ul> <li>Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.</li> </ul>"},{"location":"changelog.html#version-031","title":"Version 0.3.1","text":"<p>Release date:  4 November, 2020</p> <p>Highlights:</p> <ul> <li>Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.</li> </ul>"},{"location":"changelog.html#version-030","title":"Version 0.3.0","text":"<p>Release date:  29 October, 2020</p> <p>Highlights:</p> <ul> <li>transform() and fit_transform() now also return the topic probability distributions</li> <li>Added visualize_distribution() which visualizes the topic probability distribution for a single document</li> </ul>"},{"location":"changelog.html#version-022","title":"Version 0.2.2","text":"<p>Release date:  17 October, 2020</p> <p>Highlights:</p> <ul> <li>Fixed n_gram_range not being used</li> <li>Added option for using stopwords</li> </ul>"},{"location":"changelog.html#version-021","title":"Version 0.2.1","text":"<p>Release date:  11 October, 2020</p> <p>Highlights:</p> <ul> <li>Improved the calculation of the class-based TF-IDF procedure by limiting the calculation to sparse matrices. This prevents out-of-memory problems when faced with large datasets.</li> </ul>"},{"location":"changelog.html#version-020","title":"Version 0.2.0","text":"<p>Release date:  11 October, 2020</p> <p>Highlights:</p> <ul> <li>Changed c-TF-IDF procedure such that it implements a version of scikit-learns procedure. This should also speed up the calculation of the sparse matrix and prevent memory errors.</li> <li>Added automated unit tests  </li> </ul>"},{"location":"changelog.html#version-012","title":"Version 0.1.2","text":"<p>Release date:  1 October, 2020</p> <p>Highlights:</p> <ul> <li>When transforming new documents, self.mapped_topics seemed to be missing. Added to the init.</li> </ul>"},{"location":"changelog.html#version-011","title":"Version 0.1.1","text":"<p>Release date:  24 September, 2020</p> <p>Highlights:</p> <ul> <li>Fixed requirements --&gt; Issue with pytorch</li> <li>Update documentation</li> </ul>"},{"location":"changelog.html#version-010","title":"Version 0.1.0","text":"<p>Release date:  24 September, 2020</p> <p>Highlights:  </p> <ul> <li>First release of <code>BERTopic</code></li> <li>Added parameters for UMAP and HDBSCAN</li> <li>Option to choose sentence-transformer model</li> <li>Method for transforming unseen documents</li> <li>Save and load trained models (UMAP and HDBSCAN)</li> <li>Extract topics and their sizes</li> </ul> <p>Notable Changes:  </p> <ul> <li>Optimized c-TF-IDF</li> <li>Improved documentation</li> <li>Improved topic reduction</li> </ul>"},{"location":"faq.html","title":"Frequently Asked Questions","text":""},{"location":"faq.html#why-are-the-results-not-consistent-between-runs","title":"Why are the results not consistent between runs?","text":"<p>Due to the stochastic nature of UMAP, the results from BERTopic might differ even if you run the same code multiple times. Using custom embeddings allows you to try out BERTopic several times until you find the topics that suit you best. You only need to generate the embeddings themselves once and run BERTopic several times with different parameters. </p> <p>If you want to reproduce the results, at the expense of performance, you can set a <code>random_state</code> in UMAP to prevent  any stochastic behavior:</p> <pre><code>from bertopic import BERTopic\nfrom umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=5, \n                  min_dist=0.0, metric='cosine', random_state=42)\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre>"},{"location":"faq.html#which-embedding-model-should-i-choose","title":"Which embedding model should I choose?","text":"<p>Unfortunately, there is not a definitive list of the best models for each language, this highly depends on your data, the model, and your specific use case. However, the default model in BERTopic (<code>\"all-MiniLM-L6-v2\"</code>) works great for English documents. In contrast, for multi-lingual documents or any other language, <code>\"paraphrase-multilingual-MiniLM-L12-v2\"</code> has shown great performance.  </p> <p>If you want to use a model that provides a higher quality, but takes more computing time, then I would advise using <code>all-mpnet-base-v2</code> and <code>paraphrase-multilingual-mpnet-base-v2</code> instead. </p> <p>MTEB Leaderboard New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the MTEB leaderboard. It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look. </p> <p>Many of these models can be used with <code>SentenceTransformers</code> in BERTopic, like so:</p> <pre><code>from bertopic import BERTopic\nfrom sentence_transformers import SentenceTransformer\n\nembedding_model = SentenceTransformer(\"BAAI/bge-base-en-v1.5\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre> <p>SentenceTransformers SentenceTransformers work typically quite well  and are the preferred models to use. They are great at generating document embeddings and have several  multi-lingual versions available.  </p> <p>\ud83e\udd17 transformers BERTopic allows you to use any \ud83e\udd17 transformers model. These models are typically embeddings created on a word/sentence level but can easily be pooled using Flair (see Guides/Embeddings). If you have a specific language for which you want to generate embeddings, you can choose the model here.</p>"},{"location":"faq.html#how-do-i-reduce-topic-outliers","title":"How do I reduce topic outliers?","text":"<p>There are several ways we can reduce outliers.</p> <p>First, the amount of datapoint classified as outliers is handled by the <code>min_samples</code> parameters in HDBSCAN. This value is automatically set to the  same value of <code>min_cluster_size</code>. However, you can set it independently if you want to reduce the number of generated outliers. Lowering this value will  result in less noise being generated. </p> <pre><code>from bertopic import BERTopic\nfrom hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', \n                        cluster_selection_method='eom', prediction_data=True, min_samples=5)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Note</p> <p>Although this will lower outliers found in the data, this might force outliers to be put into topics where they do not belong. So make  sure to strike a balance between keeping noise and reducing outliers. </p> <p>Second, after training our BERTopic model, we can assign outliers to topics by making use of the <code>.reduce_outliers</code> function in BERTopic. An advantage of using this approach is that there are four built in strategies one can choose for reducing outliers. Moreover, this technique allows the user to experiment with reducing outliers across a number of strategies and parameters without actually having to re-train the topic model each time. You can learn more about the <code>.reduce_outlier</code> function here. The following is a minimal example of how to use this function:</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers\nnew_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> <p>Third, we can replace HDBSCAN with any other clustering algorithm that we want. So we can choose a clustering algorithm, like k-Means, that  does not produce any outliers at all. Using k-Means instead of HDBSCAN is straightforward:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\n\ncluster_model = KMeans(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre>"},{"location":"faq.html#how-do-i-remove-stop-words","title":"How do I remove stop words?","text":"<p>At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings. </p> <p>Instead, we can use the <code>CountVectorizer</code> to preprocess our documents after having generated embeddings and clustered  our documents. I have found almost no disadvantages to using the <code>CountVectorizer</code> to remove stop words and  it is something I would strongly advise to try out:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>We can also use the <code>ClassTfidfTransformer</code> to reduce the impact of frequent words. The result is very similar to explicitly removing stop words but this process does this automatically:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\n</code></pre>"},{"location":"faq.html#how-can-i-speed-up-bertopic","title":"How can I speed up BERTopic?","text":"<p>You can speed up BERTopic by either generating your embeddings beforehand or by  setting <code>calculate_probabilities</code> to False. Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than a couple of hundred thousand documents. </p> <p>Also, make sure to use a GPU when extracting the sentence/document embeddings. Transformer models typically require a GPU and using only a CPU can slow down computation time quite a lot. However, if you do not have access to a GPU, looking into quantization might help. </p> <p>Lastly, it is also possible to speed up BERTopic with cuML's GPU acceleration of UMAP and HDBSCAN: </p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\nfrom cuml.manifold import UMAP\n\n# Create instances of GPU-accelerated UMAP and HDBSCAN\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\n\n# Pass the above models to be used in BERTopic\ntopic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)\n</code></pre>"},{"location":"faq.html#i-am-facing-memory-issues-help","title":"I am facing memory issues. Help!","text":"<p>There are several ways to perform computation with large datasets:</p> <ul> <li> <p>First, you can set <code>low_memory</code> to True when instantiating BERTopic.  This may prevent blowing up the memory in UMAP. </p> </li> <li> <p>Second, setting <code>calculate_probabilities</code> to False when instantiating BERTopic prevents a huge document-topic  probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets. </p> </li> <li> <p>Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting  sparse c-TF-IDF matrix. You can do this as follows:</p> </li> </ul> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\", min_df=10)\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>The min_df  parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.</p> <ul> <li>Fourth, you can use online topic modeling instead to use BERTopic on big data by training the model in chunks</li> </ul> <p>If the problem persists, then this could be an issue related to your available memory. The processing of millions of documents is quite computationally expensive and sufficient RAM is necessary.  </p>"},{"location":"faq.html#i-have-only-a-few-topics-how-do-i-increase-them","title":"I have only a few topics, how do I increase them?","text":"<p>There are several reasons why your topic model may result in only a few topics:</p> <ul> <li> <p>First, you might only have a few documents (~1000). This makes it very difficult to properly  extract topics due to the little amount of data available. Increasing the number of documents  might solve your issues. </p> </li> <li> <p>Second, <code>min_topic_size</code> might be simply too large for your number of documents. If you decrease  the minimum size of topics, then you are much more likely to increase the number of topics generated. You could also decrease the <code>n_neighbors</code> parameter used in <code>UMAP</code> if this does not work. </p> </li> <li> <p>Third, although this does not happen very often, there simply aren't that many topics to be found  in your documents. You can often see this when you have many <code>-1</code> topics, which is not a topic  but a category of outliers.  </p> </li> </ul>"},{"location":"faq.html#i-have-too-many-topics-how-do-i-decrease-them","title":"I have too many topics, how do I decrease them?","text":"<p>If you have a large dataset, then it is possible to generate thousands of topics. Especially with large datasets, there is a good chance they contain many small topics. In practice, you might want a few hundred topics at most to interpret them nicely. </p> <p>There are a few ways of decreasing the number of generated topics: </p> <ul> <li> <p>First, we can set the <code>min_topic_size</code> in the BERTopic initialization much higher (e.g., 300) to make sure that those small clusters will not be generated. This is an HDBSCAN parameter that specifies the minimum number of documents needed in a cluster. More documents in a cluster mean fewer topics will be generated. </p> </li> <li> <p>Second, you can create a custom UMAP model and set <code>n_neighbors</code> much higher than the default 15 (e.g., 200). This also prevents those micro clusters to be generated as it will need many neighboring documents to create a cluster. </p> </li> <li> <p>Third, we can set <code>nr_topics</code> to a value that seems logical to the user. Do note that topics are forced  to merge which might result in a lower quality of topics. In practice, I would advise using  <code>nr_topic=\"auto\"</code> as that will merge topics that are very similar. Dissimilar topics will  therefore remain separated. </p> </li> </ul>"},{"location":"faq.html#how-do-i-calculate-the-probabilities-of-all-topics-in-a-document","title":"How do I calculate the probabilities of all topics in a document?","text":"<p>Although it is possible to calculate all the probabilities, the process of doing so is quite computationally  inefficient and might significantly increase the computation time. To prevent this, the probabilities are  not calculated as a default. To calculate them, you will have to set <code>calculate_probabilities</code> to True:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>Note</p> <p>The <code>calculate_probabilties</code> parameter is only used when using HDBSCAN or cuML's HDBSCAN model. In other words, this will not work when using a model other than HDBSCAN. Instead, we can approximate the topic distributions across all documents with <code>.approximate_distribution</code>.</p>"},{"location":"faq.html#numpy-gives-me-an-error-when-running-bertopic","title":"Numpy gives me an error when running BERTopic","text":"<p>With the release of Numpy 1.20.0, there have been significant issues with using that version (and previous ones) due to compilation issues and pypi.   </p> <p>This is a known issue with the order of installation using pypi. You can find more details about this issue  here and here.</p> <p>I would suggest doing one of the following:</p> <ul> <li>Install the newest version from BERTopic (&gt;= v0.5).</li> <li>You can install hdbscan with <code>pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation</code> which might resolve the issue</li> <li>Install BERTopic in a fresh environment using these steps. </li> </ul>"},{"location":"faq.html#how-can-i-run-bertopic-without-an-internet-connection","title":"How can I run BERTopic without an internet connection?","text":"<p>The great thing about using sentence-transformers is that it searches automatically for an embedding model locally.  If it cannot find one, it will download the pre-trained model from its servers.  Make sure that you set the correct path for sentence-transformers to work. You can find a bit more about that  here. </p> <p>You can download the corresponding model here and unzip it. Then, simply use the following to create your embedding model:</p> <pre><code>from sentence_transformers import SentenceTransformer\nembedding_model = SentenceTransformer('path/to/unzipped/model')\n</code></pre> <p>Then, pass it to BERTopic:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"faq.html#can-i-use-the-gpu-to-speed-up-the-model","title":"Can I use the GPU to speed up the model?","text":"<p>Yes. The GPU is automatically used when you use a SentenceTransformer or Flair embedding model. Using  a CPU would then definitely slow things down. However, you can use other embeddings like TF-IDF or Doc2Vec  embeddings in BERTopic which do not depend on GPU acceleration. </p> <p>You can use cuML to speed up both  UMAP and HDBSCAN through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\nfrom cuml.manifold import UMAP\n\n# Create instances of GPU-accelerated UMAP and HDBSCAN\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\n\n# Pass the above models to be used in BERTopic\ntopic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Depending on the embeddings you are using, you might want to normalize them first to force a cosine-related distance metric in UMAP:</p> <pre><code>from cuml.preprocessing import normalize\nembeddings = normalize(embeddings)\n</code></pre>"},{"location":"faq.html#how-can-i-use-bertopic-with-chinese-documents","title":"How can I use BERTopic with Chinese documents?","text":"<p>Currently, CountVectorizer tokenizes text by splitting whitespace which does not work for Chinese.  To get it to work, you will have to create a custom <code>CountVectorizer</code> with <code>jieba</code>:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nimport jieba\n\ndef tokenize_zh(text):\n    words = jieba.lcut(text)\n    return words\n\nvectorizer = CountVectorizer(tokenizer=tokenize_zh)\n</code></pre> <p>Next, we pass our custom vectorizer to BERTopic and create our topic model:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(embedding_model=model, verbose=True, vectorizer_model=vectorizer)\ntopics, _ = topic_model.fit_transform(docs, embeddings=embeddings)\n</code></pre>"},{"location":"faq.html#why-does-it-take-so-long-to-import-bertopic","title":"Why does it take so long to import BERTopic?","text":"<p>The main culprit here seems to be UMAP. After running tests with Tuna we  can see that most of the resources when importing BERTopic can be dedicated to UMAP:   </p> <p></p> <p>Unfortunately, there currently is no fix for this issue. The most recent ticket regarding this  issue can be found here.</p>"},{"location":"faq.html#should-i-preprocess-the-data","title":"Should I preprocess the data?","text":"<p>No. By using document embeddings there is typically no need to preprocess the data as all parts of a document  are important in understanding the general topic of the document. Although this holds in 99% of cases, if you  have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags  typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply  topic modeling to HTML-code to extract topics of code, then it becomes important.</p>"},{"location":"usecases.html","title":"Use Cases","text":"<p>Over the last few years, BERTopic has been used on a wide variety of use cases and domains, from cancer research and voice perception, to employee surveys and social media. This diversity allows for interesting use cases but it might quickly become overwhelming. This page is meant to demonstrate how, when, and why BERTopic is used in practice. </p> <p> </p>"},{"location":"usecases.html#examples","title":"Examples","text":"<p>Below are a number of use cases that have been applied in practice. These use cases are collected from and written by data-professionals.</p> <p>Note</p> <p>If you would like to add your use case, feel free open up a PR! You only need to update this file and add your example.  You can just copy-paste one of the existing examples and adjust it contain a description of your use case.</p>"},{"location":"usecases.html#app-user-feedback","title":"App User Feedback","text":"<p><p>      \"Analyzing user reviews from the App Store and Play Store helps us reveal valuable customer information, fix technical or usability issues, and help constantly improve customer experience. We utilize BERTopic for topic modeling and supervised classification of predefined categories.\"          \u2022\u2022\u2022Tibor Fabian, Ph.D.Lead/Master Data ScientistTelef\u00f3nica Germany </p></p>"},{"location":"usecases.html#employee-surveys","title":"Employee Surveys","text":"<p><p>      \"We are using BERTopic to support analysis of employee surveys. Here, we use BERTopic to compute the topics of discussion found in employee responses to open-ended survey questions. To further understand how employees feel about certain topics, we combined BERTopic with sentiment analysis to identify the sentiments associated with different topics and vice versa.\"          \u2022\u2022\u2022Steve Quirolgico, Ph.D.Principal Engineer U.S. Department of Homeland Security </p></p>"},{"location":"usecases.html#voice-perception","title":"Voice Perception","text":"<p><p> \"A research project on voice perception to categorize what people describe when they make first impressions based on hearing people say, \"Hi\".\" preprint | code     \u2022\u2022\u2022David FeinbergAssociate ProfessorMcMaster University </p></p>"},{"location":"usecases.html#social-media","title":"Social Media","text":"<p><p> \"We use BERTopic to detect trending topics in social media, Our product (AIM Insights) is a social media monitoring tool so detecting trending topics in social media helps our clients to capitalize on them for their campaigns.     We use BERTopic to group social media posts into clusters, sort them by engagement to detect the ones that are trending, and then use OpenAI's GPT-3 to generate a label for each of the top clusters based on the most relevant documents in it. This is all done on Arabic posts using an in-house sentence embeddings model.\"     \u2022\u2022\u2022Ahmed RashwanAI leadAIM Technologies </p></p>"},{"location":"usecases.html#it-service-management","title":"IT Service Management","text":"<p><p> \"In IT Service Management systems (e.g., Service Now) we receive Incidents, Problems, Change requests etc. We use BERTopic to categorize them into a group of topics/clusters to understand the distribution of the work requests over the period of time to plan and act accordingly for the future.\"     \u2022\u2022\u2022Rajesh ThanaseelanData Science ConsultantDXC Technology </p></p>"},{"location":"usecases.html#colon-cancer","title":"Colon Cancer","text":"<p><p> \"We use BERTopic to evaluate P53 in Ovarian cancer for Computational backgrounds researchers, who find it easier to relate Artificial Intelligence with advancing the transformer model and unstructured medical data. The paper explores the heterogeneity of keyBERT, BERTopic, PyCaret, and LDAs as key phrase generators and topic model extractors, with P53 in ovarian cancer as a use case.\"     \u2022\u2022\u2022     Mary AdewunmiPhD Student in Colon Cancer and AIUTAS </p></p>"},{"location":"usecases.html#telephone-help-line","title":"Telephone Help Line","text":"<p><p> \"We analyzed 100K+ phone call memos from a telephone help line. The Help Line is open to all people, regardless of religion, culture, and origin. It follows the principles of IFOTES (International Federation Of Telephone Emergency Services). The regional offices each offer independent counseling services via telephone or online.     The phone call memos are written by hundreds of independent volunteers and come in various shapes, lengths, forms, and wordings - additionally to have them in multiple languages. While using BERTopic we ran a few tests to figure out if the topic modeling works. Selecting only one language with ~60K data points and a mixed language model we achieved good results. It helped identify topics within the calls and therefore show the organization what reasons there are for people calling them. We identified in a workshop a few interesting topics, which they were not aware of, for example, religious topics.     The identification of existing and new, arising topics is crucial for the service quality of the organization. It furthermore helps detect trends over time, which can then be reported directly to Public Health institutions, which can then come up with campaigns to inform the public and help reduce certain psychological concerns. It acts as a representative psychological health barometer of the population.\"     \u2022\u2022\u2022Kevin KuhnChief Executive Officergopf </p></p>"},{"location":"usecases.html#regional-newspaper","title":"Regional Newspaper","text":"<p><p> \"Recently, we wanted to evaluate our overall section structure, especially our local news section. As you can imagine, local news is quite a big part of what we do in a regional newspaper. We used BERTopic on a year's worth of local news data to explore the topics in local news and define a new section structure. The results from this analysis helped to define the new section structure, which was implemented this month. \"     \u2022\u2022\u2022Thomas HuskenData ScientistBergens Tidende </p></p>"},{"location":"usecases.html#intelligent-virtual-assistants","title":"Intelligent Virtual Assistants","text":"<p><p> \"We have been using BERTopic as an early step in our exploratory analysis for intelligent virtual assistants. It helps us get a quick read on what some of the intents may be. The results help in the design discussions with customers.\"     \u2022\u2022\u2022Stephen DrewVP, AI and Automation SolutionsFive9 </p></p>"},{"location":"usecases.html#electronic-health-records","title":"Electronic Health Records","text":"<p><p> \"Given physician-created documents from hospitals, find themes in the text as well as differentiate between \"relevant\" and \"irrelevant\" text, and disambiguate homonyms. \"     \u2022\u2022\u2022     Alexis RaykhelSenior NLP EngineerIodine Software </p></p>"},{"location":"usecases.html#teaching","title":"Teaching","text":"<p><p> \"BERTopic was used to determine a taxonomy of climate change risks discussed in financial news, and to compute firms' related exposure. It was used in a context a course offering on Climate Risks modelling with NLP.\"     \u2022\u2022\u2022     Thomas LoransSenior Associate, Quantitative Analyst </p></p>"},{"location":"usecases.html#zero-hunger-lab","title":"Zero Hunger Lab","text":"<p><p> \"I am a PhD student at Tilburg University, at a lab called Zero Hunger Lab, where we try to use data science methods to improve food insecurity. One key issue is classifying and predicting food insecurity in food-insecure nations. The Integrated Food Security Phase Classification (IPC) system serves this purpose. The IPC categorizes food insecurity into five phases, ranging from minimal food insecurity to famine, and serves as a guide for directing humanitarian resources to the most affected regions.     The IPC system strives to be based on evidence, however, obtaining accurate information about food insecurity in remote regions can prove challenging. Despite the availability of weather data, data in the socio-economic domain, such as food prices and conflict, can be scarce or unreliable due to limited infrastructure and bureaucratic obstacles. These complications often result in infrequent releases of IPC classifications and projections, making it difficult to effectively respond to food insecurity in these areas.     One large source of daily-updated information is local news. Thus, one can build a model that classifies/predicts IPC by relying on news features obtained by NLP methods in addition to stuff like weather data. Previous research shows this is possible (see https://arxiv.org/pdf/2111.15602.pdf). The authors find words related to food insecurity using semantic frame parsing. After which, they count the occurrence of these words to create features. The features are put into a linear classifier. We wanted to apply more advanced methods and use local news sources (which we suppose contain more localized information). We used BERTopic on over a million articles scraped from Somali news websites. Because articles are both in English and Somali, we use a multilingual sentence encoder (LaBSE, which outperforms newer models in Somali). The results are quite nice. For example, topics most strongly correlated with known conflict casualty data are topics about terrorist attacks, car bombings, etc. And topics most negatively correlated with known conflict casualty data are about peace talks. We can also get an indication of food price development and forced migration. Most importantly, we can track the development of topics relating to food insecurity over time. While topic modelling cannot replace evidence-based food insecurity assessment, it can give a quick insight into a local situation when 'hard data' is lacking.     I applaud you on your success with BERTopic. The package is incredibly clean and easy to use, and the method works well with little parameter tuning. To me, the fact that you were able to deliver such a useful tool on your own is incredible, especially in the field of NLP, which is dominated by large organizations such as Google and Meta.  \"     \u2022\u2022\u2022Cascha van WanrooijPhD StudentTilburg University </p></p>"},{"location":"usecases.html#papers","title":"Papers","text":"<p>BERTopic has also been adopted more and more in the academic field. Here are a few from all different kinds of research domains with interesting applications:</p> <ul> <li>Adewunmi, M., Sharma, S. K., Sharma, N., Sushma, N. S., &amp; Mounmo, B. (2022). Cancer Health Disparities drivers with BERTopic modelling and PyCaret Evaluation. Cancer Health Disparities, 6.</li> <li>Ebeling, R., S\u00e1enz, C. A. C., Nobre, J. C., &amp; Becker, K. (2022, May). Analysis of the influence of political polarization in the vaccination stance: the Brazilian COVID-19 scenario. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 16, pp. 159-170).</li> <li>Hoseini, M., Melo, P., Benevenuto, F., Feldmann, A., &amp; Zannettou, S. (2021). On the globalization of the QAnon conspiracy theory through Telegram. arXiv preprint arXiv:2105.13020.</li> <li>Falkenberg, M., Galeazzi, A., Torricelli, M., Di Marco, N., Larosa, F., Sas, M., ... &amp; Baronchelli, A. (2022). Growing polarization around climate change on social media. Nature Climate Change, 1-8.</li> <li>S\u00e1nchez\u2010Franco, M. J., &amp; Rey\u2010Moreno, M. (2022). Do travelers' reviews depend on the destination? An analysis in coastal and urban peer\u2010to\u2010peer lodgings. Psychology &amp; Marketing, 39(2), 441-459.</li> <li>Zhunis, A., Lima, G., Song, H., Han, J., &amp; Cha, M. (2022, April). Emotion bubbles: Emotional composition of online discourse before and after the COVID-19 outbreak. In Proceedings of the ACM Web Conference 2022 (pp. 2603-2613).</li> <li>Alhaj, F., Al-Haj, A., Sharieh, A., &amp; Jabri, R. (2022). Improving Arabic cognitive distortion classification in Twitter using BERTopic. International Journal of Advanced Computer Science and Applications, 13(1), 854-860.</li> </ul> <p>Click here for a full overview of papers citing BERTopic. </p>"},{"location":"algorithm/algorithm.html","title":"The Algorithm","text":"<p>Below, you will find different types of overviews of each step in BERTopic's main algorithm. Each successive overview will be more in-depth than the previous overview. This approach aims to make the underlying algorithm as intuitive as possible for a wide range of users. </p>"},{"location":"algorithm/algorithm.html#visual-overview","title":"Visual Overview","text":"<p>BERTopic can be viewed as a sequence of steps to create its topic representations. There are five steps to this process:</p> <p></p> <p>Although these steps are the default, there is some modularity to BERTopic. Each step in this process was carefully selected such that they are all somewhat independent from one another. For example, the tokenization step is not directly influenced by the embedding model that was used to convert the documents which allow us to be creative in how we perform the tokenization step. </p> <p>This effect is especially strong in the clustering step. Models like HDBSCAN assume that clusters can have different shapes and forms. As a result, using a centroid-based technique to model the topic representations would not be beneficial since the centroid is not always representative of these types of clusters. A bag-of-words representation, however, makes very few assumptions concerning the shape and form of a cluster. </p> <p>As a result, BERTopic is quite modular and can maintain its quality of topic generation throughout a variety of sub-models. In other words, BERTopic essentially allows you to build your own topic model:</p> <p></p> <p>There is extensive documentation on how to use each step in this pipeline:</p> <ol> <li>Embeddings</li> <li>Dimensionality Reduction</li> <li>Clustering</li> <li>Tokenizer</li> <li>Weighting Scheme</li> <li>Representation Tuning<ul> <li>Large Language Models (LLM)</li> </ul> </li> </ol>"},{"location":"algorithm/algorithm.html#code-overview","title":"Code Overview","text":"<p>After going through the visual overview, this code overview demonstrates the algorithm using BERTopic. An advantage of using BERTopic is each major step in its algorithm can be explicitly defined, thereby making the process not only transparent but also more intuitive. </p> <pre><code>from umap import UMAP\nfrom hdbscan import HDBSCAN\nfrom sentence_transformers import SentenceTransformer\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nfrom bertopic import BERTopic\nfrom bertopic.representation import KeyBERTInspired\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\n\n# Step 1 - Extract embeddings\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n# Step 2 - Reduce dimensionality\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')\n\n# Step 3 - Cluster reduced embeddings\nhdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)\n\n# Step 4 - Tokenize topics\nvectorizer_model = CountVectorizer(stop_words=\"english\")\n\n# Step 5 - Create topic representation\nctfidf_model = ClassTfidfTransformer()\n\n# Step 6 - (Optional) Fine-tune topic representations with \n# a `bertopic.representation` model\nrepresentation_model = KeyBERTInspired()\n\n# All steps together\ntopic_model = BERTopic(\n  embedding_model=embedding_model,          # Step 1 - Extract embeddings\n  umap_model=umap_model,                    # Step 2 - Reduce dimensionality\n  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings\n  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics\n  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words\n  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations\n)\n</code></pre>"},{"location":"algorithm/algorithm.html#detailed-overview","title":"Detailed Overview","text":"<p>This overview describes each step in more detail such that you can get an intuitive feeling as to what models might fit best at each step in your use case. </p>"},{"location":"algorithm/algorithm.html#1-embed-documents","title":"1. Embed documents","text":"<p>We start by converting our documents to numerical representations. Although there are many methods for doing so the default in BERTopic is sentence-transformers. These models are often optimized for semantic similarity which helps tremendously in our clustering task. Moreover, they are great for creating either document- or sentence-embeddings.   In BERTopic, you can choose any sentence-transformers model but two models are set as defaults:</p> <ul> <li><code>\"all-MiniLM-L6-v2\"</code></li> <li><code>\"paraphrase-multilingual-MiniLM-L12-v2\"</code></li> </ul> <p>The first is an English language model trained specifically for semantic similarity tasks which works quite  well for most use cases. The second model is very similar to the first with one major difference being that the  <code>multilingual</code> models work for 50+ languages. This model is quite a bit larger than the first and is only selected if  you select any language other than English.</p> <p>Tip</p> <p>Although BERTopic uses sentence-transformers models as a default, you can choose  any embedding model that fits your use case. Follow the guide here for selecting  and customizing your model.</p>"},{"location":"algorithm/algorithm.html#2-dimensionality-reduction","title":"2. Dimensionality reduction","text":"<p>After having created our numerical representations of the documents we have to reduce the dimensionality of these representations. Cluster models typically have difficulty handling high dimensional data due to the curse of dimensionality. There are great approaches that can reduce dimensionality, such as PCA, but as a default UMAP is selected in BERTopic. It is a technique that can keep some of a dataset's local and global structure when reducing its dimensionality. This structure is important to keep as it contains the information necessary to create clusters of semantically similar documents. </p> <p>Tip</p> <p>Although BERTopic uses UMAP as a default, you can choose  any dimensionality reduction model that fits your use case. Follow the guide here for selecting  and customizing your model.</p>"},{"location":"algorithm/algorithm.html#3-cluster-documents","title":"3. Cluster Documents","text":"<p>After having reduced our embeddings, we can start clustering our data. For that, we leverage a density-based clustering technique, HDBSCAN. It can find clusters of different shapes and has the nice feature of identifying outliers where possible. As a result, we do not force documents into a cluster where they might not belong. This will improve the resulting topic representation as there is less noise to draw from. </p> <p>Tip</p> <p>Although BERTopic uses HDBSCAN as a default, you can choose  any cluster model that fits your use case. Follow the guide here for selecting  and customizing your model.</p>"},{"location":"algorithm/algorithm.html#4-bag-of-words","title":"4. Bag-of-words","text":"<p>Before we can start creating the topic representation we first need to select a technique that allows for modularity in BERTopic's algorithm. When we use HDBSCAN as a cluster model, we may assume that our clusters have different degrees of density and different shapes. This means that a centroid-based topic representation technique might not be the best-fitting model. In other words, we want a topic representation technique that makes little to no assumption on the expected structure of the clusters.   To do this, we first combine all documents in a cluster into a single document. That, very long, document then represents the cluster. Then, we can count how often each word appears in each cluster. This generates something called a bag-of-words representation in which the frequency of each word in each cluster can be found. This bag-of-words representation is therefore on a cluster level and not on a document level. This distinction is important as we are interested in words on a topic level (i.e., cluster level). By using a bag-of-words representation, no assumption is made concerning the structure of the clusters. Moreover, the bag-of-words representation is L1-normalized to account for clusters that have different sizes. </p> <p>Tip</p> <p>There are many ways you can tune or change the bag-of-words step. This step allows for processing the documents however you want without affecting the first step, embedding the documents. You can follow the guide here for more information about tokenization options in BERTopic.</p>"},{"location":"algorithm/algorithm.html#5-topic-representation","title":"5. Topic representation","text":"<p>From the generated bag-of-words representation, we want to know what makes one cluster different from another. Which words are typical for cluster 1 and not so much for all other clusters? To solve this, we need to modify TF-IDF such that it considers topics (i.e., clusters) instead of documents.    When you apply TF-IDF as usual on a set of documents, what you are doing is comparing the importance of  words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. The more important words are within a cluster, the more it is representative of that topic. In other words, if we extract the most important words per cluster, we get descriptions of topics! This model is called class-based TF-IDF: </p> <p></p> <p> Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word <code>x</code> in class <code>c</code>, where <code>c</code> refers to the cluster we created before. This results in our class-based <code>tf</code> representation. This representation is L1-normalized to account for the differences in topic sizes.     Then, we take the logarithm of one plus the average number of words per class <code>A</code> divided by the frequency of word <code>x</code> across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based <code>idf</code> representation. Like with the classic TF-IDF, we then multiply <code>tf</code> with <code>idf</code> to get the importance score per word in each class. In other words, the classical TF-IDF procedure is not used here but a modified version of the algorithm that allows for a much better representation. </p> <p>Tip</p> <p>In the <code>ClassTfidfTransformer</code>, there are a few parameters that might be worth exploring, including an option to perform additional BM-25 weighting. You can find more information about that here.</p>"},{"location":"algorithm/algorithm.html#6-optional-fine-tune-topic-representation","title":"6. (Optional) Fine-tune Topic representation","text":"<p>After having generated the c-TF-IDF representations, we have a set of words that describe a collection of documents. c-TF-IDF  is a method that can quickly generate accurate topic representations. However, with the fast developments in NLP-world, new  and exciting methods are released weekly. In order to keep up with what is happening, there is the possibility to further fine-tune  these c-TF-IDF topics using GPT, T5, KeyBERT, Spacy, and other techniques. Many are implemented in BERTopic for you to use and play around with. </p> <p>More specifically, we can consider the c-TF-IDF generated topics to be candidate topics. They each contain a set of keywords and  representative documents that we can use to further fine-tune the topic representations. Having a set of representative documents  for each topic is huge advantage as it allows for fine-tuning on a reduced number of documents. This reduces computation for  large models as they only need to operate on that small set of representative documents for each topic. As a result,  large language models like GPT and T5 becomes feasible in production settings and typically take less wall time than the dimensionality reduction  and clustering steps. </p> <p>The following models are implemented in <code>bertopic.representation</code>:</p> <ul> <li><code>MaximalMarginalRelevance</code></li> <li><code>PartOfSpeech</code></li> <li><code>KeyBERTInspired</code></li> <li><code>ZeroShotClassification</code></li> <li><code>TextGeneration</code></li> <li><code>Cohere</code></li> <li><code>OpenAI</code></li> <li><code>LangChain</code></li> </ul>"},{"location":"api/bertopic.html","title":"<code>BERTopic</code>","text":"<p>BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.</p> <p>The default embedding model is <code>all-MiniLM-L6-v2</code> when selecting <code>language=\"english\"</code> and <code>paraphrase-multilingual-MiniLM-L12-v2</code> when selecting <code>language=\"multilingual\"</code>.</p> <p>Attributes:</p> Name Type Description <code>topics_</code> <code>List[int]) </code> <p>The topics that are generated for each document after training or updating                   the topic model. The most recent topics are tracked.</p> <code>probabilities_</code> <code>List[float]</code> <p>The probability of the assigned topic per document. These are                           only calculated if a HDBSCAN model is used for the clustering step.                           When <code>calculate_probabilities=True</code>, then it is the probabilities                           of all topics per document.</p> <code>topic_sizes_</code> <code>Mapping[int, int]) </code> <p>The size of each topic</p> <code>topic_mapper_</code> <code>TopicMapper) </code> <p>A class for tracking topics and their mappings anytime they are                           merged, reduced, added, or removed.</p> <code>topic_representations_</code> <code>Mapping[int, Tuple[int, float]]) </code> <p>The top n terms per topic and their respective                                                        c-TF-IDF values.</p> <code>c_tf_idf_</code> <code>csr_matrix) </code> <p>The topic-term matrix as calculated through c-TF-IDF. To access its respective                      words, run <code>.vectorizer_model.get_feature_names()</code>  or                      <code>.vectorizer_model.get_feature_names_out()</code></p> <code>topic_labels_</code> <code>Mapping[int, str]) </code> <p>The default labels for each topic.</p> <code>custom_labels_</code> <code>List[str]) </code> <p>Custom labels for each topic.</p> <code>topic_embeddings_</code> <code>np.ndarray) </code> <p>The embeddings for each topic. It is calculated by taking the                              centroid embedding of each cluster.</p> <code>representative_docs_</code> <code>Mapping[int, str]) </code> <p>The representative documents for each topic.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic()\ntopics, probabilities = topic_model.fit_transform(docs)\n</code></pre> <p>If you want to use your own embedding model, use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\ntopic_model = BERTopic(embedding_model=sentence_model)\n</code></pre> <p>Due to the stochastisch nature of UMAP, the results from BERTopic might differ and the quality can degrade. Using your own embeddings allows you to try out BERTopic several times until you find the topics that suit you best.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>class BERTopic:\n    \"\"\"BERTopic is a topic modeling technique that leverages BERT embeddings and\n    c-TF-IDF to create dense clusters allowing for easily interpretable topics\n    whilst keeping important words in the topic descriptions.\n\n    The default embedding model is `all-MiniLM-L6-v2` when selecting `language=\"english\"`\n    and `paraphrase-multilingual-MiniLM-L12-v2` when selecting `language=\"multilingual\"`.\n\n    Attributes:\n        topics_ (List[int]) : The topics that are generated for each document after training or updating\n                              the topic model. The most recent topics are tracked.\n        probabilities_ (List[float]): The probability of the assigned topic per document. These are\n                                      only calculated if a HDBSCAN model is used for the clustering step.\n                                      When `calculate_probabilities=True`, then it is the probabilities\n                                      of all topics per document.\n        topic_sizes_ (Mapping[int, int]) : The size of each topic\n        topic_mapper_ (TopicMapper) : A class for tracking topics and their mappings anytime they are\n                                      merged, reduced, added, or removed.\n        topic_representations_ (Mapping[int, Tuple[int, float]]) : The top n terms per topic and their respective\n                                                                   c-TF-IDF values.\n        c_tf_idf_ (csr_matrix) : The topic-term matrix as calculated through c-TF-IDF. To access its respective\n                                 words, run `.vectorizer_model.get_feature_names()`  or\n                                 `.vectorizer_model.get_feature_names_out()`\n        topic_labels_ (Mapping[int, str]) : The default labels for each topic.\n        custom_labels_ (List[str]) : Custom labels for each topic.\n        topic_embeddings_ (np.ndarray) : The embeddings for each topic. It is calculated by taking the\n                                         centroid embedding of each cluster.\n        representative_docs_ (Mapping[int, str]) : The representative documents for each topic.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic()\n    topics, probabilities = topic_model.fit_transform(docs)\n    ```\n\n    If you want to use your own embedding model, use it as follows:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    topic_model = BERTopic(embedding_model=sentence_model)\n    ```\n\n    Due to the stochastisch nature of UMAP, the results from BERTopic might differ\n    and the quality can degrade. Using your own embeddings allows you to\n    try out BERTopic several times until you find the topics that suit\n    you best.\n    \"\"\"\n    def __init__(self,\n                 language: str = \"english\",\n                 top_n_words: int = 10,\n                 n_gram_range: Tuple[int, int] = (1, 1),\n                 min_topic_size: int = 10,\n                 nr_topics: Union[int, str] = None,\n                 low_memory: bool = False,\n                 calculate_probabilities: bool = False,\n                 seed_topic_list: List[List[str]] = None,\n                 zeroshot_topic_list: List[str] = None,\n                 zeroshot_min_similarity: float = .7,\n                 embedding_model=None,\n                 umap_model: UMAP = None,\n                 hdbscan_model: hdbscan.HDBSCAN = None,\n                 vectorizer_model: CountVectorizer = None,\n                 ctfidf_model: TfidfTransformer = None,\n                 representation_model: BaseRepresentation = None,\n                 verbose: bool = False,\n                 ):\n        \"\"\"BERTopic initialization\n\n        Arguments:\n            language: The main language used in your documents. The default sentence-transformers\n                      model for \"english\" is `all-MiniLM-L6-v2`. For a full overview of\n                      supported languages see bertopic.backend.languages. Select\n                      \"multilingual\" to load in the `paraphrase-multilingual-MiniLM-L12-v2`\n                      sentence-transformers model that supports 50+ languages.\n                      NOTE: This is not used if `embedding_model` is used.\n            top_n_words: The number of words per topic to extract. Setting this\n                         too high can negatively impact topic embeddings as topics\n                         are typically best represented by at most 10 words.\n            n_gram_range: The n-gram range for the CountVectorizer.\n                          Advised to keep high values between 1 and 3.\n                          More would likely lead to memory issues.\n                          NOTE: This param will not be used if you pass in your own\n                          CountVectorizer.\n            min_topic_size: The minimum size of the topic. Increasing this value will lead\n                            to a lower number of clusters/topics and vice versa. \n                            It is the same parameter as `min_cluster_size` in HDBSCAN.\n                            NOTE: This param will not be used if you are `hdbscan_model`.\n            nr_topics: Specifying the number of topics will reduce the initial\n                       number of topics to the value specified. This reduction can take\n                       a while as each reduction in topics (-1) activates a c-TF-IDF\n                       calculation. If this is set to None, no reduction is applied. Use\n                       \"auto\" to automatically reduce topics using HDBSCAN.\n                       NOTE: Controlling the number of topics is best done by adjusting\n                       `min_topic_size` first before adjusting this parameter.\n            low_memory: Sets UMAP low memory to True to make sure less memory is used.\n                        NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP\n                        this parameter will not be used.\n            calculate_probabilities: Calculate the probabilities of all topics\n                                     per document instead of the probability of the assigned\n                                     topic per document. This could slow down the extraction\n                                     of topics if you have many documents (&gt; 100_000).\n                                     NOTE: If false you cannot use the corresponding\n                                     visualization method `visualize_probabilities`.\n                                     NOTE: This is an approximation of topic probabilities\n                                     as used in HDBSCAN and not an exact representation.\n            seed_topic_list: A list of seed words per topic to converge around\n            zeroshot_topic_list: A list of topic names to use for zero-shot classification\n            zeroshot_min_similarity: The minimum similarity between a zero-shot topic and\n                                     a document for assignment. The higher this value, the more\n                                     confident the model needs to be to assign a zero-shot topic to a document.\n            verbose: Changes the verbosity of the model, Set to True if you want\n                     to track the stages of the model.\n            embedding_model: Use a custom embedding model.\n                             The following backends are currently supported\n                               * SentenceTransformers\n                               * Flair\n                               * Spacy\n                               * Gensim\n                               * USE (TF-Hub)\n                             You can also pass in a string that points to one of the following\n                             sentence-transformers models:\n                               * https://www.sbert.net/docs/pretrained_models.html\n            umap_model: Pass in a UMAP model to be used instead of the default.\n                        NOTE: You can also pass in any dimensionality reduction algorithm as long\n                        as it has `.fit` and `.transform` functions.\n            hdbscan_model: Pass in a hdbscan.HDBSCAN model to be used instead of the default\n                           NOTE: You can also pass in any clustering algorithm as long as it has\n                           `.fit` and `.predict` functions along with the `.labels_` variable.\n            vectorizer_model: Pass in a custom `CountVectorizer` instead of the default model.\n            ctfidf_model: Pass in a custom ClassTfidfTransformer instead of the default model.\n            representation_model: Pass in a model that fine-tunes the topic representations\n                                  calculated through c-TF-IDF. Models from `bertopic.representation`\n                                  are supported.\n        \"\"\"\n        # Topic-based parameters\n        if top_n_words &gt; 100:\n            logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                           \"can slow down computation quite a bit.\")\n\n        self.top_n_words = top_n_words\n        self.min_topic_size = min_topic_size\n        self.nr_topics = nr_topics\n        self.low_memory = low_memory\n        self.calculate_probabilities = calculate_probabilities\n        self.verbose = verbose\n        self.seed_topic_list = seed_topic_list\n        self.zeroshot_topic_list = zeroshot_topic_list\n        self.zeroshot_min_similarity = zeroshot_min_similarity\n\n        # Embedding model\n        self.language = language if not embedding_model else None\n        self.embedding_model = embedding_model\n\n        # Vectorizer\n        self.n_gram_range = n_gram_range\n        self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=self.n_gram_range)\n        self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n\n        # Representation model\n        self.representation_model = representation_model\n\n        # UMAP or another algorithm that has .fit and .transform functions\n        self.umap_model = umap_model or UMAP(n_neighbors=15,\n                                             n_components=5,\n                                             min_dist=0.0,\n                                             metric='cosine',\n                                             low_memory=self.low_memory)\n\n        # HDBSCAN or another clustering algorithm that has .fit and .predict functions and\n        # the .labels_ variable to extract the labels\n        self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,\n                                                              metric='euclidean',\n                                                              cluster_selection_method='eom',\n                                                              prediction_data=True)\n\n        # Public attributes\n        self.topics_ = None\n        self.probabilities_ = None\n        self.topic_sizes_ = None\n        self.topic_mapper_ = None\n        self.topic_representations_ = None\n        self.topic_embeddings_ = None\n        self.topic_labels_ = None\n        self.custom_labels_ = None\n        self.c_tf_idf_ = None\n        self.representative_images_ = None\n        self.representative_docs_ = {}\n        self.topic_aspects_ = {}\n\n        # Private attributes for internal tracking purposes\n        self._outliers = 1\n        self._merged_topics = None\n\n        if verbose:\n            logger.set_level(\"DEBUG\")\n        else:\n            logger.set_level(\"WARNING\")\n\n    def fit(self,\n            documents: List[str],\n            embeddings: np.ndarray = None,\n            images: List[str] = None,\n            y: Union[List[int], np.ndarray] = None):\n        \"\"\" Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics\n\n        Arguments:\n            documents: A list of documents to fit on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model\n            images: A list of paths to the images to fit on or the images themselves\n            y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n               specific instance is specified.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all')['data']\n        topic_model = BERTopic().fit(docs)\n        ```\n\n        If you want to use your own embeddings, use it as follows:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n\n        # Create embeddings\n        docs = fetch_20newsgroups(subset='all')['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n        # Create topic model\n        topic_model = BERTopic().fit(docs, embeddings)\n        ```\n        \"\"\"\n        self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)\n        return self\n\n    def fit_transform(self,\n                      documents: List[str],\n                      embeddings: np.ndarray = None,\n                      images: List[str] = None,\n                      y: Union[List[int], np.ndarray] = None) -&gt; Tuple[List[int],\n                                                                       Union[np.ndarray, None]]:\n        \"\"\" Fit the models on a collection of documents, generate topics,\n        and return the probabilities and topic per document.\n\n        Arguments:\n            documents: A list of documents to fit on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model\n            images: A list of paths to the images to fit on or the images themselves\n            y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n               specific instance is specified.\n\n        Returns:\n            predictions: Topic predictions for each documents\n            probabilities: The probability of the assigned topic per document.\n                           If `calculate_probabilities` in BERTopic is set to True, then\n                           it calculates the probabilities of all topics across all documents\n                           instead of only the assigned topic. This, however, slows down\n                           computation and may increase memory usage.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all')['data']\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        ```\n\n        If you want to use your own embeddings, use it as follows:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n\n        # Create embeddings\n        docs = fetch_20newsgroups(subset='all')['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n        # Create topic model\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs, embeddings)\n        ```\n        \"\"\"\n        if documents is not None:\n            check_documents_type(documents)\n            check_embeddings_shape(embeddings, documents)\n\n        doc_ids = range(len(documents)) if documents is not None else range(len(images))\n        documents = pd.DataFrame({\"Document\": documents,\n                                  \"ID\": doc_ids,\n                                  \"Topic\": None,\n                                  \"Image\": images})\n\n        # Extract embeddings\n        if embeddings is None:\n            logger.info(\"Embedding - Transforming documents to embeddings.\")\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n            embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                                  images=images,\n                                                  method=\"document\",\n                                                  verbose=self.verbose)\n            logger.info(\"Embedding - Completed \\u2713\")\n        else:\n            if self.embedding_model is not None:\n                self.embedding_model = select_backend(self.embedding_model,\n                                                      language=self.language)\n\n        # Guided Topic Modeling\n        if self.seed_topic_list is not None and self.embedding_model is not None:\n            y, embeddings = self._guided_topic_modeling(embeddings)\n\n        # Zero-shot Topic Modeling\n        if self._is_zeroshot():\n            documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(documents, embeddings)\n            if documents is None:\n                return self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n        # Reduce dimensionality\n        umap_embeddings = self._reduce_dimensionality(embeddings, y)\n\n        # Cluster reduced embeddings\n        documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)\n\n        # Sort and Map Topic IDs by their frequency\n        if not self.nr_topics:\n            documents = self._sort_mappings_by_frequency(documents)\n\n        # Create documents from images if we have images only\n        if documents.Document.values[0] is None:\n            custom_documents = self._images_to_text(documents, embeddings)\n\n            # Extract topics by calculating c-TF-IDF\n            self._extract_topics(custom_documents, embeddings=embeddings)\n            self._create_topic_vectors(documents=documents, embeddings=embeddings)\n\n            # Reduce topics\n            if self.nr_topics:\n                custom_documents = self._reduce_topics(custom_documents)\n\n            # Save the top 3 most representative documents per topic\n            self._save_representative_docs(custom_documents)\n        else:\n            # Extract topics by calculating c-TF-IDF\n            self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)\n\n            # Reduce topics\n            if self.nr_topics:\n                documents = self._reduce_topics(documents)\n\n            # Save the top 3 most representative documents per topic\n            self._save_representative_docs(documents)\n\n        # Resulting output\n        self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)\n        predictions = documents.Topic.to_list()\n\n        # Combine Zero-shot with outliers\n        if self._is_zeroshot() and len(documents) != len(doc_ids):\n            predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n        return predictions, self.probabilities_\n\n    def transform(self,\n                  documents: Union[str, List[str]],\n                  embeddings: np.ndarray = None,\n                  images: List[str] = None) -&gt; Tuple[List[int], np.ndarray]:\n        \"\"\" After having fit a model, use transform to predict new instances\n\n        Arguments:\n            documents: A single document or a list of documents to predict on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model.\n            images: A list of paths to the images to predict on or the images themselves\n\n        Returns:\n            predictions: Topic predictions for each documents\n            probabilities: The topic probability distribution which is returned by default.\n                           If `calculate_probabilities` in BERTopic is set to False, then the\n                           probabilities are not calculated to speed up computation and\n                           decrease memory usage.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all')['data']\n        topic_model = BERTopic().fit(docs)\n        topics, probs = topic_model.transform(docs)\n        ```\n\n        If you want to use your own embeddings:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n\n        # Create embeddings\n        docs = fetch_20newsgroups(subset='all')['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n        # Create topic model\n        topic_model = BERTopic().fit(docs, embeddings)\n        topics, probs = topic_model.transform(docs, embeddings)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_embeddings_shape(embeddings, documents)\n\n        if isinstance(documents, str) or documents is None:\n            documents = [documents]\n\n        if embeddings is None:\n            embeddings = self._extract_embeddings(documents,\n                                                  images=images,\n                                                  method=\"document\",\n                                                  verbose=self.verbose)\n\n        # Check if an embedding model was found\n        if embeddings is None:\n            raise ValueError(\"No embedding model was found to embed the documents.\"\n                             \"Make sure when loading in the model using BERTopic.load()\"\n                             \"to also specify the embedding model.\")\n\n        # Transform without hdbscan_model and umap_model using only cosine similarity\n        elif type(self.hdbscan_model) == BaseCluster:\n            logger.info(\"Predicting topic assignments through cosine similarity of topic and document embeddings.\")\n            sim_matrix = cosine_similarity(embeddings, np.array(self.topic_embeddings_))\n            predictions = np.argmax(sim_matrix, axis=1) - self._outliers\n\n            if self.calculate_probabilities:\n                probabilities = sim_matrix\n            else:\n                probabilities = np.max(sim_matrix, axis=1)\n\n        # Transform with full pipeline\n        else:\n            logger.info(\"Dimensionality - Reducing dimensionality of input embeddings.\")\n            umap_embeddings = self.umap_model.transform(embeddings)\n            logger.info(\"Dimensionality - Completed \\u2713\")\n\n            # Extract predictions and probabilities if it is a HDBSCAN-like model\n            logger.info(\"Clustering - Approximating new points with `hdbscan_model`\")\n            if is_supported_hdbscan(self.hdbscan_model):\n                predictions, probabilities = hdbscan_delegator(self.hdbscan_model, \"approximate_predict\", umap_embeddings)\n\n                # Calculate probabilities\n                if self.calculate_probabilities:\n                    logger.info(\"Probabilities - Start calculation of probabilities with HDBSCAN\")\n                    probabilities = hdbscan_delegator(self.hdbscan_model, \"membership_vector\", umap_embeddings)\n                    logger.info(\"Probabilities - Completed \\u2713\")\n            else:\n                predictions = self.hdbscan_model.predict(umap_embeddings)\n                probabilities = None\n            logger.info(\"Cluster - Completed \\u2713\")\n\n            # Map probabilities and predictions\n            probabilities = self._map_probabilities(probabilities, original_topics=True)\n            predictions = self._map_predictions(predictions)\n        return predictions, probabilities\n\n    def partial_fit(self,\n                    documents: List[str],\n                    embeddings: np.ndarray = None,\n                    y: Union[List[int], np.ndarray] = None):\n        \"\"\" Fit BERTopic on a subset of the data and perform online learning\n        with batch-like data.\n\n        Online topic modeling in BERTopic is performed by using dimensionality\n        reduction and cluster algorithms that support a `partial_fit` method\n        in order to incrementally train the topic model.\n\n        Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used\n        to dynamically update its vocabulary when presented with new data.\n        It has several parameters for modeling decay and updating the\n        representations.\n\n        In other words, although the main algorithm stays the same, the training\n        procedure now works as follows:\n\n        For each subset of the data:\n\n        1. Generate embeddings with a pre-traing language model\n        2. Incrementally update the dimensionality reduction algorithm with `partial_fit`\n        3. Incrementally update the cluster algorithm with `partial_fit`\n        4. Incrementally update the OnlineCountVectorizer and apply some form of decay\n\n        Note that it is advised to use `partial_fit` with batches and\n        not single documents for the best performance.\n\n        Arguments:\n            documents: A list of documents to fit on\n            embeddings: Pre-trained document embeddings. These can be used\n                        instead of the sentence-transformer model\n            y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n               specific instance is specified.\n\n        Examples:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n        from sklearn.cluster import MiniBatchKMeans\n        from sklearn.decomposition import IncrementalPCA\n        from bertopic.vectorizers import OnlineCountVectorizer\n        from bertopic import BERTopic\n\n        # Prepare documents\n        docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\n        # Prepare sub-models that support online learning\n        umap_model = IncrementalPCA(n_components=5)\n        cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\n        vectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\n        topic_model = BERTopic(umap_model=umap_model,\n                               hdbscan_model=cluster_model,\n                               vectorizer_model=vectorizer_model)\n\n        # Incrementally fit the topic model by training on 1000 documents at a time\n        for index in range(0, len(docs), 1000):\n            topic_model.partial_fit(docs[index: index+1000])\n        ```\n        \"\"\"\n        # Checks\n        check_embeddings_shape(embeddings, documents)\n        if not hasattr(self.hdbscan_model, \"partial_fit\"):\n            raise ValueError(\"In order to use `.partial_fit`, the cluster model should have \"\n                             \"a `.partial_fit` function.\")\n\n        # Prepare documents\n        if isinstance(documents, str):\n            documents = [documents]\n        documents = pd.DataFrame({\"Document\": documents,\n                                  \"ID\": range(len(documents)),\n                                  \"Topic\": None})\n\n        # Extract embeddings\n        if embeddings is None:\n            if self.topic_representations_ is None:\n                self.embedding_model = select_backend(self.embedding_model,\n                                                      language=self.language)\n            embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                                  method=\"document\",\n                                                  verbose=self.verbose)\n        else:\n            if self.embedding_model is not None and self.topic_representations_ is None:\n                self.embedding_model = select_backend(self.embedding_model,\n                                                      language=self.language)\n\n        # Reduce dimensionality\n        if self.seed_topic_list is not None and self.embedding_model is not None:\n            y, embeddings = self._guided_topic_modeling(embeddings)\n        umap_embeddings = self._reduce_dimensionality(embeddings, y, partial_fit=True)\n\n        # Cluster reduced embeddings\n        documents, self.probabilities_ = self._cluster_embeddings(umap_embeddings, documents, partial_fit=True)\n        topics = documents.Topic.to_list()\n\n        # Map and find new topics\n        if not self.topic_mapper_:\n            self.topic_mapper_ = TopicMapper(topics)\n        mappings = self.topic_mapper_.get_mappings()\n        new_topics = set(topics).difference(set(mappings.keys()))\n        new_topic_ids = {topic: max(mappings.values()) + index + 1 for index, topic in enumerate(new_topics)}\n        self.topic_mapper_.add_new_topics(new_topic_ids)\n        updated_mappings = self.topic_mapper_.get_mappings()\n        updated_topics = [updated_mappings[topic] for topic in topics]\n        documents[\"Topic\"] = updated_topics\n\n        # Add missing topics (topics that were originally created but are now missing)\n        if self.topic_representations_:\n            missing_topics = set(self.topic_representations_.keys()).difference(set(updated_topics))\n            for missing_topic in missing_topics:\n                documents.loc[len(documents), :] = [\" \", len(documents), missing_topic]\n        else:\n            missing_topics = {}\n\n        # Prepare documents\n        documents_per_topic = documents.sort_values(\"Topic\").groupby(['Topic'], as_index=False)\n        updated_topics = documents_per_topic.first().Topic.astype(int)\n        documents_per_topic = documents_per_topic.agg({'Document': ' '.join})\n\n        # Update topic representations\n        self.c_tf_idf_, updated_words = self._c_tf_idf(documents_per_topic, partial_fit=True)\n        self.topic_representations_ = self._extract_words_per_topic(updated_words, documents, self.c_tf_idf_, calculate_aspects=False)\n        self._create_topic_vectors()\n        self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                              for key, values in self.topic_representations_.items()}\n\n        # Update topic sizes\n        if len(missing_topics) &gt; 0:\n            documents = documents.iloc[:-len(missing_topics)]\n\n        if self.topic_sizes_ is None:\n            self._update_topic_size(documents)\n        else:\n            sizes = documents.groupby(['Topic'], as_index=False).count()\n            for _, row in sizes.iterrows():\n                topic = int(row.Topic)\n                if self.topic_sizes_.get(topic) is not None and topic not in missing_topics:\n                    self.topic_sizes_[topic] += int(row.Document)\n                elif self.topic_sizes_.get(topic) is None:\n                    self.topic_sizes_[topic] = int(row.Document)\n            self.topics_ = documents.Topic.astype(int).tolist()\n\n        return self\n\n    def topics_over_time(self,\n                         docs: List[str],\n                         timestamps: Union[List[str],\n                                           List[int]],\n                         topics: List[int] = None,\n                         nr_bins: int = None,\n                         datetime_format: str = None,\n                         evolution_tuning: bool = True,\n                         global_tuning: bool = True) -&gt; pd.DataFrame:\n        \"\"\" Create topics over time\n\n        To create the topics over time, BERTopic needs to be already fitted once.\n        From the fitted models, the c-TF-IDF representations are calculate at\n        each timestamp t. Then, the c-TF-IDF representations at timestamp t are\n        averaged with the global c-TF-IDF representations in order to fine-tune the\n        local representations.\n\n        NOTE:\n            Make sure to use a limited number of unique timestamps (&lt;100) as the\n            c-TF-IDF representation will be calculated at each single unique timestamp.\n            Having a large number of unique timestamps can take some time to be calculated.\n            Moreover, there aren't many use-cased where you would like to see the difference\n            in topic representations over more than 100 different timestamps.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            timestamps: The timestamp of each document. This can be either a list of strings or ints.\n                        If it is a list of strings, then the datetime format will be automatically\n                        inferred. If it is a list of ints, then the documents will be ordered by\n                        ascending order.\n            topics: A list of topics where each topic is related to a document in `docs` and\n                    a timestamp in `timestamps`. You can use this to apply topics_over_time on\n                    a subset of the data. Make sure that `docs`, `timestamps`, and `topics`\n                    all correspond to one another and have the same size.\n            nr_bins: The number of bins you want to create for the timestamps. The left interval will\n                     be chosen as the timestamp. An additional column will be created with the\n                     entire interval.\n            datetime_format: The datetime format of the timestamps if they are strings, eg \u201c%d/%m/%Y\u201d.\n                             Set this to None if you want to have it automatically detect the format.\n                             See strftime documentation for more information on choices:\n                             https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.\n            evolution_tuning: Fine-tune each topic representation at timestamp *t* by averaging its\n                              c-TF-IDF matrix with the c-TF-IDF matrix at timestamp *t-1*. This creates\n                              evolutionary topic representations.\n            global_tuning: Fine-tune each topic representation at timestamp *t* by averaging its c-TF-IDF matrix\n                       with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                       topic representations that could not be found in the documents at timestamp *t*.\n\n        Returns:\n            topics_over_time: A dataframe that contains the topic, words, and frequency of topic\n                              at timestamp *t*.\n\n        Examples:\n\n        The timestamps variable represent the timestamp of each document. If you have over\n        100 unique timestamps, it is advised to bin the timestamps as shown below:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        selected_topics = topics if topics else self.topics_\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": selected_topics, \"Timestamps\": timestamps})\n        global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n        all_topics = sorted(list(documents.Topic.unique()))\n        all_topics_indices = {topic: index for index, topic in enumerate(all_topics)}\n\n        if isinstance(timestamps[0], str):\n            infer_datetime_format = True if not datetime_format else False\n            documents[\"Timestamps\"] = pd.to_datetime(documents[\"Timestamps\"],\n                                                     infer_datetime_format=infer_datetime_format,\n                                                     format=datetime_format)\n\n        if nr_bins:\n            documents[\"Bins\"] = pd.cut(documents.Timestamps, bins=nr_bins)\n            documents[\"Timestamps\"] = documents.apply(lambda row: row.Bins.left, 1)\n\n        # Sort documents in chronological order\n        documents = documents.sort_values(\"Timestamps\")\n        timestamps = documents.Timestamps.unique()\n        if len(timestamps) &gt; 100:\n            logger.warning(f\"There are more than 100 unique timestamps (i.e., {len(timestamps)}) \"\n                           \"which significantly slows down the application. Consider setting `nr_bins` \"\n                           \"to a value lower than 100 to speed up calculation. \")\n\n        # For each unique timestamp, create topic representations\n        topics_over_time = []\n        for index, timestamp in tqdm(enumerate(timestamps), disable=not self.verbose):\n\n            # Calculate c-TF-IDF representation for a specific timestamp\n            selection = documents.loc[documents.Timestamps == timestamp, :]\n            documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                    \"Timestamps\": \"count\"})\n            c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n            if global_tuning or evolution_tuning:\n                c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n\n            # Fine-tune the c-TF-IDF matrix at timestamp t by averaging it with the c-TF-IDF\n            # matrix at timestamp t-1\n            if evolution_tuning and index != 0:\n                current_topics = sorted(list(documents_per_topic.Topic.values))\n                overlapping_topics = sorted(list(set(previous_topics).intersection(set(current_topics))))\n\n                current_overlap_idx = [current_topics.index(topic) for topic in overlapping_topics]\n                previous_overlap_idx = [previous_topics.index(topic) for topic in overlapping_topics]\n\n                c_tf_idf.tolil()[current_overlap_idx] = ((c_tf_idf[current_overlap_idx] +\n                                                          previous_c_tf_idf[previous_overlap_idx]) / 2.0).tolil()\n\n            # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n            # by simply taking the average of the two\n            if global_tuning:\n                selected_topics = [all_topics_indices[topic] for topic in documents_per_topic.Topic.values]\n                c_tf_idf = (global_c_tf_idf[selected_topics] + c_tf_idf) / 2.0\n\n            # Extract the words per topic\n            words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n            topic_frequency = pd.Series(documents_per_topic.Timestamps.values,\n                                        index=documents_per_topic.Topic).to_dict()\n\n            # Fill dataframe with results\n            topics_at_timestamp = [(topic,\n                                    \", \".join([words[0] for words in values][:5]),\n                                    topic_frequency[topic],\n                                    timestamp) for topic, values in words_per_topic.items()]\n            topics_over_time.extend(topics_at_timestamp)\n\n            if evolution_tuning:\n                previous_topics = sorted(list(documents_per_topic.Topic.values))\n                previous_c_tf_idf = c_tf_idf.copy()\n\n        return pd.DataFrame(topics_over_time, columns=[\"Topic\", \"Words\", \"Frequency\", \"Timestamp\"])\n\n    def topics_per_class(self,\n                         docs: List[str],\n                         classes: Union[List[int], List[str]],\n                         global_tuning: bool = True) -&gt; pd.DataFrame:\n        \"\"\" Create topics per class\n\n        To create the topics per class, BERTopic needs to be already fitted once.\n        From the fitted models, the c-TF-IDF representations are calculate at\n        each class c. Then, the c-TF-IDF representations at class c are\n        averaged with the global c-TF-IDF representations in order to fine-tune the\n        local representations. This can be turned off if the pure representation is\n        needed.\n\n        NOTE:\n            Make sure to use a limited number of unique classes (&lt;100) as the\n            c-TF-IDF representation will be calculated at each single unique class.\n            Having a large number of unique classes can take some time to be calculated.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            classes: The class of each document. This can be either a list of strings or ints.\n            global_tuning: Fine-tune each topic representation for class c t by averaging its c-TF-IDF matrix\n                           with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                           topic representations that could not be found in the documents for class c.\n\n        Returns:\n            topics_per_class: A dataframe that contains the topic, words, and frequency of topics\n                              for each class.\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        topics_per_class = topic_model.topics_per_class(docs, classes)\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Class\": classes})\n        global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n        # For each unique timestamp, create topic representations\n        topics_per_class = []\n        for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose):\n\n            # Calculate c-TF-IDF representation for a specific timestamp\n            selection = documents.loc[documents.Class == class_, :]\n            documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                    \"Class\": \"count\"})\n            c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n            # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n            # by simply taking the average of the two\n            if global_tuning:\n                c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n                c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0\n\n            # Extract the words per topic\n            words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n            topic_frequency = pd.Series(documents_per_topic.Class.values,\n                                        index=documents_per_topic.Topic).to_dict()\n\n            # Fill dataframe with results\n            topics_at_class = [(topic,\n                                \", \".join([words[0] for words in values][:5]),\n                                topic_frequency[topic],\n                                class_) for topic, values in words_per_topic.items()]\n            topics_per_class.extend(topics_at_class)\n\n        topics_per_class = pd.DataFrame(topics_per_class, columns=[\"Topic\", \"Words\", \"Frequency\", \"Class\"])\n\n        return topics_per_class\n\n    def hierarchical_topics(self,\n                            docs: List[str],\n                            linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                            distance_function: Callable[[csr_matrix], csr_matrix] = None) -&gt; pd.DataFrame:\n        \"\"\" Create a hierarchy of topics\n\n        To create this hierarchy, BERTopic needs to be already fitted once.\n        Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF\n        representation using `scipy.cluster.hierarchy.linkage`.\n\n        Based on that hierarchy, we calculate the topic representation at each\n        merged step. This is a local representation, as we only assume that the\n        chosen step is merged and not all others which typically improves the\n        topic representation.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            linkage_function: The linkage function to use. Default is:\n                            `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n            distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                                `lambda x: 1 - cosine_similarity(x)`.\n                                You can pass any function that returns either a square matrix of\n                                shape (n_samples, n_samples) with zeros on the diagonal and\n                                non-negative values or condensed distance matrix of shape\n                                (n_samples * (n_samples - 1) / 2,) containing the upper\n                                triangular of the distance matrix.\n\n        Returns:\n            hierarchical_topics: A dataframe that contains a hierarchy of topics\n                                represented by their parents and their children\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n        ```\n\n        A custom linkage function can be used as follows:\n\n        ```python\n        from scipy.cluster import hierarchy as sch\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n\n        # Hierarchical topics\n        linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n        hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        if distance_function is None:\n            distance_function = lambda x: 1 - cosine_similarity(x)\n\n        if linkage_function is None:\n            linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n\n        # Calculate distance\n        embeddings = self.c_tf_idf_[self._outliers:]\n        X = distance_function(embeddings)\n        X = validate_distance_matrix(X, embeddings.shape[0])\n\n        # Use the 1-D condensed distance matrix as an input instead of the raw distance matrix\n        Z = linkage_function(X)\n\n        # Calculate basic bag-of-words to be iteratively merged later\n        documents = pd.DataFrame({\"Document\": docs,\n                                  \"ID\": range(len(docs)),\n                                  \"Topic\": self.topics_})\n        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n        documents_per_topic = documents_per_topic.loc[documents_per_topic.Topic != -1, :]\n        clean_documents = self._preprocess_text(documents_per_topic.Document.values)\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = self.vectorizer_model.get_feature_names_out()\n        else:\n            words = self.vectorizer_model.get_feature_names()\n\n        bow = self.vectorizer_model.transform(clean_documents)\n\n        # Extract clusters\n        hier_topics = pd.DataFrame(columns=[\"Parent_ID\", \"Parent_Name\", \"Topics\",\n                                            \"Child_Left_ID\", \"Child_Left_Name\",\n                                            \"Child_Right_ID\", \"Child_Right_Name\"])\n        for index in tqdm(range(len(Z))):\n\n            # Find clustered documents\n            clusters = sch.fcluster(Z, t=Z[index][2], criterion='distance') - self._outliers\n            cluster_df = pd.DataFrame({\"Topic\": range(len(clusters)), \"Cluster\": clusters})\n            cluster_df = cluster_df.groupby(\"Cluster\").agg({'Topic': lambda x: list(x)}).reset_index()\n            nr_clusters = len(clusters)\n\n            # Extract first topic we find to get the set of topics in a merged topic\n            topic = None\n            val = Z[index][0]\n            while topic is None:\n                if val - len(clusters) &lt; 0:\n                    topic = int(val)\n                else:\n                    val = Z[int(val - len(clusters))][0]\n            clustered_topics = [i for i, x in enumerate(clusters) if x == clusters[topic]]\n\n            # Group bow per cluster, calculate c-TF-IDF and extract words\n            grouped = csr_matrix(bow[clustered_topics].sum(axis=0))\n            c_tf_idf = self.ctfidf_model.transform(grouped)\n            selection = documents.loc[documents.Topic.isin(clustered_topics), :]\n            selection.Topic = 0\n            words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n\n            # Extract parent's name and ID\n            parent_id = index + len(clusters)\n            parent_name = \"_\".join([x[0] for x in words_per_topic[0]][:5])\n\n            # Extract child's name and ID\n            Z_id = Z[index][0]\n            child_left_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n            if Z_id - nr_clusters &lt; 0:\n                child_left_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n            else:\n                child_left_name = hier_topics.iloc[int(child_left_id)].Parent_Name\n\n            # Extract child's name and ID\n            Z_id = Z[index][1]\n            child_right_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n            if Z_id - nr_clusters &lt; 0:\n                child_right_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n            else:\n                child_right_name = hier_topics.iloc[int(child_right_id)].Parent_Name\n\n            # Save results\n            hier_topics.loc[len(hier_topics), :] = [parent_id, parent_name,\n                                                    clustered_topics,\n                                                    int(Z[index][0]), child_left_name,\n                                                    int(Z[index][1]), child_right_name]\n\n        hier_topics[\"Distance\"] = Z[:, 2]\n        hier_topics = hier_topics.sort_values(\"Parent_ID\", ascending=False)\n        hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]] = hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]].astype(str)\n\n        return hier_topics\n\n    def approximate_distribution(self,\n                                 documents: Union[str, List[str]],\n                                 window: int = 4,\n                                 stride: int = 1,\n                                 min_similarity: float = 0.1,\n                                 batch_size: int = 1000,\n                                 padding: bool = False,\n                                 use_embedding_model: bool = False,\n                                 calculate_tokens: bool = False,\n                                 separator: str = \" \") -&gt; Tuple[np.ndarray,\n                                                                Union[List[np.ndarray], None]]:\n        \"\"\" A post-hoc approximation of topic distributions across documents.\n\n        In order to perform this approximation, each document is split into tokens\n        according to the provided tokenizer in the `CountVectorizer`. Then, a\n        sliding window is applied on each document creating subsets of the document.\n        For example, with a window size of 3 and stride of 1, the sentence:\n\n        `Solving the right problem is difficult.`\n\n        can be split up into `solving the right`, `the right problem`, `right problem is`,\n        and `problem is difficult`. These are called tokensets. For each of these\n        tokensets, we calculate their c-TF-IDF representation and find out\n        how similar they are to the previously generated topics. Then, the\n        similarities to the topics for each tokenset are summed in order to\n        create a topic distribution for the entire document.\n\n        We can also dive into this a bit deeper by then splitting these tokensets\n        up into individual tokens and calculate how much a word, in a specific sentence,\n        contributes to the topics found in that document. This can be enabled by\n        setting `calculate_tokens=True` which can be used for visualization purposes\n        in `topic_model.visualize_approximate_distribution`.\n\n        The main output, `topic_distributions`, can also be used directly in\n        `.visualize_distribution(topic_distributions[index])` by simply selecting\n        a single distribution.\n\n        Arguments:\n            documents: A single document or a list of documents for which we\n                    approximate their topic distributions\n            window: Size of the moving window which indicates the number of\n                    tokens being considered.\n            stride: How far the window should move at each step.\n            min_similarity: The minimum similarity of a document's tokenset\n                            with respect to the topics.\n            batch_size: The number of documents to process at a time. If None,\n                        then all documents are processed at once.\n                        NOTE: With a large number of documents, it is not\n                        advised to process all documents at once.\n            padding: Whether to pad the beginning and ending of a document with\n                     empty tokens.\n            use_embedding_model: Whether to use the topic model's embedding\n                                model to calculate the similarity between\n                                tokensets and topics instead of using c-TF-IDF.\n            calculate_tokens: Calculate the similarity of tokens with all topics.\n                            NOTE: This is computation-wise more expensive and\n                            can require more memory. Using this over batches of\n                            documents might be preferred.\n            separator: The separator used to merge tokens into tokensets.\n\n        Returns:\n            topic_distributions: A `n` x `m` matrix containing the topic distributions\n                                for all input documents with `n` being the documents\n                                and `m` the topics.\n            topic_token_distributions: A list of `t` x `m` arrays with `t` being the\n                                    number of tokens for the respective document\n                                    and `m` the topics.\n\n        Examples:\n\n        After fitting the model, the topic distributions can be calculated regardless\n        of the clustering model and regardless of whether the documents were previously\n        seen or not:\n\n        ```python\n        topic_distr, _ = topic_model.approximate_distribution(docs)\n        ```\n\n        As a result, the topic distributions are calculated in `topic_distr` for the\n        entire document based on token set with a specific window size and stride.\n\n        If you want to calculate the topic distributions on a token-level:\n\n        ```python\n        topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n        ```\n\n        The `topic_token_distr` then contains, for each token, the best fitting topics.\n        As with `topic_distr`, it can contain multiple topics for a single token.\n        \"\"\"\n        if isinstance(documents, str):\n            documents = [documents]\n\n        if batch_size is None:\n            batch_size = len(documents)\n            batches = 1\n        else:\n            batches = math.ceil(len(documents)/batch_size)\n\n        topic_distributions = []\n        topic_token_distributions = []\n\n        for i in tqdm(range(batches), disable=not self.verbose):\n            doc_set = documents[i*batch_size: (i+1) * batch_size]\n\n            # Extract tokens\n            analyzer = self.vectorizer_model.build_tokenizer()\n            tokens = [analyzer(document) for document in doc_set]\n\n            # Extract token sets\n            all_sentences = []\n            all_indices = [0]\n            all_token_sets_ids = []\n\n            for tokenset in tokens:\n                if len(tokenset) &lt; window:\n                    token_sets = [tokenset]\n                    token_sets_ids = [list(range(len(tokenset)))]\n                else:\n\n                    # Extract tokensets using window and stride parameters\n                    stride_indices = list(range(len(tokenset)))[::stride]\n                    token_sets = []\n                    token_sets_ids = []\n                    for stride_index in stride_indices:\n                        selected_tokens = tokenset[stride_index: stride_index+window]\n\n                        if padding or len(selected_tokens) == window:\n                            token_sets.append(selected_tokens)\n                            token_sets_ids.append(list(range(stride_index, stride_index+len(selected_tokens))))\n\n                    # Add empty tokens at the beginning and end of a document\n                    if padding:\n                        padded = []\n                        padded_ids = []\n                        t = math.ceil(window / stride) - 1\n                        for i in range(math.ceil(window / stride) - 1):\n                            padded.append(tokenset[:window - ((t-i) * stride)])\n                            padded_ids.append(list(range(0, window - ((t-i) * stride))))\n\n                        token_sets = padded + token_sets\n                        token_sets_ids = padded_ids + token_sets_ids\n\n                # Join the tokens\n                sentences = [separator.join(token) for token in token_sets]\n                all_sentences.extend(sentences)\n                all_token_sets_ids.extend(token_sets_ids)\n                all_indices.append(all_indices[-1] + len(sentences))\n\n            # Calculate similarity between embeddings of token sets and the topics\n            if use_embedding_model:\n                embeddings = self._extract_embeddings(all_sentences, method=\"document\", verbose=True)\n                similarity = cosine_similarity(embeddings, self.topic_embeddings_[self._outliers:])\n\n            # Calculate similarity between c-TF-IDF of token sets and the topics\n            else:\n                bow_doc = self.vectorizer_model.transform(all_sentences)\n                c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n                similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n            # Only keep similarities that exceed the minimum\n            similarity[similarity &lt; min_similarity] = 0\n\n            # Aggregate results on an individual token level\n            if calculate_tokens:\n                topic_distribution = []\n                topic_token_distribution = []\n                for index, token in enumerate(tokens):\n                    start = all_indices[index]\n                    end = all_indices[index+1]\n\n                    if start == end:\n                        end = end + 1\n\n                    # Assign topics to individual tokens\n                    token_id = [i for i in range(len(token))]\n                    token_val = {index: [] for index in token_id}\n                    for sim, token_set in zip(similarity[start:end], all_token_sets_ids[start:end]):\n                        for token in token_set:\n                            if token in token_val:\n                                token_val[token].append(sim)\n\n                    matrix = []\n                    for _, value in token_val.items():\n                        matrix.append(np.add.reduce(value))\n\n                    # Take empty documents into account\n                    matrix = np.array(matrix)\n                    if len(matrix.shape) == 1:\n                        matrix = np.zeros((1, len(self.topic_labels_) - self._outliers))\n\n                    topic_token_distribution.append(np.array(matrix))\n                    topic_distribution.append(np.add.reduce(matrix))\n\n                topic_distribution = normalize(topic_distribution, norm='l1', axis=1)\n\n            # Aggregate on a tokenset level indicated by the window and stride\n            else:\n                topic_distribution = []\n                for index in range(len(all_indices)-1):\n                    start = all_indices[index]\n                    end = all_indices[index+1]\n\n                    if start == end:\n                        end = end + 1\n                    group = similarity[start:end].sum(axis=0)\n                    topic_distribution.append(group)\n                topic_distribution = normalize(np.array(topic_distribution), norm='l1', axis=1)\n                topic_token_distribution = None\n\n            # Combine results\n            topic_distributions.append(topic_distribution)\n            if topic_token_distribution is None:\n                topic_token_distributions = None\n            else:\n                topic_token_distributions.extend(topic_token_distribution)\n\n        topic_distributions = np.vstack(topic_distributions)\n\n        return topic_distributions, topic_token_distributions\n\n    def find_topics(self,\n                    search_term: str = None,\n                    image: str = None,\n                    top_n: int = 5) -&gt; Tuple[List[int], List[float]]:\n        \"\"\" Find topics most similar to a search_term\n\n        Creates an embedding for search_term and compares that with\n        the topic embeddings. The most similar topics are returned\n        along with their similarity values.\n\n        The search_term can be of any size but since it compares\n        with the topic representation it is advised to keep it\n        below 5 words.\n\n        Arguments:\n            search_term: the term you want to use to search for topics.\n            top_n: the number of topics to return\n\n        Returns:\n            similar_topics: the most similar topics from high to low\n            similarity: the similarity scores from high to low\n\n        Examples:\n\n        You can use the underlying embedding model to find topics that\n        best represent the search term:\n\n        ```python\n        topics, similarity = topic_model.find_topics(\"sports\", top_n=5)\n        ```\n\n        Note that the search query is typically more accurate if the\n        search_term consists of a phrase or multiple words.\n        \"\"\"\n        if self.embedding_model is None:\n            raise Exception(\"This method can only be used if you did not use custom embeddings.\")\n\n        topic_list = list(self.topic_representations_.keys())\n        topic_list.sort()\n\n        # Extract search_term embeddings and compare with topic embeddings\n        if search_term is not None:\n            search_embedding = self._extract_embeddings([search_term],\n                                                        method=\"word\",\n                                                        verbose=False).flatten()\n        elif image is not None:\n            search_embedding = self._extract_embeddings([None],\n                                                        images=[image],\n                                                        method=\"document\",\n                                                        verbose=False).flatten()\n        sims = cosine_similarity(search_embedding.reshape(1, -1), self.topic_embeddings_).flatten()\n\n        # Extract topics most similar to search_term\n        ids = np.argsort(sims)[-top_n:]\n        similarity = [sims[i] for i in ids][::-1]\n        similar_topics = [topic_list[index] for index in ids][::-1]\n\n        return similar_topics, similarity\n\n    def update_topics(self,\n                      docs: List[str],\n                      images: List[str] = None,\n                      topics: List[int] = None,\n                      top_n_words: int = 10,\n                      n_gram_range: Tuple[int, int] = None,\n                      vectorizer_model: CountVectorizer = None,\n                      ctfidf_model: ClassTfidfTransformer = None,\n                      representation_model: BaseRepresentation = None):\n        \"\"\" Updates the topic representation by recalculating c-TF-IDF with the new\n        parameters as defined in this function.\n\n        When you have trained a model and viewed the topics and the words that represent them,\n        you might not be satisfied with the representation. Perhaps you forgot to remove\n        stop_words or you want to try out a different n_gram_range. This function allows you\n        to update the topic representation after they have been formed.\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            images: The images you used when calling either `fit` or `fit_transform`\n            topics: A list of topics where each topic is related to a document in `docs`.\n                    Use this variable to change or map the topics.\n                    NOTE: Using a custom list of topic assignments may lead to errors if\n                          topic reduction techniques are used afterwards. Make sure that\n                          manually assigning topics is the last step in the pipeline\n            top_n_words: The number of words per topic to extract. Setting this\n                         too high can negatively impact topic embeddings as topics\n                         are typically best represented by at most 10 words.\n            n_gram_range: The n-gram range for the CountVectorizer.\n            vectorizer_model: Pass in your own CountVectorizer from scikit-learn\n            ctfidf_model: Pass in your own c-TF-IDF model to update the representations\n            representation_model: Pass in a model that fine-tunes the topic representations\n                        calculated through c-TF-IDF. Models from `bertopic.representation`\n                        are supported.\n\n        Examples:\n\n        In order to update the topic representation, you will need to first fit the topic\n        model and extract topics from them. Based on these, you can update the representation:\n\n        ```python\n        topic_model.update_topics(docs, n_gram_range=(2, 3))\n        ```\n\n        You can also use a custom vectorizer to update the representation:\n\n        ```python\n        from sklearn.feature_extraction.text import CountVectorizer\n        vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\")\n        topic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n        ```\n\n        You can also use this function to change or map the topics to something else.\n        You can update them as follows:\n\n        ```python\n        topic_model.update_topics(docs, my_updated_topics)\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        check_is_fitted(self)\n        if not n_gram_range:\n            n_gram_range = self.n_gram_range\n\n        if top_n_words &gt; 100:\n            logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                           \"can slow down computation quite a bit.\")\n        self.top_n_words = top_n_words\n        self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=n_gram_range)\n        self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n        self.representation_model = representation_model\n\n        if topics is None:\n            topics = self.topics_\n        else:\n            logger.warning(\"Using a custom list of topic assignments may lead to errors if \"\n                           \"topic reduction techniques are used afterwards. Make sure that \"\n                           \"manually assigning topics is the last step in the pipeline.\"\n                           \"Note that topic embeddings will also be created through weighted\"\n                           \"c-TF-IDF embeddings instead of centroid embeddings.\")\n\n        self._outliers = 1 if -1 in set(topics) else 0\n        # Extract words\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": topics, \"ID\": range(len(docs)), \"Image\": images})\n        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n        self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)\n        self.topic_representations_ = self._extract_words_per_topic(words, documents)\n        if set(topics) != self.topics_:\n            self._create_topic_vectors()\n        self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                              for key, values in\n                              self.topic_representations_.items()}\n        self._update_topic_size(documents)\n\n    def get_topics(self, full: bool = False) -&gt; Mapping[str, Tuple[str, float]]:\n        \"\"\" Return topics with top n words and their c-TF-IDF score\n\n        Arguments:\n            full: If True, returns all different forms of topic representations\n                  for each topic, including aspects\n\n        Returns:\n            self.topic_representations_: The top n words per topic and the corresponding c-TF-IDF score\n\n        Examples:\n\n        ```python\n        all_topics = topic_model.get_topics()\n        ```\n        \"\"\"\n        check_is_fitted(self)\n\n        if full:\n            topic_representations = {\"Main\": self.topic_representations_}\n            topic_representations.update(self.topic_aspects_)\n            return topic_representations\n        else:\n            return self.topic_representations_\n\n    def get_topic(self, topic: int, full: bool = False) -&gt; Union[Mapping[str, Tuple[str, float]], bool]:\n        \"\"\" Return top n words for a specific topic and their c-TF-IDF scores\n\n        Arguments:\n            topic: A specific topic for which you want its representation\n            full: If True, returns all different forms of topic representations\n                  for a topic, including aspects\n\n        Returns:\n            The top n words for a specific word and its respective c-TF-IDF scores\n\n        Examples:\n\n        ```python\n        topic = topic_model.get_topic(12)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        if topic in self.topic_representations_:\n            if full:\n                representations = {\"Main\": self.topic_representations_[topic]}\n                aspects = {aspect: representations[topic] for aspect, representations in self.topic_aspects_.items()}\n                representations.update(aspects)\n                return representations\n            else:\n                return self.topic_representations_[topic]\n        else:\n            return False\n\n    def get_topic_info(self, topic: int = None) -&gt; pd.DataFrame:\n        \"\"\" Get information about each topic including its ID, frequency, and name.\n\n        Arguments:\n            topic: A specific topic for which you want the frequency\n\n        Returns:\n            info: The information relating to either a single topic or all topics\n\n        Examples:\n\n        ```python\n        info_df = topic_model.get_topic_info()\n        ```\n        \"\"\"\n        check_is_fitted(self)\n\n        info = pd.DataFrame(self.topic_sizes_.items(), columns=[\"Topic\", \"Count\"]).sort_values(\"Topic\")\n        info[\"Name\"] = info.Topic.map(self.topic_labels_)\n\n        # Custom label\n        if self.custom_labels_ is not None:\n            if len(self.custom_labels_) == len(info):\n                labels = {topic - self._outliers: label for topic, label in enumerate(self.custom_labels_)}\n                info[\"CustomName\"] = info[\"Topic\"].map(labels)\n\n        # Main Keywords\n        values = {topic: list(list(zip(*values))[0]) for topic, values in self.topic_representations_.items()}\n        info[\"Representation\"] = info[\"Topic\"].map(values)\n\n        # Extract all topic aspects\n        if self.topic_aspects_:\n            for aspect, values in self.topic_aspects_.items():\n                if isinstance(list(values.values())[-1], list):\n                    if isinstance(list(values.values())[-1][0], tuple) or isinstance(list(values.values())[-1][0], list):\n                        values = {topic: list(list(zip(*value))[0]) for topic, value in values.items()}\n                    elif isinstance(list(values.values())[-1][0], str):\n                        values = {topic: \" \".join(value).strip() for topic, value in values.items()}\n                info[aspect] = info[\"Topic\"].map(values)\n\n        # Representative Docs / Images\n        if self.representative_docs_ is not None:\n            info[\"Representative_Docs\"] = info[\"Topic\"].map(self.representative_docs_)\n        if self.representative_images_ is not None:\n            info[\"Representative_Images\"] = info[\"Topic\"].map(self.representative_images_)\n\n        # Select specific topic to return\n        if topic is not None:\n            info = info.loc[info.Topic == topic, :]\n\n        return info.reset_index(drop=True)\n\n    def get_topic_freq(self, topic: int = None) -&gt; Union[pd.DataFrame, int]:\n        \"\"\" Return the size of topics (descending order)\n\n        Arguments:\n            topic: A specific topic for which you want the frequency\n\n        Returns:\n            Either the frequency of a single topic or dataframe with\n            the frequencies of all topics\n\n        Examples:\n\n        To extract the frequency of all topics:\n\n        ```python\n        frequency = topic_model.get_topic_freq()\n        ```\n\n        To get the frequency of a single topic:\n\n        ```python\n        frequency = topic_model.get_topic_freq(12)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        if isinstance(topic, int):\n            return self.topic_sizes_[topic]\n        else:\n            return pd.DataFrame(self.topic_sizes_.items(), columns=['Topic', 'Count']).sort_values(\"Count\",\n                                                                                                   ascending=False)\n\n    def get_document_info(self,\n                          docs: List[str],\n                          df: pd.DataFrame = None,\n                          metadata: Mapping[str, Any] = None) -&gt; pd.DataFrame:\n        \"\"\" Get information about the documents on which the topic was trained\n        including the documents themselves, their respective topics, the name\n        of each topic, the top n words of each topic, whether it is a\n        representative document, and probability of the clustering if the cluster\n        model supports it.\n\n        There are also options to include other meta data, such as the topic\n        distributions or the x and y coordinates of the reduced embeddings.\n\n        Arguments:\n            docs: The documents on which the topic model was trained.\n            df: A dataframe containing the metadata and the documents on which\n                the topic model was originally trained on.\n            metadata: A dictionary with meta data for each document in the form\n                    of column name (key) and the respective values (value).\n\n        Returns:\n            document_info: A dataframe with several statistics regarding\n                        the documents on which the topic model was trained.\n\n        Usage:\n\n        To get the document info, you will only need to pass the documents on which\n        the topic model was trained:\n\n        ```python\n        document_info = topic_model.get_document_info(docs)\n        ```\n\n        There are additionally options to include meta data, such as the topic\n        distributions. Moreover, we can pass the original dataframe that contains\n        the documents and extend it with the information retrieved from BERTopic:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n\n        # The original data in a dataframe format to include the target variable\n        data= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\n        df = pd.DataFrame({\"Document\": data['data'], \"Class\": data['target']})\n\n        # Add information about the percentage of the document that relates to the topic\n        topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)\n        distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]\n\n        # Create our documents dataframe using the original dataframe and meta data about\n        # the topic distributions\n        document_info = topic_model.get_document_info(docs, df=df,\n                                                      metadata={\"Topic_distribution\": distributions})\n        ```\n        \"\"\"\n        check_documents_type(docs)\n        if df is not None:\n            document_info = df.copy()\n            document_info[\"Document\"] = docs\n            document_info[\"Topic\"] = self.topics_\n        else:\n            document_info = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_})\n\n        # Add topic info through `.get_topic_info()`\n        topic_info = self.get_topic_info().drop(\"Count\", axis=1)\n        document_info = pd.merge(document_info, topic_info, on=\"Topic\", how=\"left\")\n\n        # Add top n words\n        top_n_words = {topic: \" - \".join(list(zip(*self.get_topic(topic)))[0]) for topic in set(self.topics_)}\n        document_info[\"Top_n_words\"] = document_info.Topic.map(top_n_words)\n\n        # Add flat probabilities\n        if self.probabilities_ is not None:\n            if len(self.probabilities_.shape) == 1:\n                document_info[\"Probability\"] = self.probabilities_\n            else:\n                document_info[\"Probability\"] = [max(probs) if topic != -1 else 1-sum(probs)\n                                                for topic, probs in zip(self.topics_, self.probabilities_)]\n\n        # Add representative document labels\n        repr_docs = [repr_doc for repr_docs in self.representative_docs_.values() for repr_doc in repr_docs]\n        document_info[\"Representative_document\"] = False\n        document_info.loc[document_info.Document.isin(repr_docs), \"Representative_document\"] = True\n\n        # Add custom meta data provided by the user\n        if metadata is not None:\n            for column, values in metadata.items():\n                document_info[column] = values\n        return document_info\n\n    def get_representative_docs(self, topic: int = None) -&gt; List[str]:\n        \"\"\" Extract the best representing documents per topic.\n\n        NOTE:\n            This does not extract all documents per topic as all documents\n            are not saved within BERTopic. To get all documents, please\n            run the following:\n\n            ```python\n            # When you used `.fit_transform`:\n            df = pd.DataFrame({\"Document\": docs, \"Topic\": topic})\n\n            # When you used `.fit`:\n            df = pd.DataFrame({\"Document\": docs, \"Topic\": topic_model.topics_})\n            ```\n\n        Arguments:\n            topic: A specific topic for which you want\n                   the representative documents\n\n        Returns:\n            Representative documents of the chosen topic\n\n        Examples:\n\n        To extract the representative docs of all topics:\n\n        ```python\n        representative_docs = topic_model.get_representative_docs()\n        ```\n\n        To get the representative docs of a single topic:\n\n        ```python\n        representative_docs = topic_model.get_representative_docs(12)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        if isinstance(topic, int):\n            if self.representative_docs_.get(topic):\n                return self.representative_docs_[topic]\n            else:\n                return None\n        else:\n            return self.representative_docs_\n\n    @staticmethod\n    def get_topic_tree(hier_topics: pd.DataFrame,\n                       max_distance: float = None,\n                       tight_layout: bool = False) -&gt; str:\n        \"\"\" Extract the topic tree such that it can be printed\n\n        Arguments:\n            hier_topics: A dataframe containing the structure of the topic tree.\n                        This is the output of `topic_model.hierachical_topics()`\n            max_distance: The maximum distance between two topics. This value is\n                        based on the Distance column in `hier_topics`.\n            tight_layout: Whether to use a tight layout (narrow width) for\n                        easier readability if you have hundreds of topics.\n\n        Returns:\n            A tree that has the following structure when printed:\n                .\n                .\n                \u2514\u2500health_medical_disease_patients_hiv\n                    \u251c\u2500patients_medical_disease_candida_health\n                    \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n                    \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n                    \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n                    \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n                    \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n\n            The blocks (\u25a0) indicate that the topic is one you can directly access\n            from `topic_model.get_topic`. In other words, they are the original un-grouped topics.\n\n        Examples:\n\n        ```python\n        # Train model\n        from bertopic import BERTopic\n        topic_model = BERTopic()\n        topics, probs = topic_model.fit_transform(docs)\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n        # Print topic tree\n        tree = topic_model.get_topic_tree(hierarchical_topics)\n        print(tree)\n        ```\n        \"\"\"\n        width = 1 if tight_layout else 4\n        if max_distance is None:\n            max_distance = hier_topics.Distance.max() + 1\n\n        max_original_topic = hier_topics.Parent_ID.astype(int).min() - 1\n\n        # Extract mapping from ID to name\n        topic_to_name = dict(zip(hier_topics.Child_Left_ID, hier_topics.Child_Left_Name))\n        topic_to_name.update(dict(zip(hier_topics.Child_Right_ID, hier_topics.Child_Right_Name)))\n        topic_to_name = {topic: name[:100] for topic, name in topic_to_name.items()}\n\n        # Create tree\n        tree = {str(row[1].Parent_ID): [str(row[1].Child_Left_ID), str(row[1].Child_Right_ID)]\n                for row in hier_topics.iterrows()}\n\n        def get_tree(start, tree):\n            \"\"\" Based on: https://stackoverflow.com/a/51920869/10532563 \"\"\"\n\n            def _tree(to_print, start, parent, tree, grandpa=None, indent=\"\"):\n\n                # Get distance between merged topics\n                distance = hier_topics.loc[(hier_topics.Child_Left_ID == parent) |\n                                           (hier_topics.Child_Right_ID == parent), \"Distance\"]\n                distance = distance.values[0] if len(distance) &gt; 0 else 10\n\n                if parent != start:\n                    if grandpa is None:\n                        to_print += topic_to_name[parent]\n                    else:\n                        if int(parent) &lt;= max_original_topic:\n\n                            # Do not append topic ID if they are not merged\n                            if distance &lt; max_distance:\n                                to_print += \"\u25a0\u2500\u2500\" + topic_to_name[parent] + f\" \u2500\u2500 Topic: {parent}\" + \"\\n\"\n                            else:\n                                to_print += \"O \\n\"\n                        else:\n                            to_print += topic_to_name[parent] + \"\\n\"\n\n                if parent not in tree:\n                    return to_print\n\n                for child in tree[parent][:-1]:\n                    to_print += indent + \"\u251c\" + \"\u2500\"\n                    to_print = _tree(to_print, start, child, tree, parent, indent + \"\u2502\" + \" \" * width)\n\n                child = tree[parent][-1]\n                to_print += indent + \"\u2514\" + \"\u2500\"\n                to_print = _tree(to_print, start, child, tree, parent, indent + \" \" * (width+1))\n\n                return to_print\n\n            to_print = \".\" + \"\\n\"\n            to_print = _tree(to_print, start, start, tree)\n            return to_print\n\n        start = str(hier_topics.Parent_ID.astype(int).max())\n        return get_tree(start, tree)\n\n    def set_topic_labels(self, topic_labels: Union[List[str], Mapping[int, str]]) -&gt; None:\n        \"\"\" Set custom topic labels in your fitted BERTopic model\n\n        Arguments:\n            topic_labels: If a list of topic labels, it should contain the same number\n                        of labels as there are topics. This must be ordered\n                        from the topic with the lowest ID to the highest ID,\n                        including topic -1 if it exists.\n                        If a dictionary of `topic ID`: `topic_label`, it can have\n                        any number of topics as it will only map the topics found\n                        in the dictionary.\n\n        Examples:\n\n        First, we define our topic labels with `.generate_topic_labels` in which\n        we can customize our topic labels:\n\n        ```python\n        topic_labels = topic_model.generate_topic_labels(nr_words=2,\n                                                    topic_prefix=True,\n                                                    word_length=10,\n                                                    separator=\", \")\n        ```\n\n        Then, we pass these `topic_labels` to our topic model which\n        can be accessed at any time with `.custom_labels_`:\n\n        ```python\n        topic_model.set_topic_labels(topic_labels)\n        topic_model.custom_labels_\n        ```\n\n        You might want to change only a few topic labels instead of all of them.\n        To do so, you can pass a dictionary where the keys are the topic IDs and\n        its keys the topic labels:\n\n        ```python\n        topic_model.set_topic_labels({0: \"Space\", 1: \"Sports\", 2: \"Medicine\"})\n        topic_model.custom_labels_\n        ```\n        \"\"\"\n        unique_topics = sorted(set(self.topics_))\n\n        if isinstance(topic_labels, dict):\n            if self.custom_labels_ is not None:\n                original_labels = {topic: label for topic, label in zip(unique_topics, self.custom_labels_)}\n            else:\n                info = self.get_topic_info()\n                original_labels = dict(zip(info.Topic, info.Name))\n            custom_labels = [topic_labels.get(topic) if topic_labels.get(topic) else original_labels[topic] for topic in unique_topics]\n\n        elif isinstance(topic_labels, list):\n            if len(topic_labels) == len(unique_topics):\n                custom_labels = topic_labels\n            else:\n                raise ValueError(\"Make sure that `topic_labels` contains the same number \"\n                                 \"of labels as that there are topics.\")\n\n        self.custom_labels_ = custom_labels\n\n    def generate_topic_labels(self,\n                              nr_words: int = 3,\n                              topic_prefix: bool = True,\n                              word_length: int = None,\n                              separator: str = \"_\",\n                              aspect: str = None) -&gt; List[str]:\n        \"\"\" Get labels for each topic in a user-defined format\n\n        Arguments:\n            nr_words: Top `n` words per topic to use\n            topic_prefix: Whether to use the topic ID as a prefix.\n                          If set to True, the topic ID will be separated\n                          using the `separator`\n            word_length: The maximum length of each word in the topic label.\n                         Some words might be relatively long and setting this\n                         value helps to make sure that all labels have relatively\n                         similar lengths.\n            separator: The string with which the words and topic prefix will be\n                       separated. Underscores are the default but a nice alternative\n                       is `\", \"`.\n            aspect: The aspect from which to generate topic labels\n\n        Returns:\n            topic_labels: A list of topic labels sorted from the lowest topic ID to the highest.\n                        If the topic model was trained using HDBSCAN, the lowest topic ID is -1,\n                        otherwise it is 0.\n\n        Examples:\n\n        To create our custom topic labels, usage is rather straightforward:\n\n        ```python\n        topic_labels = topic_model.generate_topic_labels(nr_words=2, separator=\", \")\n        ```\n        \"\"\"\n        unique_topics = sorted(set(self.topics_))\n\n        topic_labels = []\n        for topic in unique_topics:\n            if aspect:\n                words, _ = zip(*self.topic_aspects_[aspect][topic])\n            else:\n                words, _ = zip(*self.get_topic(topic))\n\n            if word_length:\n                words = [word[:word_length] for word in words][:nr_words]\n            else:\n                words = list(words)[:nr_words]\n\n            if topic_prefix:\n                topic_label = f\"{topic}{separator}\" + separator.join(words)\n            else:\n                topic_label = separator.join(words)\n\n            topic_labels.append(topic_label)\n\n        return topic_labels\n\n    def merge_topics(self,\n                     docs: List[str],\n                     topics_to_merge: List[Union[Iterable[int], int]],\n                     images: List[str] = None) -&gt; None:\n        \"\"\"\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            topics_to_merge: Either a list of topics or a list of list of topics\n                            to merge. For example:\n                                [1, 2, 3] will merge topics 1, 2 and 3\n                                [[1, 2], [3, 4]] will merge topics 1 and 2, and\n                                separately merge topics 3 and 4.\n            images: A list of paths to the images used when calling either\n                    `fit` or `fit_transform`\n\n        Examples:\n\n        If you want to merge topics 1, 2, and 3:\n\n        ```python\n        topics_to_merge = [1, 2, 3]\n        topic_model.merge_topics(docs, topics_to_merge)\n        ```\n\n        or if you want to merge topics 1 and 2, and separately\n        merge topics 3 and 4:\n\n        ```python\n        topics_to_merge = [[1, 2]\n                            [3, 4]]\n        topic_model.merge_topics(docs, topics_to_merge)\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n        mapping = {topic: topic for topic in set(self.topics_)}\n        if isinstance(topics_to_merge[0], int):\n            for topic in sorted(topics_to_merge):\n                mapping[topic] = topics_to_merge[0]\n        elif isinstance(topics_to_merge[0], Iterable):\n            for topic_group in sorted(topics_to_merge):\n                for topic in topic_group:\n                    mapping[topic] = topic_group[0]\n        else:\n            raise ValueError(\"Make sure that `topics_to_merge` is either\"\n                             \"a list of topics or a list of list of topics.\")\n\n        # Track mappings and sizes of topics for merging topic embeddings\n        mappings = defaultdict(list)\n        for key, val in sorted(mapping.items()):\n            mappings[val].append(key)\n        mappings = {topic_from:\n                    {\"topics_to\": topics_to,\n                     \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                    for topic_from, topics_to in mappings.items()}\n\n        # Update topics\n        documents.Topic = documents.Topic.map(mapping)\n        self.topic_mapper_.add_mappings(mapping)\n        documents = self._sort_mappings_by_frequency(documents)\n        self._extract_topics(documents, mappings=mappings)\n        self._update_topic_size(documents)\n        self._save_representative_docs(documents)\n        self.probabilities_ = self._map_probabilities(self.probabilities_)\n\n    def reduce_topics(self,\n                      docs: List[str],\n                      nr_topics: Union[int, str] = 20,\n                      images: List[str] = None) -&gt; None:\n        \"\"\" Reduce the number of topics to a fixed number of topics\n        or automatically.\n\n        If nr_topics is a integer, then the number of topics is reduced\n        to nr_topics using `AgglomerativeClustering` on the cosine distance matrix\n        of the topic embeddings.\n\n        If nr_topics is `\"auto\"`, then HDBSCAN is used to automatically\n        reduce the number of topics by running it on the topic embeddings.\n\n        The topics, their sizes, and representations are updated.\n\n        Arguments:\n            docs: The docs you used when calling either `fit` or `fit_transform`\n            nr_topics: The number of topics you want reduced to\n            images: A list of paths to the images used when calling either\n                    `fit` or `fit_transform`\n\n        Updates:\n            topics_ : Assigns topics to their merged representations.\n            probabilities_ : Assigns probabilities to their merged representations.\n\n        Examples:\n\n        You can further reduce the topics by passing the documents with its\n        topics and probabilities (if they were calculated):\n\n        ```python\n        topic_model.reduce_topics(docs, nr_topics=30)\n        ```\n\n        You can then access the updated topics and probabilities with:\n\n        ```python\n        topics = topic_model.topics_\n        probabilities = topic_model.probabilities_\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n\n        self.nr_topics = nr_topics\n        documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n        # Reduce number of topics\n        documents = self._reduce_topics(documents)\n        self._merged_topics = None\n        self._save_representative_docs(documents)\n        self.probabilities_ = self._map_probabilities(self.probabilities_)\n\n        return self\n\n    def reduce_outliers(self,\n                        documents: List[str],\n                        topics: List[int],\n                        images: List[str] = None,\n                        strategy: str = \"distributions\",\n                        probabilities: np.ndarray = None,\n                        threshold: float = 0,\n                        embeddings: np.ndarray = None,\n                        distributions_params: Mapping[str, Any] = {}) -&gt; List[int]:\n        \"\"\" Reduce outliers by merging them with their nearest topic according\n        to one of several strategies.\n\n        When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created\n        that do not fall within any of the created topics. These are labeled as -1.\n        This function allows the user to match outlier documents with their nearest topic\n        using one of the following strategies using the `strategy` parameter:\n            * \"probabilities\"\n                This uses the soft-clustering as performed by HDBSCAN to find the\n                best matching topic for each outlier document. To use this, make\n                sure to calculate the `probabilities` beforehand by instantiating\n                BERTopic with `calculate_probabilities=True`.\n            * \"distributions\"\n                Use the topic distributions, as calculated with `.approximate_distribution`\n                to find the most frequent topic in each outlier document. You can use the\n                `distributions_params` variable to tweak the parameters of\n                `.approximate_distribution`.\n            * \"c-tf-idf\"\n                Calculate the c-TF-IDF representation for each outlier document and\n                find the best matching c-TF-IDF topic representation using\n                cosine similarity.\n            * \"embeddings\"\n                Using the embeddings of each outlier documents, find the best\n                matching topic embedding using cosine similarity.\n\n        Arguments:\n            documents: A list of documents for which we reduce or remove the outliers.\n            topics: The topics that correspond to the documents\n            images: A list of paths to the images used when calling either\n                    `fit` or `fit_transform`\n            strategy: The strategy used for reducing outliers.\n                    Options:\n                        * \"probabilities\"\n                            This uses the soft-clustering as performed by HDBSCAN\n                            to find the best matching topic for each outlier document.\n\n                        * \"distributions\"\n                            Use the topic distributions, as calculated with `.approximate_distribution`\n                            to find the most frequent topic in each outlier document.\n\n                        * \"c-tf-idf\"\n                            Calculate the c-TF-IDF representation for outlier documents and\n                            find the best matching c-TF-IDF topic representation.\n\n                        * \"embeddings\"\n                            Calculate the embeddings for outlier documents and\n                            find the best matching topic embedding.\n            threshold: The threshold for assigning topics to outlier documents. This value\n                    represents the minimum probability when `strategy=\"probabilities\"`.\n                    For all other strategies, it represents the minimum similarity.\n            embeddings: The pre-computed embeddings to be used when `strategy=\"embeddings\"`.\n                        If this is None, then it will compute the embeddings for the outlier documents.\n            distributions_params: The parameters used in `.approximate_distribution` when using\n                                  the strategy `\"distributions\"`.\n\n        Returns:\n            new_topics: The updated topics\n\n        Usage:\n\n        The default settings uses the `\"distributions\"` strategy:\n\n        ```python\n        new_topics = topic_model.reduce_outliers(docs, topics)\n        ```\n\n        When you use the `\"probabilities\"` strategy, make sure to also pass the probabilities\n        as generated through HDBSCAN:\n\n        ```python\n        from bertopic import BERTopic\n        topic_model = BERTopic(calculate_probabilities=True)\n        topics, probs = topic_model.fit_transform(docs)\n\n        new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n        ```\n        \"\"\"\n        if images is not None:\n            strategy = \"embeddings\"\n\n        # Check correct use of parameters\n        if strategy.lower() == \"probabilities\" and probabilities is None:\n            raise ValueError(\"Make sure to pass in `probabilities` in order to use the probabilities strategy\")\n\n        # Reduce outliers by extracting most likely topics through the topic-term probability matrix\n        if strategy.lower() == \"probabilities\":\n            new_topics = [np.argmax(prob) if np.max(prob) &gt;= threshold and topic == -1 else topic\n                          for topic, prob in zip(topics, probabilities)]\n\n        # Reduce outliers by extracting most frequent topics through calculating of Topic Distributions\n        elif strategy.lower() == \"distributions\":\n            outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n            outlier_docs = [documents[index] for index in outlier_ids]\n            topic_distr, _ = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params)\n            outlier_topics = iter([np.argmax(prob) if sum(prob) &gt; 0 else -1 for prob in topic_distr])\n            new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n        # Reduce outliers by finding the most similar c-TF-IDF representations\n        elif strategy.lower() == \"c-tf-idf\":\n            outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n            outlier_docs = [documents[index] for index in outlier_ids]\n\n            # Calculate c-TF-IDF of outlier documents with all topics\n            bow_doc = self.vectorizer_model.transform(outlier_docs)\n            c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n            similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n            # Update topics\n            similarity[similarity &lt; threshold] = 0\n            outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n            new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n        # Reduce outliers by finding the most similar topic embeddings\n        elif strategy.lower() == \"embeddings\":\n            if self.embedding_model is None and embeddings is None:\n                raise ValueError(\"To use this strategy, you will need to pass a model to `embedding_model`\"\n                                 \"when instantiating BERTopic.\")\n            outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n            if images is not None:\n                outlier_docs = [images[index] for index in outlier_ids]\n            else:\n                outlier_docs = [documents[index] for index in outlier_ids]\n\n            # Extract or calculate embeddings for outlier documents\n            if embeddings is not None:\n                outlier_embeddings = np.array([embeddings[index] for index in outlier_ids])\n            elif images is not None:\n                outlier_images = [images[index] for index in outlier_ids]\n                outlier_embeddings = self.embedding_model.embed_images(outlier_images, verbose=self.verbose)\n            else:\n                outlier_embeddings = self.embedding_model.embed_documents(outlier_docs)\n            similarity = cosine_similarity(outlier_embeddings, self.topic_embeddings_[self._outliers:])\n\n            # Update topics\n            similarity[similarity &lt; threshold] = 0\n            outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n            new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n        return new_topics\n\n    def visualize_topics(self,\n                         topics: List[int] = None,\n                         top_n_topics: int = None,\n                         custom_labels: bool = False,\n                         title: str = \"&lt;b&gt;Intertopic Distance Map&lt;/b&gt;\",\n                         width: int = 650,\n                         height: int = 650) -&gt; go.Figure:\n        \"\"\" Visualize topics, their sizes, and their corresponding words\n\n        This visualization is highly inspired by LDAvis, a great visualization\n        technique typically reserved for LDA.\n\n        Arguments:\n            topics: A selection of topics to visualize\n                    Not to be confused with the topics that you get from `.fit_transform`.\n                    For example, if you want to visualize only topics 1 through 5:\n                    `topics = [1, 2, 3, 4, 5]`.\n            top_n_topics: Only select the top n most frequent topics\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        To visualize the topics simply run:\n\n        ```python\n        topic_model.visualize_topics()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_topics()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_topics(self,\n                                         topics=topics,\n                                         top_n_topics=top_n_topics,\n                                         custom_labels=custom_labels,\n                                         title=title,\n                                         width=width,\n                                         height=height)\n\n    def visualize_documents(self,\n                            docs: List[str],\n                            topics: List[int] = None,\n                            embeddings: np.ndarray = None,\n                            reduced_embeddings: np.ndarray = None,\n                            sample: float = None,\n                            hide_annotations: bool = False,\n                            hide_document_hover: bool = False,\n                            custom_labels: bool = False,\n                            title: str = \"&lt;b&gt;Documents and Topics&lt;/b&gt;\",\n                            width: int = 1200,\n                            height: int = 750) -&gt; go.Figure:\n        \"\"\" Visualize documents and their topics in 2D\n\n        Arguments:\n            topic_model: A fitted BERTopic instance.\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            topics: A selection of topics to visualize.\n                    Not to be confused with the topics that you get from `.fit_transform`.\n                    For example, if you want to visualize only topics 1 through 5:\n                    `topics = [1, 2, 3, 4, 5]`.\n            embeddings: The embeddings of all documents in `docs`.\n            reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n            sample: The percentage of documents in each topic that you would like to keep.\n                    Value can be between 0 and 1. Setting this value to, for example,\n                    0.1 (10% of documents in each topic) makes it easier to visualize\n                    millions of documents as a subset is chosen.\n            hide_annotations: Hide the names of the traces on top of each cluster.\n            hide_document_hover: Hide the content of the documents when hovering over\n                                specific points. Helps to speed up generation of visualization.\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        To visualize the topics simply run:\n\n        ```python\n        topic_model.visualize_documents(docs)\n        ```\n\n        Do note that this re-calculates the embeddings and reduces them to 2D.\n        The advised and prefered pipeline for using this function is as follows:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n        from bertopic import BERTopic\n        from umap import UMAP\n\n        # Prepare embeddings\n        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n        # Train BERTopic\n        topic_model = BERTopic().fit(docs, embeddings)\n\n        # Reduce dimensionality of embeddings, this step is optional\n        # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n        # Run the visualization with the original embeddings\n        topic_model.visualize_documents(docs, embeddings=embeddings)\n\n        # Or, if you have reduced the original embeddings already:\n        topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n        fig.write_html(\"path/to/file.html\")\n        ```\n\n        &lt;iframe src=\"../getting_started/visualization/documents.html\"\n        style=\"width:1000px; height: 800px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        return plotting.visualize_documents(self,\n                                            docs=docs,\n                                            topics=topics,\n                                            embeddings=embeddings,\n                                            reduced_embeddings=reduced_embeddings,\n                                            sample=sample,\n                                            hide_annotations=hide_annotations,\n                                            hide_document_hover=hide_document_hover,\n                                            custom_labels=custom_labels,\n                                            title=title,\n                                            width=width,\n                                            height=height)\n\n    def visualize_hierarchical_documents(self,\n                                         docs: List[str],\n                                         hierarchical_topics: pd.DataFrame,\n                                         topics: List[int] = None,\n                                         embeddings: np.ndarray = None,\n                                         reduced_embeddings: np.ndarray = None,\n                                         sample: Union[float, int] = None,\n                                         hide_annotations: bool = False,\n                                         hide_document_hover: bool = True,\n                                         nr_levels: int = 10,\n                                         level_scale: str = 'linear',\n                                         custom_labels: bool = False,\n                                         title: str = \"&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;\",\n                                         width: int = 1200,\n                                         height: int = 750) -&gt; go.Figure:\n        \"\"\" Visualize documents and their topics in 2D at different levels of hierarchy\n\n        Arguments:\n            docs: The documents you used when calling either `fit` or `fit_transform`\n            hierarchical_topics: A dataframe that contains a hierarchy of topics\n                                represented by their parents and their children\n            topics: A selection of topics to visualize.\n                    Not to be confused with the topics that you get from `.fit_transform`.\n                    For example, if you want to visualize only topics 1 through 5:\n                    `topics = [1, 2, 3, 4, 5]`.\n            embeddings: The embeddings of all documents in `docs`.\n            reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n            sample: The percentage of documents in each topic that you would like to keep.\n                    Value can be between 0 and 1. Setting this value to, for example,\n                    0.1 (10% of documents in each topic) makes it easier to visualize\n                    millions of documents as a subset is chosen.\n            hide_annotations: Hide the names of the traces on top of each cluster.\n            hide_document_hover: Hide the content of the documents when hovering over\n                                specific points. Helps to speed up generation of visualizations.\n            nr_levels: The number of levels to be visualized in the hierarchy. First, the distances\n                    in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances with\n                    equal length. Then, for each list of distances, the merged topics are selected that\n                    have a distance less or equal to the maximum distance of the selected list of distances.\n                    NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to\n                    the length of `hierarchical_topics`.\n            level_scale: Whether to apply a linear or logarithmic ('log') scale levels of the distance\n                         vector. Linear scaling will perform an equal number of merges at each level\n                         while logarithmic scaling will perform more mergers in earlier levels to\n                         provide more resolution at higher levels (this can be used for when the number\n                         of topics is large).\n            custom_labels: Whether to use custom topic labels that were defined using\n                           `topic_model.set_topic_labels`.\n                           NOTE: Custom labels are only generated for the original\n                           un-merged topics.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        To visualize the topics simply run:\n\n        ```python\n        topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n        ```\n\n        Do note that this re-calculates the embeddings and reduces them to 2D.\n        The advised and prefered pipeline for using this function is as follows:\n\n        ```python\n        from sklearn.datasets import fetch_20newsgroups\n        from sentence_transformers import SentenceTransformer\n        from bertopic import BERTopic\n        from umap import UMAP\n\n        # Prepare embeddings\n        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n        sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n        embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n        # Train BERTopic and extract hierarchical topics\n        topic_model = BERTopic().fit(docs, embeddings)\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n        # Reduce dimensionality of embeddings, this step is optional\n        # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n        # Run the visualization with the original embeddings\n        topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n        # Or, if you have reduced the original embeddings already:\n        topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n        fig.write_html(\"path/to/file.html\")\n        ```\n\n        &lt;iframe src=\"../getting_started/visualization/hierarchical_documents.html\"\n        style=\"width:1000px; height: 770px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n        \"\"\"\n        check_is_fitted(self)\n        check_documents_type(docs)\n        return plotting.visualize_hierarchical_documents(self,\n                                                         docs=docs,\n                                                         hierarchical_topics=hierarchical_topics,\n                                                         topics=topics,\n                                                         embeddings=embeddings,\n                                                         reduced_embeddings=reduced_embeddings,\n                                                         sample=sample,\n                                                         hide_annotations=hide_annotations,\n                                                         hide_document_hover=hide_document_hover,\n                                                         nr_levels=nr_levels,\n                                                         level_scale=level_scale,\n                                                         custom_labels=custom_labels,\n                                                         title=title,\n                                                         width=width,\n                                                         height=height)\n\n    def visualize_term_rank(self,\n                            topics: List[int] = None,\n                            log_scale: bool = False,\n                            custom_labels: bool = False,\n                            title: str = \"&lt;b&gt;Term score decline per Topic&lt;/b&gt;\",\n                            width: int = 800,\n                            height: int = 500) -&gt; go.Figure:\n        \"\"\" Visualize the ranks of all terms across all topics\n\n        Each topic is represented by a set of words. These words, however,\n        do not all equally represent the topic. This visualization shows\n        how many words are needed to represent a topic and at which point\n        the beneficial effect of adding words starts to decline.\n\n        Arguments:\n            topics: A selection of topics to visualize. These will be colored\n                    red where all others will be colored black.\n            log_scale: Whether to represent the ranking on a log scale\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the ranks of all words across\n        all topics simply run:\n\n        ```python\n        topic_model.visualize_term_rank()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_term_rank()\n        fig.write_html(\"path/to/file.html\")\n        ```\n\n        Reference:\n\n        This visualization was heavily inspired by the\n        \"Term Probability Decline\" visualization found in an\n        analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).\n        Reference to that specific analysis can be found\n        [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_term_rank(self,\n                                            topics=topics,\n                                            log_scale=log_scale,\n                                            custom_labels=custom_labels,\n                                            title=title,\n                                            width=width,\n                                            height=height)\n\n    def visualize_topics_over_time(self,\n                                   topics_over_time: pd.DataFrame,\n                                   top_n_topics: int = None,\n                                   topics: List[int] = None,\n                                   normalize_frequency: bool = False,\n                                   custom_labels: bool = False,\n                                   title: str = \"&lt;b&gt;Topics over Time&lt;/b&gt;\",\n                                   width: int = 1250,\n                                   height: int = 450) -&gt; go.Figure:\n        \"\"\" Visualize topics over time\n\n        Arguments:\n            topics_over_time: The topics you would like to be visualized with the\n                              corresponding topic representation\n            top_n_topics: To visualize the most frequent topics instead of all\n            topics: Select which topics you would like to be visualized\n            normalize_frequency: Whether to normalize each topic's frequency individually\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            A plotly.graph_objects.Figure including all traces\n\n        Examples:\n\n        To visualize the topics over time, simply run:\n\n        ```python\n        topics_over_time = topic_model.topics_over_time(docs, timestamps)\n        topic_model.visualize_topics_over_time(topics_over_time)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_topics_over_time(topics_over_time)\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_topics_over_time(self,\n                                                   topics_over_time=topics_over_time,\n                                                   top_n_topics=top_n_topics,\n                                                   topics=topics,\n                                                   normalize_frequency=normalize_frequency,\n                                                   custom_labels=custom_labels,\n                                                   title=title,\n                                                   width=width,\n                                                   height=height)\n\n    def visualize_topics_per_class(self,\n                                   topics_per_class: pd.DataFrame,\n                                   top_n_topics: int = 10,\n                                   topics: List[int] = None,\n                                   normalize_frequency: bool = False,\n                                   custom_labels: bool = False,\n                                   title: str = \"&lt;b&gt;Topics per Class&lt;/b&gt;\",\n                                   width: int = 1250,\n                                   height: int = 900) -&gt; go.Figure:\n        \"\"\" Visualize topics per class\n\n        Arguments:\n            topics_per_class: The topics you would like to be visualized with the\n                              corresponding topic representation\n            top_n_topics: To visualize the most frequent topics instead of all\n            topics: Select which topics you would like to be visualized\n            normalize_frequency: Whether to normalize each topic's frequency individually\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            A plotly.graph_objects.Figure including all traces\n\n        Examples:\n\n        To visualize the topics per class, simply run:\n\n        ```python\n        topics_per_class = topic_model.topics_per_class(docs, classes)\n        topic_model.visualize_topics_per_class(topics_per_class)\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_topics_per_class(topics_per_class)\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_topics_per_class(self,\n                                                   topics_per_class=topics_per_class,\n                                                   top_n_topics=top_n_topics,\n                                                   topics=topics,\n                                                   normalize_frequency=normalize_frequency,\n                                                   custom_labels=custom_labels,\n                                                   title=title,\n                                                   width=width,\n                                                   height=height)\n\n    def visualize_distribution(self,\n                               probabilities: np.ndarray,\n                               min_probability: float = 0.015,\n                               custom_labels: bool = False,\n                               title: str = \"&lt;b&gt;Topic Probability Distribution&lt;/b&gt;\",\n                               width: int = 800,\n                               height: int = 600) -&gt; go.Figure:\n        \"\"\" Visualize the distribution of topic probabilities\n\n        Arguments:\n            probabilities: An array of probability scores\n            min_probability: The minimum probability score to visualize.\n                             All others are ignored.\n            custom_labels: Whether to use custom topic labels that were defined using\n                           `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Examples:\n\n        Make sure to fit the model before and only input the\n        probabilities of a single document:\n\n        ```python\n        topic_model.visualize_distribution(topic_model.probabilities_[0])\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_distribution(topic_model.probabilities_[0])\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_distribution(self,\n                                               probabilities=probabilities,\n                                               min_probability=min_probability,\n                                               custom_labels=custom_labels,\n                                               title=title,\n                                               width=width,\n                                               height=height)\n\n    def visualize_approximate_distribution(self,\n                                           document: str,\n                                           topic_token_distribution: np.ndarray,\n                                           normalize: bool = False):\n        \"\"\" Visualize the topic distribution calculated by `.approximate_topic_distribution`\n        on a token level. Thereby indicating the extend to which a certain word or phrases belong\n        to a specific topic. The assumption here is that a single word can belong to multiple\n        similar topics and as such give information about the broader set of topics within\n        a single document.\n\n        Arguments:\n            topic_model: A fitted BERTopic instance.\n            document: The document for which you want to visualize\n                    the approximated topic distribution.\n            topic_token_distribution: The topic-token distribution of the document as\n                                    extracted by `.approximate_topic_distribution`\n            normalize: Whether to normalize, between 0 and 1 (summing to 1), the\n                    topic distribution values.\n\n        Returns:\n            df: A stylized dataframe indicating the best fitting topics\n                for each token.\n\n        Examples:\n\n        ```python\n        # Calculate the topic distributions on a token level\n        # Note that we need to have `calculate_token_level=True`\n        topic_distr, topic_token_distr = topic_model.approximate_distribution(\n                docs, calculate_token_level=True\n        )\n\n        # Visualize the approximated topic distributions\n        df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])\n        df\n        ```\n\n        To revert this stylized dataframe back to a regular dataframe,\n        you can run the following:\n\n        ```python\n        df.data.columns = [column.strip() for column in df.data.columns]\n        df = df.data\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_approximate_distribution(self,\n                                                           document=document,\n                                                           topic_token_distribution=topic_token_distribution,\n                                                           normalize=normalize)\n\n    def visualize_hierarchy(self,\n                            orientation: str = \"left\",\n                            topics: List[int] = None,\n                            top_n_topics: int = None,\n                            custom_labels: bool = False,\n                            title: str = \"&lt;b&gt;Hierarchical Clustering&lt;/b&gt;\",\n                            width: int = 1000,\n                            height: int = 600,\n                            hierarchical_topics: pd.DataFrame = None,\n                            linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                            distance_function: Callable[[csr_matrix], csr_matrix] = None,\n                            color_threshold: int = 1) -&gt; go.Figure:\n        \"\"\" Visualize a hierarchical structure of the topics\n\n        A ward linkage function is used to perform the\n        hierarchical clustering based on the cosine distance\n        matrix between topic embeddings.\n\n        Arguments:\n            topic_model: A fitted BERTopic instance.\n            orientation: The orientation of the figure.\n                        Either 'left' or 'bottom'\n            topics: A selection of topics to visualize\n            top_n_topics: Only select the top n most frequent topics\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n                       NOTE: Custom labels are only generated for the original\n                       un-merged topics.\n            title: Title of the plot.\n            width: The width of the figure. Only works if orientation is set to 'left'\n            height: The height of the figure. Only works if orientation is set to 'bottom'\n            hierarchical_topics: A dataframe that contains a hierarchy of topics\n                                represented by their parents and their children.\n                                NOTE: The hierarchical topic names are only visualized\n                                if both `topics` and `top_n_topics` are not set.\n            linkage_function: The linkage function to use. Default is:\n                            `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n                            NOTE: Make sure to use the same `linkage_function` as used\n                            in `topic_model.hierarchical_topics`.\n            distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                            `lambda x: 1 - cosine_similarity(x)`\n                            NOTE: Make sure to use the same `distance_function` as used\n                            in `topic_model.hierarchical_topics`.\n            color_threshold: Value at which the separation of clusters will be made which\n                         will result in different colors for different clusters.\n                         A higher value will typically lead in less colored clusters.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the hierarchical structure of\n        topics simply run:\n\n        ```python\n        topic_model.visualize_hierarchy()\n        ```\n\n        If you also want the labels visualized of hierarchical topics,\n        run the following:\n\n        ```python\n        # Extract hierarchical topics and their representations\n        hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n        # Visualize these representations\n        topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n        ```\n\n        If you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_hierarchy()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        &lt;iframe src=\"../getting_started/visualization/hierarchy.html\"\n        style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_hierarchy(self,\n                                            orientation=orientation,\n                                            topics=topics,\n                                            top_n_topics=top_n_topics,\n                                            custom_labels=custom_labels,\n                                            title=title,\n                                            width=width,\n                                            height=height,\n                                            hierarchical_topics=hierarchical_topics,\n                                            linkage_function=linkage_function,\n                                            distance_function=distance_function,\n                                            color_threshold=color_threshold\n                                            )\n\n    def visualize_heatmap(self,\n                          topics: List[int] = None,\n                          top_n_topics: int = None,\n                          n_clusters: int = None,\n                          custom_labels: bool = False,\n                          title: str = \"&lt;b&gt;Similarity Matrix&lt;/b&gt;\",\n                          width: int = 800,\n                          height: int = 800) -&gt; go.Figure:\n        \"\"\" Visualize a heatmap of the topic's similarity matrix\n\n        Based on the cosine similarity matrix between topic embeddings,\n        a heatmap is created showing the similarity between topics.\n\n        Arguments:\n            topics: A selection of topics to visualize.\n            top_n_topics: Only select the top n most frequent topics.\n            n_clusters: Create n clusters and order the similarity\n                        matrix by those clusters.\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of the figure.\n            height: The height of the figure.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the similarity matrix of\n        topics simply run:\n\n        ```python\n        topic_model.visualize_heatmap()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_heatmap()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_heatmap(self,\n                                          topics=topics,\n                                          top_n_topics=top_n_topics,\n                                          n_clusters=n_clusters,\n                                          custom_labels=custom_labels,\n                                          title=title,\n                                          width=width,\n                                          height=height)\n\n    def visualize_barchart(self,\n                           topics: List[int] = None,\n                           top_n_topics: int = 8,\n                           n_words: int = 5,\n                           custom_labels: bool = False,\n                           title: str = \"Topic Word Scores\",\n                           width: int = 250,\n                           height: int = 250) -&gt; go.Figure:\n        \"\"\" Visualize a barchart of selected topics\n\n        Arguments:\n            topics: A selection of topics to visualize.\n            top_n_topics: Only select the top n most frequent topics.\n            n_words: Number of words to show in a topic\n            custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n            title: Title of the plot.\n            width: The width of each figure.\n            height: The height of each figure.\n\n        Returns:\n            fig: A plotly figure\n\n        Examples:\n\n        To visualize the barchart of selected topics\n        simply run:\n\n        ```python\n        topic_model.visualize_barchart()\n        ```\n\n        Or if you want to save the resulting figure:\n\n        ```python\n        fig = topic_model.visualize_barchart()\n        fig.write_html(\"path/to/file.html\")\n        ```\n        \"\"\"\n        check_is_fitted(self)\n        return plotting.visualize_barchart(self,\n                                           topics=topics,\n                                           top_n_topics=top_n_topics,\n                                           n_words=n_words,\n                                           custom_labels=custom_labels,\n                                           title=title,\n                                           width=width,\n                                           height=height)\n\n    def save(self,\n             path,\n             serialization: Literal[\"safetensors\", \"pickle\", \"pytorch\"] = \"pickle\",\n             save_embedding_model: Union[bool, str] = True,\n             save_ctfidf: bool = False):\n        \"\"\" Saves the model to the specified path or folder\n\n        When saving the model, make sure to also keep track of the versions\n        of dependencies and Python used. Loading and saving the model should\n        be done using the same dependencies and Python. Moreover, models\n        saved in one version of BERTopic should not be loaded in other versions.\n\n        Arguments:\n            path: If `serialization` is 'safetensors' or `pytorch`, this is a directory.\n                  If `serialization` is `pickle`, then this is file.\n            serialization: If `pickle`, the entire model will be pickled. If `safetensors`\n                           or `pytorch` the model will be saved without the embedding,\n                           dimensionality reduction, and clustering algorithms.\n                           This is a very efficient format and typically advised.\n            save_embedding_model: If serialization `pickle`, then you can choose to skip\n                                  saving the embedding model. If serialization `safetensors`\n                                  or `pytorch`, this variable can be used as a string pointing\n                                  towards a huggingface model.\n            save_ctfidf: Whether to save c-TF-IDF information if serialization is `safetensors`\n                         or `pytorch`\n\n        Examples:\n\n        To save the model in an efficient and safe format (safetensors) with c-TF-IDF information:\n\n        ```python\n        topic_model.save(\"model_dir\", serialization=\"safetensors\", save_ctfidf=True)\n        ```\n\n        If you wish to also add a pointer to the embedding model, which will be downloaded from\n        HuggingFace upon loading:\n\n        ```python\n        embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\n        topic_model.save(\"model_dir\", serialization=\"safetensors\", save_embedding_model=embedding_model)\n        ```\n\n        or if you want save the full model with pickle:\n\n        ```python\n        topic_model.save(\"my_model\")\n        ```\n\n        NOTE: Pickle can run arbitrary code and is generally considered to be less safe than\n        safetensors.\n        \"\"\"\n        if serialization == \"pickle\":\n            logger.warning(\"When you use `pickle` to save/load a BERTopic model,\"\n                           \"please make sure that the environments in which you save\"\n                           \"and load the model are **exactly** the same. The version of BERTopic,\"\n                           \"its dependencies, and python need to remain the same.\")\n\n            with open(path, 'wb') as file:\n\n                # This prevents the vectorizer from being too large in size if `min_df` was\n                # set to a value higher than 1\n                self.vectorizer_model.stop_words_ = None\n\n                if not save_embedding_model:\n                    embedding_model = self.embedding_model\n                    self.embedding_model = None\n                    joblib.dump(self, file)\n                    self.embedding_model = embedding_model\n                else:\n                    joblib.dump(self, file)\n        elif serialization == \"safetensors\" or serialization == \"pytorch\":\n\n            # Directory\n            save_directory = Path(path)\n            save_directory.mkdir(exist_ok=True, parents=True)\n\n            # Check embedding model\n            if save_embedding_model and hasattr(self.embedding_model, '_hf_model') and not isinstance(save_embedding_model, str):\n                save_embedding_model = self.embedding_model._hf_model\n            elif not save_embedding_model:\n                logger.warning(\"You are saving a BERTopic model without explicitly defining an embedding model.\"\n                               \"If you are using a sentence-transformers model or a HuggingFace model supported\"\n                               \"by sentence-transformers, please save the model by using a pointer towards that model.\"\n                               \"For example, `save_embedding_model=sentence-transformers/all-mpnet-base-v2`\", RuntimeWarning)\n\n            # Minimal\n            save_utils.save_hf(model=self, save_directory=save_directory, serialization=serialization)\n            save_utils.save_topics(model=self, path=save_directory / \"topics.json\")\n            save_utils.save_images(model=self, path=save_directory / \"images\")\n            save_utils.save_config(model=self, path=save_directory / 'config.json', embedding_model=save_embedding_model)\n\n            # Additional\n            if save_ctfidf:\n                save_utils.save_ctfidf(model=self, save_directory=save_directory, serialization=serialization)\n                save_utils.save_ctfidf_config(model=self, path=save_directory / 'ctfidf_config.json')\n\n    @classmethod\n    def load(cls,\n             path: str,\n             embedding_model=None):\n        \"\"\" Loads the model from the specified path or directory\n\n        Arguments:\n            path: Either load a BERTopic model from a file (`.pickle`) or a folder containing\n                  `.safetensors` or `.bin` files.\n            embedding_model: Additionally load in an embedding model if it was not saved\n                             in the BERTopic model file or directory.\n\n        Examples:\n\n        ```python\n        BERTopic.load(\"model_dir\")\n        ```\n\n        or if you did not save the embedding model:\n\n        ```python\n        BERTopic.load(\"model_dir\", embedding_model=\"all-MiniLM-L6-v2\")\n        ```\n        \"\"\"\n        file_or_dir = Path(path)\n\n        # Load from Pickle\n        if file_or_dir.is_file():\n            with open(file_or_dir, 'rb') as file:\n                if embedding_model:\n                    topic_model = joblib.load(file)\n                    topic_model.embedding_model = select_backend(embedding_model)\n                else:\n                    topic_model = joblib.load(file)\n                return topic_model\n\n        # Load from directory or HF\n        if file_or_dir.is_dir():\n            topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_local_files(file_or_dir)\n        elif \"/\" in str(path):\n            topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)\n        else:\n            raise ValueError(\"Make sure to either pass a valid directory or HF model.\")\n        topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)\n\n        # Replace embedding model if one is specifically chosen\n        if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:\n            topic_model.embedding_model = select_backend(embedding_model)\n\n        return topic_model\n\n    @classmethod\n    def merge_models(cls, models, min_similarity: float = .7, embedding_model=None):\n        \"\"\" Merge multiple pre-trained BERTopic models into a single model.\n\n        The models are merged as if they were all saved using pytorch or\n        safetensors, so a minimal version without c-TF-IDF.\n\n        To do this, we choose the first model in the list of\n        models as a baseline. Then, we check for each model\n        whether they contain topics that are not in the baseline.\n        This check if based on the cosine similarity between\n        topics embeddings. If topic embeddings between two models\n        are similar, then the topic of the second model are re-assigned\n        to the first. If they are dissimilar, the topic of the second\n        model is assigned to the first.\n\n        In essence, we simply check whether sufficiently \"new\"\n        topics emerge and add them.\n\n        Arguments:\n            models: A list of fitted BERTopic models\n            min_similarity: The minimum similarity for when topics are merged.\n            embedding_model: Additionally load in an embedding model if necessary.\n\n        Returns:\n            A new BERTopic model that was created as if you were\n            loading a model from the HuggingFace Hub without c-TF-IDF\n\n        Examples:\n\n        ```python\n        from bertopic import BERTopic\n        from sklearn.datasets import fetch_20newsgroups\n\n        docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n        # Create three separate models\n        topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])\n        topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])\n        topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])\n\n        # Combine all models into one\n        merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n        ```\n        \"\"\"\n        import torch\n\n        # Temporarily save model and push to HF\n        with TemporaryDirectory() as tmpdir:\n\n            # Save model weights and config.\n            all_topics, all_params, all_tensors = [], [], []\n            for index, model in enumerate(models):\n                model.save(tmpdir, serialization=\"pytorch\")\n                topics, params, tensors, _, _, _ = save_utils.load_local_files(Path(tmpdir))\n                all_topics.append(topics)\n                all_params.append(params)\n                all_tensors.append(np.array(tensors[\"topic_embeddings\"]))\n\n                # Create a base set of parameters\n                if index == 0:\n                    merged_topics = topics\n                    merged_params = params\n                    merged_tensors = np.array(tensors[\"topic_embeddings\"])\n                    merged_topics[\"custom_labels\"] = None\n\n        for tensors, selected_topics in zip(all_tensors[1:], all_topics[1:]):\n            # Calculate similarity matrix\n            sim_matrix = cosine_similarity(tensors, merged_tensors)\n            sims = np.max(sim_matrix, axis=1)\n\n            # Extract new topics\n            new_topics = sorted([index - selected_topics[\"_outliers\"] for index, sim in enumerate(sims) if sim &lt; min_similarity])\n            max_topic = max(set(merged_topics[\"topics\"]))\n\n            # Merge Topic Representations\n            new_topics_dict = {}\n            new_topic_val = max_topic + 1\n            for index, new_topic in enumerate(new_topics):\n                new_topic_val = max_topic + index + 1\n                new_topics_dict[new_topic] = new_topic_val\n                merged_topics[\"topic_representations\"][str(new_topic_val)] = selected_topics[\"topic_representations\"][str(new_topic)]\n                merged_topics[\"topic_labels\"][str(new_topic_val)] = selected_topics[\"topic_labels\"][str(new_topic)]\n\n                if selected_topics[\"topic_aspects\"]:\n                    merged_topics[\"topic_aspects\"][str(new_topic_val)] = selected_topics[\"topic_aspects\"][str(new_topic)]\n\n                # Add new embeddings\n                new_tensors = tensors[new_topic - selected_topics[\"_outliers\"]]\n                merged_tensors = np.vstack([merged_tensors, new_tensors])\n\n            # Topic Mapper\n            merged_topics[\"topic_mapper\"] = TopicMapper(list(range(-1, new_topic_val+1, 1))).mappings_\n\n            # Find similar topics and re-assign those from the new models\n            sims_idx = np.argmax(sim_matrix, axis=1)\n            sims = np.max(sim_matrix, axis=1)\n            to_merge = {\n                a - selected_topics[\"_outliers\"]:\n                b - merged_topics[\"_outliers\"] for a, (b, val) in enumerate(zip(sims_idx, sims))\n                if val &gt;= min_similarity\n            }\n            to_merge.update(new_topics_dict)\n            to_merge[-1] = -1\n            topics = [to_merge[topic] for topic in selected_topics[\"topics\"]]\n            merged_topics[\"topics\"].extend(topics)\n            merged_topics[\"topic_sizes\"] = dict(Counter(merged_topics[\"topics\"]))\n\n        # Create a new model from the merged parameters\n        merged_tensors = {\"topic_embeddings\": torch.from_numpy(merged_tensors)}\n        merged_model = _create_model_from_files(merged_topics, merged_params, merged_tensors, None, None, None, warn_no_backend=False)\n        merged_model.embedding_model = models[0].embedding_model\n\n        # Replace embedding model if one is specifically chosen\n        if embedding_model is not None and type(merged_model.embedding_model) == BaseEmbedder:\n            merged_model.embedding_model = select_backend(embedding_model)\n        return merged_model\n\n    def push_to_hf_hub(\n            self,\n            repo_id: str,\n            commit_message: str = 'Add BERTopic model',\n            token: str = None,\n            revision: str = None,\n            private: bool = False,\n            create_pr: bool = False,\n            model_card: bool = True,\n            serialization: str = \"safetensors\",\n            save_embedding_model: Union[str, bool] = True,\n            save_ctfidf: bool = False,\n            ):\n        \"\"\" Push your BERTopic model to a HuggingFace Hub\n\n        Whenever you want to upload files to the Hub, you need to log in to your Hugging Face account:\n\n        * Log in to your Hugging Face account with the following command:\n            ```bash\n            huggingface-cli login\n\n            # or using an environment variable\n            huggingface-cli login --token $HUGGINGFACE_TOKEN\n            ```\n        * Alternatively, you can programmatically login using login() in a notebook or a script:\n            ```python\n            from huggingface_hub import login\n            login()\n            ```\n        * Or you can give a token with the `token` variable\n\n        Arguments:\n            repo_id: The name of your HuggingFace repository\n            commit_message: A commit message\n            token: Token to add if not already logged in\n            revision: Repository revision\n            private: Whether to create a private repository\n            create_pr: Whether to upload the model as a Pull Request\n            model_card: Whether to automatically create a modelcard\n            serialization: The type of serialization.\n                        Either `safetensors` or `pytorch`\n            save_embedding_model: A pointer towards a HuggingFace model to be loaded in with\n                                    SentenceTransformers. E.g.,\n                                    `sentence-transformers/all-MiniLM-L6-v2`\n            save_ctfidf: Whether to save c-TF-IDF information\n\n\n        Examples:\n\n        ```python\n        topic_model.push_to_hf_hub(\n            repo_id=\"ArXiv\",\n            save_ctfidf=True,\n            save_embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\"\n        )\n        ```\n        \"\"\"\n        return save_utils.push_to_hf_hub(model=self, repo_id=repo_id, commit_message=commit_message,\n                                         token=token, revision=revision, private=private, create_pr=create_pr,\n                                         model_card=model_card, serialization=serialization,\n                                         save_embedding_model=save_embedding_model, save_ctfidf=save_ctfidf)\n\n    def get_params(self, deep: bool = False) -&gt; Mapping[str, Any]:\n        \"\"\" Get parameters for this estimator.\n\n        Adapted from:\n            https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178\n\n        Arguments:\n            deep: bool, default=True\n                  If True, will return the parameters for this estimator and\n                  contained subobjects that are estimators.\n\n        Returns:\n            out: Parameter names mapped to their values.\n        \"\"\"\n        out = dict()\n        for key in self._get_param_names():\n            value = getattr(self, key)\n            if deep and hasattr(value, 'get_params'):\n                deep_items = value.get_params().items()\n                out.update((key + '__' + k, val) for k, val in deep_items)\n            out[key] = value\n        return out\n\n    def _extract_embeddings(self,\n                            documents: Union[List[str], str],\n                            images: List[str] = None,\n                            method: str = \"document\",\n                            verbose: bool = None) -&gt; np.ndarray:\n        \"\"\" Extract sentence/document embeddings through pre-trained embeddings\n        For an overview of pre-trained models: https://www.sbert.net/docs/pretrained_models.html\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            images: A list of paths to the images to fit on or the images themselves\n            method: Whether to extract document or word-embeddings, options are \"document\" and \"word\"\n            verbose: Whether to show a progressbar demonstrating the time to extract embeddings\n\n        Returns:\n            embeddings: The extracted embeddings.\n        \"\"\"\n        if isinstance(documents, str):\n            documents = [documents]\n\n        if images is not None and hasattr(self.embedding_model, \"embed_images\"):\n            embeddings = self.embedding_model.embed(documents=documents, images=images, verbose=verbose)\n        elif method == \"word\":\n            embeddings = self.embedding_model.embed_words(words=documents, verbose=verbose)\n        elif method == \"document\":\n            embeddings = self.embedding_model.embed_documents(documents, verbose=verbose)\n        elif documents[0] is None and images is None:\n            raise ValueError(\"Make sure to use an embedding model that can either embed documents\"\n                             \"or images depending on which you want to embed.\")\n        else:\n            raise ValueError(\"Wrong method for extracting document/word embeddings. \"\n                             \"Either choose 'word' or 'document' as the method. \")\n        return embeddings\n\n    def _images_to_text(self, documents: pd.DataFrame, embeddings: np.ndarray) -&gt; pd.DataFrame:\n        \"\"\" Convert images to text \"\"\"\n        logger.info(\"Images - Converting images to text. This might take a while.\")\n        if isinstance(self.representation_model, dict):\n            for tuner in self.representation_model.values():\n                if getattr(tuner, 'image_to_text_model', False):\n                    documents = tuner.image_to_text(documents, embeddings)\n        elif isinstance(self.representation_model, list):\n            for tuner in self.representation_model:\n                if getattr(tuner, 'image_to_text_model', False):\n                    documents = tuner.image_to_text(documents, embeddings)\n        elif isinstance(self.representation_model, BaseRepresentation):\n            if getattr(self.representation_model, 'image_to_text_model', False):\n                documents = self.representation_model.image_to_text(documents, embeddings)\n        logger.info(\"Images - Completed \\u2713\")\n        return documents\n\n    def _map_predictions(self, predictions: List[int]) -&gt; List[int]:\n        \"\"\" Map predictions to the correct topics if topics were reduced \"\"\"\n        mappings = self.topic_mapper_.get_mappings(original_topics=True)\n        mapped_predictions = [mappings[prediction]\n                              if prediction in mappings\n                              else -1\n                              for prediction in predictions]\n        return mapped_predictions\n\n    def _reduce_dimensionality(self,\n                               embeddings: Union[np.ndarray, csr_matrix],\n                               y: Union[List[int], np.ndarray] = None,\n                               partial_fit: bool = False) -&gt; np.ndarray:\n        \"\"\" Reduce dimensionality of embeddings using UMAP and train a UMAP model\n\n        Arguments:\n            embeddings: The extracted embeddings using the sentence transformer module.\n            y: The target class for (semi)-supervised dimensionality reduction\n            partial_fit: Whether to run `partial_fit` for online learning\n\n        Returns:\n            umap_embeddings: The reduced embeddings\n        \"\"\"\n        logger.info(\"Dimensionality - Fitting the dimensionality reduction algorithm\")\n        # Partial fit\n        if partial_fit:\n            if hasattr(self.umap_model, \"partial_fit\"):\n                self.umap_model = self.umap_model.partial_fit(embeddings)\n            elif self.topic_representations_ is None:\n                self.umap_model.fit(embeddings)\n\n        # Regular fit\n        else:\n            try:\n                # cuml umap needs y to be an numpy array\n                y = np.array(y) if y is not None else None\n                self.umap_model.fit(embeddings, y=y)\n            except TypeError:\n\n                self.umap_model.fit(embeddings)\n\n        umap_embeddings = self.umap_model.transform(embeddings)\n        logger.info(\"Dimensionality - Completed \\u2713\")\n        return np.nan_to_num(umap_embeddings)\n\n    def _cluster_embeddings(self,\n                            umap_embeddings: np.ndarray,\n                            documents: pd.DataFrame,\n                            partial_fit: bool = False,\n                            y: np.ndarray = None) -&gt; Tuple[pd.DataFrame,\n                                                           np.ndarray]:\n        \"\"\" Cluster UMAP embeddings with HDBSCAN\n\n        Arguments:\n            umap_embeddings: The reduced sentence embeddings with UMAP\n            documents: Dataframe with documents and their corresponding IDs\n            partial_fit: Whether to run `partial_fit` for online learning\n\n        Returns:\n            documents: Updated dataframe with documents and their corresponding IDs\n                       and newly added Topics\n            probabilities: The distribution of probabilities\n        \"\"\"\n        logger.info(\"Cluster - Start clustering the reduced embeddings\")\n        if partial_fit:\n            self.hdbscan_model = self.hdbscan_model.partial_fit(umap_embeddings)\n            labels = self.hdbscan_model.labels_\n            documents['Topic'] = labels\n            self.topics_ = labels\n        else:\n            try:\n                self.hdbscan_model.fit(umap_embeddings, y=y)\n            except TypeError:\n                self.hdbscan_model.fit(umap_embeddings)\n\n            try:\n                labels = self.hdbscan_model.labels_\n            except AttributeError:\n                labels = y\n            documents['Topic'] = labels\n            self._update_topic_size(documents)\n\n        # Some algorithms have outlier labels (-1) that can be tricky to work\n        # with if you are slicing data based on that labels. Therefore, we\n        # track if there are outlier labels and act accordingly when slicing.\n        self._outliers = 1 if -1 in set(labels) else 0\n\n        # Extract probabilities\n        probabilities = None\n        if hasattr(self.hdbscan_model, \"probabilities_\"):\n            probabilities = self.hdbscan_model.probabilities_\n\n            if self.calculate_probabilities and is_supported_hdbscan(self.hdbscan_model):\n                probabilities = hdbscan_delegator(self.hdbscan_model, \"all_points_membership_vectors\")\n\n        if not partial_fit:\n            self.topic_mapper_ = TopicMapper(self.topics_)\n        logger.info(\"Cluster - Completed \\u2713\")\n        return documents, probabilities\n\n    def _zeroshot_topic_modeling(self, documents: pd.DataFrame, embeddings: np.ndarray) -&gt; Tuple[pd.DataFrame, np.array,\n                                                                                                 pd.DataFrame, np.array]:\n        \"\"\" Find documents that could be assigned to either one of the topics in self.zeroshot_topic_list\n\n        We transform the the topics in `self.zeroshot_topic_list` to embeddings and\n        through cosine similarity compare them with the document embeddings.\n        If they pass the `self.zeroshot_min_similarity` threshold, they are assigned.\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            embeddings: The document embeddings\n\n        Returns:\n            documents: The leftover documents that were not assigned to any topic\n            embeddings: The leftover embeddings that were not assigned to any topic\n        \"\"\"\n        logger.info(\"Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics\")\n        # Similarity between document and zero-shot topic embeddings\n        zeroshot_embeddings = self._extract_embeddings(self.zeroshot_topic_list)\n        cosine_similarities = cosine_similarity(embeddings, zeroshot_embeddings)\n        assignment = np.argmax(cosine_similarities, 1)\n        assignment_vals = np.max(cosine_similarities, 1)\n        assigned_ids = [index for index, value in enumerate(assignment_vals) if value &gt;= self.zeroshot_min_similarity]\n        non_assigned_ids = [index for index, value in enumerate(assignment_vals) if value &lt; self.zeroshot_min_similarity]\n\n        # Assign topics\n        assigned_documents = documents.iloc[assigned_ids]\n        assigned_documents[\"Topic\"] = [topic for topic in assignment[assigned_ids]]\n        assigned_documents[\"Old_ID\"] = assigned_documents[\"ID\"].copy()\n        assigned_documents[\"ID\"] = range(len(assigned_documents))\n        assigned_embeddings = embeddings[assigned_ids]\n\n        # Select non-assigned topics to be clustered\n        documents = documents.iloc[non_assigned_ids]\n        documents[\"Old_ID\"] = documents[\"ID\"].copy()\n        documents[\"ID\"] = range(len(documents))\n        embeddings = embeddings[non_assigned_ids]\n\n        # If only matches were found\n        if len(non_assigned_ids) == 0:\n            return None, None, assigned_documents, assigned_embeddings\n        logger.info(\"Zeroshot Step 1 - Completed \\u2713\")\n        return documents, embeddings, assigned_documents, assigned_embeddings\n\n    def _is_zeroshot(self):\n        \"\"\" Check whether zero-shot topic modeling is possible\n\n        * There should be a cluster model used\n        * Embedding model is necessary to convert zero-shot topics to embeddings\n        * Zero-shot topics should be defined\n        \"\"\"\n        if self.zeroshot_topic_list is not None and self.embedding_model is not None and type(self.hdbscan_model) != BaseCluster:\n            return True\n        return False\n\n    def _combine_zeroshot_topics(self,\n                                 documents: pd.DataFrame,\n                                 assigned_documents: pd.DataFrame,\n                                 embeddings: np.ndarray) -&gt; Union[Tuple[np.ndarray, np.ndarray], np.ndarray]:\n        \"\"\" Combine the zero-shot topics with the clustered topics\n\n        There are three cases considered:\n        * Only zero-shot topics were found which will only return the zero-shot topic model\n        * Only clustered topics were found which will only return the clustered topic model\n        * Both zero-shot and clustered topics were found which will return a merged model\n          * This merged model is created using the `merge_models` function which will ignore\n            the underlying UMAP and HDBSCAN models\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            assigned_documents: Dataframe with documents and their corresponding IDs\n                                that were assigned to a zero-shot topic\n            embeddings: The document embeddings\n\n        Returns:\n            topics: The topics for each document\n            probabilities: The probabilities for each document\n        \"\"\"\n        logger.info(\"Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...\")\n\n        # Fit BERTopic without actually performing any clustering\n        docs = assigned_documents.Document.tolist()\n        y = assigned_documents.Topic.tolist()\n        empty_dimensionality_model = BaseDimensionalityReduction()\n        empty_cluster_model = BaseCluster()\n        zeroshot_model = BERTopic(\n                n_gram_range=self.n_gram_range,\n                low_memory=self.low_memory,\n                calculate_probabilities=self.calculate_probabilities,\n                embedding_model=self.embedding_model,\n                umap_model=empty_dimensionality_model,\n                hdbscan_model=empty_cluster_model,\n                vectorizer_model=self.vectorizer_model,\n                ctfidf_model=self.ctfidf_model,\n                representation_model=self.representation_model,\n                verbose=self.verbose\n        ).fit(docs, embeddings=embeddings, y=y)\n        logger.info(\"Zeroshot Step 2 - Completed \\u2713\")\n        logger.info(\"Zeroshot Step 3 - Combining clustered topics with the zeroshot model\")\n\n        # Update model\n        self.umap_model = BaseDimensionalityReduction()\n        self.hdbscan_model = BaseCluster()\n\n        # Update topic label\n        assigned_topics = assigned_documents.groupby(\"Topic\").first().reset_index()\n        indices, topics = assigned_topics.ID.values, assigned_topics.Topic.values\n        labels = [zeroshot_model.topic_labels_[zeroshot_model.topics_[index]] for index in indices]\n        labels = {label: self.zeroshot_topic_list[topic] for label, topic in zip(labels, topics)}\n\n        # If only zero-shot matches were found and clustering was not performed\n        if documents is None:\n            for topic in range(len(set(y))):\n                if zeroshot_model.topic_labels_.get(topic):\n                    if labels.get(zeroshot_model.topic_labels_[topic]):\n                        zeroshot_model.topic_labels_[topic] = labels[zeroshot_model.topic_labels_[topic]]\n            self.__dict__.clear()\n            self.__dict__.update(zeroshot_model.__dict__)\n            return self.topics_, self.probabilities_\n\n        # Merge the two topic models\n        merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1)\n\n        # Update topic labels and representative docs of the zero-shot model\n        for topic in range(len(set(y))):\n            if merged_model.topic_labels_.get(topic):\n                if labels.get(merged_model.topic_labels_[topic]):\n                    label = labels[merged_model.topic_labels_[topic]]\n                    merged_model.topic_labels_[topic] = label\n                    merged_model.representative_docs_[topic] = zeroshot_model.representative_docs_[topic]\n\n        # Add representative docs of the clustered model\n        for topic in set(self.topics_):\n            merged_model.representative_docs_[topic + self._outliers + len(set(y))] = self.representative_docs_[topic]\n\n        if self._outliers and merged_model.topic_sizes_.get(-1):\n            merged_model.topic_sizes_[len(set(y))] = merged_model.topic_sizes_[-1]\n            del merged_model.topic_sizes_[-1]\n\n        # Update topic assignment by finding the documents with the\n        # correct updated topics\n        zeroshot_indices = list(assigned_documents.Old_ID.values)\n        zeroshot_topics = [self.zeroshot_topic_list[topic] for topic in assigned_documents.Topic.values]\n\n        cluster_indices = list(documents.Old_ID.values)\n        cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]\n        cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]\n\n        df = pd.DataFrame({\n            \"Indices\": zeroshot_indices + cluster_indices,\n            \"Label\": zeroshot_topics + cluster_topics}\n        ).sort_values(\"Indices\")\n        reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())\n        df.Label = df.Label.map(reverse_topic_labels)\n        merged_model.topics_ = df.Label.values\n\n        # Update the class internally\n        self.__dict__.clear()\n        self.__dict__.update(merged_model.__dict__)\n        logger.info(\"Zeroshot Step 3 - Completed \\u2713\")\n        return self.topics_\n\n    def _guided_topic_modeling(self, embeddings: np.ndarray) -&gt; Tuple[List[int], np.array]:\n        \"\"\" Apply Guided Topic Modeling\n\n        We transform the seeded topics to embeddings using the\n        same embedder as used for generating document embeddings.\n\n        Then, we apply cosine similarity between the embeddings\n        and set labels for documents that are more similar to\n        one of the topics, then the average document.\n\n        If a document is more similar to the average document\n        than any of the topics, it gets the -1 label and is\n        thereby not included in UMAP.\n\n        Arguments:\n            embeddings: The document embeddings\n\n        Returns\n            y: The labels for each seeded topic\n            embeddings: Updated embeddings\n        \"\"\"\n        logger.info(\"Guided - Find embeddings highly related to seeded topics.\")\n        # Create embeddings from the seeded topics\n        seed_topic_list = [\" \".join(seed_topic) for seed_topic in self.seed_topic_list]\n        seed_topic_embeddings = self._extract_embeddings(seed_topic_list, verbose=self.verbose)\n        seed_topic_embeddings = np.vstack([seed_topic_embeddings, embeddings.mean(axis=0)])\n\n        # Label documents that are most similar to one of the seeded topics\n        sim_matrix = cosine_similarity(embeddings, seed_topic_embeddings)\n        y = [np.argmax(sim_matrix[index]) for index in range(sim_matrix.shape[0])]\n        y = [val if val != len(seed_topic_list) else -1 for val in y]\n\n        # Average the document embeddings related to the seeded topics with the\n        # embedding of the seeded topic to force the documents in a cluster\n        for seed_topic in range(len(seed_topic_list)):\n            indices = [index for index, topic in enumerate(y) if topic == seed_topic]\n            embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])\n        logger.info(\"Guided - Completed \\u2713\")\n        return y, embeddings\n\n    def _extract_topics(self, documents: pd.DataFrame, embeddings: np.ndarray = None, mappings=None, verbose: bool = False):\n        \"\"\" Extract topics from the clusters using a class-based TF-IDF\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n            embeddings: The document embeddings\n            mappings: The mappings from topic to word\n            verbose: Whether log the process of extracting topics\n\n        Returns:\n            c_tf_idf: The resulting matrix giving a value (importance score) for each word per topic\n        \"\"\"\n        if verbose:\n            logger.info(\"Representation - Extracting topics from clusters using representation models.\")\n        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n        self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)\n        self.topic_representations_ = self._extract_words_per_topic(words, documents)\n        self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)\n        self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                              for key, values in\n                              self.topic_representations_.items()}\n        if verbose:\n            logger.info(\"Representation - Completed \\u2713\")\n\n    def _save_representative_docs(self, documents: pd.DataFrame):\n        \"\"\" Save the 3 most representative docs per topic\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs\n\n        Updates:\n            self.representative_docs_: Populate each topic with 3 representative docs\n        \"\"\"\n        repr_docs, _, _, _ = self._extract_representative_docs(\n            self.c_tf_idf_,\n            documents,\n            self.topic_representations_,\n            nr_samples=500,\n            nr_repr_docs=3\n        )\n        self.representative_docs_ = repr_docs\n\n    def _extract_representative_docs(self,\n                                     c_tf_idf: csr_matrix,\n                                     documents: pd.DataFrame,\n                                     topics: Mapping[str, List[Tuple[str, float]]],\n                                     nr_samples: int = 500,\n                                     nr_repr_docs: int = 5,\n                                     diversity: float = None\n                                     ) -&gt; Union[List[str], List[List[int]]]:\n        \"\"\" Approximate most representative documents per topic by sampling\n        a subset of the documents in each topic and calculating which are\n        most represenative to their topic based on the cosine similarity between\n        c-TF-IDF representations.\n\n        Arguments:\n            c_tf_idf: The topic c-TF-IDF representation\n            documents: All input documents\n            topics: The candidate topics as calculated with c-TF-IDF\n            nr_samples: The number of candidate documents to extract per topic\n            nr_repr_docs: The number of representative documents to extract per topic\n            diversity: The diversity between the most representative documents.\n                       If None, no MMR is used. Otherwise, accepts values between 0 and 1.\n\n        Returns:\n            repr_docs_mappings: A dictionary from topic to representative documents\n            representative_docs: A flat list of representative documents\n            repr_doc_indices: Orrdere indices of representative documents\n                              that belong to each topic\n            repr_doc_ids: The indices of representative documents\n                              that belong to each topic\n        \"\"\"\n        # Sample documents per topic\n        documents_per_topic = (\n            documents.drop(\"Image\", axis=1, errors=\"ignore\")\n                     .groupby('Topic')\n                     .sample(n=nr_samples, replace=True, random_state=42)\n                     .drop_duplicates()\n        )\n\n        # Find and extract documents that are most similar to the topic\n        repr_docs = []\n        repr_docs_indices = []\n        repr_docs_mappings = {}\n        repr_docs_ids = []\n        labels = sorted(list(topics.keys()))\n        for index, topic in enumerate(labels):\n\n            # Slice data\n            selection = documents_per_topic.loc[documents_per_topic.Topic == topic, :]\n            selected_docs = selection[\"Document\"].values\n            selected_docs_ids = selection.index.tolist()\n\n            # Calculate similarity\n            nr_docs = nr_repr_docs if len(selected_docs) &gt; nr_repr_docs else len(selected_docs)\n            bow = self.vectorizer_model.transform(selected_docs)\n            ctfidf = self.ctfidf_model.transform(bow)\n            sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])\n\n            # Use MMR to find representative but diverse documents\n            if diversity:\n                docs = mmr(c_tf_idf[index], ctfidf, selected_docs, top_n=nr_docs, diversity=diversity)\n\n            # Extract top n most representative documents\n            else:\n                indices = np.argpartition(sim_matrix.reshape(1, -1)[0], -nr_docs)[-nr_docs:]\n                docs = [selected_docs[index] for index in indices]\n\n            doc_ids = [selected_docs_ids[index] for index, doc in enumerate(selected_docs) if doc in docs]\n            repr_docs_ids.append(doc_ids)\n            repr_docs.extend(docs)\n            repr_docs_indices.append([repr_docs_indices[-1][-1] + i + 1 if index != 0 else i for i in range(nr_docs)])\n        repr_docs_mappings = {topic: repr_docs[i[0]:i[-1]+1] for topic, i in zip(topics.keys(), repr_docs_indices)}\n\n        return repr_docs_mappings, repr_docs, repr_docs_indices, repr_docs_ids\n\n    def _create_topic_vectors(self, documents: pd.DataFrame = None, embeddings: np.ndarray = None, mappings=None):\n        \"\"\" Creates embeddings per topics based on their topic representation\n\n        As a default, topic vectors (topic embeddings) or created by taking\n        the average of all document embeddings within a topic. If topics are\n        merged, then a weighted average of topic embeddings is taken based on\n        the initial topic sizes.\n\n        For the `.partial_fit` and `.update_topics` method, the average\n        of all document embeddings is not taken since those are not known.\n        Instead, the weighted average of the embeddings of the top n words\n        is taken for each topic. The weighting is done based on the c-TF-IDF\n        score. This will put more emphasis to words that represent a topic best.\n        \"\"\"\n        # Topic embeddings based on input embeddings\n        if embeddings is not None and documents is not None:\n            topic_embeddings = []\n            topics = documents.sort_values(\"Topic\").Topic.unique()\n            for topic in topics:\n                indices = documents.loc[documents.Topic == topic, \"ID\"].values\n                indices = [int(index) for index in indices]\n                topic_embedding = np.mean(embeddings[indices], axis=0)\n                topic_embeddings.append(topic_embedding)\n            self.topic_embeddings_ = np.array(topic_embeddings)\n\n        # Topic embeddings when merging topics\n        elif self.topic_embeddings_ is not None and mappings is not None:\n            topic_embeddings_dict = {}\n            for topic_from, topics_to in mappings.items():\n                topic_ids = topics_to[\"topics_to\"]\n                topic_sizes = topics_to[\"topic_sizes\"]\n                if topic_ids:\n                    embds = np.array(self.topic_embeddings_)[np.array(topic_ids) + self._outliers]\n                    topic_embedding = np.average(embds, axis=0, weights=topic_sizes)\n                    topic_embeddings_dict[topic_from] = topic_embedding\n\n            # Re-order topic embeddings\n            topics_to_map = {topic_mapping[0]: topic_mapping[1] for topic_mapping in np.array(self.topic_mapper_.mappings_)[:, -2:]}\n            topic_embeddings = {}\n            for topic, embds in topic_embeddings_dict.items():\n                topic_embeddings[topics_to_map[topic]] = embds\n            unique_topics = sorted(list(topic_embeddings.keys()))\n            self.topic_embeddings_ = np.array([topic_embeddings[topic] for topic in unique_topics])\n\n        # Topic embeddings based on keyword representations\n        elif self.embedding_model is not None and type(self.embedding_model) is not BaseEmbedder:\n            topic_list = list(self.topic_representations_.keys())\n            topic_list.sort()\n\n            # Only extract top n words\n            n = len(self.topic_representations_[topic_list[0]])\n            if self.top_n_words &lt; n:\n                n = self.top_n_words\n\n            # Extract embeddings for all words in all topics\n            topic_words = [self.get_topic(topic) for topic in topic_list]\n            topic_words = [word[0] for topic in topic_words for word in topic]\n            word_embeddings = self._extract_embeddings(\n                topic_words,\n                method=\"word\",\n                verbose=False\n            )\n\n            # Take the weighted average of word embeddings in a topic based on their c-TF-IDF value\n            # The embeddings var is a single numpy matrix and therefore slicing is necessary to\n            # access the words per topic\n            topic_embeddings = []\n            for i, topic in enumerate(topic_list):\n                word_importance = [val[1] for val in self.get_topic(topic)]\n                if sum(word_importance) == 0:\n                    word_importance = [1 for _ in range(len(self.get_topic(topic)))]\n                topic_embedding = np.average(word_embeddings[i * n: n + (i * n)], weights=word_importance, axis=0)\n                topic_embeddings.append(topic_embedding)\n\n            self.topic_embeddings_ = np.array(topic_embeddings)\n\n    def _c_tf_idf(self,\n                  documents_per_topic: pd.DataFrame,\n                  fit: bool = True,\n                  partial_fit: bool = False) -&gt; Tuple[csr_matrix, List[str]]:\n        \"\"\" Calculate a class-based TF-IDF where m is the number of total documents.\n\n        Arguments:\n            documents_per_topic: The joined documents per topic such that each topic has a single\n                                 string made out of multiple documents\n            m: The total number of documents (unjoined)\n            fit: Whether to fit a new vectorizer or use the fitted self.vectorizer_model\n            partial_fit: Whether to run `partial_fit` for online learning\n\n        Returns:\n            tf_idf: The resulting matrix giving a value (importance score) for each word per topic\n            words: The names of the words to which values were given\n        \"\"\"\n        documents = self._preprocess_text(documents_per_topic.Document.values)\n\n        if partial_fit:\n            X = self.vectorizer_model.partial_fit(documents).update_bow(documents)\n        elif fit:\n            self.vectorizer_model.fit(documents)\n            X = self.vectorizer_model.transform(documents)\n        else:\n            X = self.vectorizer_model.transform(documents)\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = self.vectorizer_model.get_feature_names_out()\n        else:\n            words = self.vectorizer_model.get_feature_names()\n\n        multiplier = None\n        if self.ctfidf_model.seed_words and self.seed_topic_list:\n            seed_topic_list = [seed for seeds in self.seed_topic_list for seed in seeds]\n            multiplier = np.array([self.ctfidf_model.seed_multiplier if word in self.ctfidf_model.seed_words else 1 for word in words])\n            multiplier = np.array([1.2 if word in seed_topic_list else value for value, word in zip(multiplier, words)])\n        elif self.ctfidf_model.seed_words:\n            multiplier = np.array([self.ctfidf_model.seed_multiplier if word in self.ctfidf_model.seed_words else 1 for word in words])\n        elif self.seed_topic_list:\n            seed_topic_list = [seed for seeds in self.seed_topic_list for seed in seeds]\n            multiplier = np.array([1.2 if word in seed_topic_list else 1 for word in words])\n\n        if fit:\n            self.ctfidf_model = self.ctfidf_model.fit(X, multiplier=multiplier)\n\n        c_tf_idf = self.ctfidf_model.transform(X)\n\n        return c_tf_idf, words\n\n    def _update_topic_size(self, documents: pd.DataFrame):\n        \"\"\" Calculate the topic sizes\n\n        Arguments:\n            documents: Updated dataframe with documents and their corresponding IDs and newly added Topics\n        \"\"\"\n        self.topic_sizes_ = collections.Counter(documents.Topic.values.tolist())\n        self.topics_ = documents.Topic.astype(int).tolist()\n\n    def _extract_words_per_topic(self,\n                                 words: List[str],\n                                 documents: pd.DataFrame,\n                                 c_tf_idf: csr_matrix = None,\n                                 calculate_aspects: bool = True) -&gt; Mapping[str,\n                                                                            List[Tuple[str, float]]]:\n        \"\"\" Based on tf_idf scores per topic, extract the top n words per topic\n\n        If the top words per topic need to be extracted, then only the `words` parameter\n        needs to be passed. If the top words per topic in a specific timestamp, then it\n        is important to pass the timestamp-based c-TF-IDF matrix and its corresponding\n        labels.\n\n        Arguments:\n            words: List of all words (sorted according to tf_idf matrix position)\n            documents: DataFrame with documents and their topic IDs\n            c_tf_idf: A c-TF-IDF matrix from which to calculate the top words\n\n        Returns:\n            topics: The top words per topic\n        \"\"\"\n        if c_tf_idf is None:\n            c_tf_idf = self.c_tf_idf_\n\n        labels = sorted(list(documents.Topic.unique()))\n        labels = [int(label) for label in labels]\n\n        # Get at least the top 30 indices and values per row in a sparse c-TF-IDF matrix\n        top_n_words = max(self.top_n_words, 30)\n        indices = self._top_n_idx_sparse(c_tf_idf, top_n_words)\n        scores = self._top_n_values_sparse(c_tf_idf, indices)\n        sorted_indices = np.argsort(scores, 1)\n        indices = np.take_along_axis(indices, sorted_indices, axis=1)\n        scores = np.take_along_axis(scores, sorted_indices, axis=1)\n\n        # Get top 30 words per topic based on c-TF-IDF score\n        topics = {label: [(words[word_index], score)\n                          if word_index is not None and score &gt; 0\n                          else (\"\", 0.00001)\n                          for word_index, score in zip(indices[index][::-1], scores[index][::-1])\n                          ]\n                  for index, label in enumerate(labels)}\n\n        # Fine-tune the topic representations\n        if isinstance(self.representation_model, list):\n            for tuner in self.representation_model:\n                topics = tuner.extract_topics(self, documents, c_tf_idf, topics)\n        elif isinstance(self.representation_model, BaseRepresentation):\n            topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)\n        elif isinstance(self.representation_model, dict):\n            if self.representation_model.get(\"Main\"):\n                topics = self.representation_model[\"Main\"].extract_topics(self, documents, c_tf_idf, topics)\n        topics = {label: values[:self.top_n_words] for label, values in topics.items()}\n\n        # Extract additional topic aspects\n        if calculate_aspects and isinstance(self.representation_model, dict):\n            for aspect, aspect_model in self.representation_model.items():\n                aspects = topics.copy()\n                if aspect != \"Main\":\n                    if isinstance(aspect_model, list):\n                        for tuner in aspect_model:\n                            aspects = tuner.extract_topics(self, documents, c_tf_idf, aspects)\n                        self.topic_aspects_[aspect] = aspects\n                    elif isinstance(aspect_model, BaseRepresentation):\n                        self.topic_aspects_[aspect] = aspect_model.extract_topics(self, documents, c_tf_idf, aspects)\n\n        return topics\n\n    def _reduce_topics(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reduce topics to self.nr_topics\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the reduced number of Topics\n        \"\"\"\n        logger.info(\"Topic reduction - Reducing number of topics\")\n        initial_nr_topics = len(self.get_topics())\n\n        if isinstance(self.nr_topics, int):\n            if self.nr_topics &lt; initial_nr_topics:\n                documents = self._reduce_to_n_topics(documents)\n        elif isinstance(self.nr_topics, str):\n            documents = self._auto_reduce_topics(documents)\n        else:\n            raise ValueError(\"nr_topics needs to be an int or 'auto'! \")\n\n        logger.info(f\"Topic reduction - Reduced number of topics from {initial_nr_topics} to {len(self.get_topic_freq())}\")\n        return documents\n\n    def _reduce_to_n_topics(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reduce topics to self.nr_topics\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the reduced number of Topics\n        \"\"\"\n        topics = documents.Topic.tolist().copy()\n\n        # Create topic distance matrix\n        if self.topic_embeddings_ is not None:\n            topic_embeddings = self.topic_embeddings_[self._outliers:, ]\n        else:\n            topic_embeddings = self.c_tf_idf_[self._outliers:, ].toarray()\n        distance_matrix = 1-cosine_similarity(topic_embeddings)\n        np.fill_diagonal(distance_matrix, 0)\n\n        # Cluster the topic embeddings using AgglomerativeClustering\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.4.0\"):\n            cluster = AgglomerativeClustering(self.nr_topics - self._outliers, metric=\"precomputed\", linkage=\"average\")\n        else:\n            cluster = AgglomerativeClustering(self.nr_topics - self._outliers, affinity=\"precomputed\", linkage=\"average\")\n        cluster.fit(distance_matrix)\n        new_topics = [cluster.labels_[topic] if topic != -1 else -1 for topic in topics]\n\n        # Track mappings and sizes of topics for merging topic embeddings\n        mapped_topics = {from_topic: to_topic for from_topic, to_topic in zip(topics, new_topics)}\n        mappings = defaultdict(list)\n        for key, val in sorted(mapped_topics.items()):\n            mappings[val].append(key)\n        mappings = {topic_from:\n                    {\"topics_to\": topics_to,\n                     \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                    for topic_from, topics_to in mappings.items()}\n\n        # Map topics\n        documents.Topic = new_topics\n        self._update_topic_size(documents)\n        self.topic_mapper_.add_mappings(mapped_topics)\n\n        # Update representations\n        documents = self._sort_mappings_by_frequency(documents)\n        self._extract_topics(documents, mappings=mappings)\n        self._update_topic_size(documents)\n        return documents\n\n    def _auto_reduce_topics(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reduce the number of topics automatically using HDBSCAN\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the reduced number of Topics\n        \"\"\"\n        topics = documents.Topic.tolist().copy()\n        unique_topics = sorted(list(documents.Topic.unique()))[self._outliers:]\n        max_topic = unique_topics[-1]\n\n        # Find similar topics\n        if self.topic_embeddings_ is not None:\n            embeddings = np.array(self.topic_embeddings_)\n        else:\n            embeddings = self.c_tf_idf_.toarray()\n        norm_data = normalize(embeddings, norm='l2')\n        predictions = hdbscan.HDBSCAN(min_cluster_size=2,\n                                      metric='euclidean',\n                                      cluster_selection_method='eom',\n                                      prediction_data=True).fit_predict(norm_data[self._outliers:])\n\n        # Map similar topics\n        mapped_topics = {unique_topics[index]: prediction + max_topic\n                         for index, prediction in enumerate(predictions)\n                         if prediction != -1}\n        documents.Topic = documents.Topic.map(mapped_topics).fillna(documents.Topic).astype(int)\n        mapped_topics = {from_topic: to_topic for from_topic, to_topic in zip(topics, documents.Topic.tolist())}\n\n        # Track mappings and sizes of topics for merging topic embeddings\n        mappings = defaultdict(list)\n        for key, val in sorted(mapped_topics.items()):\n            mappings[val].append(key)\n        mappings = {topic_from:\n                    {\"topics_to\": topics_to,\n                     \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                    for topic_from, topics_to in mappings.items()}\n\n        # Update documents and topics\n        self.topic_mapper_.add_mappings(mapped_topics)\n        documents = self._sort_mappings_by_frequency(documents)\n        self._extract_topics(documents, mappings=mappings)\n        self._update_topic_size(documents)\n        return documents\n\n    def _sort_mappings_by_frequency(self, documents: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\" Reorder mappings by their frequency.\n\n        For example, if topic 88 was mapped to topic\n        5 and topic 5 turns out to be the largest topic,\n        then topic 5 will be topic 0. The second largest,\n        will be topic 1, etc.\n\n        If there are no mappings since no reduction of topics\n        took place, then the topics will simply be ordered\n        by their frequency and will get the topic ids based\n        on that order.\n\n        This means that -1 will remain the outlier class, and\n        that the rest of the topics will be in descending order\n        of ids and frequency.\n\n        Arguments:\n            documents: Dataframe with documents and their corresponding IDs and Topics\n\n        Returns:\n            documents: Updated dataframe with documents and the mapped\n                       and re-ordered topic ids\n        \"\"\"\n        self._update_topic_size(documents)\n\n        # Map topics based on frequency\n        df = pd.DataFrame(self.topic_sizes_.items(), columns=[\"Old_Topic\", \"Size\"]).sort_values(\"Size\", ascending=False)\n        df = df[df.Old_Topic != -1]\n        sorted_topics = {**{-1: -1}, **dict(zip(df.Old_Topic, range(len(df))))}\n        self.topic_mapper_.add_mappings(sorted_topics)\n\n        # Map documents\n        documents.Topic = documents.Topic.map(sorted_topics).fillna(documents.Topic).astype(int)\n        self._update_topic_size(documents)\n        return documents\n\n    def _map_probabilities(self,\n                           probabilities: Union[np.ndarray, None],\n                           original_topics: bool = False) -&gt; Union[np.ndarray, None]:\n        \"\"\" Map the probabilities to the reduced topics.\n        This is achieved by adding the probabilities together\n        of all topics that were mapped to the same topic. Then,\n        the topics that were mapped from were set to 0 as they\n        were reduced.\n\n        Arguments:\n            probabilities: An array containing probabilities\n            original_topics: Whether we want to map from the\n                             original topics to the most recent topics\n                             or from the second-most recent topics.\n\n        Returns:\n            mapped_probabilities: Updated probabilities\n        \"\"\"\n        mappings = self.topic_mapper_.get_mappings(original_topics)\n\n        # Map array of probabilities (probability for assigned topic per document)\n        if probabilities is not None:\n            if len(probabilities.shape) == 2:\n                mapped_probabilities = np.zeros((probabilities.shape[0],\n                                                 len(set(mappings.values())) - self._outliers))\n                for from_topic, to_topic in mappings.items():\n                    if to_topic != -1 and from_topic != -1:\n                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]\n\n                return mapped_probabilities\n\n        return probabilities\n\n    def _preprocess_text(self, documents: np.ndarray) -&gt; List[str]:\n        \"\"\" Basic preprocessing of text\n\n        Steps:\n            * Replace \\n and \\t with whitespace\n            * Only keep alpha-numerical characters\n        \"\"\"\n        cleaned_documents = [doc.replace(\"\\n\", \" \") for doc in documents]\n        cleaned_documents = [doc.replace(\"\\t\", \" \") for doc in cleaned_documents]\n        if self.language == \"english\":\n            cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]\n        cleaned_documents = [doc if doc != \"\" else \"emptydoc\" for doc in cleaned_documents]\n        return cleaned_documents\n\n    @staticmethod\n    def _top_n_idx_sparse(matrix: csr_matrix, n: int) -&gt; np.ndarray:\n        \"\"\" Return indices of top n values in each row of a sparse matrix\n\n        Retrieved from:\n            https://stackoverflow.com/questions/49207275/finding-the-top-n-values-in-a-row-of-a-scipy-sparse-matrix\n\n        Arguments:\n            matrix: The sparse matrix from which to get the top n indices per row\n            n: The number of highest values to extract from each row\n\n        Returns:\n            indices: The top n indices per row\n        \"\"\"\n        indices = []\n        for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):\n            n_row_pick = min(n, ri - le)\n            values = matrix.indices[le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]]\n            values = [values[index] if len(values) &gt;= index + 1 else None for index in range(n)]\n            indices.append(values)\n        return np.array(indices)\n\n    @staticmethod\n    def _top_n_values_sparse(matrix: csr_matrix, indices: np.ndarray) -&gt; np.ndarray:\n        \"\"\" Return the top n values for each row in a sparse matrix\n\n        Arguments:\n            matrix: The sparse matrix from which to get the top n indices per row\n            indices: The top n indices per row\n\n        Returns:\n            top_values: The top n scores per row\n        \"\"\"\n        top_values = []\n        for row, values in enumerate(indices):\n            scores = np.array([matrix[row, value] if value is not None else 0 for value in values])\n            top_values.append(scores)\n        return np.array(top_values)\n\n    @classmethod\n    def _get_param_names(cls):\n        \"\"\"Get parameter names for the estimator\n\n        Adapted from:\n            https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178\n        \"\"\"\n        init_signature = inspect.signature(cls.__init__)\n        parameters = sorted([p.name for p in init_signature.parameters.values()\n                             if p.name != 'self' and p.kind != p.VAR_KEYWORD])\n        return parameters\n\n    def __str__(self):\n        \"\"\"Get a string representation of the current object.\n\n        Returns:\n            str: Human readable representation of the most important model parameters.\n                 The parameters that represent models are ignored due to their length.\n        \"\"\"\n        parameters = \"\"\n        for parameter, value in self.get_params().items():\n            value = str(value)\n            if \"(\" in value and value[0] != \"(\":\n                value = value.split(\"(\")[0] + \"(...)\"\n            parameters += f\"{parameter}={value}, \"\n\n        return f\"BERTopic({parameters[:-2]})\"\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.__init__","title":"<code>__init__(self, language='english', top_n_words=10, n_gram_range=(1, 1), min_topic_size=10, nr_topics=None, low_memory=False, calculate_probabilities=False, seed_topic_list=None, zeroshot_topic_list=None, zeroshot_min_similarity=0.7, embedding_model=None, umap_model=None, hdbscan_model=None, vectorizer_model=None, ctfidf_model=None, representation_model=None, verbose=False)</code>  <code>special</code>","text":"<p>BERTopic initialization</p> <p>Parameters:</p> Name Type Description Default <code>language</code> <code>str</code> <p>The main language used in your documents. The default sentence-transformers       model for \"english\" is <code>all-MiniLM-L6-v2</code>. For a full overview of       supported languages see bertopic.backend.languages. Select       \"multilingual\" to load in the <code>paraphrase-multilingual-MiniLM-L12-v2</code>       sentence-transformers model that supports 50+ languages.       NOTE: This is not used if <code>embedding_model</code> is used.</p> <code>'english'</code> <code>top_n_words</code> <code>int</code> <p>The number of words per topic to extract. Setting this          too high can negatively impact topic embeddings as topics          are typically best represented by at most 10 words.</p> <code>10</code> <code>n_gram_range</code> <code>Tuple[int, int]</code> <p>The n-gram range for the CountVectorizer.           Advised to keep high values between 1 and 3.           More would likely lead to memory issues.           NOTE: This param will not be used if you pass in your own           CountVectorizer.</p> <code>(1, 1)</code> <code>min_topic_size</code> <code>int</code> <p>The minimum size of the topic. Increasing this value will lead             to a lower number of clusters/topics and vice versa.              It is the same parameter as <code>min_cluster_size</code> in HDBSCAN.             NOTE: This param will not be used if you are <code>hdbscan_model</code>.</p> <code>10</code> <code>nr_topics</code> <code>Union[int, str]</code> <p>Specifying the number of topics will reduce the initial        number of topics to the value specified. This reduction can take        a while as each reduction in topics (-1) activates a c-TF-IDF        calculation. If this is set to None, no reduction is applied. Use        \"auto\" to automatically reduce topics using HDBSCAN.        NOTE: Controlling the number of topics is best done by adjusting        <code>min_topic_size</code> first before adjusting this parameter.</p> <code>None</code> <code>low_memory</code> <code>bool</code> <p>Sets UMAP low memory to True to make sure less memory is used.         NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP         this parameter will not be used.</p> <code>False</code> <code>calculate_probabilities</code> <code>bool</code> <p>Calculate the probabilities of all topics                      per document instead of the probability of the assigned                      topic per document. This could slow down the extraction                      of topics if you have many documents (&gt; 100_000).                      NOTE: If false you cannot use the corresponding                      visualization method <code>visualize_probabilities</code>.                      NOTE: This is an approximation of topic probabilities                      as used in HDBSCAN and not an exact representation.</p> <code>False</code> <code>seed_topic_list</code> <code>List[List[str]]</code> <p>A list of seed words per topic to converge around</p> <code>None</code> <code>zeroshot_topic_list</code> <code>List[str]</code> <p>A list of topic names to use for zero-shot classification</p> <code>None</code> <code>zeroshot_min_similarity</code> <code>float</code> <p>The minimum similarity between a zero-shot topic and                      a document for assignment. The higher this value, the more                      confident the model needs to be to assign a zero-shot topic to a document.</p> <code>0.7</code> <code>verbose</code> <code>bool</code> <p>Changes the verbosity of the model, Set to True if you want      to track the stages of the model.</p> <code>False</code> <code>embedding_model</code> <p>Use a custom embedding model.              The following backends are currently supported                * SentenceTransformers                * Flair                * Spacy                * Gensim                * USE (TF-Hub)              You can also pass in a string that points to one of the following              sentence-transformers models:                * https://www.sbert.net/docs/pretrained_models.html</p> <code>None</code> <code>umap_model</code> <code>UMAP</code> <p>Pass in a UMAP model to be used instead of the default.         NOTE: You can also pass in any dimensionality reduction algorithm as long         as it has <code>.fit</code> and <code>.transform</code> functions.</p> <code>None</code> <code>hdbscan_model</code> <code>HDBSCAN</code> <p>Pass in a hdbscan.HDBSCAN model to be used instead of the default            NOTE: You can also pass in any clustering algorithm as long as it has            <code>.fit</code> and <code>.predict</code> functions along with the <code>.labels_</code> variable.</p> <code>None</code> <code>vectorizer_model</code> <code>CountVectorizer</code> <p>Pass in a custom <code>CountVectorizer</code> instead of the default model.</p> <code>None</code> <code>ctfidf_model</code> <code>TfidfTransformer</code> <p>Pass in a custom ClassTfidfTransformer instead of the default model.</p> <code>None</code> <code>representation_model</code> <code>BaseRepresentation</code> <p>Pass in a model that fine-tunes the topic representations                   calculated through c-TF-IDF. Models from <code>bertopic.representation</code>                   are supported.</p> <code>None</code> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def __init__(self,\n             language: str = \"english\",\n             top_n_words: int = 10,\n             n_gram_range: Tuple[int, int] = (1, 1),\n             min_topic_size: int = 10,\n             nr_topics: Union[int, str] = None,\n             low_memory: bool = False,\n             calculate_probabilities: bool = False,\n             seed_topic_list: List[List[str]] = None,\n             zeroshot_topic_list: List[str] = None,\n             zeroshot_min_similarity: float = .7,\n             embedding_model=None,\n             umap_model: UMAP = None,\n             hdbscan_model: hdbscan.HDBSCAN = None,\n             vectorizer_model: CountVectorizer = None,\n             ctfidf_model: TfidfTransformer = None,\n             representation_model: BaseRepresentation = None,\n             verbose: bool = False,\n             ):\n    \"\"\"BERTopic initialization\n\n    Arguments:\n        language: The main language used in your documents. The default sentence-transformers\n                  model for \"english\" is `all-MiniLM-L6-v2`. For a full overview of\n                  supported languages see bertopic.backend.languages. Select\n                  \"multilingual\" to load in the `paraphrase-multilingual-MiniLM-L12-v2`\n                  sentence-transformers model that supports 50+ languages.\n                  NOTE: This is not used if `embedding_model` is used.\n        top_n_words: The number of words per topic to extract. Setting this\n                     too high can negatively impact topic embeddings as topics\n                     are typically best represented by at most 10 words.\n        n_gram_range: The n-gram range for the CountVectorizer.\n                      Advised to keep high values between 1 and 3.\n                      More would likely lead to memory issues.\n                      NOTE: This param will not be used if you pass in your own\n                      CountVectorizer.\n        min_topic_size: The minimum size of the topic. Increasing this value will lead\n                        to a lower number of clusters/topics and vice versa. \n                        It is the same parameter as `min_cluster_size` in HDBSCAN.\n                        NOTE: This param will not be used if you are `hdbscan_model`.\n        nr_topics: Specifying the number of topics will reduce the initial\n                   number of topics to the value specified. This reduction can take\n                   a while as each reduction in topics (-1) activates a c-TF-IDF\n                   calculation. If this is set to None, no reduction is applied. Use\n                   \"auto\" to automatically reduce topics using HDBSCAN.\n                   NOTE: Controlling the number of topics is best done by adjusting\n                   `min_topic_size` first before adjusting this parameter.\n        low_memory: Sets UMAP low memory to True to make sure less memory is used.\n                    NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP\n                    this parameter will not be used.\n        calculate_probabilities: Calculate the probabilities of all topics\n                                 per document instead of the probability of the assigned\n                                 topic per document. This could slow down the extraction\n                                 of topics if you have many documents (&gt; 100_000).\n                                 NOTE: If false you cannot use the corresponding\n                                 visualization method `visualize_probabilities`.\n                                 NOTE: This is an approximation of topic probabilities\n                                 as used in HDBSCAN and not an exact representation.\n        seed_topic_list: A list of seed words per topic to converge around\n        zeroshot_topic_list: A list of topic names to use for zero-shot classification\n        zeroshot_min_similarity: The minimum similarity between a zero-shot topic and\n                                 a document for assignment. The higher this value, the more\n                                 confident the model needs to be to assign a zero-shot topic to a document.\n        verbose: Changes the verbosity of the model, Set to True if you want\n                 to track the stages of the model.\n        embedding_model: Use a custom embedding model.\n                         The following backends are currently supported\n                           * SentenceTransformers\n                           * Flair\n                           * Spacy\n                           * Gensim\n                           * USE (TF-Hub)\n                         You can also pass in a string that points to one of the following\n                         sentence-transformers models:\n                           * https://www.sbert.net/docs/pretrained_models.html\n        umap_model: Pass in a UMAP model to be used instead of the default.\n                    NOTE: You can also pass in any dimensionality reduction algorithm as long\n                    as it has `.fit` and `.transform` functions.\n        hdbscan_model: Pass in a hdbscan.HDBSCAN model to be used instead of the default\n                       NOTE: You can also pass in any clustering algorithm as long as it has\n                       `.fit` and `.predict` functions along with the `.labels_` variable.\n        vectorizer_model: Pass in a custom `CountVectorizer` instead of the default model.\n        ctfidf_model: Pass in a custom ClassTfidfTransformer instead of the default model.\n        representation_model: Pass in a model that fine-tunes the topic representations\n                              calculated through c-TF-IDF. Models from `bertopic.representation`\n                              are supported.\n    \"\"\"\n    # Topic-based parameters\n    if top_n_words &gt; 100:\n        logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                       \"can slow down computation quite a bit.\")\n\n    self.top_n_words = top_n_words\n    self.min_topic_size = min_topic_size\n    self.nr_topics = nr_topics\n    self.low_memory = low_memory\n    self.calculate_probabilities = calculate_probabilities\n    self.verbose = verbose\n    self.seed_topic_list = seed_topic_list\n    self.zeroshot_topic_list = zeroshot_topic_list\n    self.zeroshot_min_similarity = zeroshot_min_similarity\n\n    # Embedding model\n    self.language = language if not embedding_model else None\n    self.embedding_model = embedding_model\n\n    # Vectorizer\n    self.n_gram_range = n_gram_range\n    self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=self.n_gram_range)\n    self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n\n    # Representation model\n    self.representation_model = representation_model\n\n    # UMAP or another algorithm that has .fit and .transform functions\n    self.umap_model = umap_model or UMAP(n_neighbors=15,\n                                         n_components=5,\n                                         min_dist=0.0,\n                                         metric='cosine',\n                                         low_memory=self.low_memory)\n\n    # HDBSCAN or another clustering algorithm that has .fit and .predict functions and\n    # the .labels_ variable to extract the labels\n    self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,\n                                                          metric='euclidean',\n                                                          cluster_selection_method='eom',\n                                                          prediction_data=True)\n\n    # Public attributes\n    self.topics_ = None\n    self.probabilities_ = None\n    self.topic_sizes_ = None\n    self.topic_mapper_ = None\n    self.topic_representations_ = None\n    self.topic_embeddings_ = None\n    self.topic_labels_ = None\n    self.custom_labels_ = None\n    self.c_tf_idf_ = None\n    self.representative_images_ = None\n    self.representative_docs_ = {}\n    self.topic_aspects_ = {}\n\n    # Private attributes for internal tracking purposes\n    self._outliers = 1\n    self._merged_topics = None\n\n    if verbose:\n        logger.set_level(\"DEBUG\")\n    else:\n        logger.set_level(\"WARNING\")\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.__str__","title":"<code>__str__(self)</code>  <code>special</code>","text":"<p>Get a string representation of the current object.</p> <p>Returns:</p> Type Description <code>str</code> <p>Human readable representation of the most important model parameters.      The parameters that represent models are ignored due to their length.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def __str__(self):\n    \"\"\"Get a string representation of the current object.\n\n    Returns:\n        str: Human readable representation of the most important model parameters.\n             The parameters that represent models are ignored due to their length.\n    \"\"\"\n    parameters = \"\"\n    for parameter, value in self.get_params().items():\n        value = str(value)\n        if \"(\" in value and value[0] != \"(\":\n            value = value.split(\"(\")[0] + \"(...)\"\n        parameters += f\"{parameter}={value}, \"\n\n    return f\"BERTopic({parameters[:-2]})\"\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.approximate_distribution","title":"<code>approximate_distribution(self, documents, window=4, stride=1, min_similarity=0.1, batch_size=1000, padding=False, use_embedding_model=False, calculate_tokens=False, separator=' ')</code>","text":"<p>A post-hoc approximation of topic distributions across documents.</p> <p>In order to perform this approximation, each document is split into tokens according to the provided tokenizer in the <code>CountVectorizer</code>. Then, a sliding window is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the sentence:</p> <p><code>Solving the right problem is difficult.</code></p> <p>can be split up into <code>solving the right</code>, <code>the right problem</code>, <code>right problem is</code>, and <code>problem is difficult</code>. These are called tokensets. For each of these tokensets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics. Then, the similarities to the topics for each tokenset are summed in order to create a topic distribution for the entire document.</p> <p>We can also dive into this a bit deeper by then splitting these tokensets up into individual tokens and calculate how much a word, in a specific sentence, contributes to the topics found in that document. This can be enabled by setting <code>calculate_tokens=True</code> which can be used for visualization purposes in <code>topic_model.visualize_approximate_distribution</code>.</p> <p>The main output, <code>topic_distributions</code>, can also be used directly in <code>.visualize_distribution(topic_distributions[index])</code> by simply selecting a single distribution.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>Union[str, List[str]]</code> <p>A single document or a list of documents for which we     approximate their topic distributions</p> required <code>window</code> <code>int</code> <p>Size of the moving window which indicates the number of     tokens being considered.</p> <code>4</code> <code>stride</code> <code>int</code> <p>How far the window should move at each step.</p> <code>1</code> <code>min_similarity</code> <code>float</code> <p>The minimum similarity of a document's tokenset             with respect to the topics.</p> <code>0.1</code> <code>batch_size</code> <code>int</code> <p>The number of documents to process at a time. If None,         then all documents are processed at once.         NOTE: With a large number of documents, it is not         advised to process all documents at once.</p> <code>1000</code> <code>padding</code> <code>bool</code> <p>Whether to pad the beginning and ending of a document with      empty tokens.</p> <code>False</code> <code>use_embedding_model</code> <code>bool</code> <p>Whether to use the topic model's embedding                 model to calculate the similarity between                 tokensets and topics instead of using c-TF-IDF.</p> <code>False</code> <code>calculate_tokens</code> <code>bool</code> <p>Calculate the similarity of tokens with all topics.             NOTE: This is computation-wise more expensive and             can require more memory. Using this over batches of             documents might be preferred.</p> <code>False</code> <code>separator</code> <code>str</code> <p>The separator used to merge tokens into tokensets.</p> <code>' '</code> <p>Returns:</p> Type Description <code>topic_distributions</code> <p>A <code>n</code> x <code>m</code> matrix containing the topic distributions                     for all input documents with <code>n</code> being the documents                     and <code>m</code> the topics. topic_token_distributions: A list of <code>t</code> x <code>m</code> arrays with <code>t</code> being the                         number of tokens for the respective document                         and <code>m</code> the topics.</p> <p>Examples:</p> <p>After fitting the model, the topic distributions can be calculated regardless of the clustering model and regardless of whether the documents were previously seen or not:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs)\n</code></pre> <p>As a result, the topic distributions are calculated in <code>topic_distr</code> for the entire document based on token set with a specific window size and stride.</p> <p>If you want to calculate the topic distributions on a token-level:</p> <pre><code>topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n</code></pre> <p>The <code>topic_token_distr</code> then contains, for each token, the best fitting topics. As with <code>topic_distr</code>, it can contain multiple topics for a single token.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def approximate_distribution(self,\n                             documents: Union[str, List[str]],\n                             window: int = 4,\n                             stride: int = 1,\n                             min_similarity: float = 0.1,\n                             batch_size: int = 1000,\n                             padding: bool = False,\n                             use_embedding_model: bool = False,\n                             calculate_tokens: bool = False,\n                             separator: str = \" \") -&gt; Tuple[np.ndarray,\n                                                            Union[List[np.ndarray], None]]:\n    \"\"\" A post-hoc approximation of topic distributions across documents.\n\n    In order to perform this approximation, each document is split into tokens\n    according to the provided tokenizer in the `CountVectorizer`. Then, a\n    sliding window is applied on each document creating subsets of the document.\n    For example, with a window size of 3 and stride of 1, the sentence:\n\n    `Solving the right problem is difficult.`\n\n    can be split up into `solving the right`, `the right problem`, `right problem is`,\n    and `problem is difficult`. These are called tokensets. For each of these\n    tokensets, we calculate their c-TF-IDF representation and find out\n    how similar they are to the previously generated topics. Then, the\n    similarities to the topics for each tokenset are summed in order to\n    create a topic distribution for the entire document.\n\n    We can also dive into this a bit deeper by then splitting these tokensets\n    up into individual tokens and calculate how much a word, in a specific sentence,\n    contributes to the topics found in that document. This can be enabled by\n    setting `calculate_tokens=True` which can be used for visualization purposes\n    in `topic_model.visualize_approximate_distribution`.\n\n    The main output, `topic_distributions`, can also be used directly in\n    `.visualize_distribution(topic_distributions[index])` by simply selecting\n    a single distribution.\n\n    Arguments:\n        documents: A single document or a list of documents for which we\n                approximate their topic distributions\n        window: Size of the moving window which indicates the number of\n                tokens being considered.\n        stride: How far the window should move at each step.\n        min_similarity: The minimum similarity of a document's tokenset\n                        with respect to the topics.\n        batch_size: The number of documents to process at a time. If None,\n                    then all documents are processed at once.\n                    NOTE: With a large number of documents, it is not\n                    advised to process all documents at once.\n        padding: Whether to pad the beginning and ending of a document with\n                 empty tokens.\n        use_embedding_model: Whether to use the topic model's embedding\n                            model to calculate the similarity between\n                            tokensets and topics instead of using c-TF-IDF.\n        calculate_tokens: Calculate the similarity of tokens with all topics.\n                        NOTE: This is computation-wise more expensive and\n                        can require more memory. Using this over batches of\n                        documents might be preferred.\n        separator: The separator used to merge tokens into tokensets.\n\n    Returns:\n        topic_distributions: A `n` x `m` matrix containing the topic distributions\n                            for all input documents with `n` being the documents\n                            and `m` the topics.\n        topic_token_distributions: A list of `t` x `m` arrays with `t` being the\n                                number of tokens for the respective document\n                                and `m` the topics.\n\n    Examples:\n\n    After fitting the model, the topic distributions can be calculated regardless\n    of the clustering model and regardless of whether the documents were previously\n    seen or not:\n\n    ```python\n    topic_distr, _ = topic_model.approximate_distribution(docs)\n    ```\n\n    As a result, the topic distributions are calculated in `topic_distr` for the\n    entire document based on token set with a specific window size and stride.\n\n    If you want to calculate the topic distributions on a token-level:\n\n    ```python\n    topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n    ```\n\n    The `topic_token_distr` then contains, for each token, the best fitting topics.\n    As with `topic_distr`, it can contain multiple topics for a single token.\n    \"\"\"\n    if isinstance(documents, str):\n        documents = [documents]\n\n    if batch_size is None:\n        batch_size = len(documents)\n        batches = 1\n    else:\n        batches = math.ceil(len(documents)/batch_size)\n\n    topic_distributions = []\n    topic_token_distributions = []\n\n    for i in tqdm(range(batches), disable=not self.verbose):\n        doc_set = documents[i*batch_size: (i+1) * batch_size]\n\n        # Extract tokens\n        analyzer = self.vectorizer_model.build_tokenizer()\n        tokens = [analyzer(document) for document in doc_set]\n\n        # Extract token sets\n        all_sentences = []\n        all_indices = [0]\n        all_token_sets_ids = []\n\n        for tokenset in tokens:\n            if len(tokenset) &lt; window:\n                token_sets = [tokenset]\n                token_sets_ids = [list(range(len(tokenset)))]\n            else:\n\n                # Extract tokensets using window and stride parameters\n                stride_indices = list(range(len(tokenset)))[::stride]\n                token_sets = []\n                token_sets_ids = []\n                for stride_index in stride_indices:\n                    selected_tokens = tokenset[stride_index: stride_index+window]\n\n                    if padding or len(selected_tokens) == window:\n                        token_sets.append(selected_tokens)\n                        token_sets_ids.append(list(range(stride_index, stride_index+len(selected_tokens))))\n\n                # Add empty tokens at the beginning and end of a document\n                if padding:\n                    padded = []\n                    padded_ids = []\n                    t = math.ceil(window / stride) - 1\n                    for i in range(math.ceil(window / stride) - 1):\n                        padded.append(tokenset[:window - ((t-i) * stride)])\n                        padded_ids.append(list(range(0, window - ((t-i) * stride))))\n\n                    token_sets = padded + token_sets\n                    token_sets_ids = padded_ids + token_sets_ids\n\n            # Join the tokens\n            sentences = [separator.join(token) for token in token_sets]\n            all_sentences.extend(sentences)\n            all_token_sets_ids.extend(token_sets_ids)\n            all_indices.append(all_indices[-1] + len(sentences))\n\n        # Calculate similarity between embeddings of token sets and the topics\n        if use_embedding_model:\n            embeddings = self._extract_embeddings(all_sentences, method=\"document\", verbose=True)\n            similarity = cosine_similarity(embeddings, self.topic_embeddings_[self._outliers:])\n\n        # Calculate similarity between c-TF-IDF of token sets and the topics\n        else:\n            bow_doc = self.vectorizer_model.transform(all_sentences)\n            c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n            similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n        # Only keep similarities that exceed the minimum\n        similarity[similarity &lt; min_similarity] = 0\n\n        # Aggregate results on an individual token level\n        if calculate_tokens:\n            topic_distribution = []\n            topic_token_distribution = []\n            for index, token in enumerate(tokens):\n                start = all_indices[index]\n                end = all_indices[index+1]\n\n                if start == end:\n                    end = end + 1\n\n                # Assign topics to individual tokens\n                token_id = [i for i in range(len(token))]\n                token_val = {index: [] for index in token_id}\n                for sim, token_set in zip(similarity[start:end], all_token_sets_ids[start:end]):\n                    for token in token_set:\n                        if token in token_val:\n                            token_val[token].append(sim)\n\n                matrix = []\n                for _, value in token_val.items():\n                    matrix.append(np.add.reduce(value))\n\n                # Take empty documents into account\n                matrix = np.array(matrix)\n                if len(matrix.shape) == 1:\n                    matrix = np.zeros((1, len(self.topic_labels_) - self._outliers))\n\n                topic_token_distribution.append(np.array(matrix))\n                topic_distribution.append(np.add.reduce(matrix))\n\n            topic_distribution = normalize(topic_distribution, norm='l1', axis=1)\n\n        # Aggregate on a tokenset level indicated by the window and stride\n        else:\n            topic_distribution = []\n            for index in range(len(all_indices)-1):\n                start = all_indices[index]\n                end = all_indices[index+1]\n\n                if start == end:\n                    end = end + 1\n                group = similarity[start:end].sum(axis=0)\n                topic_distribution.append(group)\n            topic_distribution = normalize(np.array(topic_distribution), norm='l1', axis=1)\n            topic_token_distribution = None\n\n        # Combine results\n        topic_distributions.append(topic_distribution)\n        if topic_token_distribution is None:\n            topic_token_distributions = None\n        else:\n            topic_token_distributions.extend(topic_token_distribution)\n\n    topic_distributions = np.vstack(topic_distributions)\n\n    return topic_distributions, topic_token_distributions\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.find_topics","title":"<code>find_topics(self, search_term=None, image=None, top_n=5)</code>","text":"<p>Find topics most similar to a search_term</p> <p>Creates an embedding for search_term and compares that with the topic embeddings. The most similar topics are returned along with their similarity values.</p> <p>The search_term can be of any size but since it compares with the topic representation it is advised to keep it below 5 words.</p> <p>Parameters:</p> Name Type Description Default <code>search_term</code> <code>str</code> <p>the term you want to use to search for topics.</p> <code>None</code> <code>top_n</code> <code>int</code> <p>the number of topics to return</p> <code>5</code> <p>Returns:</p> Type Description <code>similar_topics</code> <p>the most similar topics from high to low similarity: the similarity scores from high to low</p> <p>Examples:</p> <p>You can use the underlying embedding model to find topics that best represent the search term:</p> <pre><code>topics, similarity = topic_model.find_topics(\"sports\", top_n=5)\n</code></pre> <p>Note that the search query is typically more accurate if the search_term consists of a phrase or multiple words.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def find_topics(self,\n                search_term: str = None,\n                image: str = None,\n                top_n: int = 5) -&gt; Tuple[List[int], List[float]]:\n    \"\"\" Find topics most similar to a search_term\n\n    Creates an embedding for search_term and compares that with\n    the topic embeddings. The most similar topics are returned\n    along with their similarity values.\n\n    The search_term can be of any size but since it compares\n    with the topic representation it is advised to keep it\n    below 5 words.\n\n    Arguments:\n        search_term: the term you want to use to search for topics.\n        top_n: the number of topics to return\n\n    Returns:\n        similar_topics: the most similar topics from high to low\n        similarity: the similarity scores from high to low\n\n    Examples:\n\n    You can use the underlying embedding model to find topics that\n    best represent the search term:\n\n    ```python\n    topics, similarity = topic_model.find_topics(\"sports\", top_n=5)\n    ```\n\n    Note that the search query is typically more accurate if the\n    search_term consists of a phrase or multiple words.\n    \"\"\"\n    if self.embedding_model is None:\n        raise Exception(\"This method can only be used if you did not use custom embeddings.\")\n\n    topic_list = list(self.topic_representations_.keys())\n    topic_list.sort()\n\n    # Extract search_term embeddings and compare with topic embeddings\n    if search_term is not None:\n        search_embedding = self._extract_embeddings([search_term],\n                                                    method=\"word\",\n                                                    verbose=False).flatten()\n    elif image is not None:\n        search_embedding = self._extract_embeddings([None],\n                                                    images=[image],\n                                                    method=\"document\",\n                                                    verbose=False).flatten()\n    sims = cosine_similarity(search_embedding.reshape(1, -1), self.topic_embeddings_).flatten()\n\n    # Extract topics most similar to search_term\n    ids = np.argsort(sims)[-top_n:]\n    similarity = [sims[i] for i in ids][::-1]\n    similar_topics = [topic_list[index] for index in ids][::-1]\n\n    return similar_topics, similarity\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.fit","title":"<code>fit(self, documents, embeddings=None, images=None, y=None)</code>","text":"<p>Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents to fit on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model</p> <code>None</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images to fit on or the images themselves</p> <code>None</code> <code>y</code> <code>Union[List[int], numpy.ndarray]</code> <p>The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified.</p> <code>None</code> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic().fit(docs)\n</code></pre> <p>If you want to use your own embeddings, use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Create embeddings\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# Create topic model\ntopic_model = BERTopic().fit(docs, embeddings)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def fit(self,\n        documents: List[str],\n        embeddings: np.ndarray = None,\n        images: List[str] = None,\n        y: Union[List[int], np.ndarray] = None):\n    \"\"\" Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics\n\n    Arguments:\n        documents: A list of documents to fit on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model\n        images: A list of paths to the images to fit on or the images themselves\n        y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n           specific instance is specified.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic().fit(docs)\n    ```\n\n    If you want to use your own embeddings, use it as follows:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    # Create embeddings\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n    # Create topic model\n    topic_model = BERTopic().fit(docs, embeddings)\n    ```\n    \"\"\"\n    self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)\n    return self\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.fit_transform","title":"<code>fit_transform(self, documents, embeddings=None, images=None, y=None)</code>","text":"<p>Fit the models on a collection of documents, generate topics, and return the probabilities and topic per document.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents to fit on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model</p> <code>None</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images to fit on or the images themselves</p> <code>None</code> <code>y</code> <code>Union[List[int], numpy.ndarray]</code> <p>The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified.</p> <code>None</code> <p>Returns:</p> Type Description <code>predictions</code> <p>Topic predictions for each documents probabilities: The probability of the assigned topic per document.                If <code>calculate_probabilities</code> in BERTopic is set to True, then                it calculates the probabilities of all topics across all documents                instead of only the assigned topic. This, however, slows down                computation and may increase memory usage.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>If you want to use your own embeddings, use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Create embeddings\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# Create topic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def fit_transform(self,\n                  documents: List[str],\n                  embeddings: np.ndarray = None,\n                  images: List[str] = None,\n                  y: Union[List[int], np.ndarray] = None) -&gt; Tuple[List[int],\n                                                                   Union[np.ndarray, None]]:\n    \"\"\" Fit the models on a collection of documents, generate topics,\n    and return the probabilities and topic per document.\n\n    Arguments:\n        documents: A list of documents to fit on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model\n        images: A list of paths to the images to fit on or the images themselves\n        y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n           specific instance is specified.\n\n    Returns:\n        predictions: Topic predictions for each documents\n        probabilities: The probability of the assigned topic per document.\n                       If `calculate_probabilities` in BERTopic is set to True, then\n                       it calculates the probabilities of all topics across all documents\n                       instead of only the assigned topic. This, however, slows down\n                       computation and may increase memory usage.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    ```\n\n    If you want to use your own embeddings, use it as follows:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    # Create embeddings\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n    # Create topic model\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs, embeddings)\n    ```\n    \"\"\"\n    if documents is not None:\n        check_documents_type(documents)\n        check_embeddings_shape(embeddings, documents)\n\n    doc_ids = range(len(documents)) if documents is not None else range(len(images))\n    documents = pd.DataFrame({\"Document\": documents,\n                              \"ID\": doc_ids,\n                              \"Topic\": None,\n                              \"Image\": images})\n\n    # Extract embeddings\n    if embeddings is None:\n        logger.info(\"Embedding - Transforming documents to embeddings.\")\n        self.embedding_model = select_backend(self.embedding_model,\n                                              language=self.language)\n        embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                              images=images,\n                                              method=\"document\",\n                                              verbose=self.verbose)\n        logger.info(\"Embedding - Completed \\u2713\")\n    else:\n        if self.embedding_model is not None:\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n\n    # Guided Topic Modeling\n    if self.seed_topic_list is not None and self.embedding_model is not None:\n        y, embeddings = self._guided_topic_modeling(embeddings)\n\n    # Zero-shot Topic Modeling\n    if self._is_zeroshot():\n        documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(documents, embeddings)\n        if documents is None:\n            return self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n    # Reduce dimensionality\n    umap_embeddings = self._reduce_dimensionality(embeddings, y)\n\n    # Cluster reduced embeddings\n    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)\n\n    # Sort and Map Topic IDs by their frequency\n    if not self.nr_topics:\n        documents = self._sort_mappings_by_frequency(documents)\n\n    # Create documents from images if we have images only\n    if documents.Document.values[0] is None:\n        custom_documents = self._images_to_text(documents, embeddings)\n\n        # Extract topics by calculating c-TF-IDF\n        self._extract_topics(custom_documents, embeddings=embeddings)\n        self._create_topic_vectors(documents=documents, embeddings=embeddings)\n\n        # Reduce topics\n        if self.nr_topics:\n            custom_documents = self._reduce_topics(custom_documents)\n\n        # Save the top 3 most representative documents per topic\n        self._save_representative_docs(custom_documents)\n    else:\n        # Extract topics by calculating c-TF-IDF\n        self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)\n\n        # Reduce topics\n        if self.nr_topics:\n            documents = self._reduce_topics(documents)\n\n        # Save the top 3 most representative documents per topic\n        self._save_representative_docs(documents)\n\n    # Resulting output\n    self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)\n    predictions = documents.Topic.to_list()\n\n    # Combine Zero-shot with outliers\n    if self._is_zeroshot() and len(documents) != len(doc_ids):\n        predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)\n\n    return predictions, self.probabilities_\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.generate_topic_labels","title":"<code>generate_topic_labels(self, nr_words=3, topic_prefix=True, word_length=None, separator='_', aspect=None)</code>","text":"<p>Get labels for each topic in a user-defined format</p> <p>Parameters:</p> Name Type Description Default <code>nr_words</code> <code>int</code> <p>Top <code>n</code> words per topic to use</p> <code>3</code> <code>topic_prefix</code> <code>bool</code> <p>Whether to use the topic ID as a prefix.           If set to True, the topic ID will be separated           using the <code>separator</code></p> <code>True</code> <code>word_length</code> <code>int</code> <p>The maximum length of each word in the topic label.          Some words might be relatively long and setting this          value helps to make sure that all labels have relatively          similar lengths.</p> <code>None</code> <code>separator</code> <code>str</code> <p>The string with which the words and topic prefix will be        separated. Underscores are the default but a nice alternative        is <code>\", \"</code>.</p> <code>'_'</code> <code>aspect</code> <code>str</code> <p>The aspect from which to generate topic labels</p> <code>None</code> <p>Returns:</p> Type Description <code>topic_labels</code> <p>A list of topic labels sorted from the lowest topic ID to the highest.             If the topic model was trained using HDBSCAN, the lowest topic ID is -1,             otherwise it is 0.</p> <p>Examples:</p> <p>To create our custom topic labels, usage is rather straightforward:</p> <pre><code>topic_labels = topic_model.generate_topic_labels(nr_words=2, separator=\", \")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def generate_topic_labels(self,\n                          nr_words: int = 3,\n                          topic_prefix: bool = True,\n                          word_length: int = None,\n                          separator: str = \"_\",\n                          aspect: str = None) -&gt; List[str]:\n    \"\"\" Get labels for each topic in a user-defined format\n\n    Arguments:\n        nr_words: Top `n` words per topic to use\n        topic_prefix: Whether to use the topic ID as a prefix.\n                      If set to True, the topic ID will be separated\n                      using the `separator`\n        word_length: The maximum length of each word in the topic label.\n                     Some words might be relatively long and setting this\n                     value helps to make sure that all labels have relatively\n                     similar lengths.\n        separator: The string with which the words and topic prefix will be\n                   separated. Underscores are the default but a nice alternative\n                   is `\", \"`.\n        aspect: The aspect from which to generate topic labels\n\n    Returns:\n        topic_labels: A list of topic labels sorted from the lowest topic ID to the highest.\n                    If the topic model was trained using HDBSCAN, the lowest topic ID is -1,\n                    otherwise it is 0.\n\n    Examples:\n\n    To create our custom topic labels, usage is rather straightforward:\n\n    ```python\n    topic_labels = topic_model.generate_topic_labels(nr_words=2, separator=\", \")\n    ```\n    \"\"\"\n    unique_topics = sorted(set(self.topics_))\n\n    topic_labels = []\n    for topic in unique_topics:\n        if aspect:\n            words, _ = zip(*self.topic_aspects_[aspect][topic])\n        else:\n            words, _ = zip(*self.get_topic(topic))\n\n        if word_length:\n            words = [word[:word_length] for word in words][:nr_words]\n        else:\n            words = list(words)[:nr_words]\n\n        if topic_prefix:\n            topic_label = f\"{topic}{separator}\" + separator.join(words)\n        else:\n            topic_label = separator.join(words)\n\n        topic_labels.append(topic_label)\n\n    return topic_labels\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info","title":"<code>get_document_info(self, docs, df=None, metadata=None)</code>","text":"<p>Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and probability of the clustering if the cluster model supports it.</p> <p>There are also options to include other meta data, such as the topic distributions or the x and y coordinates of the reduced embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents on which the topic model was trained.</p> required <code>df</code> <code>DataFrame</code> <p>A dataframe containing the metadata and the documents on which the topic model was originally trained on.</p> <code>None</code> <code>metadata</code> <code>Mapping[str, Any]</code> <p>A dictionary with meta data for each document in the form     of column name (key) and the respective values (value).</p> <code>None</code> <p>Returns:</p> Type Description <code>document_info</code> <p>A dataframe with several statistics regarding             the documents on which the topic model was trained.</p> <p>Usage:</p> <p>To get the document info, you will only need to pass the documents on which the topic model was trained:</p> <pre><code>document_info = topic_model.get_document_info(docs)\n</code></pre> <p>There are additionally options to include meta data, such as the topic distributions. Moreover, we can pass the original dataframe that contains the documents and extend it with the information retrieved from BERTopic:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# The original data in a dataframe format to include the target variable\ndata= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndf = pd.DataFrame({\"Document\": data['data'], \"Class\": data['target']})\n\n# Add information about the percentage of the document that relates to the topic\ntopic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)\ndistributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]\n\n# Create our documents dataframe using the original dataframe and meta data about\n# the topic distributions\ndocument_info = topic_model.get_document_info(docs, df=df,\n                                              metadata={\"Topic_distribution\": distributions})\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_document_info(self,\n                      docs: List[str],\n                      df: pd.DataFrame = None,\n                      metadata: Mapping[str, Any] = None) -&gt; pd.DataFrame:\n    \"\"\" Get information about the documents on which the topic was trained\n    including the documents themselves, their respective topics, the name\n    of each topic, the top n words of each topic, whether it is a\n    representative document, and probability of the clustering if the cluster\n    model supports it.\n\n    There are also options to include other meta data, such as the topic\n    distributions or the x and y coordinates of the reduced embeddings.\n\n    Arguments:\n        docs: The documents on which the topic model was trained.\n        df: A dataframe containing the metadata and the documents on which\n            the topic model was originally trained on.\n        metadata: A dictionary with meta data for each document in the form\n                of column name (key) and the respective values (value).\n\n    Returns:\n        document_info: A dataframe with several statistics regarding\n                    the documents on which the topic model was trained.\n\n    Usage:\n\n    To get the document info, you will only need to pass the documents on which\n    the topic model was trained:\n\n    ```python\n    document_info = topic_model.get_document_info(docs)\n    ```\n\n    There are additionally options to include meta data, such as the topic\n    distributions. Moreover, we can pass the original dataframe that contains\n    the documents and extend it with the information retrieved from BERTopic:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n\n    # The original data in a dataframe format to include the target variable\n    data= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\n    df = pd.DataFrame({\"Document\": data['data'], \"Class\": data['target']})\n\n    # Add information about the percentage of the document that relates to the topic\n    topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)\n    distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]\n\n    # Create our documents dataframe using the original dataframe and meta data about\n    # the topic distributions\n    document_info = topic_model.get_document_info(docs, df=df,\n                                                  metadata={\"Topic_distribution\": distributions})\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    if df is not None:\n        document_info = df.copy()\n        document_info[\"Document\"] = docs\n        document_info[\"Topic\"] = self.topics_\n    else:\n        document_info = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_})\n\n    # Add topic info through `.get_topic_info()`\n    topic_info = self.get_topic_info().drop(\"Count\", axis=1)\n    document_info = pd.merge(document_info, topic_info, on=\"Topic\", how=\"left\")\n\n    # Add top n words\n    top_n_words = {topic: \" - \".join(list(zip(*self.get_topic(topic)))[0]) for topic in set(self.topics_)}\n    document_info[\"Top_n_words\"] = document_info.Topic.map(top_n_words)\n\n    # Add flat probabilities\n    if self.probabilities_ is not None:\n        if len(self.probabilities_.shape) == 1:\n            document_info[\"Probability\"] = self.probabilities_\n        else:\n            document_info[\"Probability\"] = [max(probs) if topic != -1 else 1-sum(probs)\n                                            for topic, probs in zip(self.topics_, self.probabilities_)]\n\n    # Add representative document labels\n    repr_docs = [repr_doc for repr_docs in self.representative_docs_.values() for repr_doc in repr_docs]\n    document_info[\"Representative_document\"] = False\n    document_info.loc[document_info.Document.isin(repr_docs), \"Representative_document\"] = True\n\n    # Add custom meta data provided by the user\n    if metadata is not None:\n        for column, values in metadata.items():\n            document_info[column] = values\n    return document_info\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_params","title":"<code>get_params(self, deep=False)</code>","text":"<p>Get parameters for this estimator.</p> <p>Adapted from:     https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178</p> <p>Parameters:</p> Name Type Description Default <code>deep</code> <code>bool</code> <p>bool, default=True   If True, will return the parameters for this estimator and   contained subobjects that are estimators.</p> <code>False</code> <p>Returns:</p> Type Description <code>out</code> <p>Parameter names mapped to their values.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_params(self, deep: bool = False) -&gt; Mapping[str, Any]:\n    \"\"\" Get parameters for this estimator.\n\n    Adapted from:\n        https://github.com/scikit-learn/scikit-learn/blob/b3ea3ed6a/sklearn/base.py#L178\n\n    Arguments:\n        deep: bool, default=True\n              If True, will return the parameters for this estimator and\n              contained subobjects that are estimators.\n\n    Returns:\n        out: Parameter names mapped to their values.\n    \"\"\"\n    out = dict()\n    for key in self._get_param_names():\n        value = getattr(self, key)\n        if deep and hasattr(value, 'get_params'):\n            deep_items = value.get_params().items()\n            out.update((key + '__' + k, val) for k, val in deep_items)\n        out[key] = value\n    return out\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_representative_docs","title":"<code>get_representative_docs(self, topic=None)</code>","text":"<p>Extract the best representing documents per topic.</p> <p>Note</p> <p>This does not extract all documents per topic as all documents are not saved within BERTopic. To get all documents, please run the following:</p> <pre><code># When you used `.fit_transform`:\ndf = pd.DataFrame({\"Document\": docs, \"Topic\": topic})\n\n# When you used `.fit`:\ndf = pd.DataFrame({\"Document\": docs, \"Topic\": topic_model.topics_})\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want    the representative documents</p> <code>None</code> <p>Returns:</p> Type Description <code>List[str]</code> <p>Representative documents of the chosen topic</p> <p>Examples:</p> <p>To extract the representative docs of all topics:</p> <pre><code>representative_docs = topic_model.get_representative_docs()\n</code></pre> <p>To get the representative docs of a single topic:</p> <pre><code>representative_docs = topic_model.get_representative_docs(12)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_representative_docs(self, topic: int = None) -&gt; List[str]:\n    \"\"\" Extract the best representing documents per topic.\n\n    NOTE:\n        This does not extract all documents per topic as all documents\n        are not saved within BERTopic. To get all documents, please\n        run the following:\n\n        ```python\n        # When you used `.fit_transform`:\n        df = pd.DataFrame({\"Document\": docs, \"Topic\": topic})\n\n        # When you used `.fit`:\n        df = pd.DataFrame({\"Document\": docs, \"Topic\": topic_model.topics_})\n        ```\n\n    Arguments:\n        topic: A specific topic for which you want\n               the representative documents\n\n    Returns:\n        Representative documents of the chosen topic\n\n    Examples:\n\n    To extract the representative docs of all topics:\n\n    ```python\n    representative_docs = topic_model.get_representative_docs()\n    ```\n\n    To get the representative docs of a single topic:\n\n    ```python\n    representative_docs = topic_model.get_representative_docs(12)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    if isinstance(topic, int):\n        if self.representative_docs_.get(topic):\n            return self.representative_docs_[topic]\n        else:\n            return None\n    else:\n        return self.representative_docs_\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic","title":"<code>get_topic(self, topic, full=False)</code>","text":"<p>Return top n words for a specific topic and their c-TF-IDF scores</p> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want its representation</p> required <code>full</code> <code>bool</code> <p>If True, returns all different forms of topic representations   for a topic, including aspects</p> <code>False</code> <p>Returns:</p> Type Description <code>Union[Mapping[str, Tuple[str, float]], bool]</code> <p>The top n words for a specific word and its respective c-TF-IDF scores</p> <p>Examples:</p> <pre><code>topic = topic_model.get_topic(12)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topic(self, topic: int, full: bool = False) -&gt; Union[Mapping[str, Tuple[str, float]], bool]:\n    \"\"\" Return top n words for a specific topic and their c-TF-IDF scores\n\n    Arguments:\n        topic: A specific topic for which you want its representation\n        full: If True, returns all different forms of topic representations\n              for a topic, including aspects\n\n    Returns:\n        The top n words for a specific word and its respective c-TF-IDF scores\n\n    Examples:\n\n    ```python\n    topic = topic_model.get_topic(12)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    if topic in self.topic_representations_:\n        if full:\n            representations = {\"Main\": self.topic_representations_[topic]}\n            aspects = {aspect: representations[topic] for aspect, representations in self.topic_aspects_.items()}\n            representations.update(aspects)\n            return representations\n        else:\n            return self.topic_representations_[topic]\n    else:\n        return False\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic_freq","title":"<code>get_topic_freq(self, topic=None)</code>","text":"<p>Return the size of topics (descending order)</p> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want the frequency</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[pandas.core.frame.DataFrame, int]</code> <p>Either the frequency of a single topic or dataframe with the frequencies of all topics</p> <p>Examples:</p> <p>To extract the frequency of all topics:</p> <pre><code>frequency = topic_model.get_topic_freq()\n</code></pre> <p>To get the frequency of a single topic:</p> <pre><code>frequency = topic_model.get_topic_freq(12)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topic_freq(self, topic: int = None) -&gt; Union[pd.DataFrame, int]:\n    \"\"\" Return the size of topics (descending order)\n\n    Arguments:\n        topic: A specific topic for which you want the frequency\n\n    Returns:\n        Either the frequency of a single topic or dataframe with\n        the frequencies of all topics\n\n    Examples:\n\n    To extract the frequency of all topics:\n\n    ```python\n    frequency = topic_model.get_topic_freq()\n    ```\n\n    To get the frequency of a single topic:\n\n    ```python\n    frequency = topic_model.get_topic_freq(12)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    if isinstance(topic, int):\n        return self.topic_sizes_[topic]\n    else:\n        return pd.DataFrame(self.topic_sizes_.items(), columns=['Topic', 'Count']).sort_values(\"Count\",\n                                                                                               ascending=False)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic_info","title":"<code>get_topic_info(self, topic=None)</code>","text":"<p>Get information about each topic including its ID, frequency, and name.</p> <p>Parameters:</p> Name Type Description Default <code>topic</code> <code>int</code> <p>A specific topic for which you want the frequency</p> <code>None</code> <p>Returns:</p> Type Description <code>info</code> <p>The information relating to either a single topic or all topics</p> <p>Examples:</p> <pre><code>info_df = topic_model.get_topic_info()\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topic_info(self, topic: int = None) -&gt; pd.DataFrame:\n    \"\"\" Get information about each topic including its ID, frequency, and name.\n\n    Arguments:\n        topic: A specific topic for which you want the frequency\n\n    Returns:\n        info: The information relating to either a single topic or all topics\n\n    Examples:\n\n    ```python\n    info_df = topic_model.get_topic_info()\n    ```\n    \"\"\"\n    check_is_fitted(self)\n\n    info = pd.DataFrame(self.topic_sizes_.items(), columns=[\"Topic\", \"Count\"]).sort_values(\"Topic\")\n    info[\"Name\"] = info.Topic.map(self.topic_labels_)\n\n    # Custom label\n    if self.custom_labels_ is not None:\n        if len(self.custom_labels_) == len(info):\n            labels = {topic - self._outliers: label for topic, label in enumerate(self.custom_labels_)}\n            info[\"CustomName\"] = info[\"Topic\"].map(labels)\n\n    # Main Keywords\n    values = {topic: list(list(zip(*values))[0]) for topic, values in self.topic_representations_.items()}\n    info[\"Representation\"] = info[\"Topic\"].map(values)\n\n    # Extract all topic aspects\n    if self.topic_aspects_:\n        for aspect, values in self.topic_aspects_.items():\n            if isinstance(list(values.values())[-1], list):\n                if isinstance(list(values.values())[-1][0], tuple) or isinstance(list(values.values())[-1][0], list):\n                    values = {topic: list(list(zip(*value))[0]) for topic, value in values.items()}\n                elif isinstance(list(values.values())[-1][0], str):\n                    values = {topic: \" \".join(value).strip() for topic, value in values.items()}\n            info[aspect] = info[\"Topic\"].map(values)\n\n    # Representative Docs / Images\n    if self.representative_docs_ is not None:\n        info[\"Representative_Docs\"] = info[\"Topic\"].map(self.representative_docs_)\n    if self.representative_images_ is not None:\n        info[\"Representative_Images\"] = info[\"Topic\"].map(self.representative_images_)\n\n    # Select specific topic to return\n    if topic is not None:\n        info = info.loc[info.Topic == topic, :]\n\n    return info.reset_index(drop=True)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topic_tree","title":"<code>get_topic_tree(hier_topics, max_distance=None, tight_layout=False)</code>  <code>staticmethod</code>","text":"<p>Extract the topic tree such that it can be printed</p> <p>Parameters:</p> Name Type Description Default <code>hier_topics</code> <code>DataFrame</code> <p>A dataframe containing the structure of the topic tree.         This is the output of <code>topic_model.hierachical_topics()</code></p> required <code>max_distance</code> <code>float</code> <p>The maximum distance between two topics. This value is         based on the Distance column in <code>hier_topics</code>.</p> <code>None</code> <code>tight_layout</code> <code>bool</code> <p>Whether to use a tight layout (narrow width) for         easier readability if you have hundreds of topics.</p> <code>False</code> <p>Returns:</p> Type Description <code>A tree that has the following structure when printed</code> <p>.     .     \u2514\u2500health_medical_disease_patients_hiv         \u251c\u2500patients_medical_disease_candida_health         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9</p> <p>The blocks (\u25a0) indicate that the topic is one you can directly access from <code>topic_model.get_topic</code>. In other words, they are the original un-grouped topics.</p> <p>Examples:</p> <pre><code># Train model\nfrom bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Print topic tree\ntree = topic_model.get_topic_tree(hierarchical_topics)\nprint(tree)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>@staticmethod\ndef get_topic_tree(hier_topics: pd.DataFrame,\n                   max_distance: float = None,\n                   tight_layout: bool = False) -&gt; str:\n    \"\"\" Extract the topic tree such that it can be printed\n\n    Arguments:\n        hier_topics: A dataframe containing the structure of the topic tree.\n                    This is the output of `topic_model.hierachical_topics()`\n        max_distance: The maximum distance between two topics. This value is\n                    based on the Distance column in `hier_topics`.\n        tight_layout: Whether to use a tight layout (narrow width) for\n                    easier readability if you have hundreds of topics.\n\n    Returns:\n        A tree that has the following structure when printed:\n            .\n            .\n            \u2514\u2500health_medical_disease_patients_hiv\n                \u251c\u2500patients_medical_disease_candida_health\n                \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n                \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n                \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n                \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n                \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n\n        The blocks (\u25a0) indicate that the topic is one you can directly access\n        from `topic_model.get_topic`. In other words, they are the original un-grouped topics.\n\n    Examples:\n\n    ```python\n    # Train model\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Print topic tree\n    tree = topic_model.get_topic_tree(hierarchical_topics)\n    print(tree)\n    ```\n    \"\"\"\n    width = 1 if tight_layout else 4\n    if max_distance is None:\n        max_distance = hier_topics.Distance.max() + 1\n\n    max_original_topic = hier_topics.Parent_ID.astype(int).min() - 1\n\n    # Extract mapping from ID to name\n    topic_to_name = dict(zip(hier_topics.Child_Left_ID, hier_topics.Child_Left_Name))\n    topic_to_name.update(dict(zip(hier_topics.Child_Right_ID, hier_topics.Child_Right_Name)))\n    topic_to_name = {topic: name[:100] for topic, name in topic_to_name.items()}\n\n    # Create tree\n    tree = {str(row[1].Parent_ID): [str(row[1].Child_Left_ID), str(row[1].Child_Right_ID)]\n            for row in hier_topics.iterrows()}\n\n    def get_tree(start, tree):\n        \"\"\" Based on: https://stackoverflow.com/a/51920869/10532563 \"\"\"\n\n        def _tree(to_print, start, parent, tree, grandpa=None, indent=\"\"):\n\n            # Get distance between merged topics\n            distance = hier_topics.loc[(hier_topics.Child_Left_ID == parent) |\n                                       (hier_topics.Child_Right_ID == parent), \"Distance\"]\n            distance = distance.values[0] if len(distance) &gt; 0 else 10\n\n            if parent != start:\n                if grandpa is None:\n                    to_print += topic_to_name[parent]\n                else:\n                    if int(parent) &lt;= max_original_topic:\n\n                        # Do not append topic ID if they are not merged\n                        if distance &lt; max_distance:\n                            to_print += \"\u25a0\u2500\u2500\" + topic_to_name[parent] + f\" \u2500\u2500 Topic: {parent}\" + \"\\n\"\n                        else:\n                            to_print += \"O \\n\"\n                    else:\n                        to_print += topic_to_name[parent] + \"\\n\"\n\n            if parent not in tree:\n                return to_print\n\n            for child in tree[parent][:-1]:\n                to_print += indent + \"\u251c\" + \"\u2500\"\n                to_print = _tree(to_print, start, child, tree, parent, indent + \"\u2502\" + \" \" * width)\n\n            child = tree[parent][-1]\n            to_print += indent + \"\u2514\" + \"\u2500\"\n            to_print = _tree(to_print, start, child, tree, parent, indent + \" \" * (width+1))\n\n            return to_print\n\n        to_print = \".\" + \"\\n\"\n        to_print = _tree(to_print, start, start, tree)\n        return to_print\n\n    start = str(hier_topics.Parent_ID.astype(int).max())\n    return get_tree(start, tree)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.get_topics","title":"<code>get_topics(self, full=False)</code>","text":"<p>Return topics with top n words and their c-TF-IDF score</p> <p>Parameters:</p> Name Type Description Default <code>full</code> <code>bool</code> <p>If True, returns all different forms of topic representations   for each topic, including aspects</p> <code>False</code> <p>Returns:</p> Type Description <code>self.topic_representations_</code> <p>The top n words per topic and the corresponding c-TF-IDF score</p> <p>Examples:</p> <pre><code>all_topics = topic_model.get_topics()\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def get_topics(self, full: bool = False) -&gt; Mapping[str, Tuple[str, float]]:\n    \"\"\" Return topics with top n words and their c-TF-IDF score\n\n    Arguments:\n        full: If True, returns all different forms of topic representations\n              for each topic, including aspects\n\n    Returns:\n        self.topic_representations_: The top n words per topic and the corresponding c-TF-IDF score\n\n    Examples:\n\n    ```python\n    all_topics = topic_model.get_topics()\n    ```\n    \"\"\"\n    check_is_fitted(self)\n\n    if full:\n        topic_representations = {\"Main\": self.topic_representations_}\n        topic_representations.update(self.topic_aspects_)\n        return topic_representations\n    else:\n        return self.topic_representations_\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.hierarchical_topics","title":"<code>hierarchical_topics(self, docs, linkage_function=None, distance_function=None)</code>","text":"<p>Create a hierarchy of topics</p> <p>To create this hierarchy, BERTopic needs to be already fitted once. Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF representation using <code>scipy.cluster.hierarchy.linkage</code>.</p> <p>Based on that hierarchy, we calculate the topic representation at each merged step. This is a local representation, as we only assume that the chosen step is merged and not all others which typically improves the topic representation.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>linkage_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], numpy.ndarray]</code> <p>The linkage function to use. Default is:             <code>lambda x: sch.linkage(x, 'ward', optimal_ordering=True)</code></p> <code>None</code> <code>distance_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], scipy.sparse._csr.csr_matrix]</code> <p>The distance function to use on the c-TF-IDF matrix. Default is:                 <code>lambda x: 1 - cosine_similarity(x)</code>.                 You can pass any function that returns either a square matrix of                 shape (n_samples, n_samples) with zeros on the diagonal and                 non-negative values or condensed distance matrix of shape                 (n_samples * (n_samples - 1) / 2,) containing the upper                 triangular of the distance matrix.</p> <code>None</code> <p>Returns:</p> Type Description <code>hierarchical_topics</code> <p>A dataframe that contains a hierarchy of topics                     represented by their parents and their children</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>A custom linkage function can be used as follows:</p> <pre><code>from scipy.cluster import hierarchy as sch\nfrom bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Hierarchical topics\nlinkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\nhierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def hierarchical_topics(self,\n                        docs: List[str],\n                        linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                        distance_function: Callable[[csr_matrix], csr_matrix] = None) -&gt; pd.DataFrame:\n    \"\"\" Create a hierarchy of topics\n\n    To create this hierarchy, BERTopic needs to be already fitted once.\n    Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF\n    representation using `scipy.cluster.hierarchy.linkage`.\n\n    Based on that hierarchy, we calculate the topic representation at each\n    merged step. This is a local representation, as we only assume that the\n    chosen step is merged and not all others which typically improves the\n    topic representation.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        linkage_function: The linkage function to use. Default is:\n                        `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                            `lambda x: 1 - cosine_similarity(x)`.\n                            You can pass any function that returns either a square matrix of\n                            shape (n_samples, n_samples) with zeros on the diagonal and\n                            non-negative values or condensed distance matrix of shape\n                            (n_samples * (n_samples - 1) / 2,) containing the upper\n                            triangular of the distance matrix.\n\n    Returns:\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                            represented by their parents and their children\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n    ```\n\n    A custom linkage function can be used as follows:\n\n    ```python\n    from scipy.cluster import hierarchy as sch\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n\n    # Hierarchical topics\n    linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n    hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    if distance_function is None:\n        distance_function = lambda x: 1 - cosine_similarity(x)\n\n    if linkage_function is None:\n        linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n\n    # Calculate distance\n    embeddings = self.c_tf_idf_[self._outliers:]\n    X = distance_function(embeddings)\n    X = validate_distance_matrix(X, embeddings.shape[0])\n\n    # Use the 1-D condensed distance matrix as an input instead of the raw distance matrix\n    Z = linkage_function(X)\n\n    # Calculate basic bag-of-words to be iteratively merged later\n    documents = pd.DataFrame({\"Document\": docs,\n                              \"ID\": range(len(docs)),\n                              \"Topic\": self.topics_})\n    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n    documents_per_topic = documents_per_topic.loc[documents_per_topic.Topic != -1, :]\n    clean_documents = self._preprocess_text(documents_per_topic.Document.values)\n\n    # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n    # and will be removed in 1.2. Please use get_feature_names_out instead.\n    if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n        words = self.vectorizer_model.get_feature_names_out()\n    else:\n        words = self.vectorizer_model.get_feature_names()\n\n    bow = self.vectorizer_model.transform(clean_documents)\n\n    # Extract clusters\n    hier_topics = pd.DataFrame(columns=[\"Parent_ID\", \"Parent_Name\", \"Topics\",\n                                        \"Child_Left_ID\", \"Child_Left_Name\",\n                                        \"Child_Right_ID\", \"Child_Right_Name\"])\n    for index in tqdm(range(len(Z))):\n\n        # Find clustered documents\n        clusters = sch.fcluster(Z, t=Z[index][2], criterion='distance') - self._outliers\n        cluster_df = pd.DataFrame({\"Topic\": range(len(clusters)), \"Cluster\": clusters})\n        cluster_df = cluster_df.groupby(\"Cluster\").agg({'Topic': lambda x: list(x)}).reset_index()\n        nr_clusters = len(clusters)\n\n        # Extract first topic we find to get the set of topics in a merged topic\n        topic = None\n        val = Z[index][0]\n        while topic is None:\n            if val - len(clusters) &lt; 0:\n                topic = int(val)\n            else:\n                val = Z[int(val - len(clusters))][0]\n        clustered_topics = [i for i, x in enumerate(clusters) if x == clusters[topic]]\n\n        # Group bow per cluster, calculate c-TF-IDF and extract words\n        grouped = csr_matrix(bow[clustered_topics].sum(axis=0))\n        c_tf_idf = self.ctfidf_model.transform(grouped)\n        selection = documents.loc[documents.Topic.isin(clustered_topics), :]\n        selection.Topic = 0\n        words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n\n        # Extract parent's name and ID\n        parent_id = index + len(clusters)\n        parent_name = \"_\".join([x[0] for x in words_per_topic[0]][:5])\n\n        # Extract child's name and ID\n        Z_id = Z[index][0]\n        child_left_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n        if Z_id - nr_clusters &lt; 0:\n            child_left_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n        else:\n            child_left_name = hier_topics.iloc[int(child_left_id)].Parent_Name\n\n        # Extract child's name and ID\n        Z_id = Z[index][1]\n        child_right_id = Z_id if Z_id - nr_clusters &lt; 0 else Z_id - nr_clusters\n\n        if Z_id - nr_clusters &lt; 0:\n            child_right_name = \"_\".join([x[0] for x in self.get_topic(Z_id)][:5])\n        else:\n            child_right_name = hier_topics.iloc[int(child_right_id)].Parent_Name\n\n        # Save results\n        hier_topics.loc[len(hier_topics), :] = [parent_id, parent_name,\n                                                clustered_topics,\n                                                int(Z[index][0]), child_left_name,\n                                                int(Z[index][1]), child_right_name]\n\n    hier_topics[\"Distance\"] = Z[:, 2]\n    hier_topics = hier_topics.sort_values(\"Parent_ID\", ascending=False)\n    hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]] = hier_topics[[\"Parent_ID\", \"Child_Left_ID\", \"Child_Right_ID\"]].astype(str)\n\n    return hier_topics\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.load","title":"<code>load(path, embedding_model=None)</code>  <code>classmethod</code>","text":"<p>Loads the model from the specified path or directory</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>str</code> <p>Either load a BERTopic model from a file (<code>.pickle</code>) or a folder containing   <code>.safetensors</code> or <code>.bin</code> files.</p> required <code>embedding_model</code> <p>Additionally load in an embedding model if it was not saved              in the BERTopic model file or directory.</p> <code>None</code> <p>Examples:</p> <pre><code>BERTopic.load(\"model_dir\")\n</code></pre> <p>or if you did not save the embedding model:</p> <pre><code>BERTopic.load(\"model_dir\", embedding_model=\"all-MiniLM-L6-v2\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>@classmethod\ndef load(cls,\n         path: str,\n         embedding_model=None):\n    \"\"\" Loads the model from the specified path or directory\n\n    Arguments:\n        path: Either load a BERTopic model from a file (`.pickle`) or a folder containing\n              `.safetensors` or `.bin` files.\n        embedding_model: Additionally load in an embedding model if it was not saved\n                         in the BERTopic model file or directory.\n\n    Examples:\n\n    ```python\n    BERTopic.load(\"model_dir\")\n    ```\n\n    or if you did not save the embedding model:\n\n    ```python\n    BERTopic.load(\"model_dir\", embedding_model=\"all-MiniLM-L6-v2\")\n    ```\n    \"\"\"\n    file_or_dir = Path(path)\n\n    # Load from Pickle\n    if file_or_dir.is_file():\n        with open(file_or_dir, 'rb') as file:\n            if embedding_model:\n                topic_model = joblib.load(file)\n                topic_model.embedding_model = select_backend(embedding_model)\n            else:\n                topic_model = joblib.load(file)\n            return topic_model\n\n    # Load from directory or HF\n    if file_or_dir.is_dir():\n        topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_local_files(file_or_dir)\n    elif \"/\" in str(path):\n        topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)\n    else:\n        raise ValueError(\"Make sure to either pass a valid directory or HF model.\")\n    topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)\n\n    # Replace embedding model if one is specifically chosen\n    if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:\n        topic_model.embedding_model = select_backend(embedding_model)\n\n    return topic_model\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.merge_models","title":"<code>merge_models(models, min_similarity=0.7, embedding_model=None)</code>  <code>classmethod</code>","text":"<p>Merge multiple pre-trained BERTopic models into a single model.</p> <p>The models are merged as if they were all saved using pytorch or safetensors, so a minimal version without c-TF-IDF.</p> <p>To do this, we choose the first model in the list of models as a baseline. Then, we check for each model whether they contain topics that are not in the baseline. This check if based on the cosine similarity between topics embeddings. If topic embeddings between two models are similar, then the topic of the second model are re-assigned to the first. If they are dissimilar, the topic of the second model is assigned to the first.</p> <p>In essence, we simply check whether sufficiently \"new\" topics emerge and add them.</p> <p>Parameters:</p> Name Type Description Default <code>models</code> <p>A list of fitted BERTopic models</p> required <code>min_similarity</code> <code>float</code> <p>The minimum similarity for when topics are merged.</p> <code>0.7</code> <code>embedding_model</code> <p>Additionally load in an embedding model if necessary.</p> <code>None</code> <p>Returns:</p> Type Description <p>A new BERTopic model that was created as if you were loading a model from the HuggingFace Hub without c-TF-IDF</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# Create three separate models\ntopic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])\ntopic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])\ntopic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])\n\n# Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>@classmethod\ndef merge_models(cls, models, min_similarity: float = .7, embedding_model=None):\n    \"\"\" Merge multiple pre-trained BERTopic models into a single model.\n\n    The models are merged as if they were all saved using pytorch or\n    safetensors, so a minimal version without c-TF-IDF.\n\n    To do this, we choose the first model in the list of\n    models as a baseline. Then, we check for each model\n    whether they contain topics that are not in the baseline.\n    This check if based on the cosine similarity between\n    topics embeddings. If topic embeddings between two models\n    are similar, then the topic of the second model are re-assigned\n    to the first. If they are dissimilar, the topic of the second\n    model is assigned to the first.\n\n    In essence, we simply check whether sufficiently \"new\"\n    topics emerge and add them.\n\n    Arguments:\n        models: A list of fitted BERTopic models\n        min_similarity: The minimum similarity for when topics are merged.\n        embedding_model: Additionally load in an embedding model if necessary.\n\n    Returns:\n        A new BERTopic model that was created as if you were\n        loading a model from the HuggingFace Hub without c-TF-IDF\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n    # Create three separate models\n    topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])\n    topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])\n    topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])\n\n    # Combine all models into one\n    merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n    ```\n    \"\"\"\n    import torch\n\n    # Temporarily save model and push to HF\n    with TemporaryDirectory() as tmpdir:\n\n        # Save model weights and config.\n        all_topics, all_params, all_tensors = [], [], []\n        for index, model in enumerate(models):\n            model.save(tmpdir, serialization=\"pytorch\")\n            topics, params, tensors, _, _, _ = save_utils.load_local_files(Path(tmpdir))\n            all_topics.append(topics)\n            all_params.append(params)\n            all_tensors.append(np.array(tensors[\"topic_embeddings\"]))\n\n            # Create a base set of parameters\n            if index == 0:\n                merged_topics = topics\n                merged_params = params\n                merged_tensors = np.array(tensors[\"topic_embeddings\"])\n                merged_topics[\"custom_labels\"] = None\n\n    for tensors, selected_topics in zip(all_tensors[1:], all_topics[1:]):\n        # Calculate similarity matrix\n        sim_matrix = cosine_similarity(tensors, merged_tensors)\n        sims = np.max(sim_matrix, axis=1)\n\n        # Extract new topics\n        new_topics = sorted([index - selected_topics[\"_outliers\"] for index, sim in enumerate(sims) if sim &lt; min_similarity])\n        max_topic = max(set(merged_topics[\"topics\"]))\n\n        # Merge Topic Representations\n        new_topics_dict = {}\n        new_topic_val = max_topic + 1\n        for index, new_topic in enumerate(new_topics):\n            new_topic_val = max_topic + index + 1\n            new_topics_dict[new_topic] = new_topic_val\n            merged_topics[\"topic_representations\"][str(new_topic_val)] = selected_topics[\"topic_representations\"][str(new_topic)]\n            merged_topics[\"topic_labels\"][str(new_topic_val)] = selected_topics[\"topic_labels\"][str(new_topic)]\n\n            if selected_topics[\"topic_aspects\"]:\n                merged_topics[\"topic_aspects\"][str(new_topic_val)] = selected_topics[\"topic_aspects\"][str(new_topic)]\n\n            # Add new embeddings\n            new_tensors = tensors[new_topic - selected_topics[\"_outliers\"]]\n            merged_tensors = np.vstack([merged_tensors, new_tensors])\n\n        # Topic Mapper\n        merged_topics[\"topic_mapper\"] = TopicMapper(list(range(-1, new_topic_val+1, 1))).mappings_\n\n        # Find similar topics and re-assign those from the new models\n        sims_idx = np.argmax(sim_matrix, axis=1)\n        sims = np.max(sim_matrix, axis=1)\n        to_merge = {\n            a - selected_topics[\"_outliers\"]:\n            b - merged_topics[\"_outliers\"] for a, (b, val) in enumerate(zip(sims_idx, sims))\n            if val &gt;= min_similarity\n        }\n        to_merge.update(new_topics_dict)\n        to_merge[-1] = -1\n        topics = [to_merge[topic] for topic in selected_topics[\"topics\"]]\n        merged_topics[\"topics\"].extend(topics)\n        merged_topics[\"topic_sizes\"] = dict(Counter(merged_topics[\"topics\"]))\n\n    # Create a new model from the merged parameters\n    merged_tensors = {\"topic_embeddings\": torch.from_numpy(merged_tensors)}\n    merged_model = _create_model_from_files(merged_topics, merged_params, merged_tensors, None, None, None, warn_no_backend=False)\n    merged_model.embedding_model = models[0].embedding_model\n\n    # Replace embedding model if one is specifically chosen\n    if embedding_model is not None and type(merged_model.embedding_model) == BaseEmbedder:\n        merged_model.embedding_model = select_backend(embedding_model)\n    return merged_model\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.merge_topics","title":"<code>merge_topics(self, docs, topics_to_merge, images=None)</code>","text":"<p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>topics_to_merge</code> <code>List[Union[Iterable[int], int]]</code> <p>Either a list of topics or a list of list of topics             to merge. For example:                 [1, 2, 3] will merge topics 1, 2 and 3                 [[1, 2], [3, 4]] will merge topics 1 and 2, and                 separately merge topics 3 and 4.</p> required <code>images</code> <code>List[str]</code> <p>A list of paths to the images used when calling either     <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <p>Examples:</p> <p>If you want to merge topics 1, 2, and 3:</p> <pre><code>topics_to_merge = [1, 2, 3]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> <p>or if you want to merge topics 1 and 2, and separately merge topics 3 and 4:</p> <pre><code>topics_to_merge = [[1, 2]\n                    [3, 4]]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def merge_topics(self,\n                 docs: List[str],\n                 topics_to_merge: List[Union[Iterable[int], int]],\n                 images: List[str] = None) -&gt; None:\n    \"\"\"\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        topics_to_merge: Either a list of topics or a list of list of topics\n                        to merge. For example:\n                            [1, 2, 3] will merge topics 1, 2 and 3\n                            [[1, 2], [3, 4]] will merge topics 1 and 2, and\n                            separately merge topics 3 and 4.\n        images: A list of paths to the images used when calling either\n                `fit` or `fit_transform`\n\n    Examples:\n\n    If you want to merge topics 1, 2, and 3:\n\n    ```python\n    topics_to_merge = [1, 2, 3]\n    topic_model.merge_topics(docs, topics_to_merge)\n    ```\n\n    or if you want to merge topics 1 and 2, and separately\n    merge topics 3 and 4:\n\n    ```python\n    topics_to_merge = [[1, 2]\n                        [3, 4]]\n    topic_model.merge_topics(docs, topics_to_merge)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n    mapping = {topic: topic for topic in set(self.topics_)}\n    if isinstance(topics_to_merge[0], int):\n        for topic in sorted(topics_to_merge):\n            mapping[topic] = topics_to_merge[0]\n    elif isinstance(topics_to_merge[0], Iterable):\n        for topic_group in sorted(topics_to_merge):\n            for topic in topic_group:\n                mapping[topic] = topic_group[0]\n    else:\n        raise ValueError(\"Make sure that `topics_to_merge` is either\"\n                         \"a list of topics or a list of list of topics.\")\n\n    # Track mappings and sizes of topics for merging topic embeddings\n    mappings = defaultdict(list)\n    for key, val in sorted(mapping.items()):\n        mappings[val].append(key)\n    mappings = {topic_from:\n                {\"topics_to\": topics_to,\n                 \"topic_sizes\": [self.topic_sizes_[topic] for topic in topics_to]}\n                for topic_from, topics_to in mappings.items()}\n\n    # Update topics\n    documents.Topic = documents.Topic.map(mapping)\n    self.topic_mapper_.add_mappings(mapping)\n    documents = self._sort_mappings_by_frequency(documents)\n    self._extract_topics(documents, mappings=mappings)\n    self._update_topic_size(documents)\n    self._save_representative_docs(documents)\n    self.probabilities_ = self._map_probabilities(self.probabilities_)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.partial_fit","title":"<code>partial_fit(self, documents, embeddings=None, y=None)</code>","text":"<p>Fit BERTopic on a subset of the data and perform online learning with batch-like data.</p> <p>Online topic modeling in BERTopic is performed by using dimensionality reduction and cluster algorithms that support a <code>partial_fit</code> method in order to incrementally train the topic model.</p> <p>Likewise, the <code>bertopic.vectorizers.OnlineCountVectorizer</code> is used to dynamically update its vocabulary when presented with new data. It has several parameters for modeling decay and updating the representations.</p> <p>In other words, although the main algorithm stays the same, the training procedure now works as follows:</p> <p>For each subset of the data:</p> <ol> <li>Generate embeddings with a pre-traing language model</li> <li>Incrementally update the dimensionality reduction algorithm with <code>partial_fit</code></li> <li>Incrementally update the cluster algorithm with <code>partial_fit</code></li> <li>Incrementally update the OnlineCountVectorizer and apply some form of decay</li> </ol> <p>Note that it is advised to use <code>partial_fit</code> with batches and not single documents for the best performance.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents to fit on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model</p> <code>None</code> <code>y</code> <code>Union[List[int], numpy.ndarray]</code> <p>The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified.</p> <code>None</code> <p>Examples:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.cluster import MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom bertopic.vectorizers import OnlineCountVectorizer\nfrom bertopic import BERTopic\n\n# Prepare documents\ndocs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\n# Prepare sub-models that support online learning\numap_model = IncrementalPCA(n_components=5)\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\ntopic_model = BERTopic(umap_model=umap_model,\n                       hdbscan_model=cluster_model,\n                       vectorizer_model=vectorizer_model)\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor index in range(0, len(docs), 1000):\n    topic_model.partial_fit(docs[index: index+1000])\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def partial_fit(self,\n                documents: List[str],\n                embeddings: np.ndarray = None,\n                y: Union[List[int], np.ndarray] = None):\n    \"\"\" Fit BERTopic on a subset of the data and perform online learning\n    with batch-like data.\n\n    Online topic modeling in BERTopic is performed by using dimensionality\n    reduction and cluster algorithms that support a `partial_fit` method\n    in order to incrementally train the topic model.\n\n    Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used\n    to dynamically update its vocabulary when presented with new data.\n    It has several parameters for modeling decay and updating the\n    representations.\n\n    In other words, although the main algorithm stays the same, the training\n    procedure now works as follows:\n\n    For each subset of the data:\n\n    1. Generate embeddings with a pre-traing language model\n    2. Incrementally update the dimensionality reduction algorithm with `partial_fit`\n    3. Incrementally update the cluster algorithm with `partial_fit`\n    4. Incrementally update the OnlineCountVectorizer and apply some form of decay\n\n    Note that it is advised to use `partial_fit` with batches and\n    not single documents for the best performance.\n\n    Arguments:\n        documents: A list of documents to fit on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model\n        y: The target class for (semi)-supervised modeling. Use -1 if no class for a\n           specific instance is specified.\n\n    Examples:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sklearn.cluster import MiniBatchKMeans\n    from sklearn.decomposition import IncrementalPCA\n    from bertopic.vectorizers import OnlineCountVectorizer\n    from bertopic import BERTopic\n\n    # Prepare documents\n    docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\n    # Prepare sub-models that support online learning\n    umap_model = IncrementalPCA(n_components=5)\n    cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\n    vectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n\n    topic_model = BERTopic(umap_model=umap_model,\n                           hdbscan_model=cluster_model,\n                           vectorizer_model=vectorizer_model)\n\n    # Incrementally fit the topic model by training on 1000 documents at a time\n    for index in range(0, len(docs), 1000):\n        topic_model.partial_fit(docs[index: index+1000])\n    ```\n    \"\"\"\n    # Checks\n    check_embeddings_shape(embeddings, documents)\n    if not hasattr(self.hdbscan_model, \"partial_fit\"):\n        raise ValueError(\"In order to use `.partial_fit`, the cluster model should have \"\n                         \"a `.partial_fit` function.\")\n\n    # Prepare documents\n    if isinstance(documents, str):\n        documents = [documents]\n    documents = pd.DataFrame({\"Document\": documents,\n                              \"ID\": range(len(documents)),\n                              \"Topic\": None})\n\n    # Extract embeddings\n    if embeddings is None:\n        if self.topic_representations_ is None:\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n        embeddings = self._extract_embeddings(documents.Document.values.tolist(),\n                                              method=\"document\",\n                                              verbose=self.verbose)\n    else:\n        if self.embedding_model is not None and self.topic_representations_ is None:\n            self.embedding_model = select_backend(self.embedding_model,\n                                                  language=self.language)\n\n    # Reduce dimensionality\n    if self.seed_topic_list is not None and self.embedding_model is not None:\n        y, embeddings = self._guided_topic_modeling(embeddings)\n    umap_embeddings = self._reduce_dimensionality(embeddings, y, partial_fit=True)\n\n    # Cluster reduced embeddings\n    documents, self.probabilities_ = self._cluster_embeddings(umap_embeddings, documents, partial_fit=True)\n    topics = documents.Topic.to_list()\n\n    # Map and find new topics\n    if not self.topic_mapper_:\n        self.topic_mapper_ = TopicMapper(topics)\n    mappings = self.topic_mapper_.get_mappings()\n    new_topics = set(topics).difference(set(mappings.keys()))\n    new_topic_ids = {topic: max(mappings.values()) + index + 1 for index, topic in enumerate(new_topics)}\n    self.topic_mapper_.add_new_topics(new_topic_ids)\n    updated_mappings = self.topic_mapper_.get_mappings()\n    updated_topics = [updated_mappings[topic] for topic in topics]\n    documents[\"Topic\"] = updated_topics\n\n    # Add missing topics (topics that were originally created but are now missing)\n    if self.topic_representations_:\n        missing_topics = set(self.topic_representations_.keys()).difference(set(updated_topics))\n        for missing_topic in missing_topics:\n            documents.loc[len(documents), :] = [\" \", len(documents), missing_topic]\n    else:\n        missing_topics = {}\n\n    # Prepare documents\n    documents_per_topic = documents.sort_values(\"Topic\").groupby(['Topic'], as_index=False)\n    updated_topics = documents_per_topic.first().Topic.astype(int)\n    documents_per_topic = documents_per_topic.agg({'Document': ' '.join})\n\n    # Update topic representations\n    self.c_tf_idf_, updated_words = self._c_tf_idf(documents_per_topic, partial_fit=True)\n    self.topic_representations_ = self._extract_words_per_topic(updated_words, documents, self.c_tf_idf_, calculate_aspects=False)\n    self._create_topic_vectors()\n    self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                          for key, values in self.topic_representations_.items()}\n\n    # Update topic sizes\n    if len(missing_topics) &gt; 0:\n        documents = documents.iloc[:-len(missing_topics)]\n\n    if self.topic_sizes_ is None:\n        self._update_topic_size(documents)\n    else:\n        sizes = documents.groupby(['Topic'], as_index=False).count()\n        for _, row in sizes.iterrows():\n            topic = int(row.Topic)\n            if self.topic_sizes_.get(topic) is not None and topic not in missing_topics:\n                self.topic_sizes_[topic] += int(row.Document)\n            elif self.topic_sizes_.get(topic) is None:\n                self.topic_sizes_[topic] = int(row.Document)\n        self.topics_ = documents.Topic.astype(int).tolist()\n\n    return self\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.push_to_hf_hub","title":"<code>push_to_hf_hub(self, repo_id, commit_message='Add BERTopic model', token=None, revision=None, private=False, create_pr=False, model_card=True, serialization='safetensors', save_embedding_model=True, save_ctfidf=False)</code>","text":"<p>Push your BERTopic model to a HuggingFace Hub</p> <p>Whenever you want to upload files to the Hub, you need to log in to your Hugging Face account:</p> <ul> <li>Log in to your Hugging Face account with the following command:     <pre><code>huggingface-cli login\n\n# or using an environment variable\nhuggingface-cli login --token $HUGGINGFACE_TOKEN\n</code></pre></li> <li>Alternatively, you can programmatically login using login() in a notebook or a script:     <pre><code>from huggingface_hub import login\nlogin()\n</code></pre></li> <li>Or you can give a token with the <code>token</code> variable</li> </ul> <p>Parameters:</p> Name Type Description Default <code>repo_id</code> <code>str</code> <p>The name of your HuggingFace repository</p> required <code>commit_message</code> <code>str</code> <p>A commit message</p> <code>'Add BERTopic model'</code> <code>token</code> <code>str</code> <p>Token to add if not already logged in</p> <code>None</code> <code>revision</code> <code>str</code> <p>Repository revision</p> <code>None</code> <code>private</code> <code>bool</code> <p>Whether to create a private repository</p> <code>False</code> <code>create_pr</code> <code>bool</code> <p>Whether to upload the model as a Pull Request</p> <code>False</code> <code>model_card</code> <code>bool</code> <p>Whether to automatically create a modelcard</p> <code>True</code> <code>serialization</code> <code>str</code> <p>The type of serialization.         Either <code>safetensors</code> or <code>pytorch</code></p> <code>'safetensors'</code> <code>save_embedding_model</code> <code>Union[str, bool]</code> <p>A pointer towards a HuggingFace model to be loaded in with                     SentenceTransformers. E.g.,                     <code>sentence-transformers/all-MiniLM-L6-v2</code></p> <code>True</code> <code>save_ctfidf</code> <code>bool</code> <p>Whether to save c-TF-IDF information</p> <code>False</code> <p>Examples:</p> <pre><code>topic_model.push_to_hf_hub(\n    repo_id=\"ArXiv\",\n    save_ctfidf=True,\n    save_embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\"\n)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def push_to_hf_hub(\n        self,\n        repo_id: str,\n        commit_message: str = 'Add BERTopic model',\n        token: str = None,\n        revision: str = None,\n        private: bool = False,\n        create_pr: bool = False,\n        model_card: bool = True,\n        serialization: str = \"safetensors\",\n        save_embedding_model: Union[str, bool] = True,\n        save_ctfidf: bool = False,\n        ):\n    \"\"\" Push your BERTopic model to a HuggingFace Hub\n\n    Whenever you want to upload files to the Hub, you need to log in to your Hugging Face account:\n\n    * Log in to your Hugging Face account with the following command:\n        ```bash\n        huggingface-cli login\n\n        # or using an environment variable\n        huggingface-cli login --token $HUGGINGFACE_TOKEN\n        ```\n    * Alternatively, you can programmatically login using login() in a notebook or a script:\n        ```python\n        from huggingface_hub import login\n        login()\n        ```\n    * Or you can give a token with the `token` variable\n\n    Arguments:\n        repo_id: The name of your HuggingFace repository\n        commit_message: A commit message\n        token: Token to add if not already logged in\n        revision: Repository revision\n        private: Whether to create a private repository\n        create_pr: Whether to upload the model as a Pull Request\n        model_card: Whether to automatically create a modelcard\n        serialization: The type of serialization.\n                    Either `safetensors` or `pytorch`\n        save_embedding_model: A pointer towards a HuggingFace model to be loaded in with\n                                SentenceTransformers. E.g.,\n                                `sentence-transformers/all-MiniLM-L6-v2`\n        save_ctfidf: Whether to save c-TF-IDF information\n\n\n    Examples:\n\n    ```python\n    topic_model.push_to_hf_hub(\n        repo_id=\"ArXiv\",\n        save_ctfidf=True,\n        save_embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\"\n    )\n    ```\n    \"\"\"\n    return save_utils.push_to_hf_hub(model=self, repo_id=repo_id, commit_message=commit_message,\n                                     token=token, revision=revision, private=private, create_pr=create_pr,\n                                     model_card=model_card, serialization=serialization,\n                                     save_embedding_model=save_embedding_model, save_ctfidf=save_ctfidf)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.reduce_outliers","title":"<code>reduce_outliers(self, documents, topics, images=None, strategy='distributions', probabilities=None, threshold=0, embeddings=None, distributions_params={})</code>","text":"<p>Reduce outliers by merging them with their nearest topic according to one of several strategies.</p> <p>When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created that do not fall within any of the created topics. These are labeled as -1. This function allows the user to match outlier documents with their nearest topic using one of the following strategies using the <code>strategy</code> parameter:     * \"probabilities\"         This uses the soft-clustering as performed by HDBSCAN to find the         best matching topic for each outlier document. To use this, make         sure to calculate the <code>probabilities</code> beforehand by instantiating         BERTopic with <code>calculate_probabilities=True</code>.     * \"distributions\"         Use the topic distributions, as calculated with <code>.approximate_distribution</code>         to find the most frequent topic in each outlier document. You can use the         <code>distributions_params</code> variable to tweak the parameters of         <code>.approximate_distribution</code>.     * \"c-tf-idf\"         Calculate the c-TF-IDF representation for each outlier document and         find the best matching c-TF-IDF topic representation using         cosine similarity.     * \"embeddings\"         Using the embeddings of each outlier documents, find the best         matching topic embedding using cosine similarity.</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents for which we reduce or remove the outliers.</p> required <code>topics</code> <code>List[int]</code> <p>The topics that correspond to the documents</p> required <code>images</code> <code>List[str]</code> <p>A list of paths to the images used when calling either     <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <code>strategy</code> <code>str</code> <p>The strategy used for reducing outliers.     Options:         * \"probabilities\"             This uses the soft-clustering as performed by HDBSCAN             to find the best matching topic for each outlier document.</p> <pre><code>    * \"distributions\"\n        Use the topic distributions, as calculated with `.approximate_distribution`\n        to find the most frequent topic in each outlier document.\n\n    * \"c-tf-idf\"\n        Calculate the c-TF-IDF representation for outlier documents and\n        find the best matching c-TF-IDF topic representation.\n\n    * \"embeddings\"\n        Calculate the embeddings for outlier documents and\n        find the best matching topic embedding.\n</code></pre> <code>'distributions'</code> <code>threshold</code> <code>float</code> <p>The threshold for assigning topics to outlier documents. This value     represents the minimum probability when <code>strategy=\"probabilities\"</code>.     For all other strategies, it represents the minimum similarity.</p> <code>0</code> <code>embeddings</code> <code>ndarray</code> <p>The pre-computed embeddings to be used when <code>strategy=\"embeddings\"</code>.         If this is None, then it will compute the embeddings for the outlier documents.</p> <code>None</code> <code>distributions_params</code> <code>Mapping[str, Any]</code> <p>The parameters used in <code>.approximate_distribution</code> when using                   the strategy <code>\"distributions\"</code>.</p> <code>{}</code> <p>Returns:</p> Type Description <code>new_topics</code> <p>The updated topics</p> <p>Usage:</p> <p>The default settings uses the <code>\"distributions\"</code> strategy:</p> <pre><code>new_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> <p>When you use the <code>\"probabilities\"</code> strategy, make sure to also pass the probabilities as generated through HDBSCAN:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs)\n\nnew_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def reduce_outliers(self,\n                    documents: List[str],\n                    topics: List[int],\n                    images: List[str] = None,\n                    strategy: str = \"distributions\",\n                    probabilities: np.ndarray = None,\n                    threshold: float = 0,\n                    embeddings: np.ndarray = None,\n                    distributions_params: Mapping[str, Any] = {}) -&gt; List[int]:\n    \"\"\" Reduce outliers by merging them with their nearest topic according\n    to one of several strategies.\n\n    When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created\n    that do not fall within any of the created topics. These are labeled as -1.\n    This function allows the user to match outlier documents with their nearest topic\n    using one of the following strategies using the `strategy` parameter:\n        * \"probabilities\"\n            This uses the soft-clustering as performed by HDBSCAN to find the\n            best matching topic for each outlier document. To use this, make\n            sure to calculate the `probabilities` beforehand by instantiating\n            BERTopic with `calculate_probabilities=True`.\n        * \"distributions\"\n            Use the topic distributions, as calculated with `.approximate_distribution`\n            to find the most frequent topic in each outlier document. You can use the\n            `distributions_params` variable to tweak the parameters of\n            `.approximate_distribution`.\n        * \"c-tf-idf\"\n            Calculate the c-TF-IDF representation for each outlier document and\n            find the best matching c-TF-IDF topic representation using\n            cosine similarity.\n        * \"embeddings\"\n            Using the embeddings of each outlier documents, find the best\n            matching topic embedding using cosine similarity.\n\n    Arguments:\n        documents: A list of documents for which we reduce or remove the outliers.\n        topics: The topics that correspond to the documents\n        images: A list of paths to the images used when calling either\n                `fit` or `fit_transform`\n        strategy: The strategy used for reducing outliers.\n                Options:\n                    * \"probabilities\"\n                        This uses the soft-clustering as performed by HDBSCAN\n                        to find the best matching topic for each outlier document.\n\n                    * \"distributions\"\n                        Use the topic distributions, as calculated with `.approximate_distribution`\n                        to find the most frequent topic in each outlier document.\n\n                    * \"c-tf-idf\"\n                        Calculate the c-TF-IDF representation for outlier documents and\n                        find the best matching c-TF-IDF topic representation.\n\n                    * \"embeddings\"\n                        Calculate the embeddings for outlier documents and\n                        find the best matching topic embedding.\n        threshold: The threshold for assigning topics to outlier documents. This value\n                represents the minimum probability when `strategy=\"probabilities\"`.\n                For all other strategies, it represents the minimum similarity.\n        embeddings: The pre-computed embeddings to be used when `strategy=\"embeddings\"`.\n                    If this is None, then it will compute the embeddings for the outlier documents.\n        distributions_params: The parameters used in `.approximate_distribution` when using\n                              the strategy `\"distributions\"`.\n\n    Returns:\n        new_topics: The updated topics\n\n    Usage:\n\n    The default settings uses the `\"distributions\"` strategy:\n\n    ```python\n    new_topics = topic_model.reduce_outliers(docs, topics)\n    ```\n\n    When you use the `\"probabilities\"` strategy, make sure to also pass the probabilities\n    as generated through HDBSCAN:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic(calculate_probabilities=True)\n    topics, probs = topic_model.fit_transform(docs)\n\n    new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n    ```\n    \"\"\"\n    if images is not None:\n        strategy = \"embeddings\"\n\n    # Check correct use of parameters\n    if strategy.lower() == \"probabilities\" and probabilities is None:\n        raise ValueError(\"Make sure to pass in `probabilities` in order to use the probabilities strategy\")\n\n    # Reduce outliers by extracting most likely topics through the topic-term probability matrix\n    if strategy.lower() == \"probabilities\":\n        new_topics = [np.argmax(prob) if np.max(prob) &gt;= threshold and topic == -1 else topic\n                      for topic, prob in zip(topics, probabilities)]\n\n    # Reduce outliers by extracting most frequent topics through calculating of Topic Distributions\n    elif strategy.lower() == \"distributions\":\n        outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n        outlier_docs = [documents[index] for index in outlier_ids]\n        topic_distr, _ = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params)\n        outlier_topics = iter([np.argmax(prob) if sum(prob) &gt; 0 else -1 for prob in topic_distr])\n        new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n    # Reduce outliers by finding the most similar c-TF-IDF representations\n    elif strategy.lower() == \"c-tf-idf\":\n        outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n        outlier_docs = [documents[index] for index in outlier_ids]\n\n        # Calculate c-TF-IDF of outlier documents with all topics\n        bow_doc = self.vectorizer_model.transform(outlier_docs)\n        c_tf_idf_doc = self.ctfidf_model.transform(bow_doc)\n        similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf_[self._outliers:])\n\n        # Update topics\n        similarity[similarity &lt; threshold] = 0\n        outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n        new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n    # Reduce outliers by finding the most similar topic embeddings\n    elif strategy.lower() == \"embeddings\":\n        if self.embedding_model is None and embeddings is None:\n            raise ValueError(\"To use this strategy, you will need to pass a model to `embedding_model`\"\n                             \"when instantiating BERTopic.\")\n        outlier_ids = [index for index, topic in enumerate(topics) if topic == -1]\n        if images is not None:\n            outlier_docs = [images[index] for index in outlier_ids]\n        else:\n            outlier_docs = [documents[index] for index in outlier_ids]\n\n        # Extract or calculate embeddings for outlier documents\n        if embeddings is not None:\n            outlier_embeddings = np.array([embeddings[index] for index in outlier_ids])\n        elif images is not None:\n            outlier_images = [images[index] for index in outlier_ids]\n            outlier_embeddings = self.embedding_model.embed_images(outlier_images, verbose=self.verbose)\n        else:\n            outlier_embeddings = self.embedding_model.embed_documents(outlier_docs)\n        similarity = cosine_similarity(outlier_embeddings, self.topic_embeddings_[self._outliers:])\n\n        # Update topics\n        similarity[similarity &lt; threshold] = 0\n        outlier_topics = iter([np.argmax(sim) if sum(sim) &gt; 0 else -1 for sim in similarity])\n        new_topics = [topic if topic != -1 else next(outlier_topics) for topic in topics]\n\n    return new_topics\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.reduce_topics","title":"<code>reduce_topics(self, docs, nr_topics=20, images=None)</code>","text":"<p>Reduce the number of topics to a fixed number of topics or automatically.</p> <p>If nr_topics is a integer, then the number of topics is reduced to nr_topics using <code>AgglomerativeClustering</code> on the cosine distance matrix of the topic embeddings.</p> <p>If nr_topics is <code>\"auto\"</code>, then HDBSCAN is used to automatically reduce the number of topics by running it on the topic embeddings.</p> <p>The topics, their sizes, and representations are updated.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The docs you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>nr_topics</code> <code>Union[int, str]</code> <p>The number of topics you want reduced to</p> <code>20</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images used when calling either     <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <p>Updates</p> <p>topics_ : Assigns topics to their merged representations. probabilities_ : Assigns probabilities to their merged representations.</p> <p>Examples:</p> <p>You can further reduce the topics by passing the documents with its topics and probabilities (if they were calculated):</p> <pre><code>topic_model.reduce_topics(docs, nr_topics=30)\n</code></pre> <p>You can then access the updated topics and probabilities with:</p> <pre><code>topics = topic_model.topics_\nprobabilities = topic_model.probabilities_\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def reduce_topics(self,\n                  docs: List[str],\n                  nr_topics: Union[int, str] = 20,\n                  images: List[str] = None) -&gt; None:\n    \"\"\" Reduce the number of topics to a fixed number of topics\n    or automatically.\n\n    If nr_topics is a integer, then the number of topics is reduced\n    to nr_topics using `AgglomerativeClustering` on the cosine distance matrix\n    of the topic embeddings.\n\n    If nr_topics is `\"auto\"`, then HDBSCAN is used to automatically\n    reduce the number of topics by running it on the topic embeddings.\n\n    The topics, their sizes, and representations are updated.\n\n    Arguments:\n        docs: The docs you used when calling either `fit` or `fit_transform`\n        nr_topics: The number of topics you want reduced to\n        images: A list of paths to the images used when calling either\n                `fit` or `fit_transform`\n\n    Updates:\n        topics_ : Assigns topics to their merged representations.\n        probabilities_ : Assigns probabilities to their merged representations.\n\n    Examples:\n\n    You can further reduce the topics by passing the documents with its\n    topics and probabilities (if they were calculated):\n\n    ```python\n    topic_model.reduce_topics(docs, nr_topics=30)\n    ```\n\n    You can then access the updated topics and probabilities with:\n\n    ```python\n    topics = topic_model.topics_\n    probabilities = topic_model.probabilities_\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n\n    self.nr_topics = nr_topics\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Image\": images, \"ID\": range(len(docs))})\n\n    # Reduce number of topics\n    documents = self._reduce_topics(documents)\n    self._merged_topics = None\n    self._save_representative_docs(documents)\n    self.probabilities_ = self._map_probabilities(self.probabilities_)\n\n    return self\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.save","title":"<code>save(self, path, serialization='pickle', save_embedding_model=True, save_ctfidf=False)</code>","text":"<p>Saves the model to the specified path or folder</p> <p>When saving the model, make sure to also keep track of the versions of dependencies and Python used. Loading and saving the model should be done using the same dependencies and Python. Moreover, models saved in one version of BERTopic should not be loaded in other versions.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <p>If <code>serialization</code> is 'safetensors' or <code>pytorch</code>, this is a directory.   If <code>serialization</code> is <code>pickle</code>, then this is file.</p> required <code>serialization</code> <code>Literal['safetensors', 'pickle', 'pytorch']</code> <p>If <code>pickle</code>, the entire model will be pickled. If <code>safetensors</code>            or <code>pytorch</code> the model will be saved without the embedding,            dimensionality reduction, and clustering algorithms.            This is a very efficient format and typically advised.</p> <code>'pickle'</code> <code>save_embedding_model</code> <code>Union[bool, str]</code> <p>If serialization <code>pickle</code>, then you can choose to skip                   saving the embedding model. If serialization <code>safetensors</code>                   or <code>pytorch</code>, this variable can be used as a string pointing                   towards a huggingface model.</p> <code>True</code> <code>save_ctfidf</code> <code>bool</code> <p>Whether to save c-TF-IDF information if serialization is <code>safetensors</code>          or <code>pytorch</code></p> <code>False</code> <p>Examples:</p> <p>To save the model in an efficient and safe format (safetensors) with c-TF-IDF information:</p> <pre><code>topic_model.save(\"model_dir\", serialization=\"safetensors\", save_ctfidf=True)\n</code></pre> <p>If you wish to also add a pointer to the embedding model, which will be downloaded from HuggingFace upon loading:</p> <pre><code>embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"model_dir\", serialization=\"safetensors\", save_embedding_model=embedding_model)\n</code></pre> <p>or if you want save the full model with pickle:</p> <pre><code>topic_model.save(\"my_model\")\n</code></pre> <p>NOTE: Pickle can run arbitrary code and is generally considered to be less safe than safetensors.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def save(self,\n         path,\n         serialization: Literal[\"safetensors\", \"pickle\", \"pytorch\"] = \"pickle\",\n         save_embedding_model: Union[bool, str] = True,\n         save_ctfidf: bool = False):\n    \"\"\" Saves the model to the specified path or folder\n\n    When saving the model, make sure to also keep track of the versions\n    of dependencies and Python used. Loading and saving the model should\n    be done using the same dependencies and Python. Moreover, models\n    saved in one version of BERTopic should not be loaded in other versions.\n\n    Arguments:\n        path: If `serialization` is 'safetensors' or `pytorch`, this is a directory.\n              If `serialization` is `pickle`, then this is file.\n        serialization: If `pickle`, the entire model will be pickled. If `safetensors`\n                       or `pytorch` the model will be saved without the embedding,\n                       dimensionality reduction, and clustering algorithms.\n                       This is a very efficient format and typically advised.\n        save_embedding_model: If serialization `pickle`, then you can choose to skip\n                              saving the embedding model. If serialization `safetensors`\n                              or `pytorch`, this variable can be used as a string pointing\n                              towards a huggingface model.\n        save_ctfidf: Whether to save c-TF-IDF information if serialization is `safetensors`\n                     or `pytorch`\n\n    Examples:\n\n    To save the model in an efficient and safe format (safetensors) with c-TF-IDF information:\n\n    ```python\n    topic_model.save(\"model_dir\", serialization=\"safetensors\", save_ctfidf=True)\n    ```\n\n    If you wish to also add a pointer to the embedding model, which will be downloaded from\n    HuggingFace upon loading:\n\n    ```python\n    embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\n    topic_model.save(\"model_dir\", serialization=\"safetensors\", save_embedding_model=embedding_model)\n    ```\n\n    or if you want save the full model with pickle:\n\n    ```python\n    topic_model.save(\"my_model\")\n    ```\n\n    NOTE: Pickle can run arbitrary code and is generally considered to be less safe than\n    safetensors.\n    \"\"\"\n    if serialization == \"pickle\":\n        logger.warning(\"When you use `pickle` to save/load a BERTopic model,\"\n                       \"please make sure that the environments in which you save\"\n                       \"and load the model are **exactly** the same. The version of BERTopic,\"\n                       \"its dependencies, and python need to remain the same.\")\n\n        with open(path, 'wb') as file:\n\n            # This prevents the vectorizer from being too large in size if `min_df` was\n            # set to a value higher than 1\n            self.vectorizer_model.stop_words_ = None\n\n            if not save_embedding_model:\n                embedding_model = self.embedding_model\n                self.embedding_model = None\n                joblib.dump(self, file)\n                self.embedding_model = embedding_model\n            else:\n                joblib.dump(self, file)\n    elif serialization == \"safetensors\" or serialization == \"pytorch\":\n\n        # Directory\n        save_directory = Path(path)\n        save_directory.mkdir(exist_ok=True, parents=True)\n\n        # Check embedding model\n        if save_embedding_model and hasattr(self.embedding_model, '_hf_model') and not isinstance(save_embedding_model, str):\n            save_embedding_model = self.embedding_model._hf_model\n        elif not save_embedding_model:\n            logger.warning(\"You are saving a BERTopic model without explicitly defining an embedding model.\"\n                           \"If you are using a sentence-transformers model or a HuggingFace model supported\"\n                           \"by sentence-transformers, please save the model by using a pointer towards that model.\"\n                           \"For example, `save_embedding_model=sentence-transformers/all-mpnet-base-v2`\", RuntimeWarning)\n\n        # Minimal\n        save_utils.save_hf(model=self, save_directory=save_directory, serialization=serialization)\n        save_utils.save_topics(model=self, path=save_directory / \"topics.json\")\n        save_utils.save_images(model=self, path=save_directory / \"images\")\n        save_utils.save_config(model=self, path=save_directory / 'config.json', embedding_model=save_embedding_model)\n\n        # Additional\n        if save_ctfidf:\n            save_utils.save_ctfidf(model=self, save_directory=save_directory, serialization=serialization)\n            save_utils.save_ctfidf_config(model=self, path=save_directory / 'ctfidf_config.json')\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.set_topic_labels","title":"<code>set_topic_labels(self, topic_labels)</code>","text":"<p>Set custom topic labels in your fitted BERTopic model</p> <p>Parameters:</p> Name Type Description Default <code>topic_labels</code> <code>Union[List[str], Mapping[int, str]]</code> <p>If a list of topic labels, it should contain the same number         of labels as there are topics. This must be ordered         from the topic with the lowest ID to the highest ID,         including topic -1 if it exists.         If a dictionary of <code>topic ID</code>: <code>topic_label</code>, it can have         any number of topics as it will only map the topics found         in the dictionary.</p> required <p>Examples:</p> <p>First, we define our topic labels with <code>.generate_topic_labels</code> in which we can customize our topic labels:</p> <pre><code>topic_labels = topic_model.generate_topic_labels(nr_words=2,\n                                            topic_prefix=True,\n                                            word_length=10,\n                                            separator=\", \")\n</code></pre> <p>Then, we pass these <code>topic_labels</code> to our topic model which can be accessed at any time with <code>.custom_labels_</code>:</p> <pre><code>topic_model.set_topic_labels(topic_labels)\ntopic_model.custom_labels_\n</code></pre> <p>You might want to change only a few topic labels instead of all of them. To do so, you can pass a dictionary where the keys are the topic IDs and its keys the topic labels:</p> <pre><code>topic_model.set_topic_labels({0: \"Space\", 1: \"Sports\", 2: \"Medicine\"})\ntopic_model.custom_labels_\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def set_topic_labels(self, topic_labels: Union[List[str], Mapping[int, str]]) -&gt; None:\n    \"\"\" Set custom topic labels in your fitted BERTopic model\n\n    Arguments:\n        topic_labels: If a list of topic labels, it should contain the same number\n                    of labels as there are topics. This must be ordered\n                    from the topic with the lowest ID to the highest ID,\n                    including topic -1 if it exists.\n                    If a dictionary of `topic ID`: `topic_label`, it can have\n                    any number of topics as it will only map the topics found\n                    in the dictionary.\n\n    Examples:\n\n    First, we define our topic labels with `.generate_topic_labels` in which\n    we can customize our topic labels:\n\n    ```python\n    topic_labels = topic_model.generate_topic_labels(nr_words=2,\n                                                topic_prefix=True,\n                                                word_length=10,\n                                                separator=\", \")\n    ```\n\n    Then, we pass these `topic_labels` to our topic model which\n    can be accessed at any time with `.custom_labels_`:\n\n    ```python\n    topic_model.set_topic_labels(topic_labels)\n    topic_model.custom_labels_\n    ```\n\n    You might want to change only a few topic labels instead of all of them.\n    To do so, you can pass a dictionary where the keys are the topic IDs and\n    its keys the topic labels:\n\n    ```python\n    topic_model.set_topic_labels({0: \"Space\", 1: \"Sports\", 2: \"Medicine\"})\n    topic_model.custom_labels_\n    ```\n    \"\"\"\n    unique_topics = sorted(set(self.topics_))\n\n    if isinstance(topic_labels, dict):\n        if self.custom_labels_ is not None:\n            original_labels = {topic: label for topic, label in zip(unique_topics, self.custom_labels_)}\n        else:\n            info = self.get_topic_info()\n            original_labels = dict(zip(info.Topic, info.Name))\n        custom_labels = [topic_labels.get(topic) if topic_labels.get(topic) else original_labels[topic] for topic in unique_topics]\n\n    elif isinstance(topic_labels, list):\n        if len(topic_labels) == len(unique_topics):\n            custom_labels = topic_labels\n        else:\n            raise ValueError(\"Make sure that `topic_labels` contains the same number \"\n                             \"of labels as that there are topics.\")\n\n    self.custom_labels_ = custom_labels\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.topics_over_time","title":"<code>topics_over_time(self, docs, timestamps, topics=None, nr_bins=None, datetime_format=None, evolution_tuning=True, global_tuning=True)</code>","text":"<p>Create topics over time</p> <p>To create the topics over time, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each timestamp t. Then, the c-TF-IDF representations at timestamp t are averaged with the global c-TF-IDF representations in order to fine-tune the local representations.</p> <p>Note</p> <p>Make sure to use a limited number of unique timestamps (&lt;100) as the c-TF-IDF representation will be calculated at each single unique timestamp. Having a large number of unique timestamps can take some time to be calculated. Moreover, there aren't many use-cased where you would like to see the difference in topic representations over more than 100 different timestamps.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>timestamps</code> <code>Union[List[str], List[int]]</code> <p>The timestamp of each document. This can be either a list of strings or ints.         If it is a list of strings, then the datetime format will be automatically         inferred. If it is a list of ints, then the documents will be ordered by         ascending order.</p> required <code>topics</code> <code>List[int]</code> <p>A list of topics where each topic is related to a document in <code>docs</code> and     a timestamp in <code>timestamps</code>. You can use this to apply topics_over_time on     a subset of the data. Make sure that <code>docs</code>, <code>timestamps</code>, and <code>topics</code>     all correspond to one another and have the same size.</p> <code>None</code> <code>nr_bins</code> <code>int</code> <p>The number of bins you want to create for the timestamps. The left interval will      be chosen as the timestamp. An additional column will be created with the      entire interval.</p> <code>None</code> <code>datetime_format</code> <code>str</code> <p>The datetime format of the timestamps if they are strings, eg \u201c%d/%m/%Y\u201d.              Set this to None if you want to have it automatically detect the format.              See strftime documentation for more information on choices:              https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.</p> <code>None</code> <code>evolution_tuning</code> <code>bool</code> <p>Fine-tune each topic representation at timestamp t by averaging its               c-TF-IDF matrix with the c-TF-IDF matrix at timestamp t-1. This creates               evolutionary topic representations.</p> <code>True</code> <code>global_tuning</code> <code>bool</code> <p>Fine-tune each topic representation at timestamp t by averaging its c-TF-IDF matrix        with the global c-TF-IDF matrix. Turn this off if you want to prevent words in        topic representations that could not be found in the documents at timestamp t.</p> <code>True</code> <p>Returns:</p> Type Description <code>topics_over_time</code> <p>A dataframe that contains the topic, words, and frequency of topic                   at timestamp t.</p> <p>Examples:</p> <p>The timestamps variable represent the timestamp of each document. If you have over 100 unique timestamps, it is advised to bin the timestamps as shown below:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def topics_over_time(self,\n                     docs: List[str],\n                     timestamps: Union[List[str],\n                                       List[int]],\n                     topics: List[int] = None,\n                     nr_bins: int = None,\n                     datetime_format: str = None,\n                     evolution_tuning: bool = True,\n                     global_tuning: bool = True) -&gt; pd.DataFrame:\n    \"\"\" Create topics over time\n\n    To create the topics over time, BERTopic needs to be already fitted once.\n    From the fitted models, the c-TF-IDF representations are calculate at\n    each timestamp t. Then, the c-TF-IDF representations at timestamp t are\n    averaged with the global c-TF-IDF representations in order to fine-tune the\n    local representations.\n\n    NOTE:\n        Make sure to use a limited number of unique timestamps (&lt;100) as the\n        c-TF-IDF representation will be calculated at each single unique timestamp.\n        Having a large number of unique timestamps can take some time to be calculated.\n        Moreover, there aren't many use-cased where you would like to see the difference\n        in topic representations over more than 100 different timestamps.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        timestamps: The timestamp of each document. This can be either a list of strings or ints.\n                    If it is a list of strings, then the datetime format will be automatically\n                    inferred. If it is a list of ints, then the documents will be ordered by\n                    ascending order.\n        topics: A list of topics where each topic is related to a document in `docs` and\n                a timestamp in `timestamps`. You can use this to apply topics_over_time on\n                a subset of the data. Make sure that `docs`, `timestamps`, and `topics`\n                all correspond to one another and have the same size.\n        nr_bins: The number of bins you want to create for the timestamps. The left interval will\n                 be chosen as the timestamp. An additional column will be created with the\n                 entire interval.\n        datetime_format: The datetime format of the timestamps if they are strings, eg \u201c%d/%m/%Y\u201d.\n                         Set this to None if you want to have it automatically detect the format.\n                         See strftime documentation for more information on choices:\n                         https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.\n        evolution_tuning: Fine-tune each topic representation at timestamp *t* by averaging its\n                          c-TF-IDF matrix with the c-TF-IDF matrix at timestamp *t-1*. This creates\n                          evolutionary topic representations.\n        global_tuning: Fine-tune each topic representation at timestamp *t* by averaging its c-TF-IDF matrix\n                   with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                   topic representations that could not be found in the documents at timestamp *t*.\n\n    Returns:\n        topics_over_time: A dataframe that contains the topic, words, and frequency of topic\n                          at timestamp *t*.\n\n    Examples:\n\n    The timestamps variable represent the timestamp of each document. If you have over\n    100 unique timestamps, it is advised to bin the timestamps as shown below:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    selected_topics = topics if topics else self.topics_\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": selected_topics, \"Timestamps\": timestamps})\n    global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n    all_topics = sorted(list(documents.Topic.unique()))\n    all_topics_indices = {topic: index for index, topic in enumerate(all_topics)}\n\n    if isinstance(timestamps[0], str):\n        infer_datetime_format = True if not datetime_format else False\n        documents[\"Timestamps\"] = pd.to_datetime(documents[\"Timestamps\"],\n                                                 infer_datetime_format=infer_datetime_format,\n                                                 format=datetime_format)\n\n    if nr_bins:\n        documents[\"Bins\"] = pd.cut(documents.Timestamps, bins=nr_bins)\n        documents[\"Timestamps\"] = documents.apply(lambda row: row.Bins.left, 1)\n\n    # Sort documents in chronological order\n    documents = documents.sort_values(\"Timestamps\")\n    timestamps = documents.Timestamps.unique()\n    if len(timestamps) &gt; 100:\n        logger.warning(f\"There are more than 100 unique timestamps (i.e., {len(timestamps)}) \"\n                       \"which significantly slows down the application. Consider setting `nr_bins` \"\n                       \"to a value lower than 100 to speed up calculation. \")\n\n    # For each unique timestamp, create topic representations\n    topics_over_time = []\n    for index, timestamp in tqdm(enumerate(timestamps), disable=not self.verbose):\n\n        # Calculate c-TF-IDF representation for a specific timestamp\n        selection = documents.loc[documents.Timestamps == timestamp, :]\n        documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                \"Timestamps\": \"count\"})\n        c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n        if global_tuning or evolution_tuning:\n            c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n\n        # Fine-tune the c-TF-IDF matrix at timestamp t by averaging it with the c-TF-IDF\n        # matrix at timestamp t-1\n        if evolution_tuning and index != 0:\n            current_topics = sorted(list(documents_per_topic.Topic.values))\n            overlapping_topics = sorted(list(set(previous_topics).intersection(set(current_topics))))\n\n            current_overlap_idx = [current_topics.index(topic) for topic in overlapping_topics]\n            previous_overlap_idx = [previous_topics.index(topic) for topic in overlapping_topics]\n\n            c_tf_idf.tolil()[current_overlap_idx] = ((c_tf_idf[current_overlap_idx] +\n                                                      previous_c_tf_idf[previous_overlap_idx]) / 2.0).tolil()\n\n        # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n        # by simply taking the average of the two\n        if global_tuning:\n            selected_topics = [all_topics_indices[topic] for topic in documents_per_topic.Topic.values]\n            c_tf_idf = (global_c_tf_idf[selected_topics] + c_tf_idf) / 2.0\n\n        # Extract the words per topic\n        words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n        topic_frequency = pd.Series(documents_per_topic.Timestamps.values,\n                                    index=documents_per_topic.Topic).to_dict()\n\n        # Fill dataframe with results\n        topics_at_timestamp = [(topic,\n                                \", \".join([words[0] for words in values][:5]),\n                                topic_frequency[topic],\n                                timestamp) for topic, values in words_per_topic.items()]\n        topics_over_time.extend(topics_at_timestamp)\n\n        if evolution_tuning:\n            previous_topics = sorted(list(documents_per_topic.Topic.values))\n            previous_c_tf_idf = c_tf_idf.copy()\n\n    return pd.DataFrame(topics_over_time, columns=[\"Topic\", \"Words\", \"Frequency\", \"Timestamp\"])\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.topics_per_class","title":"<code>topics_per_class(self, docs, classes, global_tuning=True)</code>","text":"<p>Create topics per class</p> <p>To create the topics per class, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each class c. Then, the c-TF-IDF representations at class c are averaged with the global c-TF-IDF representations in order to fine-tune the local representations. This can be turned off if the pure representation is needed.</p> <p>Note</p> <p>Make sure to use a limited number of unique classes (&lt;100) as the c-TF-IDF representation will be calculated at each single unique class. Having a large number of unique classes can take some time to be calculated.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>classes</code> <code>Union[List[int], List[str]]</code> <p>The class of each document. This can be either a list of strings or ints.</p> required <code>global_tuning</code> <code>bool</code> <p>Fine-tune each topic representation for class c t by averaging its c-TF-IDF matrix            with the global c-TF-IDF matrix. Turn this off if you want to prevent words in            topic representations that could not be found in the documents for class c.</p> <code>True</code> <p>Returns:</p> Type Description <code>topics_per_class</code> <p>A dataframe that contains the topic, words, and frequency of topics                   for each class.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_per_class = topic_model.topics_per_class(docs, classes)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def topics_per_class(self,\n                     docs: List[str],\n                     classes: Union[List[int], List[str]],\n                     global_tuning: bool = True) -&gt; pd.DataFrame:\n    \"\"\" Create topics per class\n\n    To create the topics per class, BERTopic needs to be already fitted once.\n    From the fitted models, the c-TF-IDF representations are calculate at\n    each class c. Then, the c-TF-IDF representations at class c are\n    averaged with the global c-TF-IDF representations in order to fine-tune the\n    local representations. This can be turned off if the pure representation is\n    needed.\n\n    NOTE:\n        Make sure to use a limited number of unique classes (&lt;100) as the\n        c-TF-IDF representation will be calculated at each single unique class.\n        Having a large number of unique classes can take some time to be calculated.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        classes: The class of each document. This can be either a list of strings or ints.\n        global_tuning: Fine-tune each topic representation for class c t by averaging its c-TF-IDF matrix\n                       with the global c-TF-IDF matrix. Turn this off if you want to prevent words in\n                       topic representations that could not be found in the documents for class c.\n\n    Returns:\n        topics_per_class: A dataframe that contains the topic, words, and frequency of topics\n                          for each class.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    topic_model = BERTopic()\n    topics, probs = topic_model.fit_transform(docs)\n    topics_per_class = topic_model.topics_per_class(docs, classes)\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": self.topics_, \"Class\": classes})\n    global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)\n\n    # For each unique timestamp, create topic representations\n    topics_per_class = []\n    for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose):\n\n        # Calculate c-TF-IDF representation for a specific timestamp\n        selection = documents.loc[documents.Class == class_, :]\n        documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,\n                                                                                \"Class\": \"count\"})\n        c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)\n\n        # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation\n        # by simply taking the average of the two\n        if global_tuning:\n            c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)\n            c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0\n\n        # Extract the words per topic\n        words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)\n        topic_frequency = pd.Series(documents_per_topic.Class.values,\n                                    index=documents_per_topic.Topic).to_dict()\n\n        # Fill dataframe with results\n        topics_at_class = [(topic,\n                            \", \".join([words[0] for words in values][:5]),\n                            topic_frequency[topic],\n                            class_) for topic, values in words_per_topic.items()]\n        topics_per_class.extend(topics_at_class)\n\n    topics_per_class = pd.DataFrame(topics_per_class, columns=[\"Topic\", \"Words\", \"Frequency\", \"Class\"])\n\n    return topics_per_class\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.transform","title":"<code>transform(self, documents, embeddings=None, images=None)</code>","text":"<p>After having fit a model, use transform to predict new instances</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>Union[str, List[str]]</code> <p>A single document or a list of documents to predict on</p> required <code>embeddings</code> <code>ndarray</code> <p>Pre-trained document embeddings. These can be used         instead of the sentence-transformer model.</p> <code>None</code> <code>images</code> <code>List[str]</code> <p>A list of paths to the images to predict on or the images themselves</p> <code>None</code> <p>Returns:</p> Type Description <code>predictions</code> <p>Topic predictions for each documents probabilities: The topic probability distribution which is returned by default.                If <code>calculate_probabilities</code> in BERTopic is set to False, then the                probabilities are not calculated to speed up computation and                decrease memory usage.</p> <p>Examples:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all')['data']\ntopic_model = BERTopic().fit(docs)\ntopics, probs = topic_model.transform(docs)\n</code></pre> <p>If you want to use your own embeddings:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Create embeddings\ndocs = fetch_20newsgroups(subset='all')['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# Create topic model\ntopic_model = BERTopic().fit(docs, embeddings)\ntopics, probs = topic_model.transform(docs, embeddings)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def transform(self,\n              documents: Union[str, List[str]],\n              embeddings: np.ndarray = None,\n              images: List[str] = None) -&gt; Tuple[List[int], np.ndarray]:\n    \"\"\" After having fit a model, use transform to predict new instances\n\n    Arguments:\n        documents: A single document or a list of documents to predict on\n        embeddings: Pre-trained document embeddings. These can be used\n                    instead of the sentence-transformer model.\n        images: A list of paths to the images to predict on or the images themselves\n\n    Returns:\n        predictions: Topic predictions for each documents\n        probabilities: The topic probability distribution which is returned by default.\n                       If `calculate_probabilities` in BERTopic is set to False, then the\n                       probabilities are not calculated to speed up computation and\n                       decrease memory usage.\n\n    Examples:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n\n    docs = fetch_20newsgroups(subset='all')['data']\n    topic_model = BERTopic().fit(docs)\n    topics, probs = topic_model.transform(docs)\n    ```\n\n    If you want to use your own embeddings:\n\n    ```python\n    from bertopic import BERTopic\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n\n    # Create embeddings\n    docs = fetch_20newsgroups(subset='all')['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n    # Create topic model\n    topic_model = BERTopic().fit(docs, embeddings)\n    topics, probs = topic_model.transform(docs, embeddings)\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    check_embeddings_shape(embeddings, documents)\n\n    if isinstance(documents, str) or documents is None:\n        documents = [documents]\n\n    if embeddings is None:\n        embeddings = self._extract_embeddings(documents,\n                                              images=images,\n                                              method=\"document\",\n                                              verbose=self.verbose)\n\n    # Check if an embedding model was found\n    if embeddings is None:\n        raise ValueError(\"No embedding model was found to embed the documents.\"\n                         \"Make sure when loading in the model using BERTopic.load()\"\n                         \"to also specify the embedding model.\")\n\n    # Transform without hdbscan_model and umap_model using only cosine similarity\n    elif type(self.hdbscan_model) == BaseCluster:\n        logger.info(\"Predicting topic assignments through cosine similarity of topic and document embeddings.\")\n        sim_matrix = cosine_similarity(embeddings, np.array(self.topic_embeddings_))\n        predictions = np.argmax(sim_matrix, axis=1) - self._outliers\n\n        if self.calculate_probabilities:\n            probabilities = sim_matrix\n        else:\n            probabilities = np.max(sim_matrix, axis=1)\n\n    # Transform with full pipeline\n    else:\n        logger.info(\"Dimensionality - Reducing dimensionality of input embeddings.\")\n        umap_embeddings = self.umap_model.transform(embeddings)\n        logger.info(\"Dimensionality - Completed \\u2713\")\n\n        # Extract predictions and probabilities if it is a HDBSCAN-like model\n        logger.info(\"Clustering - Approximating new points with `hdbscan_model`\")\n        if is_supported_hdbscan(self.hdbscan_model):\n            predictions, probabilities = hdbscan_delegator(self.hdbscan_model, \"approximate_predict\", umap_embeddings)\n\n            # Calculate probabilities\n            if self.calculate_probabilities:\n                logger.info(\"Probabilities - Start calculation of probabilities with HDBSCAN\")\n                probabilities = hdbscan_delegator(self.hdbscan_model, \"membership_vector\", umap_embeddings)\n                logger.info(\"Probabilities - Completed \\u2713\")\n        else:\n            predictions = self.hdbscan_model.predict(umap_embeddings)\n            probabilities = None\n        logger.info(\"Cluster - Completed \\u2713\")\n\n        # Map probabilities and predictions\n        probabilities = self._map_probabilities(probabilities, original_topics=True)\n        predictions = self._map_predictions(predictions)\n    return predictions, probabilities\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.update_topics","title":"<code>update_topics(self, docs, images=None, topics=None, top_n_words=10, n_gram_range=None, vectorizer_model=None, ctfidf_model=None, representation_model=None)</code>","text":"<p>Updates the topic representation by recalculating c-TF-IDF with the new parameters as defined in this function.</p> <p>When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to remove stop_words or you want to try out a different n_gram_range. This function allows you to update the topic representation after they have been formed.</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>images</code> <code>List[str]</code> <p>The images you used when calling either <code>fit</code> or <code>fit_transform</code></p> <code>None</code> <code>topics</code> <code>List[int]</code> <p>A list of topics where each topic is related to a document in <code>docs</code>.     Use this variable to change or map the topics.     NOTE: Using a custom list of topic assignments may lead to errors if           topic reduction techniques are used afterwards. Make sure that           manually assigning topics is the last step in the pipeline</p> <code>None</code> <code>top_n_words</code> <code>int</code> <p>The number of words per topic to extract. Setting this          too high can negatively impact topic embeddings as topics          are typically best represented by at most 10 words.</p> <code>10</code> <code>n_gram_range</code> <code>Tuple[int, int]</code> <p>The n-gram range for the CountVectorizer.</p> <code>None</code> <code>vectorizer_model</code> <code>CountVectorizer</code> <p>Pass in your own CountVectorizer from scikit-learn</p> <code>None</code> <code>ctfidf_model</code> <code>ClassTfidfTransformer</code> <p>Pass in your own c-TF-IDF model to update the representations</p> <code>None</code> <code>representation_model</code> <code>BaseRepresentation</code> <p>Pass in a model that fine-tunes the topic representations         calculated through c-TF-IDF. Models from <code>bertopic.representation</code>         are supported.</p> <code>None</code> <p>Examples:</p> <p>In order to update the topic representation, you will need to first fit the topic model and extract topics from them. Based on these, you can update the representation:</p> <pre><code>topic_model.update_topics(docs, n_gram_range=(2, 3))\n</code></pre> <p>You can also use a custom vectorizer to update the representation:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\")\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>You can also use this function to change or map the topics to something else. You can update them as follows:</p> <pre><code>topic_model.update_topics(docs, my_updated_topics)\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def update_topics(self,\n                  docs: List[str],\n                  images: List[str] = None,\n                  topics: List[int] = None,\n                  top_n_words: int = 10,\n                  n_gram_range: Tuple[int, int] = None,\n                  vectorizer_model: CountVectorizer = None,\n                  ctfidf_model: ClassTfidfTransformer = None,\n                  representation_model: BaseRepresentation = None):\n    \"\"\" Updates the topic representation by recalculating c-TF-IDF with the new\n    parameters as defined in this function.\n\n    When you have trained a model and viewed the topics and the words that represent them,\n    you might not be satisfied with the representation. Perhaps you forgot to remove\n    stop_words or you want to try out a different n_gram_range. This function allows you\n    to update the topic representation after they have been formed.\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        images: The images you used when calling either `fit` or `fit_transform`\n        topics: A list of topics where each topic is related to a document in `docs`.\n                Use this variable to change or map the topics.\n                NOTE: Using a custom list of topic assignments may lead to errors if\n                      topic reduction techniques are used afterwards. Make sure that\n                      manually assigning topics is the last step in the pipeline\n        top_n_words: The number of words per topic to extract. Setting this\n                     too high can negatively impact topic embeddings as topics\n                     are typically best represented by at most 10 words.\n        n_gram_range: The n-gram range for the CountVectorizer.\n        vectorizer_model: Pass in your own CountVectorizer from scikit-learn\n        ctfidf_model: Pass in your own c-TF-IDF model to update the representations\n        representation_model: Pass in a model that fine-tunes the topic representations\n                    calculated through c-TF-IDF. Models from `bertopic.representation`\n                    are supported.\n\n    Examples:\n\n    In order to update the topic representation, you will need to first fit the topic\n    model and extract topics from them. Based on these, you can update the representation:\n\n    ```python\n    topic_model.update_topics(docs, n_gram_range=(2, 3))\n    ```\n\n    You can also use a custom vectorizer to update the representation:\n\n    ```python\n    from sklearn.feature_extraction.text import CountVectorizer\n    vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=\"english\")\n    topic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n    ```\n\n    You can also use this function to change or map the topics to something else.\n    You can update them as follows:\n\n    ```python\n    topic_model.update_topics(docs, my_updated_topics)\n    ```\n    \"\"\"\n    check_documents_type(docs)\n    check_is_fitted(self)\n    if not n_gram_range:\n        n_gram_range = self.n_gram_range\n\n    if top_n_words &gt; 100:\n        logger.warning(\"Note that extracting more than 100 words from a sparse \"\n                       \"can slow down computation quite a bit.\")\n    self.top_n_words = top_n_words\n    self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=n_gram_range)\n    self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()\n    self.representation_model = representation_model\n\n    if topics is None:\n        topics = self.topics_\n    else:\n        logger.warning(\"Using a custom list of topic assignments may lead to errors if \"\n                       \"topic reduction techniques are used afterwards. Make sure that \"\n                       \"manually assigning topics is the last step in the pipeline.\"\n                       \"Note that topic embeddings will also be created through weighted\"\n                       \"c-TF-IDF embeddings instead of centroid embeddings.\")\n\n    self._outliers = 1 if -1 in set(topics) else 0\n    # Extract words\n    documents = pd.DataFrame({\"Document\": docs, \"Topic\": topics, \"ID\": range(len(docs)), \"Image\": images})\n    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})\n    self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)\n    self.topic_representations_ = self._extract_words_per_topic(words, documents)\n    if set(topics) != self.topics_:\n        self._create_topic_vectors()\n    self.topic_labels_ = {key: f\"{key}_\" + \"_\".join([word[0] for word in values[:4]])\n                          for key, values in\n                          self.topic_representations_.items()}\n    self._update_topic_size(documents)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_approximate_distribution","title":"<code>visualize_approximate_distribution(self, document, topic_token_distribution, normalize=False)</code>","text":"<p>Visualize the topic distribution calculated by <code>.approximate_topic_distribution</code> on a token level. Thereby indicating the extend to which a certain word or phrases belong to a specific topic. The assumption here is that a single word can belong to multiple similar topics and as such give information about the broader set of topics within a single document.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>document</code> <code>str</code> <p>The document for which you want to visualize     the approximated topic distribution.</p> required <code>topic_token_distribution</code> <code>ndarray</code> <p>The topic-token distribution of the document as                     extracted by <code>.approximate_topic_distribution</code></p> required <code>normalize</code> <code>bool</code> <p>Whether to normalize, between 0 and 1 (summing to 1), the     topic distribution values.</p> <code>False</code> <p>Returns:</p> Type Description <code>df</code> <p>A stylized dataframe indicating the best fitting topics     for each token.</p> <p>Examples:</p> <pre><code># Calculate the topic distributions on a token level\n# Note that we need to have `calculate_token_level=True`\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(\n        docs, calculate_token_level=True\n)\n\n# Visualize the approximated topic distributions\ndf = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])\ndf\n</code></pre> <p>To revert this stylized dataframe back to a regular dataframe, you can run the following:</p> <pre><code>df.data.columns = [column.strip() for column in df.data.columns]\ndf = df.data\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_approximate_distribution(self,\n                                       document: str,\n                                       topic_token_distribution: np.ndarray,\n                                       normalize: bool = False):\n    \"\"\" Visualize the topic distribution calculated by `.approximate_topic_distribution`\n    on a token level. Thereby indicating the extend to which a certain word or phrases belong\n    to a specific topic. The assumption here is that a single word can belong to multiple\n    similar topics and as such give information about the broader set of topics within\n    a single document.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        document: The document for which you want to visualize\n                the approximated topic distribution.\n        topic_token_distribution: The topic-token distribution of the document as\n                                extracted by `.approximate_topic_distribution`\n        normalize: Whether to normalize, between 0 and 1 (summing to 1), the\n                topic distribution values.\n\n    Returns:\n        df: A stylized dataframe indicating the best fitting topics\n            for each token.\n\n    Examples:\n\n    ```python\n    # Calculate the topic distributions on a token level\n    # Note that we need to have `calculate_token_level=True`\n    topic_distr, topic_token_distr = topic_model.approximate_distribution(\n            docs, calculate_token_level=True\n    )\n\n    # Visualize the approximated topic distributions\n    df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])\n    df\n    ```\n\n    To revert this stylized dataframe back to a regular dataframe,\n    you can run the following:\n\n    ```python\n    df.data.columns = [column.strip() for column in df.data.columns]\n    df = df.data\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_approximate_distribution(self,\n                                                       document=document,\n                                                       topic_token_distribution=topic_token_distribution,\n                                                       normalize=normalize)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_barchart","title":"<code>visualize_barchart(self, topics=None, top_n_topics=8, n_words=5, custom_labels=False, title='Topic Word Scores', width=250, height=250)</code>","text":"<p>Visualize a barchart of selected topics</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>8</code> <code>n_words</code> <code>int</code> <p>Number of words to show in a topic</p> <code>5</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'Topic Word Scores'</code> <code>width</code> <code>int</code> <p>The width of each figure.</p> <code>250</code> <code>height</code> <code>int</code> <p>The height of each figure.</p> <code>250</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the barchart of selected topics simply run:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_barchart()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_barchart(self,\n                       topics: List[int] = None,\n                       top_n_topics: int = 8,\n                       n_words: int = 5,\n                       custom_labels: bool = False,\n                       title: str = \"Topic Word Scores\",\n                       width: int = 250,\n                       height: int = 250) -&gt; go.Figure:\n    \"\"\" Visualize a barchart of selected topics\n\n    Arguments:\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_words: Number of words to show in a topic\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of each figure.\n        height: The height of each figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the barchart of selected topics\n    simply run:\n\n    ```python\n    topic_model.visualize_barchart()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_barchart()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_barchart(self,\n                                       topics=topics,\n                                       top_n_topics=top_n_topics,\n                                       n_words=n_words,\n                                       custom_labels=custom_labels,\n                                       title=title,\n                                       width=width,\n                                       height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_distribution","title":"<code>visualize_distribution(self, probabilities, min_probability=0.015, custom_labels=False, title='&lt;b&gt;Topic Probability Distribution&lt;/b&gt;', width=800, height=600)</code>","text":"<p>Visualize the distribution of topic probabilities</p> <p>Parameters:</p> Name Type Description Default <code>probabilities</code> <code>ndarray</code> <p>An array of probability scores</p> required <code>min_probability</code> <code>float</code> <p>The minimum probability score to visualize.              All others are ignored.</p> <code>0.015</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using            <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topic Probability Distribution&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>600</code> <p>Examples:</p> <p>Make sure to fit the model before and only input the probabilities of a single document:</p> <pre><code>topic_model.visualize_distribution(topic_model.probabilities_[0])\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_distribution(topic_model.probabilities_[0])\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_distribution(self,\n                           probabilities: np.ndarray,\n                           min_probability: float = 0.015,\n                           custom_labels: bool = False,\n                           title: str = \"&lt;b&gt;Topic Probability Distribution&lt;/b&gt;\",\n                           width: int = 800,\n                           height: int = 600) -&gt; go.Figure:\n    \"\"\" Visualize the distribution of topic probabilities\n\n    Arguments:\n        probabilities: An array of probability scores\n        min_probability: The minimum probability score to visualize.\n                         All others are ignored.\n        custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    Make sure to fit the model before and only input the\n    probabilities of a single document:\n\n    ```python\n    topic_model.visualize_distribution(topic_model.probabilities_[0])\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_distribution(topic_model.probabilities_[0])\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_distribution(self,\n                                           probabilities=probabilities,\n                                           min_probability=min_probability,\n                                           custom_labels=custom_labels,\n                                           title=title,\n                                           width=width,\n                                           height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_documents","title":"<code>visualize_documents(self, docs, topics=None, embeddings=None, reduced_embeddings=None, sample=None, hide_annotations=False, hide_document_hover=False, custom_labels=False, title='&lt;b&gt;Documents and Topics&lt;/b&gt;', width=1200, height=750)</code>","text":"<p>Visualize documents and their topics in 2D</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>float</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                 specific points. Helps to speed up generation of visualization.</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_documents(docs)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_documents(self,\n                        docs: List[str],\n                        topics: List[int] = None,\n                        embeddings: np.ndarray = None,\n                        reduced_embeddings: np.ndarray = None,\n                        sample: float = None,\n                        hide_annotations: bool = False,\n                        hide_document_hover: bool = False,\n                        custom_labels: bool = False,\n                        title: str = \"&lt;b&gt;Documents and Topics&lt;/b&gt;\",\n                        width: int = 1200,\n                        height: int = 750) -&gt; go.Figure:\n    \"\"\" Visualize documents and their topics in 2D\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                            specific points. Helps to speed up generation of visualization.\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_documents(docs)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic\n    topic_model = BERTopic().fit(docs, embeddings)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_documents(docs, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../getting_started/visualization/documents.html\"\n    style=\"width:1000px; height: 800px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    return plotting.visualize_documents(self,\n                                        docs=docs,\n                                        topics=topics,\n                                        embeddings=embeddings,\n                                        reduced_embeddings=reduced_embeddings,\n                                        sample=sample,\n                                        hide_annotations=hide_annotations,\n                                        hide_document_hover=hide_document_hover,\n                                        custom_labels=custom_labels,\n                                        title=title,\n                                        width=width,\n                                        height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_heatmap","title":"<code>visualize_heatmap(self, topics=None, top_n_topics=None, n_clusters=None, custom_labels=False, title='&lt;b&gt;Similarity Matrix&lt;/b&gt;', width=800, height=800)</code>","text":"<p>Visualize a heatmap of the topic's similarity matrix</p> <p>Based on the cosine similarity matrix between topic embeddings, a heatmap is created showing the similarity between topics.</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>None</code> <code>n_clusters</code> <code>int</code> <p>Create n clusters and order the similarity         matrix by those clusters.</p> <code>None</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Similarity Matrix&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>800</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the similarity matrix of topics simply run:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_heatmap()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_heatmap(self,\n                      topics: List[int] = None,\n                      top_n_topics: int = None,\n                      n_clusters: int = None,\n                      custom_labels: bool = False,\n                      title: str = \"&lt;b&gt;Similarity Matrix&lt;/b&gt;\",\n                      width: int = 800,\n                      height: int = 800) -&gt; go.Figure:\n    \"\"\" Visualize a heatmap of the topic's similarity matrix\n\n    Based on the cosine similarity matrix between topic embeddings,\n    a heatmap is created showing the similarity between topics.\n\n    Arguments:\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_clusters: Create n clusters and order the similarity\n                    matrix by those clusters.\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the similarity matrix of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_heatmap()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_heatmap()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_heatmap(self,\n                                      topics=topics,\n                                      top_n_topics=top_n_topics,\n                                      n_clusters=n_clusters,\n                                      custom_labels=custom_labels,\n                                      title=title,\n                                      width=width,\n                                      height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_hierarchical_documents","title":"<code>visualize_hierarchical_documents(self, docs, hierarchical_topics, topics=None, embeddings=None, reduced_embeddings=None, sample=None, hide_annotations=False, hide_document_hover=True, nr_levels=10, level_scale='linear', custom_labels=False, title='&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;', width=1200, height=750)</code>","text":"<p>Visualize documents and their topics in 2D at different levels of hierarchy</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                 represented by their parents and their children</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>Union[float, int]</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                 specific points. Helps to speed up generation of visualizations.</p> <code>True</code> <code>nr_levels</code> <code>int</code> <p>The number of levels to be visualized in the hierarchy. First, the distances     in <code>hierarchical_topics.Distance</code> are split in <code>nr_levels</code> lists of distances with     equal length. Then, for each list of distances, the merged topics are selected that     have a distance less or equal to the maximum distance of the selected list of distances.     NOTE: To get all possible merged steps, make sure that <code>nr_levels</code> is equal to     the length of <code>hierarchical_topics</code>.</p> <code>10</code> <code>level_scale</code> <code>str</code> <p>Whether to apply a linear or logarithmic ('log') scale levels of the distance          vector. Linear scaling will perform an equal number of merges at each level          while logarithmic scaling will perform more mergers in earlier levels to          provide more resolution at higher levels (this can be used for when the number          of topics is large).</p> <code>'linear'</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using            <code>topic_model.set_topic_labels</code>.            NOTE: Custom labels are only generated for the original            un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_hierarchical_documents(self,\n                                     docs: List[str],\n                                     hierarchical_topics: pd.DataFrame,\n                                     topics: List[int] = None,\n                                     embeddings: np.ndarray = None,\n                                     reduced_embeddings: np.ndarray = None,\n                                     sample: Union[float, int] = None,\n                                     hide_annotations: bool = False,\n                                     hide_document_hover: bool = True,\n                                     nr_levels: int = 10,\n                                     level_scale: str = 'linear',\n                                     custom_labels: bool = False,\n                                     title: str = \"&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;\",\n                                     width: int = 1200,\n                                     height: int = 750) -&gt; go.Figure:\n    \"\"\" Visualize documents and their topics in 2D at different levels of hierarchy\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                            represented by their parents and their children\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                            specific points. Helps to speed up generation of visualizations.\n        nr_levels: The number of levels to be visualized in the hierarchy. First, the distances\n                in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances with\n                equal length. Then, for each list of distances, the merged topics are selected that\n                have a distance less or equal to the maximum distance of the selected list of distances.\n                NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to\n                the length of `hierarchical_topics`.\n        level_scale: Whether to apply a linear or logarithmic ('log') scale levels of the distance\n                     vector. Linear scaling will perform an equal number of merges at each level\n                     while logarithmic scaling will perform more mergers in earlier levels to\n                     provide more resolution at higher levels (this can be used for when the number\n                     of topics is large).\n        custom_labels: Whether to use custom topic labels that were defined using\n                       `topic_model.set_topic_labels`.\n                       NOTE: Custom labels are only generated for the original\n                       un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic and extract hierarchical topics\n    topic_model = BERTopic().fit(docs, embeddings)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../getting_started/visualization/hierarchical_documents.html\"\n    style=\"width:1000px; height: 770px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    check_is_fitted(self)\n    check_documents_type(docs)\n    return plotting.visualize_hierarchical_documents(self,\n                                                     docs=docs,\n                                                     hierarchical_topics=hierarchical_topics,\n                                                     topics=topics,\n                                                     embeddings=embeddings,\n                                                     reduced_embeddings=reduced_embeddings,\n                                                     sample=sample,\n                                                     hide_annotations=hide_annotations,\n                                                     hide_document_hover=hide_document_hover,\n                                                     nr_levels=nr_levels,\n                                                     level_scale=level_scale,\n                                                     custom_labels=custom_labels,\n                                                     title=title,\n                                                     width=width,\n                                                     height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_hierarchy","title":"<code>visualize_hierarchy(self, orientation='left', topics=None, top_n_topics=None, custom_labels=False, title='&lt;b&gt;Hierarchical Clustering&lt;/b&gt;', width=1000, height=600, hierarchical_topics=None, linkage_function=None, distance_function=None, color_threshold=1)</code>","text":"<p>Visualize a hierarchical structure of the topics</p> <p>A ward linkage function is used to perform the hierarchical clustering based on the cosine distance matrix between topic embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>orientation</code> <code>str</code> <p>The orientation of the figure.         Either 'left' or 'bottom'</p> <code>'left'</code> <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.        NOTE: Custom labels are only generated for the original        un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Clustering&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure. Only works if orientation is set to 'left'</p> <code>1000</code> <code>height</code> <code>int</code> <p>The height of the figure. Only works if orientation is set to 'bottom'</p> <code>600</code> <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                 represented by their parents and their children.                 NOTE: The hierarchical topic names are only visualized                 if both <code>topics</code> and <code>top_n_topics</code> are not set.</p> <code>None</code> <code>linkage_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], numpy.ndarray]</code> <p>The linkage function to use. Default is:             <code>lambda x: sch.linkage(x, 'ward', optimal_ordering=True)</code>             NOTE: Make sure to use the same <code>linkage_function</code> as used             in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>distance_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], scipy.sparse._csr.csr_matrix]</code> <p>The distance function to use on the c-TF-IDF matrix. Default is:             <code>lambda x: 1 - cosine_similarity(x)</code>             NOTE: Make sure to use the same <code>distance_function</code> as used             in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>color_threshold</code> <code>int</code> <p>Value at which the separation of clusters will be made which          will result in different colors for different clusters.          A higher value will typically lead in less colored clusters.</p> <code>1</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the hierarchical structure of topics simply run:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>If you also want the labels visualized of hierarchical topics, run the following:</p> <pre><code># Extract hierarchical topics and their representations\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Visualize these representations\ntopic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchy()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_hierarchy(self,\n                        orientation: str = \"left\",\n                        topics: List[int] = None,\n                        top_n_topics: int = None,\n                        custom_labels: bool = False,\n                        title: str = \"&lt;b&gt;Hierarchical Clustering&lt;/b&gt;\",\n                        width: int = 1000,\n                        height: int = 600,\n                        hierarchical_topics: pd.DataFrame = None,\n                        linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                        distance_function: Callable[[csr_matrix], csr_matrix] = None,\n                        color_threshold: int = 1) -&gt; go.Figure:\n    \"\"\" Visualize a hierarchical structure of the topics\n\n    A ward linkage function is used to perform the\n    hierarchical clustering based on the cosine distance\n    matrix between topic embeddings.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        orientation: The orientation of the figure.\n                    Either 'left' or 'bottom'\n        topics: A selection of topics to visualize\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n                   NOTE: Custom labels are only generated for the original\n                   un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure. Only works if orientation is set to 'left'\n        height: The height of the figure. Only works if orientation is set to 'bottom'\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                            represented by their parents and their children.\n                            NOTE: The hierarchical topic names are only visualized\n                            if both `topics` and `top_n_topics` are not set.\n        linkage_function: The linkage function to use. Default is:\n                        `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n                        NOTE: Make sure to use the same `linkage_function` as used\n                        in `topic_model.hierarchical_topics`.\n        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                        `lambda x: 1 - cosine_similarity(x)`\n                        NOTE: Make sure to use the same `distance_function` as used\n                        in `topic_model.hierarchical_topics`.\n        color_threshold: Value at which the separation of clusters will be made which\n                     will result in different colors for different clusters.\n                     A higher value will typically lead in less colored clusters.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the hierarchical structure of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchy()\n    ```\n\n    If you also want the labels visualized of hierarchical topics,\n    run the following:\n\n    ```python\n    # Extract hierarchical topics and their representations\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Visualize these representations\n    topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n    ```\n\n    If you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchy()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../getting_started/visualization/hierarchy.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_hierarchy(self,\n                                        orientation=orientation,\n                                        topics=topics,\n                                        top_n_topics=top_n_topics,\n                                        custom_labels=custom_labels,\n                                        title=title,\n                                        width=width,\n                                        height=height,\n                                        hierarchical_topics=hierarchical_topics,\n                                        linkage_function=linkage_function,\n                                        distance_function=distance_function,\n                                        color_threshold=color_threshold\n                                        )\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_term_rank","title":"<code>visualize_term_rank(self, topics=None, log_scale=False, custom_labels=False, title='&lt;b&gt;Term score decline per Topic&lt;/b&gt;', width=800, height=500)</code>","text":"<p>Visualize the ranks of all terms across all topics</p> <p>Each topic is represented by a set of words. These words, however, do not all equally represent the topic. This visualization shows how many words are needed to represent a topic and at which point the beneficial effect of adding words starts to decline.</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize. These will be colored     red where all others will be colored black.</p> <code>None</code> <code>log_scale</code> <code>bool</code> <p>Whether to represent the ranking on a log scale</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Term score decline per Topic&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>500</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the ranks of all words across all topics simply run:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_term_rank()\nfig.write_html(\"path/to/file.html\")\n</code></pre> <p>Reference:</p> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an analysis by the amazing tmtoolkit. Reference to that specific analysis can be found here.</p> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_term_rank(self,\n                        topics: List[int] = None,\n                        log_scale: bool = False,\n                        custom_labels: bool = False,\n                        title: str = \"&lt;b&gt;Term score decline per Topic&lt;/b&gt;\",\n                        width: int = 800,\n                        height: int = 500) -&gt; go.Figure:\n    \"\"\" Visualize the ranks of all terms across all topics\n\n    Each topic is represented by a set of words. These words, however,\n    do not all equally represent the topic. This visualization shows\n    how many words are needed to represent a topic and at which point\n    the beneficial effect of adding words starts to decline.\n\n    Arguments:\n        topics: A selection of topics to visualize. These will be colored\n                red where all others will be colored black.\n        log_scale: Whether to represent the ranking on a log scale\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the ranks of all words across\n    all topics simply run:\n\n    ```python\n    topic_model.visualize_term_rank()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_term_rank()\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    Reference:\n\n    This visualization was heavily inspired by the\n    \"Term Probability Decline\" visualization found in an\n    analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).\n    Reference to that specific analysis can be found\n    [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_term_rank(self,\n                                        topics=topics,\n                                        log_scale=log_scale,\n                                        custom_labels=custom_labels,\n                                        title=title,\n                                        width=width,\n                                        height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics","title":"<code>visualize_topics(self, topics=None, top_n_topics=None, custom_labels=False, title='&lt;b&gt;Intertopic Distance Map&lt;/b&gt;', width=650, height=650)</code>","text":"<p>Visualize topics, their sizes, and their corresponding words</p> <p>This visualization is highly inspired by LDAvis, a great visualization technique typically reserved for LDA.</p> <p>Parameters:</p> Name Type Description Default <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Intertopic Distance Map&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>650</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>650</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_topics(self,\n                     topics: List[int] = None,\n                     top_n_topics: int = None,\n                     custom_labels: bool = False,\n                     title: str = \"&lt;b&gt;Intertopic Distance Map&lt;/b&gt;\",\n                     width: int = 650,\n                     height: int = 650) -&gt; go.Figure:\n    \"\"\" Visualize topics, their sizes, and their corresponding words\n\n    This visualization is highly inspired by LDAvis, a great visualization\n    technique typically reserved for LDA.\n\n    Arguments:\n        topics: A selection of topics to visualize\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_topics()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_topics(self,\n                                     topics=topics,\n                                     top_n_topics=top_n_topics,\n                                     custom_labels=custom_labels,\n                                     title=title,\n                                     width=width,\n                                     height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics_over_time","title":"<code>visualize_topics_over_time(self, topics_over_time, top_n_topics=None, topics=None, normalize_frequency=False, custom_labels=False, title='&lt;b&gt;Topics over Time&lt;/b&gt;', width=1250, height=450)</code>","text":"<p>Visualize topics over time</p> <p>Parameters:</p> Name Type Description Default <code>topics_over_time</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>None</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics over Time&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>450</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics over time, simply run:</p> <pre><code>topics_over_time = topic_model.topics_over_time(docs, timestamps)\ntopic_model.visualize_topics_over_time(topics_over_time)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_over_time(topics_over_time)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_topics_over_time(self,\n                               topics_over_time: pd.DataFrame,\n                               top_n_topics: int = None,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: bool = False,\n                               title: str = \"&lt;b&gt;Topics over Time&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 450) -&gt; go.Figure:\n    \"\"\" Visualize topics over time\n\n    Arguments:\n        topics_over_time: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics over time, simply run:\n\n    ```python\n    topics_over_time = topic_model.topics_over_time(docs, timestamps)\n    topic_model.visualize_topics_over_time(topics_over_time)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_over_time(topics_over_time)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_topics_over_time(self,\n                                               topics_over_time=topics_over_time,\n                                               top_n_topics=top_n_topics,\n                                               topics=topics,\n                                               normalize_frequency=normalize_frequency,\n                                               custom_labels=custom_labels,\n                                               title=title,\n                                               width=width,\n                                               height=height)\n</code></pre>"},{"location":"api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics_per_class","title":"<code>visualize_topics_per_class(self, topics_per_class, top_n_topics=10, topics=None, normalize_frequency=False, custom_labels=False, title='&lt;b&gt;Topics per Class&lt;/b&gt;', width=1250, height=900)</code>","text":"<p>Visualize topics per class</p> <p>Parameters:</p> Name Type Description Default <code>topics_per_class</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>10</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>bool</code> <p>Whether to use custom topic labels that were defined using        <code>topic_model.set_topic_labels</code>.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics per Class&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>900</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics per class, simply run:</p> <pre><code>topics_per_class = topic_model.topics_per_class(docs, classes)\ntopic_model.visualize_topics_per_class(topics_per_class)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_per_class(topics_per_class)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\_bertopic.py</code> <pre><code>def visualize_topics_per_class(self,\n                               topics_per_class: pd.DataFrame,\n                               top_n_topics: int = 10,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: bool = False,\n                               title: str = \"&lt;b&gt;Topics per Class&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 900) -&gt; go.Figure:\n    \"\"\" Visualize topics per class\n\n    Arguments:\n        topics_per_class: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: Whether to use custom topic labels that were defined using\n                   `topic_model.set_topic_labels`.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics per class, simply run:\n\n    ```python\n    topics_per_class = topic_model.topics_per_class(docs, classes)\n    topic_model.visualize_topics_per_class(topics_per_class)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_per_class(topics_per_class)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    \"\"\"\n    check_is_fitted(self)\n    return plotting.visualize_topics_per_class(self,\n                                               topics_per_class=topics_per_class,\n                                               top_n_topics=top_n_topics,\n                                               topics=topics,\n                                               normalize_frequency=normalize_frequency,\n                                               custom_labels=custom_labels,\n                                               title=title,\n                                               width=width,\n                                               height=height)\n</code></pre>"},{"location":"api/ctfidf.html","title":"<code>c-TF-IDF</code>","text":"<p>A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.</p> <p></p> <p>c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents. The frequency of each word x is extracted for each class c and is l1 normalized. This constitutes the term frequency.</p> <p>Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus the average number of words per class A divided by the frequency of word x across all classes.</p> <p>Parameters:</p> Name Type Description Default <code>bm25_weighting</code> <code>bool</code> <p>Uses BM25-inspired idf-weighting procedure instead of the procedure             as defined in the c-TF-IDF formula. It uses the following weighting scheme:             <code>log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))</code></p> <code>False</code> <code>reduce_frequent_words</code> <code>bool</code> <p>Takes the square root of the bag-of-words after normalizing the matrix.                    Helps to reduce the impact of words that appear too frequently.</p> <code>False</code> <code>seed_words</code> <code>List[str]</code> <p>Specific words that will have their idf value increased by          the value of <code>seed_multiplier</code>.          NOTE: This will only increase the value of words that have an exact match.</p> <code>None</code> <code>seed_multiplier</code> <code>bool</code> <p>The value with which the idf values of the words in <code>seed_words</code>              are multiplied.</p> <code>2</code> <p>Examples:</p> <pre><code>transformer = ClassTfidfTransformer()\n</code></pre> Source code in <code>bertopic\\vectorizers\\_ctfidf.py</code> <pre><code>class ClassTfidfTransformer(TfidfTransformer):\n    \"\"\"\n    A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.\n\n    ![](../algorithm/c-TF-IDF.svg)\n\n    c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes\n    by joining all documents per class. Thus, each class is converted to a single document\n    instead of set of documents. The frequency of each word **x** is extracted\n    for each class **c** and is **l1** normalized. This constitutes the term frequency.\n\n    Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus\n    the average number of words per class **A** divided by the frequency of word **x**\n    across all classes.\n\n    Arguments:\n        bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure\n                        as defined in the c-TF-IDF formula. It uses the following weighting scheme:\n                        `log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))`\n        reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix.\n                               Helps to reduce the impact of words that appear too frequently.\n        seed_words: Specific words that will have their idf value increased by \n                    the value of `seed_multiplier`. \n                    NOTE: This will only increase the value of words that have an exact match.\n        seed_multiplier: The value with which the idf values of the words in `seed_words`\n                         are multiplied.\n\n    Examples:\n\n    ```python\n    transformer = ClassTfidfTransformer()\n    ```\n    \"\"\"\n    def __init__(self, \n                 bm25_weighting: bool = False, \n                 reduce_frequent_words: bool = False,\n                 seed_words: List[str] = None,\n                 seed_multiplier: bool = 2\n                 ):\n        self.bm25_weighting = bm25_weighting\n        self.reduce_frequent_words = reduce_frequent_words\n        self.seed_words = seed_words\n        self.seed_multiplier = seed_multiplier\n        super(ClassTfidfTransformer, self).__init__()\n\n    def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):\n        \"\"\"Learn the idf vector (global term weights).\n\n        Arguments:\n            X: A matrix of term/token counts.\n            multiplier: A multiplier for increasing/decreasing certain IDF scores\n        \"\"\"\n        X = check_array(X, accept_sparse=('csr', 'csc'))\n        if not sp.issparse(X):\n            X = sp.csr_matrix(X)\n        dtype = np.float64\n\n        if self.use_idf:\n            _, n_features = X.shape\n\n            # Calculate the frequency of words across all classes\n            df = np.squeeze(np.asarray(X.sum(axis=0)))\n\n            # Calculate the average number of samples as regularization\n            avg_nr_samples = int(X.sum(axis=1).mean())\n\n            # BM25-inspired weighting procedure\n            if self.bm25_weighting:\n                idf = np.log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))\n\n            # Divide the average number of samples by the word frequency\n            # +1 is added to force values to be positive\n            else:\n                idf = np.log((avg_nr_samples / df)+1)\n\n            # Multiplier to increase/decrease certain idf scores\n            if multiplier is not None:\n                idf = idf * multiplier\n\n            self._idf_diag = sp.diags(idf, offsets=0,\n                                      shape=(n_features, n_features),\n                                      format='csr',\n                                      dtype=dtype)\n\n        return self\n\n    def transform(self, X: sp.csr_matrix):\n        \"\"\"Transform a count-based matrix to c-TF-IDF\n\n        Arguments:\n            X (sparse matrix): A matrix of term/token counts.\n\n        Returns:\n            X (sparse matrix): A c-TF-IDF matrix\n        \"\"\"\n        if self.use_idf:\n            X = normalize(X, axis=1, norm='l1', copy=False)\n\n            if self.reduce_frequent_words:\n                X.data = np.sqrt(X.data)\n\n            X = X * self._idf_diag\n\n        return X\n</code></pre>"},{"location":"api/ctfidf.html#bertopic.vectorizers._ctfidf.ClassTfidfTransformer.fit","title":"<code>fit(self, X, multiplier=None)</code>","text":"<p>Learn the idf vector (global term weights).</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>csr_matrix</code> <p>A matrix of term/token counts.</p> required <code>multiplier</code> <code>ndarray</code> <p>A multiplier for increasing/decreasing certain IDF scores</p> <code>None</code> Source code in <code>bertopic\\vectorizers\\_ctfidf.py</code> <pre><code>def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):\n    \"\"\"Learn the idf vector (global term weights).\n\n    Arguments:\n        X: A matrix of term/token counts.\n        multiplier: A multiplier for increasing/decreasing certain IDF scores\n    \"\"\"\n    X = check_array(X, accept_sparse=('csr', 'csc'))\n    if not sp.issparse(X):\n        X = sp.csr_matrix(X)\n    dtype = np.float64\n\n    if self.use_idf:\n        _, n_features = X.shape\n\n        # Calculate the frequency of words across all classes\n        df = np.squeeze(np.asarray(X.sum(axis=0)))\n\n        # Calculate the average number of samples as regularization\n        avg_nr_samples = int(X.sum(axis=1).mean())\n\n        # BM25-inspired weighting procedure\n        if self.bm25_weighting:\n            idf = np.log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))\n\n        # Divide the average number of samples by the word frequency\n        # +1 is added to force values to be positive\n        else:\n            idf = np.log((avg_nr_samples / df)+1)\n\n        # Multiplier to increase/decrease certain idf scores\n        if multiplier is not None:\n            idf = idf * multiplier\n\n        self._idf_diag = sp.diags(idf, offsets=0,\n                                  shape=(n_features, n_features),\n                                  format='csr',\n                                  dtype=dtype)\n\n    return self\n</code></pre>"},{"location":"api/ctfidf.html#bertopic.vectorizers._ctfidf.ClassTfidfTransformer.transform","title":"<code>transform(self, X)</code>","text":"<p>Transform a count-based matrix to c-TF-IDF</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>sparse matrix</code> <p>A matrix of term/token counts.</p> required <p>Returns:</p> Type Description <code>X (sparse matrix)</code> <p>A c-TF-IDF matrix</p> Source code in <code>bertopic\\vectorizers\\_ctfidf.py</code> <pre><code>def transform(self, X: sp.csr_matrix):\n    \"\"\"Transform a count-based matrix to c-TF-IDF\n\n    Arguments:\n        X (sparse matrix): A matrix of term/token counts.\n\n    Returns:\n        X (sparse matrix): A c-TF-IDF matrix\n    \"\"\"\n    if self.use_idf:\n        X = normalize(X, axis=1, norm='l1', copy=False)\n\n        if self.reduce_frequent_words:\n            X.data = np.sqrt(X.data)\n\n        X = X * self._idf_diag\n\n    return X\n</code></pre>"},{"location":"api/onlinecv.html","title":"<code>OnlineCountVectorizer</code>","text":"<p>An online variant of the CountVectorizer with updating vocabulary.</p> <p>At each <code>.partial_fit</code>, its vocabulary is updated based on any OOV words it might find. Then, <code>.update_bow</code> can be used to track and update the Bag-of-Words representation. These functions are seperated such that the vectorizer can be used in iteration without updating the Bag-of-Words representation can might speed up the fitting process. However, the <code>.update_bow</code> function is used in BERTopic to track changes in the topic representations and allow for decay.</p> <p>This class inherits its parameters and attributes from:     <code>sklearn.feature_extraction.text.CountVectorizer</code></p> <p>Parameters:</p> Name Type Description Default <code>decay</code> <code>float</code> <p>A value between [0, 1] to weight the percentage of frequencies    the previous bag-of-words should be decreased. For example,    a value of <code>.1</code> will decrease the frequencies in the bag-of-words    matrix with 10% at each iteration.</p> <code>None</code> <code>delete_min_df</code> <code>float</code> <p>Delete words eat each iteration from its vocabulary            that do not exceed a minimum frequency.            This will keep the resulting bag-of-words matrix small            such that it does not explode in size with increasing            vocabulary. If <code>decay</code> is None then this equals <code>min_df</code>.</p> <code>None</code> <code>**kwargs</code> <p>Set of parameters inherited from:       <code>sklearn.feature_extraction.text.CountVectorizer</code>       In practice, this means that you can still use parameters       from the original CountVectorizer, like <code>stop_words</code> and       <code>ngram_range</code>.</p> <code>{}</code> <p>Attributes:</p> Name Type Description <code>X_</code> <code>scipy.sparse.csr_matrix) </code> <p>The Bag-of-Words representation</p> <p>Examples:</p> <pre><code>from bertopic.vectorizers import OnlineCountVectorizer\nvectorizer = OnlineCountVectorizer(stop_words=\"english\")\n\nfor index, doc in enumerate(my_docs):\n    vectorizer.partial_fit(doc)\n\n    # Update and clean the bow every 100 iterations:\n    if index % 100 == 0:\n        X = vectorizer.update_bow()\n</code></pre> <p>To use the model in BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import OnlineCountVectorizer\n\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>References</p> <p>Adapted from: https://github.com/idoshlomo/online_vectorizers</p> Source code in <code>bertopic\\vectorizers\\_online_cv.py</code> <pre><code>class OnlineCountVectorizer(CountVectorizer):\n    \"\"\" An online variant of the CountVectorizer with updating vocabulary.\n\n    At each `.partial_fit`, its vocabulary is updated based on any OOV words\n    it might find. Then, `.update_bow` can be used to track and update\n    the Bag-of-Words representation. These functions are seperated such that\n    the vectorizer can be used in iteration without updating the Bag-of-Words\n    representation can might speed up the fitting process. However, the\n    `.update_bow` function is used in BERTopic to track changes in the\n    topic representations and allow for decay.\n\n    This class inherits its parameters and attributes from:\n        `sklearn.feature_extraction.text.CountVectorizer`\n\n    Arguments:\n        decay: A value between [0, 1] to weight the percentage of frequencies\n               the previous bag-of-words should be decreased. For example,\n               a value of `.1` will decrease the frequencies in the bag-of-words\n               matrix with 10% at each iteration.\n        delete_min_df: Delete words eat each iteration from its vocabulary\n                       that do not exceed a minimum frequency.\n                       This will keep the resulting bag-of-words matrix small\n                       such that it does not explode in size with increasing\n                       vocabulary. If `decay` is None then this equals `min_df`.\n        **kwargs: Set of parameters inherited from:\n                  `sklearn.feature_extraction.text.CountVectorizer`\n                  In practice, this means that you can still use parameters\n                  from the original CountVectorizer, like `stop_words` and\n                  `ngram_range`.\n\n    Attributes:\n        X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation\n\n    Examples:\n\n    ```python\n    from bertopic.vectorizers import OnlineCountVectorizer\n    vectorizer = OnlineCountVectorizer(stop_words=\"english\")\n\n    for index, doc in enumerate(my_docs):\n        vectorizer.partial_fit(doc)\n\n        # Update and clean the bow every 100 iterations:\n        if index % 100 == 0:\n            X = vectorizer.update_bow()\n    ```\n\n    To use the model in BERTopic:\n\n    ```python\n    from bertopic import BERTopic\n    from bertopic.vectorizers import OnlineCountVectorizer\n\n    vectorizer_model = OnlineCountVectorizer(stop_words=\"english\")\n    topic_model = BERTopic(vectorizer_model=vectorizer_model)\n    ```\n\n    References:\n        Adapted from: https://github.com/idoshlomo/online_vectorizers\n    \"\"\"\n    def __init__(self,\n                 decay: float = None,\n                 delete_min_df: float = None,\n                 **kwargs):\n        self.decay = decay\n        self.delete_min_df = delete_min_df\n        super(OnlineCountVectorizer, self).__init__(**kwargs)\n\n    def partial_fit(self, raw_documents: List[str]) -&gt; None:\n        \"\"\" Perform a partial fit and update vocabulary with OOV tokens\n\n        Arguments:\n            raw_documents: A list of documents\n        \"\"\"\n        if not hasattr(self, 'vocabulary_'):\n            return self.fit(raw_documents)\n\n        analyzer = self.build_analyzer()\n        analyzed_documents = [analyzer(doc) for doc in raw_documents]\n        new_tokens = set(chain.from_iterable(analyzed_documents))\n        oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))\n\n        if oov_tokens:\n            max_index = max(self.vocabulary_.values())\n            oov_vocabulary = dict(zip(oov_tokens, list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1))))\n            self.vocabulary_.update(oov_vocabulary)\n\n        return self\n\n    def update_bow(self, raw_documents: List[str]) -&gt; csr_matrix:\n        \"\"\" Create or update the bag-of-words matrix\n\n        Update the bag-of-words matrix by adding the newly transformed\n        documents. This may add empty columns if new words are found and/or\n        add empty rows if new topics are found.\n\n        During this process, the previous bag-of-words matrix might be\n        decayed if `self.decay` has been set during init. Similarly, words\n        that do not exceed `self.delete_min_df` are removed from its\n        vocabulary and bag-of-words matrix.\n\n        Arguments:\n            raw_documents: A list of documents\n\n        Returns:\n            X_: Bag-of-words matrix\n        \"\"\"\n        if hasattr(self, \"X_\"):\n            X = self.transform(raw_documents)\n\n            # Add empty columns if new words are found\n            columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)\n            self.X_ = sparse.hstack([self.X_, columns])\n\n            # Add empty rows if new topics are found\n            rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)\n            self.X_ = sparse.vstack([self.X_, rows])\n\n            # Decay of BoW matrix\n            if self.decay is not None:\n                self.X_ = self.X_ * (1 - self.decay)\n\n            self.X_ += X\n        else:\n            self.X_ = self.transform(raw_documents)\n\n        if self.delete_min_df is not None:\n            self._clean_bow()\n\n        return self.X_\n\n    def _clean_bow(self) -&gt; None:\n        \"\"\" Remove words that do not exceed `self.delete_min_df` \"\"\"\n        # Only keep words with a minimum frequency\n        indices = np.where(self.X_.sum(0) &gt;= self.delete_min_df)[1]\n        indices_dict = {index: index for index in indices}\n        self.X_ = self.X_[:, indices]\n\n        # Update vocabulary with new words\n        new_vocab = {}\n        vocabulary_dict = {v: k for k, v in self.vocabulary_.items()}\n        for i, index in enumerate(indices):\n            if indices_dict.get(index) is not None:\n                new_vocab[vocabulary_dict[index]] = i\n\n        self.vocabulary_ = new_vocab\n</code></pre>"},{"location":"api/onlinecv.html#bertopic.vectorizers._online_cv.OnlineCountVectorizer.partial_fit","title":"<code>partial_fit(self, raw_documents)</code>","text":"<p>Perform a partial fit and update vocabulary with OOV tokens</p> <p>Parameters:</p> Name Type Description Default <code>raw_documents</code> <code>List[str]</code> <p>A list of documents</p> required Source code in <code>bertopic\\vectorizers\\_online_cv.py</code> <pre><code>def partial_fit(self, raw_documents: List[str]) -&gt; None:\n    \"\"\" Perform a partial fit and update vocabulary with OOV tokens\n\n    Arguments:\n        raw_documents: A list of documents\n    \"\"\"\n    if not hasattr(self, 'vocabulary_'):\n        return self.fit(raw_documents)\n\n    analyzer = self.build_analyzer()\n    analyzed_documents = [analyzer(doc) for doc in raw_documents]\n    new_tokens = set(chain.from_iterable(analyzed_documents))\n    oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))\n\n    if oov_tokens:\n        max_index = max(self.vocabulary_.values())\n        oov_vocabulary = dict(zip(oov_tokens, list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1))))\n        self.vocabulary_.update(oov_vocabulary)\n\n    return self\n</code></pre>"},{"location":"api/onlinecv.html#bertopic.vectorizers._online_cv.OnlineCountVectorizer.update_bow","title":"<code>update_bow(self, raw_documents)</code>","text":"<p>Create or update the bag-of-words matrix</p> <p>Update the bag-of-words matrix by adding the newly transformed documents. This may add empty columns if new words are found and/or add empty rows if new topics are found.</p> <p>During this process, the previous bag-of-words matrix might be decayed if <code>self.decay</code> has been set during init. Similarly, words that do not exceed <code>self.delete_min_df</code> are removed from its vocabulary and bag-of-words matrix.</p> <p>Parameters:</p> Name Type Description Default <code>raw_documents</code> <code>List[str]</code> <p>A list of documents</p> required <p>Returns:</p> Type Description <code>X_</code> <p>Bag-of-words matrix</p> Source code in <code>bertopic\\vectorizers\\_online_cv.py</code> <pre><code>def update_bow(self, raw_documents: List[str]) -&gt; csr_matrix:\n    \"\"\" Create or update the bag-of-words matrix\n\n    Update the bag-of-words matrix by adding the newly transformed\n    documents. This may add empty columns if new words are found and/or\n    add empty rows if new topics are found.\n\n    During this process, the previous bag-of-words matrix might be\n    decayed if `self.decay` has been set during init. Similarly, words\n    that do not exceed `self.delete_min_df` are removed from its\n    vocabulary and bag-of-words matrix.\n\n    Arguments:\n        raw_documents: A list of documents\n\n    Returns:\n        X_: Bag-of-words matrix\n    \"\"\"\n    if hasattr(self, \"X_\"):\n        X = self.transform(raw_documents)\n\n        # Add empty columns if new words are found\n        columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)\n        self.X_ = sparse.hstack([self.X_, columns])\n\n        # Add empty rows if new topics are found\n        rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)\n        self.X_ = sparse.vstack([self.X_, rows])\n\n        # Decay of BoW matrix\n        if self.decay is not None:\n            self.X_ = self.X_ * (1 - self.decay)\n\n        self.X_ += X\n    else:\n        self.X_ = self.transform(raw_documents)\n\n    if self.delete_min_df is not None:\n        self._clean_bow()\n\n    return self.X_\n</code></pre>"},{"location":"api/backends/base.html","title":"<code>BaseEmbedder</code>","text":"<p>The Base Embedder used for creating embedding models</p> <p>Parameters:</p> Name Type Description Default <code>embedding_model</code> <p>The main embedding model to be used for extracting              document and word embedding</p> <code>None</code> <code>word_embedding_model</code> <p>The embedding model used for extracting word                   embeddings only. If this model is selected,                   then the <code>embedding_model</code> is purely used for                   creating document embeddings.</p> <code>None</code> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>class BaseEmbedder:\n    \"\"\" The Base Embedder used for creating embedding models\n\n    Arguments:\n        embedding_model: The main embedding model to be used for extracting\n                         document and word embedding\n        word_embedding_model: The embedding model used for extracting word\n                              embeddings only. If this model is selected,\n                              then the `embedding_model` is purely used for\n                              creating document embeddings.\n    \"\"\"\n    def __init__(self,\n                 embedding_model=None,\n                 word_embedding_model=None):\n        self.embedding_model = embedding_model\n        self.word_embedding_model = word_embedding_model\n\n    def embed(self,\n              documents: List[str],\n              verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n documents/words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            documents: A list of documents or words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document/words embeddings with shape (n, m) with `n` documents/words\n            that each have an embeddings size of `m`\n        \"\"\"\n        pass\n\n    def embed_words(self,\n                    words: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            words: A list of words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Word embeddings with shape (n, m) with `n` words\n            that each have an embeddings size of `m`\n\n        \"\"\"\n        return self.embed(words, verbose)\n\n    def embed_documents(self,\n                        document: List[str],\n                        verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            document: A list of documents to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document embeddings with shape (n, m) with `n` documents\n            that each have an embeddings size of `m`\n        \"\"\"\n        return self.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/base.html#bertopic.backend._base.BaseEmbedder.embed","title":"<code>embed(self, documents, verbose=False)</code>","text":"<p>Embed a list of n documents/words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents or words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document/words embeddings with shape (n, m) with <code>n</code> documents/words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>def embed(self,\n          documents: List[str],\n          verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n documents/words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        documents: A list of documents or words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document/words embeddings with shape (n, m) with `n` documents/words\n        that each have an embeddings size of `m`\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/backends/base.html#bertopic.backend._base.BaseEmbedder.embed_documents","title":"<code>embed_documents(self, document, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>document</code> <code>List[str]</code> <p>A list of documents to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document embeddings with shape (n, m) with <code>n</code> documents that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>def embed_documents(self,\n                    document: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        document: A list of documents to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document embeddings with shape (n, m) with `n` documents\n        that each have an embeddings size of `m`\n    \"\"\"\n    return self.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/base.html#bertopic.backend._base.BaseEmbedder.embed_words","title":"<code>embed_words(self, words, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>words</code> <code>List[str]</code> <p>A list of words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Word embeddings with shape (n, m) with <code>n</code> words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_base.py</code> <pre><code>def embed_words(self,\n                words: List[str],\n                verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        words: A list of words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Word embeddings with shape (n, m) with `n` words\n        that each have an embeddings size of `m`\n\n    \"\"\"\n    return self.embed(words, verbose)\n</code></pre>"},{"location":"api/backends/cohere.html","title":"<code>CohereBackend</code>","text":"<p>Cohere Embedding Model</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <p>A <code>cohere</code> client.</p> required <code>embedding_model</code> <code>str</code> <p>A Cohere model. Default is \"large\".              For an overview of models see:              https://docs.cohere.ai/docs/generation-card</p> <code>'large'</code> <code>delay_in_seconds</code> <code>float</code> <p>If a <code>batch_size</code> is given, use this set               the delay in seconds between batches.</p> <code>None</code> <code>batch_size</code> <code>int</code> <p>The size of each batch.</p> <code>None</code> <code>embed_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs passed to <code>cohere.Client.embed</code>.                 Can be used to define additional parameters                 such as <code>input_type</code></p> <code>{}</code> <p>Examples:</p> <pre><code>import cohere\nfrom bertopic.backend import CohereBackend\n\nclient = cohere.Client(\"APIKEY\")\ncohere_model = CohereBackend(client)\n</code></pre> <p>If you want to specify <code>input_type</code>:</p> <pre><code>cohere_model = CohereBackend(\n    client,\n    embedding_model=\"embed-english-v3.0\",\n    embed_kwargs={\"input_type\": \"clustering\"}\n)\n</code></pre> Source code in <code>bertopic\\backend\\_cohere.py</code> <pre><code>class CohereBackend(BaseEmbedder):\n    \"\"\" Cohere Embedding Model\n\n    Arguments:\n        client: A `cohere` client.\n        embedding_model: A Cohere model. Default is \"large\".\n                         For an overview of models see:\n                         https://docs.cohere.ai/docs/generation-card\n        delay_in_seconds: If a `batch_size` is given, use this set\n                          the delay in seconds between batches.\n        batch_size: The size of each batch.\n        embed_kwargs: Kwargs passed to `cohere.Client.embed`.\n                            Can be used to define additional parameters\n                            such as `input_type`\n\n    Examples:\n\n    ```python\n    import cohere\n    from bertopic.backend import CohereBackend\n\n    client = cohere.Client(\"APIKEY\")\n    cohere_model = CohereBackend(client)\n    ```\n\n    If you want to specify `input_type`:\n\n    ```python\n    cohere_model = CohereBackend(\n        client,\n        embedding_model=\"embed-english-v3.0\",\n        embed_kwargs={\"input_type\": \"clustering\"}\n    )\n    ```\n    \"\"\"\n    def __init__(self,\n                 client,\n                 embedding_model: str = \"large\",\n                 delay_in_seconds: float = None,\n                 batch_size: int = None,\n                 embed_kwargs: Mapping[str, Any] = {}):\n        super().__init__()\n        self.client = client\n        self.embedding_model = embedding_model\n        self.delay_in_seconds = delay_in_seconds\n        self.batch_size = batch_size\n        self.embed_kwargs = embed_kwargs\n\n        if self.embed_kwargs.get(\"model\"):\n            self.embedding_model = embed_kwargs.get(\"model\")\n        else:\n            self.embed_kwargs[\"model\"] = self.embedding_model\n\n    def embed(self,\n              documents: List[str],\n              verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n documents/words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            documents: A list of documents or words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document/words embeddings with shape (n, m) with `n` documents/words\n            that each have an embeddings size of `m`\n        \"\"\"\n        # Batch-wise embedding extraction\n        if self.batch_size is not None:\n            embeddings = []\n            for batch in tqdm(self._chunks(documents), disable=not verbose):\n                response = self.client.embed(batch, **self.embed_kwargs)\n                embeddings.extend(response.embeddings)\n\n                # Delay subsequent calls\n                if self.delay_in_seconds:\n                    time.sleep(self.delay_in_seconds)\n\n        # Extract embeddings all at once\n        else:\n            response = self.client.embed(documents, **self.embed_kwargs)\n            embeddings = response.embeddings\n        return np.array(embeddings)\n\n    def _chunks(self, documents):\n        for i in range(0, len(documents), self.batch_size):\n            yield documents[i:i + self.batch_size]\n</code></pre>"},{"location":"api/backends/cohere.html#bertopic.backend._cohere.CohereBackend.embed","title":"<code>embed(self, documents, verbose=False)</code>","text":"<p>Embed a list of n documents/words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents or words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document/words embeddings with shape (n, m) with <code>n</code> documents/words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_cohere.py</code> <pre><code>def embed(self,\n          documents: List[str],\n          verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n documents/words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        documents: A list of documents or words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document/words embeddings with shape (n, m) with `n` documents/words\n        that each have an embeddings size of `m`\n    \"\"\"\n    # Batch-wise embedding extraction\n    if self.batch_size is not None:\n        embeddings = []\n        for batch in tqdm(self._chunks(documents), disable=not verbose):\n            response = self.client.embed(batch, **self.embed_kwargs)\n            embeddings.extend(response.embeddings)\n\n            # Delay subsequent calls\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n    # Extract embeddings all at once\n    else:\n        response = self.client.embed(documents, **self.embed_kwargs)\n        embeddings = response.embeddings\n    return np.array(embeddings)\n</code></pre>"},{"location":"api/backends/openai.html","title":"<code>OpenAIBackend</code>","text":"<p>OpenAI Embedding Model</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <p>A <code>openai.OpenAI</code> client.</p> required <code>embedding_model</code> <code>str</code> <p>An OpenAI model. Default is              For an overview of models see:              https://platform.openai.com/docs/models/embeddings</p> <code>'text-embedding-ada-002'</code> <code>delay_in_seconds</code> <code>float</code> <p>If a <code>batch_size</code> is given, use this set               the delay in seconds between batches.</p> <code>None</code> <code>batch_size</code> <code>int</code> <p>The size of each batch.</p> <code>None</code> <code>generator_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs passed to <code>openai.Embedding.create</code>.               Can be used to define custom engines or               deployment_ids.</p> <code>{}</code> <p>Examples:</p> <pre><code>import openai\nfrom bertopic.backend import OpenAIBackend\n\nclient = openai.OpenAI(api_key=\"sk-...\")\nopenai_embedder = OpenAIBackend(client, \"text-embedding-ada-002\")\n</code></pre> Source code in <code>bertopic\\backend\\_openai.py</code> <pre><code>class OpenAIBackend(BaseEmbedder):\n    \"\"\" OpenAI Embedding Model\n\n    Arguments:\n        client: A `openai.OpenAI` client.\n        embedding_model: An OpenAI model. Default is\n                         For an overview of models see:\n                         https://platform.openai.com/docs/models/embeddings\n        delay_in_seconds: If a `batch_size` is given, use this set\n                          the delay in seconds between batches.\n        batch_size: The size of each batch.\n        generator_kwargs: Kwargs passed to `openai.Embedding.create`.\n                          Can be used to define custom engines or\n                          deployment_ids.\n\n    Examples:\n\n    ```python\n    import openai\n    from bertopic.backend import OpenAIBackend\n\n    client = openai.OpenAI(api_key=\"sk-...\")\n    openai_embedder = OpenAIBackend(client, \"text-embedding-ada-002\")\n    ```\n    \"\"\"\n    def __init__(self,\n                 embedding_model: str = \"text-embedding-ada-002\",\n                 delay_in_seconds: float = None,\n                 batch_size: int = None,\n                 generator_kwargs: Mapping[str, Any] = {}):\n        super().__init__()\n        self.embedding_model = embedding_model\n        self.delay_in_seconds = delay_in_seconds\n        self.batch_size = batch_size\n        self.generator_kwargs = generator_kwargs\n\n        if self.generator_kwargs.get(\"model\"):\n            self.embedding_model = generator_kwargs.get(\"model\")\n        elif not self.generator_kwargs.get(\"engine\"):\n            self.generator_kwargs[\"model\"] = self.embedding_model\n\n    def embed(self,\n              documents: List[str],\n              verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n documents/words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            documents: A list of documents or words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document/words embeddings with shape (n, m) with `n` documents/words\n            that each have an embeddings size of `m`\n        \"\"\"\n        # Batch-wise embedding extraction\n        if self.batch_size is not None:\n            embeddings = []\n            for batch in tqdm(self._chunks(documents), disable=not verbose):\n                response = openai.Embedding.create(input=batch, **self.generator_kwargs)\n                embeddings.extend([r[\"embedding\"] for r in response[\"data\"]])\n\n                # Delay subsequent calls\n                if self.delay_in_seconds:\n                    time.sleep(self.delay_in_seconds)\n\n        # Extract embeddings all at once\n        else:\n            response = openai.Embedding.create(input=documents, **self.generator_kwargs)\n            embeddings = [r[\"embedding\"] for r in response[\"data\"]]\n        return np.array(embeddings)\n\n    def _chunks(self, documents):\n        for i in range(0, len(documents), self.batch_size):\n            yield documents[i:i + self.batch_size]\n</code></pre>"},{"location":"api/backends/openai.html#bertopic.backend._openai.OpenAIBackend.embed","title":"<code>embed(self, documents, verbose=False)</code>","text":"<p>Embed a list of n documents/words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>documents</code> <code>List[str]</code> <p>A list of documents or words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document/words embeddings with shape (n, m) with <code>n</code> documents/words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_openai.py</code> <pre><code>def embed(self,\n          documents: List[str],\n          verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n documents/words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        documents: A list of documents or words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document/words embeddings with shape (n, m) with `n` documents/words\n        that each have an embeddings size of `m`\n    \"\"\"\n    # Batch-wise embedding extraction\n    if self.batch_size is not None:\n        embeddings = []\n        for batch in tqdm(self._chunks(documents), disable=not verbose):\n            response = openai.Embedding.create(input=batch, **self.generator_kwargs)\n            embeddings.extend([r[\"embedding\"] for r in response[\"data\"]])\n\n            # Delay subsequent calls\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n    # Extract embeddings all at once\n    else:\n        response = openai.Embedding.create(input=documents, **self.generator_kwargs)\n        embeddings = [r[\"embedding\"] for r in response[\"data\"]]\n    return np.array(embeddings)\n</code></pre>"},{"location":"api/backends/word_doc.html","title":"<code>WordDocEmbedder</code>","text":"<p>Combine a document- and word-level embedder</p> Source code in <code>bertopic\\backend\\_word_doc.py</code> <pre><code>class WordDocEmbedder(BaseEmbedder):\n    \"\"\" Combine a document- and word-level embedder\n    \"\"\"\n    def __init__(self,\n                 embedding_model,\n                 word_embedding_model):\n        super().__init__()\n\n        self.embedding_model = select_backend(embedding_model)\n        self.word_embedding_model = select_backend(word_embedding_model)\n\n    def embed_words(self,\n                    words: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            words: A list of words to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Word embeddings with shape (n, m) with `n` words\n            that each have an embeddings size of `m`\n\n        \"\"\"\n        return self.word_embedding_model.embed(words, verbose)\n\n    def embed_documents(self,\n                        document: List[str],\n                        verbose: bool = False) -&gt; np.ndarray:\n        \"\"\" Embed a list of n words into an n-dimensional\n        matrix of embeddings\n\n        Arguments:\n            document: A list of documents to be embedded\n            verbose: Controls the verbosity of the process\n\n        Returns:\n            Document embeddings with shape (n, m) with `n` documents\n            that each have an embeddings size of `m`\n        \"\"\"\n        return self.embedding_model.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/word_doc.html#bertopic.backend._word_doc.WordDocEmbedder.embed_documents","title":"<code>embed_documents(self, document, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>document</code> <code>List[str]</code> <p>A list of documents to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Document embeddings with shape (n, m) with <code>n</code> documents that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_word_doc.py</code> <pre><code>def embed_documents(self,\n                    document: List[str],\n                    verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        document: A list of documents to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Document embeddings with shape (n, m) with `n` documents\n        that each have an embeddings size of `m`\n    \"\"\"\n    return self.embedding_model.embed(document, verbose)\n</code></pre>"},{"location":"api/backends/word_doc.html#bertopic.backend._word_doc.WordDocEmbedder.embed_words","title":"<code>embed_words(self, words, verbose=False)</code>","text":"<p>Embed a list of n words into an n-dimensional matrix of embeddings</p> <p>Parameters:</p> Name Type Description Default <code>words</code> <code>List[str]</code> <p>A list of words to be embedded</p> required <code>verbose</code> <code>bool</code> <p>Controls the verbosity of the process</p> <code>False</code> <p>Returns:</p> Type Description <code>ndarray</code> <p>Word embeddings with shape (n, m) with <code>n</code> words that each have an embeddings size of <code>m</code></p> Source code in <code>bertopic\\backend\\_word_doc.py</code> <pre><code>def embed_words(self,\n                words: List[str],\n                verbose: bool = False) -&gt; np.ndarray:\n    \"\"\" Embed a list of n words into an n-dimensional\n    matrix of embeddings\n\n    Arguments:\n        words: A list of words to be embedded\n        verbose: Controls the verbosity of the process\n\n    Returns:\n        Word embeddings with shape (n, m) with `n` words\n        that each have an embeddings size of `m`\n\n    \"\"\"\n    return self.word_embedding_model.embed(words, verbose)\n</code></pre>"},{"location":"api/cluster/base.html","title":"<code>BaseCluster</code>","text":"<p>The Base Cluster class</p> <p>Using this class directly in BERTopic will make it skip over the cluster step. As a result, topics need to be passed  to BERTopic in the form of its <code>y</code> parameter in order to create  topic representations. </p> <p>Examples:    </p> <p>This will skip over the cluster step in BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseCluster\n\nempty_cluster_model = BaseCluster()\n\ntopic_model = BERTopic(hdbscan_model=empty_cluster_model)\n</code></pre> <p>Then, this class can be used to perform manual topic modeling.  That is, topic modeling on a topics that were already generated before  without the need to learn them:</p> <pre><code>topic_model.fit(docs, y=y)\n</code></pre> Source code in <code>bertopic\\cluster\\_base.py</code> <pre><code>class BaseCluster:\n    \"\"\" The Base Cluster class\n\n    Using this class directly in BERTopic will make it skip\n    over the cluster step. As a result, topics need to be passed \n    to BERTopic in the form of its `y` parameter in order to create \n    topic representations. \n\n    Examples:    \n\n    This will skip over the cluster step in BERTopic:\n\n    ```python\n    from bertopic import BERTopic\n    from bertopic.dimensionality import BaseCluster\n\n    empty_cluster_model = BaseCluster()\n\n    topic_model = BERTopic(hdbscan_model=empty_cluster_model)\n    ```\n\n    Then, this class can be used to perform manual topic modeling. \n    That is, topic modeling on a topics that were already generated before \n    without the need to learn them:\n\n    ```python\n    topic_model.fit(docs, y=y)\n    ```\n    \"\"\"\n    def fit(self, X, y=None):\n        if y is not None:\n            self.labels_ = y\n        else:\n            self.labels_ = None\n        return self\n\n    def transform(self, X: np.ndarray) -&gt; np.ndarray:\n        return X\n</code></pre>"},{"location":"api/dimensionality/base.html","title":"<code>BaseDimensionalityReduction</code>","text":"<p>The Base Dimensionality Reduction class</p> <p>You can use this to skip over the dimensionality reduction step in BERTopic.</p> <p>Examples:</p> <p>This will skip over the reduction step in BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\nempty_reduction_model = BaseDimensionalityReduction()\n\ntopic_model = BERTopic(umap_model=empty_reduction_model)\n</code></pre> Source code in <code>bertopic\\dimensionality\\_base.py</code> <pre><code>class BaseDimensionalityReduction:\n    \"\"\" The Base Dimensionality Reduction class\n\n    You can use this to skip over the dimensionality reduction step in BERTopic.\n\n    Examples:\n\n    This will skip over the reduction step in BERTopic:\n\n    ```python\n    from bertopic import BERTopic\n    from bertopic.dimensionality import BaseDimensionalityReduction\n\n    empty_reduction_model = BaseDimensionalityReduction()\n\n    topic_model = BERTopic(umap_model=empty_reduction_model)\n    ```\n    \"\"\"\n    def fit(self, X: np.ndarray = None):\n        return self\n\n    def transform(self, X: np.ndarray) -&gt; np.ndarray:\n        return X\n</code></pre>"},{"location":"api/plotting/barchart.html","title":"<code>Barchart</code>","text":"<p>Visualize a barchart of selected topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>8</code> <code>n_words</code> <code>int</code> <p>Number of words to show in a topic</p> <code>5</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topic Word Scores&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of each figure.</p> <code>250</code> <code>height</code> <code>int</code> <p>The height of each figure.</p> <code>250</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the barchart of selected topics simply run:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_barchart()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_barchart.py</code> <pre><code>def visualize_barchart(topic_model,\n                       topics: List[int] = None,\n                       top_n_topics: int = 8,\n                       n_words: int = 5,\n                       custom_labels: Union[bool, str] = False,\n                       title: str = \"&lt;b&gt;Topic Word Scores&lt;/b&gt;\",\n                       width: int = 250,\n                       height: int = 250) -&gt; go.Figure:\n    \"\"\" Visualize a barchart of selected topics\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_words: Number of words to show in a topic\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of each figure.\n        height: The height of each figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the barchart of selected topics\n    simply run:\n\n    ```python\n    topic_model.visualize_barchart()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_barchart()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/bar_chart.html\"\n    style=\"width:1100px; height: 660px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    colors = itertools.cycle([\"#D55E00\", \"#0072B2\", \"#CC79A7\", \"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\"])\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list()[0:6])\n\n    # Initialize figure\n    if isinstance(custom_labels, str):\n        subplot_titles = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]\n        subplot_titles = [\"_\".join([label[0] for label in labels[:4]]) for labels in subplot_titles]\n        subplot_titles = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in subplot_titles]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        subplot_titles = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topics]\n    else:\n        subplot_titles = [f\"Topic {topic}\" for topic in topics]\n    columns = 4\n    rows = int(np.ceil(len(topics) / columns))\n    fig = make_subplots(rows=rows,\n                        cols=columns,\n                        shared_xaxes=False,\n                        horizontal_spacing=.1,\n                        vertical_spacing=.4 / rows if rows &gt; 1 else 0,\n                        subplot_titles=subplot_titles)\n\n    # Add barchart for each topic\n    row = 1\n    column = 1\n    for topic in topics:\n        words = [word + \"  \" for word, _ in topic_model.get_topic(topic)][:n_words][::-1]\n        scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]\n\n        fig.add_trace(\n            go.Bar(x=scores,\n                   y=words,\n                   orientation='h',\n                   marker_color=next(colors)),\n            row=row, col=column)\n\n        if column == columns:\n            column = 1\n            row += 1\n        else:\n            column += 1\n\n    # Stylize graph\n    fig.update_layout(\n        template=\"plotly_white\",\n        showlegend=False,\n        title={\n            'text': f\"{title}\",\n            'x': .5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width*4,\n        height=height*rows if rows &gt; 1 else height * 1.3,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    fig.update_xaxes(showgrid=True)\n    fig.update_yaxes(showgrid=True)\n\n    return fig\n</code></pre>"},{"location":"api/plotting/distribution.html","title":"<code>Distribution</code>","text":"<p>Visualize the distribution of topic probabilities</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>probabilities</code> <code>ndarray</code> <p>An array of probability scores</p> required <code>min_probability</code> <code>float</code> <p>The minimum probability score to visualize.              All others are ignored.</p> <code>0.015</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topic Probability Distribution&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>600</code> <p>Examples:</p> <p>Make sure to fit the model before and only input the probabilities of a single document:</p> <pre><code>topic_model.visualize_distribution(probabilities[0])\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_distribution(probabilities[0])\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_distribution.py</code> <pre><code>def visualize_distribution(topic_model,\n                           probabilities: np.ndarray,\n                           min_probability: float = 0.015,\n                           custom_labels: Union[bool, str] = False,\n                           title: str = \"&lt;b&gt;Topic Probability Distribution&lt;/b&gt;\",\n                           width: int = 800,\n                           height: int = 600) -&gt; go.Figure:\n    \"\"\" Visualize the distribution of topic probabilities\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        probabilities: An array of probability scores\n        min_probability: The minimum probability score to visualize.\n                         All others are ignored.\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    Make sure to fit the model before and only input the\n    probabilities of a single document:\n\n    ```python\n    topic_model.visualize_distribution(probabilities[0])\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_distribution(probabilities[0])\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/probabilities.html\"\n    style=\"width:1000px; height: 500px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    if len(probabilities.shape) != 1:\n        raise ValueError(\"This visualization cannot be used if you have set `calculate_probabilities` to False \"\n                         \"as it uses the topic probabilities of all topics. \")\n    if len(probabilities[probabilities &gt; min_probability]) == 0:\n        raise ValueError(\"There are no values where `min_probability` is higher than the \"\n                         \"probabilities that were supplied. Lower `min_probability` to prevent this error.\")\n\n    # Get values and indices equal or exceed the minimum probability\n    labels_idx = np.argwhere(probabilities &gt;= min_probability).flatten()\n    vals = probabilities[labels_idx].tolist()\n\n    # Create labels\n    if isinstance(custom_labels, str):\n        labels = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in labels_idx]\n        labels = [\"_\".join([label[0] for label in l[:4]]) for l in labels]\n        labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in labels]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        labels = [topic_model.custom_labels_[idx + topic_model._outliers] for idx in labels_idx]\n    else:\n        labels = []\n        for idx in labels_idx:\n            words = topic_model.get_topic(idx)\n            if words:\n                label = [word[0] for word in words[:5]]\n                label = f\"&lt;b&gt;Topic {idx}&lt;/b&gt;: {'_'.join(label)}\"\n                label = label[:40] + \"...\" if len(label) &gt; 40 else label\n                labels.append(label)\n            else:\n                vals.remove(probabilities[idx])\n\n    # Create Figure\n    fig = go.Figure(go.Bar(\n        x=vals,\n        y=labels,\n        marker=dict(\n            color='#C8D2D7',\n            line=dict(\n                color='#6E8484',\n                width=1),\n        ),\n        orientation='h')\n    )\n\n    fig.update_layout(\n        xaxis_title=\"Probability\",\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        template=\"simple_white\",\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    return fig\n</code></pre>"},{"location":"api/plotting/documents.html","title":"<code>Documents</code>","text":"<p>Visualize documents and their topics in 2D</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>float</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                  specific points. Helps to speed up generation of visualization.</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_documents(docs)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_documents.py</code> <pre><code>def visualize_documents(topic_model,\n                        docs: List[str],\n                        topics: List[int] = None,\n                        embeddings: np.ndarray = None,\n                        reduced_embeddings: np.ndarray = None,\n                        sample: float = None,\n                        hide_annotations: bool = False,\n                        hide_document_hover: bool = False,\n                        custom_labels: Union[bool, str] = False,\n                        title: str = \"&lt;b&gt;Documents and Topics&lt;/b&gt;\",\n                        width: int = 1200,\n                        height: int = 750):\n    \"\"\" Visualize documents and their topics in 2D\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                             specific points. Helps to speed up generation of visualization.\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_documents(docs)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic\n    topic_model = BERTopic().fit(docs, embeddings)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_documents(docs, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../../getting_started/visualization/documents.html\"\n    style=\"width:1000px; height: 800px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    topic_per_doc = topic_model.topics_\n\n    # Sample the data to optimize for visualization and dimensionality reduction\n    if sample is None or sample &gt; 1:\n        sample = 1\n\n    indices = []\n    for topic in set(topic_per_doc):\n        s = np.where(np.array(topic_per_doc) == topic)[0]\n        size = len(s) if len(s) &lt; 100 else int(len(s) * sample)\n        indices.extend(np.random.choice(s, size=size, replace=False))\n    indices = np.array(indices)\n\n    df = pd.DataFrame({\"topic\": np.array(topic_per_doc)[indices]})\n    df[\"doc\"] = [docs[index] for index in indices]\n    df[\"topic\"] = [topic_per_doc[index] for index in indices]\n\n    # Extract embeddings if not already done\n    if sample is None:\n        if embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n        else:\n            embeddings_to_reduce = embeddings\n    else:\n        if embeddings is not None:\n            embeddings_to_reduce = embeddings[indices]\n        elif embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n\n    # Reduce input embeddings\n    if reduced_embeddings is None:\n        umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)\n        embeddings_2d = umap_model.embedding_\n    elif sample is not None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings[indices]\n    elif sample is None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings\n\n    unique_topics = set(topic_per_doc)\n    if topics is None:\n        topics = unique_topics\n\n    # Combine data\n    df[\"x\"] = embeddings_2d[:, 0]\n    df[\"y\"] = embeddings_2d[:, 1]\n\n    # Prepare text and names\n    if isinstance(custom_labels, str):\n        names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]\n        names = [\"_\".join([label[0] for label in labels[:4]]) for labels in names]\n        names = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in names]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]\n    else:\n        names = [f\"{topic}_\" + \"_\".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]\n\n    # Visualize\n    fig = go.Figure()\n\n    # Outliers and non-selected topics\n    non_selected_topics = set(unique_topics).difference(topics)\n    if len(non_selected_topics) == 0:\n        non_selected_topics = [-1]\n\n    selection = df.loc[df.topic.isin(non_selected_topics), :]\n    selection[\"text\"] = \"\"\n    selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), \"Other documents\"]\n\n    fig.add_trace(\n        go.Scattergl(\n            x=selection.x,\n            y=selection.y,\n            hovertext=selection.doc if not hide_document_hover else None,\n            hoverinfo=\"text\",\n            mode='markers+text',\n            name=\"other\",\n            showlegend=False,\n            marker=dict(color='#CFD8DC', size=5, opacity=0.5)\n        )\n    )\n\n    # Selected topics\n    for name, topic in zip(names, unique_topics):\n        if topic in topics and topic != -1:\n            selection = df.loc[df.topic == topic, :]\n            selection[\"text\"] = \"\"\n\n            if not hide_annotations:\n                selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name]\n\n            fig.add_trace(\n                go.Scattergl(\n                    x=selection.x,\n                    y=selection.y,\n                    hovertext=selection.doc if not hide_document_hover else None,\n                    hoverinfo=\"text\",\n                    text=selection.text,\n                    mode='markers+text',\n                    name=name,\n                    textfont=dict(\n                        size=12,\n                    ),\n                    marker=dict(size=5, opacity=0.5)\n                )\n            )\n\n    # Add grid in a 'plus' shape\n    x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))\n    y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))\n    fig.add_shape(type=\"line\",\n                  x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],\n                  line=dict(color=\"#CFD8DC\", width=2))\n    fig.add_shape(type=\"line\",\n                  x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,\n                  line=dict(color=\"#9E9E9E\", width=2))\n    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text=\"D1\", showarrow=False, yshift=10)\n    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text=\"D2\", showarrow=False, xshift=10)\n\n    # Stylize layout\n    fig.update_layout(\n        template=\"simple_white\",\n        title={\n            'text': f\"{title}\",\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height\n    )\n\n    fig.update_xaxes(visible=False)\n    fig.update_yaxes(visible=False)\n    return fig\n</code></pre>"},{"location":"api/plotting/dtm.html","title":"<code>DTM</code>","text":"<p>Visualize topics over time</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics_over_time</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>None</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics over Time&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>450</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics over time, simply run:</p> <pre><code>topics_over_time = topic_model.topics_over_time(docs, timestamps)\ntopic_model.visualize_topics_over_time(topics_over_time)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_over_time(topics_over_time)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_topics_over_time.py</code> <pre><code>def visualize_topics_over_time(topic_model,\n                               topics_over_time: pd.DataFrame,\n                               top_n_topics: int = None,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: Union[bool, str] = False,\n                               title: str = \"&lt;b&gt;Topics over Time&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 450) -&gt; go.Figure:\n    \"\"\" Visualize topics over time\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics_over_time: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics over time, simply run:\n\n    ```python\n    topics_over_time = topic_model.topics_over_time(docs, timestamps)\n    topic_model.visualize_topics_over_time(topics_over_time)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_over_time(topics_over_time)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/trump.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    colors = [\"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#D55E00\", \"#0072B2\", \"#CC79A7\"]\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        selected_topics = list(topics)\n    elif top_n_topics is not None:\n        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        selected_topics = sorted(freq_df.Topic.to_list())\n\n    # Prepare data\n    if isinstance(custom_labels, str):\n        topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]\n        topic_names = [\"_\".join([label[0] for label in labels[:4]]) for labels in topic_names]\n        topic_names = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in topic_names]\n        topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        topic_names = {key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()}\n    else:\n        topic_names = {key: value[:40] + \"...\" if len(value) &gt; 40 else value\n                       for key, value in topic_model.topic_labels_.items()}\n    topics_over_time[\"Name\"] = topics_over_time.Topic.map(topic_names)\n    data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :].sort_values([\"Topic\", \"Timestamp\"])\n\n    # Add traces\n    fig = go.Figure()\n    for index, topic in enumerate(data.Topic.unique()):\n        trace_data = data.loc[data.Topic == topic, :]\n        topic_name = trace_data.Name.values[0]\n        words = trace_data.Words.values\n        if normalize_frequency:\n            y = normalize(trace_data.Frequency.values.reshape(1, -1))[0]\n        else:\n            y = trace_data.Frequency\n        fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,\n                                 mode='lines',\n                                 marker_color=colors[index % 7],\n                                 hoverinfo=\"text\",\n                                 name=topic_name,\n                                 hovertext=[f'&lt;b&gt;Topic {topic}&lt;/b&gt;&lt;br&gt;Words: {word}' for word in words]))\n\n    # Styling of the visualization\n    fig.update_xaxes(showgrid=True)\n    fig.update_yaxes(showgrid=True)\n    fig.update_layout(\n        yaxis_title=\"Normalized Frequency\" if normalize_frequency else \"Frequency\",\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.40,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        template=\"simple_white\",\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n        legend=dict(\n            title=\"&lt;b&gt;Global Topic Representation\",\n        )\n    )\n    return fig\n</code></pre>"},{"location":"api/plotting/heatmap.html","title":"<code>Heatmap</code>","text":"<p>Visualize a heatmap of the topic's similarity matrix</p> <p>Based on the cosine similarity matrix between topic embeddings, a heatmap is created showing the similarity between topics.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics.</p> <code>None</code> <code>n_clusters</code> <code>int</code> <p>Create n clusters and order the similarity         matrix by those clusters.</p> <code>None</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Similarity Matrix&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>800</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the similarity matrix of topics simply run:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_heatmap()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_heatmap.py</code> <pre><code>def visualize_heatmap(topic_model,\n                      topics: List[int] = None,\n                      top_n_topics: int = None,\n                      n_clusters: int = None,\n                      custom_labels: Union[bool, str] = False,\n                      title: str = \"&lt;b&gt;Similarity Matrix&lt;/b&gt;\",\n                      width: int = 800,\n                      height: int = 800) -&gt; go.Figure:\n    \"\"\" Visualize a heatmap of the topic's similarity matrix\n\n    Based on the cosine similarity matrix between topic embeddings,\n    a heatmap is created showing the similarity between topics.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize.\n        top_n_topics: Only select the top n most frequent topics.\n        n_clusters: Create n clusters and order the similarity\n                    matrix by those clusters.\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the similarity matrix of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_heatmap()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_heatmap()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/heatmap.html\"\n    style=\"width:1000px; height: 720px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n\n    # Select topic embeddings\n    if topic_model.topic_embeddings_ is not None:\n        embeddings = np.array(topic_model.topic_embeddings_)[topic_model._outliers:]\n    else:\n        embeddings = topic_model.c_tf_idf_[topic_model._outliers:]\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list())\n\n    # Order heatmap by similar clusters of topics\n    sorted_topics = topics\n    if n_clusters:\n        if n_clusters &gt;= len(set(topics)):\n            raise ValueError(\"Make sure to set `n_clusters` lower than \"\n                             \"the total number of unique topics.\")\n\n        distance_matrix = cosine_similarity(embeddings[topics])\n        Z = linkage(distance_matrix, 'ward')\n        clusters = fcluster(Z, t=n_clusters, criterion='maxclust')\n\n        # Extract new order of topics\n        mapping = {cluster: [] for cluster in clusters}\n        for topic, cluster in zip(topics, clusters):\n            mapping[cluster].append(topic)\n        mapping = [cluster for cluster in mapping.values()]\n        sorted_topics = [topic for cluster in mapping for topic in cluster]\n\n    # Select embeddings\n    indices = np.array([topics.index(topic) for topic in sorted_topics])\n    embeddings = embeddings[indices]\n    distance_matrix = cosine_similarity(embeddings)\n\n    # Create labels\n    if isinstance(custom_labels, str):\n        new_labels = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in sorted_topics]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        new_labels = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in sorted_topics]\n    else:\n        new_labels = [[[str(topic), None]] + topic_model.get_topic(topic) for topic in sorted_topics]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n\n    fig = px.imshow(distance_matrix,\n                    labels=dict(color=\"Similarity Score\"),\n                    x=new_labels,\n                    y=new_labels,\n                    color_continuous_scale='GnBu'\n                    )\n\n    fig.update_layout(\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.55,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n    fig.update_layout(showlegend=True)\n    fig.update_layout(legend_title_text='Trend')\n\n    return fig\n</code></pre>"},{"location":"api/plotting/hierarchical_documents.html","title":"<code>Hierarchical Documents</code>","text":"<p>Visualize documents and their topics in 2D at different levels of hierarchy</p> <p>Parameters:</p> Name Type Description Default <code>docs</code> <code>List[str]</code> <p>The documents you used when calling either <code>fit</code> or <code>fit_transform</code></p> required <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                  represented by their parents and their children</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize.     Not to be confused with the topics that you get from <code>.fit_transform</code>.     For example, if you want to visualize only topics 1 through 5:     <code>topics = [1, 2, 3, 4, 5]</code>.</p> <code>None</code> <code>embeddings</code> <code>ndarray</code> <p>The embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>reduced_embeddings</code> <code>ndarray</code> <p>The 2D reduced embeddings of all documents in <code>docs</code>.</p> <code>None</code> <code>sample</code> <code>Union[float, int]</code> <p>The percentage of documents in each topic that you would like to keep.     Value can be between 0 and 1. Setting this value to, for example,     0.1 (10% of documents in each topic) makes it easier to visualize     millions of documents as a subset is chosen.</p> <code>None</code> <code>hide_annotations</code> <code>bool</code> <p>Hide the names of the traces on top of each cluster.</p> <code>False</code> <code>hide_document_hover</code> <code>bool</code> <p>Hide the content of the documents when hovering over                  specific points. Helps to speed up generation of visualizations.</p> <code>True</code> <code>nr_levels</code> <code>int</code> <p>The number of levels to be visualized in the hierarchy. First, the distances        in <code>hierarchical_topics.Distance</code> are split in <code>nr_levels</code> lists of distances.         Then, for each list of distances, the merged topics are selected that have a         distance less or equal to the maximum distance of the selected list of distances.        NOTE: To get all possible merged steps, make sure that <code>nr_levels</code> is equal to        the length of <code>hierarchical_topics</code>.</p> <code>10</code> <code>level_scale</code> <code>str</code> <p>Whether to apply a linear or logarithmic (log) scale levels of the distance           vector. Linear scaling will perform an equal number of merges at each level           while logarithmic scaling will perform more mergers in earlier levels to           provide more resolution at higher levels (this can be used for when the number           of topics is large). </p> <code>'linear'</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".            NOTE: Custom labels are only generated for the original             un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1200</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>750</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n</code></pre> <p>Do note that this re-calculates the embeddings and reduces them to 2D. The advised and prefered pipeline for using this function is as follows:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Reduce dimensionality of embeddings, this step is optional\n# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Or, if you have reduced the original embeddings already:\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\nfig.write_html(\"path/to/file.html\")\n</code></pre> <p>Note</p> <p>This visualization was inspired by the scatter plot representation of Doc2Map: https://github.com/louisgeisler/Doc2Map</p> Source code in <code>bertopic\\plotting\\_hierarchical_documents.py</code> <pre><code>def visualize_hierarchical_documents(topic_model,\n                                     docs: List[str],\n                                     hierarchical_topics: pd.DataFrame,\n                                     topics: List[int] = None,\n                                     embeddings: np.ndarray = None,\n                                     reduced_embeddings: np.ndarray = None,\n                                     sample: Union[float, int] = None,\n                                     hide_annotations: bool = False,\n                                     hide_document_hover: bool = True,\n                                     nr_levels: int = 10,\n                                     level_scale: str = 'linear', \n                                     custom_labels: Union[bool, str] = False,\n                                     title: str = \"&lt;b&gt;Hierarchical Documents and Topics&lt;/b&gt;\",\n                                     width: int = 1200,\n                                     height: int = 750) -&gt; go.Figure:\n    \"\"\" Visualize documents and their topics in 2D at different levels of hierarchy\n\n    Arguments:\n        docs: The documents you used when calling either `fit` or `fit_transform`\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                             represented by their parents and their children\n        topics: A selection of topics to visualize.\n                Not to be confused with the topics that you get from `.fit_transform`.\n                For example, if you want to visualize only topics 1 through 5:\n                `topics = [1, 2, 3, 4, 5]`.\n        embeddings: The embeddings of all documents in `docs`.\n        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.\n        sample: The percentage of documents in each topic that you would like to keep.\n                Value can be between 0 and 1. Setting this value to, for example,\n                0.1 (10% of documents in each topic) makes it easier to visualize\n                millions of documents as a subset is chosen.\n        hide_annotations: Hide the names of the traces on top of each cluster.\n        hide_document_hover: Hide the content of the documents when hovering over\n                             specific points. Helps to speed up generation of visualizations.\n        nr_levels: The number of levels to be visualized in the hierarchy. First, the distances\n                   in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances. \n                   Then, for each list of distances, the merged topics are selected that have a \n                   distance less or equal to the maximum distance of the selected list of distances.\n                   NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to\n                   the length of `hierarchical_topics`.\n        level_scale: Whether to apply a linear or logarithmic (log) scale levels of the distance \n                     vector. Linear scaling will perform an equal number of merges at each level \n                     while logarithmic scaling will perform more mergers in earlier levels to \n                     provide more resolution at higher levels (this can be used for when the number \n                     of topics is large). \n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n                       NOTE: Custom labels are only generated for the original \n                       un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)\n    ```\n\n    Do note that this re-calculates the embeddings and reduces them to 2D.\n    The advised and prefered pipeline for using this function is as follows:\n\n    ```python\n    from sklearn.datasets import fetch_20newsgroups\n    from sentence_transformers import SentenceTransformer\n    from bertopic import BERTopic\n    from umap import UMAP\n\n    # Prepare embeddings\n    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n    sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n    embeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n    # Train BERTopic and extract hierarchical topics\n    topic_model = BERTopic().fit(docs, embeddings)\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Reduce dimensionality of embeddings, this step is optional\n    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n    # Run the visualization with the original embeddings\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n    # Or, if you have reduced the original embeddings already:\n    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    NOTE:\n        This visualization was inspired by the scatter plot representation of Doc2Map:\n        https://github.com/louisgeisler/Doc2Map\n\n    &lt;iframe src=\"../../getting_started/visualization/hierarchical_documents.html\"\n    style=\"width:1000px; height: 770px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    topic_per_doc = topic_model.topics_\n\n    # Sample the data to optimize for visualization and dimensionality reduction\n    if sample is None or sample &gt; 1:\n        sample = 1\n\n    indices = []\n    for topic in set(topic_per_doc):\n        s = np.where(np.array(topic_per_doc) == topic)[0]\n        size = len(s) if len(s) &lt; 100 else int(len(s)*sample)\n        indices.extend(np.random.choice(s, size=size, replace=False))\n    indices = np.array(indices)\n\n    df = pd.DataFrame({\"topic\": np.array(topic_per_doc)[indices]})\n    df[\"doc\"] = [docs[index] for index in indices]\n    df[\"topic\"] = [topic_per_doc[index] for index in indices]\n\n    # Extract embeddings if not already done\n    if sample is None:\n        if embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n        else:\n            embeddings_to_reduce = embeddings\n    else:\n        if embeddings is not None:\n            embeddings_to_reduce = embeddings[indices]\n        elif embeddings is None and reduced_embeddings is None:\n            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method=\"document\")\n\n    # Reduce input embeddings\n    if reduced_embeddings is None:\n        umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)\n        embeddings_2d = umap_model.embedding_\n    elif sample is not None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings[indices]\n    elif sample is None and reduced_embeddings is not None:\n        embeddings_2d = reduced_embeddings\n\n    # Combine data\n    df[\"x\"] = embeddings_2d[:, 0]\n    df[\"y\"] = embeddings_2d[:, 1]\n\n    # Create topic list for each level, levels are created by calculating the distance\n    distances = hierarchical_topics.Distance.to_list()\n    if level_scale == 'log' or level_scale == 'logarithmic':\n        log_indices = np.round(np.logspace(start=math.log(1,10), stop=math.log(len(distances)-1,10), num=nr_levels)).astype(int).tolist()\n        log_indices.reverse()\n        max_distances = [distances[i] for i in log_indices]\n    elif level_scale == 'lin' or level_scale == 'linear':\n        max_distances = [distances[indices[-1]] for indices in np.array_split(range(len(hierarchical_topics)), nr_levels)][::-1]\n    else:\n        raise ValueError(\"level_scale needs to be one of 'log' or 'linear'\")\n\n    for index, max_distance in enumerate(max_distances):\n\n        # Get topics below `max_distance`\n        mapping = {topic: topic for topic in df.topic.unique()}\n        selection = hierarchical_topics.loc[hierarchical_topics.Distance &lt;= max_distance, :]\n        selection.Parent_ID = selection.Parent_ID.astype(int)\n        selection = selection.sort_values(\"Parent_ID\")\n\n        for row in selection.iterrows():\n            for topic in row[1].Topics:\n                mapping[topic] = row[1].Parent_ID\n\n        # Make sure the mappings are mapped 1:1\n        mappings = [True for _ in mapping]\n        while any(mappings):\n            for i, (key, value) in enumerate(mapping.items()):\n                if value in mapping.keys() and key != value:\n                    mapping[key] = mapping[value]\n                else:\n                    mappings[i] = False\n\n        # Create new column\n        df[f\"level_{index+1}\"] = df.topic.map(mapping)\n        df[f\"level_{index+1}\"] = df[f\"level_{index+1}\"].astype(int)\n\n    # Prepare topic names of original and merged topics\n    trace_names = []\n    topic_names = {}\n    for topic in range(hierarchical_topics.Parent_ID.astype(int).max()):\n        if topic &lt; hierarchical_topics.Parent_ID.astype(int).min():\n            if topic_model.get_topic(topic):\n                if isinstance(custom_labels, str):\n                    trace_name = f\"{topic}_\" + \"_\".join(list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3])\n                elif topic_model.custom_labels_ is not None and custom_labels:\n                    trace_name = topic_model.custom_labels_[topic + topic_model._outliers]\n                else:\n                    trace_name = f\"{topic}_\" + \"_\".join([word[:20] for word, _ in topic_model.get_topic(topic)][:3])\n                topic_names[topic] = {\"trace_name\": trace_name[:40], \"plot_text\": trace_name[:40]}\n                trace_names.append(trace_name)\n        else:\n            trace_name = f\"{topic}_\" + hierarchical_topics.loc[hierarchical_topics.Parent_ID == str(topic), \"Parent_Name\"].values[0]\n            plot_text = \"_\".join([name[:20] for name in trace_name.split(\"_\")[:3]])\n            topic_names[topic] = {\"trace_name\": trace_name[:40], \"plot_text\": plot_text[:40]}\n            trace_names.append(trace_name)\n\n    # Prepare traces\n    all_traces = []\n    for level in range(len(max_distances)):\n        traces = []\n\n        # Outliers\n        if topic_model._outliers:\n            traces.append(\n                    go.Scattergl(\n                        x=df.loc[(df[f\"level_{level+1}\"] == -1), \"x\"],\n                        y=df.loc[df[f\"level_{level+1}\"] == -1, \"y\"],\n                        mode='markers+text',\n                        name=\"other\",\n                        hoverinfo=\"text\",\n                        hovertext=df.loc[(df[f\"level_{level+1}\"] == -1), \"doc\"] if not hide_document_hover else None,\n                        showlegend=False,\n                        marker=dict(color='#CFD8DC', size=5, opacity=0.5)\n                    )\n                )\n\n        # Selected topics\n        if topics:\n            selection = df.loc[(df.topic.isin(topics)), :]\n            unique_topics = sorted([int(topic) for topic in selection[f\"level_{level+1}\"].unique()])\n        else:\n            unique_topics = sorted([int(topic) for topic in df[f\"level_{level+1}\"].unique()])\n\n        for topic in unique_topics:\n            if topic != -1:\n                if topics:\n                    selection = df.loc[(df[f\"level_{level+1}\"] == topic) &amp;\n                                       (df.topic.isin(topics)), :]\n                else:\n                    selection = df.loc[df[f\"level_{level+1}\"] == topic, :]\n\n                if not hide_annotations:\n                    selection.loc[len(selection), :] = None\n                    selection[\"text\"] = \"\"\n                    selection.loc[len(selection) - 1, \"x\"] = selection.x.mean()\n                    selection.loc[len(selection) - 1, \"y\"] = selection.y.mean()\n                    selection.loc[len(selection) - 1, \"text\"] = topic_names[int(topic)][\"plot_text\"]\n\n                traces.append(\n                    go.Scattergl(\n                        x=selection.x,\n                        y=selection.y,\n                        text=selection.text if not hide_annotations else None,\n                        hovertext=selection.doc if not hide_document_hover else None,\n                        hoverinfo=\"text\",\n                        name=topic_names[int(topic)][\"trace_name\"],\n                        mode='markers+text',\n                        marker=dict(size=5, opacity=0.5)\n                    )\n                )\n\n        all_traces.append(traces)\n\n    # Track and count traces\n    nr_traces_per_set = [len(traces) for traces in all_traces]\n    trace_indices = [(0, nr_traces_per_set[0])]\n    for index, nr_traces in enumerate(nr_traces_per_set[1:]):\n        start = trace_indices[index][1]\n        end = nr_traces + start\n        trace_indices.append((start, end))\n\n    # Visualization\n    fig = go.Figure()\n    for traces in all_traces:\n        for trace in traces:\n            fig.add_trace(trace)\n\n    for index in range(len(fig.data)):\n        if index &gt;= nr_traces_per_set[0]:\n            fig.data[index].visible = False\n\n    # Create and add slider\n    steps = []\n    for index, indices in enumerate(trace_indices):\n        step = dict(\n            method=\"update\",\n            label=str(index),\n            args=[{\"visible\": [False] * len(fig.data)}]\n        )\n        for index in range(indices[1]-indices[0]):\n            step[\"args\"][0][\"visible\"][index+indices[0]] = True\n        steps.append(step)\n\n    sliders = [dict(\n        currentvalue={\"prefix\": \"Level: \"},\n        pad={\"t\": 20},\n        steps=steps\n    )]\n\n    # Add grid in a 'plus' shape\n    x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))\n    y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))\n    fig.add_shape(type=\"line\",\n                  x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],\n                  line=dict(color=\"#CFD8DC\", width=2))\n    fig.add_shape(type=\"line\",\n                  x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,\n                  line=dict(color=\"#9E9E9E\", width=2))\n    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text=\"D1\", showarrow=False, yshift=10)\n    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text=\"D2\", showarrow=False, xshift=10)\n\n    # Stylize layout\n    fig.update_layout(\n        sliders=sliders,\n        template=\"simple_white\",\n        title={\n            'text': f\"{title}\",\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height,\n    )\n\n    fig.update_xaxes(visible=False)\n    fig.update_yaxes(visible=False)\n    return fig\n</code></pre>"},{"location":"api/plotting/hierarchy.html","title":"<code>Hierarchy</code>","text":"<p>Visualize a hierarchical structure of the topics</p> <p>A ward linkage function is used to perform the hierarchical clustering based on the cosine distance matrix between topic embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>orientation</code> <code>str</code> <p>The orientation of the figure.          Either 'left' or 'bottom'</p> <code>'left'</code> <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".            NOTE: Custom labels are only generated for the original             un-merged topics.</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Hierarchical Clustering&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure. Only works if orientation is set to 'left'</p> <code>1000</code> <code>height</code> <code>int</code> <p>The height of the figure. Only works if orientation is set to 'bottom'</p> <code>600</code> <code>hierarchical_topics</code> <code>DataFrame</code> <p>A dataframe that contains a hierarchy of topics                  represented by their parents and their children.                  NOTE: The hierarchical topic names are only visualized                  if both <code>topics</code> and <code>top_n_topics</code> are not set.</p> <code>None</code> <code>linkage_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], numpy.ndarray]</code> <p>The linkage function to use. Default is:               <code>lambda x: sch.linkage(x, 'ward', optimal_ordering=True)</code>               NOTE: Make sure to use the same <code>linkage_function</code> as used               in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>distance_function</code> <code>Callable[[scipy.sparse._csr.csr_matrix], scipy.sparse._csr.csr_matrix]</code> <p>The distance function to use on the c-TF-IDF matrix. Default is:                <code>lambda x: 1 - cosine_similarity(x)</code>.                 You can pass any function that returns either a square matrix of                  shape (n_samples, n_samples) with zeros on the diagonal and                  non-negative values or condensed distance matrix of shape                  (n_samples * (n_samples - 1) / 2,) containing the upper                  triangular of the distance matrix.                NOTE: Make sure to use the same <code>distance_function</code> as used                in <code>topic_model.hierarchical_topics</code>.</p> <code>None</code> <code>color_threshold</code> <code>int</code> <p>Value at which the separation of clusters will be made which              will result in different colors for different clusters.              A higher value will typically lead in less colored clusters.</p> <code>1</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the hierarchical structure of topics simply run:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>If you also want the labels visualized of hierarchical topics, run the following:</p> <pre><code># Extract hierarchical topics and their representations\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n\n# Visualize these representations\ntopic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_hierarchy()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_hierarchy.py</code> <pre><code>def visualize_hierarchy(topic_model,\n                        orientation: str = \"left\",\n                        topics: List[int] = None,\n                        top_n_topics: int = None,\n                        custom_labels: Union[bool, str] = False,\n                        title: str = \"&lt;b&gt;Hierarchical Clustering&lt;/b&gt;\",\n                        width: int = 1000,\n                        height: int = 600,\n                        hierarchical_topics: pd.DataFrame = None,\n                        linkage_function: Callable[[csr_matrix], np.ndarray] = None,\n                        distance_function: Callable[[csr_matrix], csr_matrix] = None,\n                        color_threshold: int = 1) -&gt; go.Figure:\n    \"\"\" Visualize a hierarchical structure of the topics\n\n    A ward linkage function is used to perform the\n    hierarchical clustering based on the cosine distance\n    matrix between topic embeddings.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        orientation: The orientation of the figure.\n                     Either 'left' or 'bottom'\n        topics: A selection of topics to visualize\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n                       NOTE: Custom labels are only generated for the original \n                       un-merged topics.\n        title: Title of the plot.\n        width: The width of the figure. Only works if orientation is set to 'left'\n        height: The height of the figure. Only works if orientation is set to 'bottom'\n        hierarchical_topics: A dataframe that contains a hierarchy of topics\n                             represented by their parents and their children.\n                             NOTE: The hierarchical topic names are only visualized\n                             if both `topics` and `top_n_topics` are not set.\n        linkage_function: The linkage function to use. Default is:\n                          `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`\n                          NOTE: Make sure to use the same `linkage_function` as used\n                          in `topic_model.hierarchical_topics`.\n        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:\n                           `lambda x: 1 - cosine_similarity(x)`.\n                            You can pass any function that returns either a square matrix of \n                            shape (n_samples, n_samples) with zeros on the diagonal and \n                            non-negative values or condensed distance matrix of shape \n                            (n_samples * (n_samples - 1) / 2,) containing the upper \n                            triangular of the distance matrix.\n                           NOTE: Make sure to use the same `distance_function` as used\n                           in `topic_model.hierarchical_topics`.\n        color_threshold: Value at which the separation of clusters will be made which\n                         will result in different colors for different clusters.\n                         A higher value will typically lead in less colored clusters.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the hierarchical structure of\n    topics simply run:\n\n    ```python\n    topic_model.visualize_hierarchy()\n    ```\n\n    If you also want the labels visualized of hierarchical topics,\n    run the following:\n\n    ```python\n    # Extract hierarchical topics and their representations\n    hierarchical_topics = topic_model.hierarchical_topics(docs)\n\n    # Visualize these representations\n    topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n    ```\n\n    If you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_hierarchy()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/hierarchy.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    if distance_function is None:\n        distance_function = lambda x: 1 - cosine_similarity(x)\n\n    if linkage_function is None:\n        linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list())\n\n    # Select embeddings\n    all_topics = sorted(list(topic_model.get_topics().keys()))\n    indices = np.array([all_topics.index(topic) for topic in topics])\n\n    # Select topic embeddings\n    if topic_model.c_tf_idf_ is not None:\n        embeddings = topic_model.c_tf_idf_[indices]\n    else:\n        embeddings = np.array(topic_model.topic_embeddings_)[indices]\n\n    # Annotations\n    if hierarchical_topics is not None and len(topics) == len(freq_df.Topic.to_list()):\n        annotations = _get_annotations(topic_model=topic_model,\n                                       hierarchical_topics=hierarchical_topics,\n                                       embeddings=embeddings,\n                                       distance_function=distance_function,\n                                       linkage_function=linkage_function,\n                                       orientation=orientation,\n                                       custom_labels=custom_labels)\n    else:\n        annotations = None\n\n    # wrap distance function to validate input and return a condensed distance matrix\n    distance_function_viz = lambda x: validate_distance_matrix(\n        distance_function(x), embeddings.shape[0])\n    # Create dendogram\n    fig = ff.create_dendrogram(embeddings,\n                               orientation=orientation,\n                               distfun=distance_function_viz,\n                               linkagefun=linkage_function,\n                               hovertext=annotations,\n                               color_threshold=color_threshold)\n\n    # Create nicer labels\n    axis = \"yaxis\" if orientation == \"left\" else \"xaxis\"\n    if isinstance(custom_labels, str):\n        new_labels = [[[str(x), None]] + topic_model.topic_aspects_[custom_labels][x] for x in fig.layout[axis][\"ticktext\"]]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        new_labels = [topic_model.custom_labels_[topics[int(x)] + topic_model._outliers] for x in fig.layout[axis][\"ticktext\"]]\n    else:\n        new_labels = [[[str(topics[int(x)]), None]] + topic_model.get_topic(topics[int(x)])\n                      for x in fig.layout[axis][\"ticktext\"]]\n        new_labels = [\"_\".join([label[0] for label in labels[:4]]) for labels in new_labels]\n        new_labels = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in new_labels]\n\n    # Stylize layout\n    fig.update_layout(\n        plot_bgcolor='#ECEFF1',\n        template=\"plotly_white\",\n        title={\n            'text': f\"{title}\",\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    # Stylize orientation\n    if orientation == \"left\":\n        fig.update_layout(height=200 + (15 * len(topics)),\n                          width=width,\n                          yaxis=dict(tickmode=\"array\",\n                                     ticktext=new_labels))\n\n        # Fix empty space on the bottom of the graph\n        y_max = max([trace['y'].max() + 5 for trace in fig['data']])\n        y_min = min([trace['y'].min() - 5 for trace in fig['data']])\n        fig.update_layout(yaxis=dict(range=[y_min, y_max]))\n\n    else:\n        fig.update_layout(width=200 + (15 * len(topics)),\n                          height=height,\n                          xaxis=dict(tickmode=\"array\",\n                                     ticktext=new_labels))\n\n    if hierarchical_topics is not None:\n        for index in [0, 3]:\n            axis = \"x\" if orientation == \"left\" else \"y\"\n            xs = [data[\"x\"][index] for data in fig.data if (data[\"text\"] and data[axis][index] &gt; 0)]\n            ys = [data[\"y\"][index] for data in fig.data if (data[\"text\"] and data[axis][index] &gt; 0)]\n            hovertext = [data[\"text\"][index] for data in fig.data if (data[\"text\"] and data[axis][index] &gt; 0)]\n\n            fig.add_trace(go.Scatter(x=xs, y=ys, marker_color='black',\n                                     hovertext=hovertext, hoverinfo=\"text\",\n                                     mode='markers', showlegend=False))\n    return fig\n</code></pre>"},{"location":"api/plotting/term.html","title":"<code>Term Score Decline</code>","text":"<p>Visualize the ranks of all terms across all topics</p> <p>Each topic is represented by a set of words. These words, however, do not all equally represent the topic. This visualization shows how many words are needed to represent a topic and at which point the beneficial effect of adding words starts to decline.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize. These will be colored     red where all others will be colored black.</p> <code>None</code> <code>log_scale</code> <code>bool</code> <p>Whether to represent the ranking on a log scale</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Term score decline per Topic&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>800</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>500</code> <p>Returns:</p> Type Description <code>fig</code> <p>A plotly figure</p> <p>Examples:</p> <p>To visualize the ranks of all words across all topics simply run:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_term_rank()\nfig.write_html(\"path/to/file.html\")\n</code></pre> <p>Reference:</p> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an analysis by the amazing tmtoolkit. Reference to that specific analysis can be found here.</p> Source code in <code>bertopic\\plotting\\_term_rank.py</code> <pre><code>def visualize_term_rank(topic_model,\n                        topics: List[int] = None,\n                        log_scale: bool = False,\n                        custom_labels: Union[bool, str] = False,\n                        title: str = \"&lt;b&gt;Term score decline per Topic&lt;/b&gt;\",\n                        width: int = 800,\n                        height: int = 500) -&gt; go.Figure:\n    \"\"\" Visualize the ranks of all terms across all topics\n\n    Each topic is represented by a set of words. These words, however,\n    do not all equally represent the topic. This visualization shows\n    how many words are needed to represent a topic and at which point\n    the beneficial effect of adding words starts to decline.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize. These will be colored\n                red where all others will be colored black.\n        log_scale: Whether to represent the ranking on a log scale\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        fig: A plotly figure\n\n    Examples:\n\n    To visualize the ranks of all words across\n    all topics simply run:\n\n    ```python\n    topic_model.visualize_term_rank()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_term_rank()\n    fig.write_html(\"path/to/file.html\")\n    ```\n\n    &lt;iframe src=\"../../getting_started/visualization/term_rank.html\"\n    style=\"width:1000px; height: 530px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n\n    &lt;iframe src=\"../../getting_started/visualization/term_rank_log.html\"\n    style=\"width:1000px; height: 530px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n\n    Reference:\n\n    This visualization was heavily inspired by the\n    \"Term Probability Decline\" visualization found in an\n    analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).\n    Reference to that specific analysis can be found\n    [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).\n    \"\"\"\n\n    topics = [] if topics is None else topics\n\n    topic_ids = topic_model.get_topic_info().Topic.unique().tolist()\n    topic_words = [topic_model.get_topic(topic) for topic in topic_ids]\n\n    values = np.array([[value[1] for value in values] for values in topic_words])\n    indices = np.array([[value + 1 for value in range(len(values))] for values in topic_words])\n\n    # Create figure\n    lines = []\n    for topic, x, y in zip(topic_ids, indices, values):\n        if not any(y &gt; 1.5):\n\n            # labels\n            if isinstance(custom_labels, str):\n                label = f\"{topic}_\" + \"_\".join(list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3])\n            elif topic_model.custom_labels_ is not None and custom_labels:\n                label = topic_model.custom_labels_[topic + topic_model._outliers]\n            else:\n                label = f\"&lt;b&gt;Topic {topic}&lt;/b&gt;:\" + \"_\".join([word[0] for word in topic_model.get_topic(topic)])\n                label = label[:50]\n\n            # line parameters\n            color = \"red\" if topic in topics else \"black\"\n            opacity = 1 if topic in topics else .1\n            if any(y == 0):\n                y[y == 0] = min(values[values &gt; 0])\n            y = np.log10(y, out=y, where=y &gt; 0) if log_scale else y\n\n            line = go.Scatter(x=x, y=y,\n                              name=\"\",\n                              hovertext=label,\n                              mode=\"lines+lines\",\n                              opacity=opacity,\n                              line=dict(color=color, width=1.5))\n            lines.append(line)\n\n    fig = go.Figure(data=lines)\n\n    # Stylize layout\n    fig.update_xaxes(range=[0, len(indices[0])], tick0=1, dtick=2)\n    fig.update_layout(\n        showlegend=False,\n        template=\"plotly_white\",\n        title={\n            'text': f\"{title}\",\n            'y': .9,\n            'x': 0.5,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n    )\n\n    fig.update_xaxes(title_text='Term Rank')\n    if log_scale:\n        fig.update_yaxes(title_text='c-TF-IDF score (log scale)')\n    else:\n        fig.update_yaxes(title_text='c-TF-IDF score')\n\n    return fig\n</code></pre>"},{"location":"api/plotting/topics.html","title":"<code>Topics</code>","text":"<p>Visualize topics, their sizes, and their corresponding words</p> <p>This visualization is highly inspired by LDAvis, a great visualization technique typically reserved for LDA.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics</code> <code>List[int]</code> <p>A selection of topics to visualize</p> <code>None</code> <code>top_n_topics</code> <code>int</code> <p>Only select the top n most frequent topics</p> <code>None</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Intertopic Distance Map&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>650</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>650</code> <p>Examples:</p> <p>To visualize the topics simply run:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics()\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_topics.py</code> <pre><code>def visualize_topics(topic_model,\n                     topics: List[int] = None,\n                     top_n_topics: int = None,\n                     custom_labels: Union[bool, str] = False,\n                     title: str = \"&lt;b&gt;Intertopic Distance Map&lt;/b&gt;\",\n                     width: int = 650,\n                     height: int = 650) -&gt; go.Figure:\n    \"\"\" Visualize topics, their sizes, and their corresponding words\n\n    This visualization is highly inspired by LDAvis, a great visualization\n    technique typically reserved for LDA.\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics: A selection of topics to visualize\n        top_n_topics: Only select the top n most frequent topics\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Examples:\n\n    To visualize the topics simply run:\n\n    ```python\n    topic_model.visualize_topics()\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics()\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/viz.html\"\n    style=\"width:1000px; height: 680px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        topics = list(topics)\n    elif top_n_topics is not None:\n        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        topics = sorted(freq_df.Topic.to_list())\n\n    # Extract topic words and their frequencies\n    topic_list = sorted(topics)\n    frequencies = [topic_model.topic_sizes_[topic] for topic in topic_list]\n    if isinstance(custom_labels, str):\n        words = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topic_list]\n        words = [\"_\".join([label[0] for label in labels[:4]]) for labels in words]\n        words = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in words]\n    elif custom_labels and topic_model.custom_labels_ is not None:\n        words = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topic_list]\n    else:\n        words = [\" | \".join([word[0] for word in topic_model.get_topic(topic)[:5]]) for topic in topic_list]\n\n    # Embed c-TF-IDF into 2D\n    all_topics = sorted(list(topic_model.get_topics().keys()))\n    indices = np.array([all_topics.index(topic) for topic in topics])\n\n    if topic_model.topic_embeddings_ is not None:\n        embeddings = topic_model.topic_embeddings_[indices]\n        embeddings = UMAP(n_neighbors=2, n_components=2, metric='cosine', random_state=42).fit_transform(embeddings)\n    else:\n        embeddings = topic_model.c_tf_idf_.toarray()[indices]\n        embeddings = MinMaxScaler().fit_transform(embeddings)\n        embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger', random_state=42).fit_transform(embeddings)\n\n    # Visualize with plotly\n    df = pd.DataFrame({\"x\": embeddings[:, 0], \"y\": embeddings[:, 1],\n                       \"Topic\": topic_list, \"Words\": words, \"Size\": frequencies})\n    return _plotly_topic_visualization(df, topic_list, title, width, height)\n</code></pre>"},{"location":"api/plotting/topics_per_class.html","title":"<code>Topics per Class</code>","text":"<p>Visualize topics per class</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A fitted BERTopic instance.</p> required <code>topics_per_class</code> <code>DataFrame</code> <p>The topics you would like to be visualized with the               corresponding topic representation</p> required <code>top_n_topics</code> <code>int</code> <p>To visualize the most frequent topics instead of all</p> <code>10</code> <code>topics</code> <code>List[int]</code> <p>Select which topics you would like to be visualized</p> <code>None</code> <code>normalize_frequency</code> <code>bool</code> <p>Whether to normalize each topic's frequency individually</p> <code>False</code> <code>custom_labels</code> <code>Union[bool, str]</code> <p>If bool, whether to use custom topic labels that were defined using             <code>topic_model.set_topic_labels</code>.            If <code>str</code>, it uses labels from other aspects, e.g., \"Aspect1\".</p> <code>False</code> <code>title</code> <code>str</code> <p>Title of the plot.</p> <code>'&lt;b&gt;Topics per Class&lt;/b&gt;'</code> <code>width</code> <code>int</code> <p>The width of the figure.</p> <code>1250</code> <code>height</code> <code>int</code> <p>The height of the figure.</p> <code>900</code> <p>Returns:</p> Type Description <code>Figure</code> <p>A plotly.graph_objects.Figure including all traces</p> <p>Examples:</p> <p>To visualize the topics per class, simply run:</p> <pre><code>topics_per_class = topic_model.topics_per_class(docs, classes)\ntopic_model.visualize_topics_per_class(topics_per_class)\n</code></pre> <p>Or if you want to save the resulting figure:</p> <pre><code>fig = topic_model.visualize_topics_per_class(topics_per_class)\nfig.write_html(\"path/to/file.html\")\n</code></pre> Source code in <code>bertopic\\plotting\\_topics_per_class.py</code> <pre><code>def visualize_topics_per_class(topic_model,\n                               topics_per_class: pd.DataFrame,\n                               top_n_topics: int = 10,\n                               topics: List[int] = None,\n                               normalize_frequency: bool = False,\n                               custom_labels: Union[bool, str] = False,\n                               title: str = \"&lt;b&gt;Topics per Class&lt;/b&gt;\",\n                               width: int = 1250,\n                               height: int = 900) -&gt; go.Figure:\n    \"\"\" Visualize topics per class\n\n    Arguments:\n        topic_model: A fitted BERTopic instance.\n        topics_per_class: The topics you would like to be visualized with the\n                          corresponding topic representation\n        top_n_topics: To visualize the most frequent topics instead of all\n        topics: Select which topics you would like to be visualized\n        normalize_frequency: Whether to normalize each topic's frequency individually\n        custom_labels: If bool, whether to use custom topic labels that were defined using \n                       `topic_model.set_topic_labels`.\n                       If `str`, it uses labels from other aspects, e.g., \"Aspect1\".\n        title: Title of the plot.\n        width: The width of the figure.\n        height: The height of the figure.\n\n    Returns:\n        A plotly.graph_objects.Figure including all traces\n\n    Examples:\n\n    To visualize the topics per class, simply run:\n\n    ```python\n    topics_per_class = topic_model.topics_per_class(docs, classes)\n    topic_model.visualize_topics_per_class(topics_per_class)\n    ```\n\n    Or if you want to save the resulting figure:\n\n    ```python\n    fig = topic_model.visualize_topics_per_class(topics_per_class)\n    fig.write_html(\"path/to/file.html\")\n    ```\n    &lt;iframe src=\"../../getting_started/visualization/topics_per_class.html\"\n    style=\"width:1400px; height: 1000px; border: 0px;\"\"&gt;&lt;/iframe&gt;\n    \"\"\"\n    colors = [\"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#D55E00\", \"#0072B2\", \"#CC79A7\"]\n\n    # Select topics based on top_n and topics args\n    freq_df = topic_model.get_topic_freq()\n    freq_df = freq_df.loc[freq_df.Topic != -1, :]\n    if topics is not None:\n        selected_topics = list(topics)\n    elif top_n_topics is not None:\n        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])\n    else:\n        selected_topics = sorted(freq_df.Topic.to_list())\n\n    # Prepare data\n    if isinstance(custom_labels, str):\n        topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]\n        topic_names = [\"_\".join([label[0] for label in labels[:4]]) for labels in topic_names]\n        topic_names = [label if len(label) &lt; 30 else label[:27] + \"...\" for label in topic_names]\n        topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}\n    elif topic_model.custom_labels_ is not None and custom_labels:\n        topic_names = {key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()}\n    else:\n        topic_names = {key: value[:40] + \"...\" if len(value) &gt; 40 else value\n                       for key, value in topic_model.topic_labels_.items()}\n    topics_per_class[\"Name\"] = topics_per_class.Topic.map(topic_names)\n    data = topics_per_class.loc[topics_per_class.Topic.isin(selected_topics), :]\n\n    # Add traces\n    fig = go.Figure()\n    for index, topic in enumerate(selected_topics):\n        if index == 0:\n            visible = True\n        else:\n            visible = \"legendonly\"\n        trace_data = data.loc[data.Topic == topic, :]\n        topic_name = trace_data.Name.values[0]\n        words = trace_data.Words.values\n        if normalize_frequency:\n            x = normalize(trace_data.Frequency.values.reshape(1, -1))[0]\n        else:\n            x = trace_data.Frequency\n        fig.add_trace(go.Bar(y=trace_data.Class,\n                             x=x,\n                             visible=visible,\n                             marker_color=colors[index % 7],\n                             hoverinfo=\"text\",\n                             name=topic_name,\n                             orientation=\"h\",\n                             hovertext=[f'&lt;b&gt;Topic {topic}&lt;/b&gt;&lt;br&gt;Words: {word}' for word in words]))\n\n    # Styling of the visualization\n    fig.update_xaxes(showgrid=True)\n    fig.update_yaxes(showgrid=True)\n    fig.update_layout(\n        xaxis_title=\"Normalized Frequency\" if normalize_frequency else \"Frequency\",\n        yaxis_title=\"Class\",\n        title={\n            'text': f\"{title}\",\n            'y': .95,\n            'x': 0.40,\n            'xanchor': 'center',\n            'yanchor': 'top',\n            'font': dict(\n                size=22,\n                color=\"Black\")\n        },\n        template=\"simple_white\",\n        width=width,\n        height=height,\n        hoverlabel=dict(\n            bgcolor=\"white\",\n            font_size=16,\n            font_family=\"Rockwell\"\n        ),\n        legend=dict(\n            title=\"&lt;b&gt;Global Topic Representation\",\n        )\n    )\n    return fig\n</code></pre>"},{"location":"api/representation/base.html","title":"<code>BaseRepresentation</code>","text":"<p>The base representation model for fine-tuning topic representations </p> Source code in <code>bertopic\\representation\\_base.py</code> <pre><code>class BaseRepresentation(BaseEstimator):\n    \"\"\" The base representation model for fine-tuning topic representations \"\"\"\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Each representation model that inherits this class will have\n        its arguments (topic_model, documents, c_tf_idf, topics)\n        automatically passed. Therefore, the representation model\n        will only have access to the information about topics related\n        to those arguments.\n\n        Arguments:\n            topic_model: The BERTopic model that is fitted until topic\n                         representations are calculated.\n            documents: A dataframe with columns \"Document\" and \"Topic\"\n                       that contains all documents with each corresponding\n                       topic.\n            c_tf_idf: A c-TF-IDF representation that is typically\n                      identical to `topic_model.c_tf_idf_` except for\n                      dynamic, class-based, and hierarchical topic modeling\n                      where it is calculated on a subset of the documents.\n            topics: A dictionary with topic (key) and tuple of word and\n                    weight (value) as calculated by c-TF-IDF. This is the\n                    default topics that are returned if no representation\n                    model is used.\n        \"\"\"\n        return topic_model.topic_representations_\n</code></pre>"},{"location":"api/representation/base.html#bertopic.representation._base.BaseRepresentation.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Each representation model that inherits this class will have its arguments (topic_model, documents, c_tf_idf, topics) automatically passed. Therefore, the representation model will only have access to the information about topics related to those arguments.</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>The BERTopic model that is fitted until topic          representations are calculated.</p> required <code>documents</code> <code>DataFrame</code> <p>A dataframe with columns \"Document\" and \"Topic\"        that contains all documents with each corresponding        topic.</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>A c-TF-IDF representation that is typically       identical to <code>topic_model.c_tf_idf_</code> except for       dynamic, class-based, and hierarchical topic modeling       where it is calculated on a subset of the documents.</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>A dictionary with topic (key) and tuple of word and     weight (value) as calculated by c-TF-IDF. This is the     default topics that are returned if no representation     model is used.</p> required Source code in <code>bertopic\\representation\\_base.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Each representation model that inherits this class will have\n    its arguments (topic_model, documents, c_tf_idf, topics)\n    automatically passed. Therefore, the representation model\n    will only have access to the information about topics related\n    to those arguments.\n\n    Arguments:\n        topic_model: The BERTopic model that is fitted until topic\n                     representations are calculated.\n        documents: A dataframe with columns \"Document\" and \"Topic\"\n                   that contains all documents with each corresponding\n                   topic.\n        c_tf_idf: A c-TF-IDF representation that is typically\n                  identical to `topic_model.c_tf_idf_` except for\n                  dynamic, class-based, and hierarchical topic modeling\n                  where it is calculated on a subset of the documents.\n        topics: A dictionary with topic (key) and tuple of word and\n                weight (value) as calculated by c-TF-IDF. This is the\n                default topics that are returned if no representation\n                model is used.\n    \"\"\"\n    return topic_model.topic_representations_\n</code></pre>"},{"location":"api/representation/cohere.html","title":"<code>Cohere</code>","text":"<p>Use the Cohere API to generate topic labels based on their generative model.</p> <p>Find more about their models here: https://docs.cohere.ai/docs</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <p>A <code>cohere.Client</code></p> required <code>model</code> <code>str</code> <p>Model to use within Cohere, defaults to <code>\"xlarge\"</code>.</p> <code>'xlarge'</code> <code>prompt</code> <code>str</code> <p>The prompt to be used in the model. If no prompt is given,     <code>self.default_prompt_</code> is used instead.     NOTE: Use <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code> in the prompt     to decide where the keywords and documents need to be     inserted.</p> <code>None</code> <code>delay_in_seconds</code> <code>float</code> <p>The delay in seconds between consecutive prompts                      in order to prevent RateLimitErrors. </p> <code>None</code> <code>nr_docs</code> <code>int</code> <p>The number of documents to pass to OpenAI if a prompt      with the <code>[\"DOCUMENTS\"]</code> tag is used.</p> <code>4</code> <code>diversity</code> <code>float</code> <p>The diversity of documents to pass to OpenAI.        Accepts values between 0 and 1. A higher         values results in passing more diverse documents        whereas lower values passes more similar documents.</p> <code>None</code> <code>doc_length</code> <code>int</code> <p>The maximum length of each document. If a document is longer,         it will be truncated. If None, the entire document is passed.</p> <code>None</code> <code>tokenizer</code> <code>Union[str, Callable]</code> <p>The tokenizer used to calculate to split the document into segments        used to count the length of a document.             * If tokenizer is 'char', then the document is split up               into characters which are counted to adhere to <code>doc_length</code>            * If tokenizer is 'whitespace', the document is split up              into words separated by whitespaces. These words are counted              and truncated depending on <code>doc_length</code>            * If tokenizer is 'vectorizer', then the internal CountVectorizer              is used to tokenize the document. These tokens are counted              and trunctated depending on <code>doc_length</code>            * If tokenizer is a callable, then that callable is used to tokenize              the document. These tokens are counted and truncated depending              on <code>doc_length</code></p> <code>None</code> <p>Usage:</p> <p>To use this, you will need to install cohere first:</p> <p><code>pip install cohere</code></p> <p>Then, get yourself an API key and use Cohere's API as follows:</p> <pre><code>import cohere\nfrom bertopic.representation import Cohere\nfrom bertopic import BERTopic\n\n# Create your representation model\nco = cohere.Client(my_api_key)\nrepresentation_model = Cohere(co)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"I have the following documents: [DOCUMENTS]. What topic do they contain?\"\nrepresentation_model = Cohere(co, prompt=prompt)\n</code></pre> Source code in <code>bertopic\\representation\\_cohere.py</code> <pre><code>class Cohere(BaseRepresentation):\n    \"\"\" Use the Cohere API to generate topic labels based on their\n    generative model.\n\n    Find more about their models here:\n    https://docs.cohere.ai/docs\n\n    Arguments:\n        client: A `cohere.Client`\n        model: Model to use within Cohere, defaults to `\"xlarge\"`.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n                NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n                to decide where the keywords and documents need to be\n                inserted.\n        delay_in_seconds: The delay in seconds between consecutive prompts \n                                in order to prevent RateLimitErrors. \n        nr_docs: The number of documents to pass to OpenAI if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to OpenAI.\n                   Accepts values between 0 and 1. A higher \n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document. \n                       * If tokenizer is 'char', then the document is split up \n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n\n    Usage:\n\n    To use this, you will need to install cohere first:\n\n    `pip install cohere`\n\n    Then, get yourself an API key and use Cohere's API as follows:\n\n    ```python\n    import cohere\n    from bertopic.representation import Cohere\n    from bertopic import BERTopic\n\n    # Create your representation model\n    co = cohere.Client(my_api_key)\n    representation_model = Cohere(co)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can also use a custom prompt:\n\n    ```python\n    prompt = \"I have the following documents: [DOCUMENTS]. What topic do they contain?\"\n    representation_model = Cohere(co, prompt=prompt)\n    ```\n    \"\"\"\n    def __init__(self,\n                 client,\n                 model: str = \"xlarge\",\n                 prompt: str = None,\n                 delay_in_seconds: float = None,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None\n                 ):\n        self.client = client\n        self.model = model\n        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT\n        self.default_prompt_ = DEFAULT_PROMPT\n        self.delay_in_seconds = delay_in_seconds\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n        self.prompts_ = []\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: Not used\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top 4 representative documents per topic\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n        # Generate using Cohere's Language Model\n        updated_topics = {}\n        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n            prompt = self._create_prompt(truncated_docs, topic, topics)\n            self.prompts_.append(prompt)\n\n            # Delay\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n            request = self.client.generate(model=self.model,\n                                           prompt=prompt,\n                                           max_tokens=50,\n                                           num_generations=1,\n                                           stop_sequences=[\"\\n\"])\n            label = request.generations[0].text.strip()\n            updated_topics[topic] = [(label, 1)] + [(\"\", 0) for _ in range(9)]\n\n        return updated_topics\n\n    def _create_prompt(self, docs, topic, topics):\n        keywords = list(zip(*topics[topic]))[0]\n\n        # Use the Default Chat Prompt\n        if self.prompt == DEFAULT_PROMPT:\n            prompt = self.prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            prompt = self._replace_documents(prompt, docs)\n\n        # Use a custom prompt that leverages keywords, documents or both using\n        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively\n        else:\n            prompt = self.prompt\n            if \"[KEYWORDS]\" in prompt:\n                prompt = prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            if \"[DOCUMENTS]\" in prompt:\n                prompt = self._replace_documents(prompt, docs)\n\n        return prompt\n\n    @staticmethod\n    def _replace_documents(prompt, docs):\n        to_replace = \"\"\n        for doc in docs:\n            to_replace += f\"- {doc}\\n\"\n        prompt = prompt.replace(\"[DOCUMENTS]\", to_replace)\n        return prompt\n</code></pre>"},{"location":"api/representation/cohere.html#bertopic.representation._cohere.Cohere.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>Not used</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_cohere.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: Not used\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top 4 representative documents per topic\n    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n    # Generate using Cohere's Language Model\n    updated_topics = {}\n    for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n        prompt = self._create_prompt(truncated_docs, topic, topics)\n        self.prompts_.append(prompt)\n\n        # Delay\n        if self.delay_in_seconds:\n            time.sleep(self.delay_in_seconds)\n\n        request = self.client.generate(model=self.model,\n                                       prompt=prompt,\n                                       max_tokens=50,\n                                       num_generations=1,\n                                       stop_sequences=[\"\\n\"])\n        label = request.generations[0].text.strip()\n        updated_topics[topic] = [(label, 1)] + [(\"\", 0) for _ in range(9)]\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/generation.html","title":"<code>TextGeneration</code>","text":"<p>Text2Text or text generation with transformers</p> <p>Parameters:</p> Name Type Description Default <code>model</code> <code>Union[str, pipeline]</code> <p>A transformers pipeline that should be initialized as \"text-generation\"    for gpt-like models or \"text2text-generation\" for T5-like models.    For example, <code>pipeline('text-generation', model='gpt2')</code>. If a string    is passed, \"text-generation\" will be selected by default.</p> required <code>prompt</code> <code>str</code> <p>The prompt to be used in the model. If no prompt is given,     <code>self.default_prompt_</code> is used instead.     NOTE: Use <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code> in the prompt     to decide where the keywords and documents need to be     inserted.</p> <code>None</code> <code>pipeline_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs that you can pass to the transformers.pipeline              when it is called.</p> <code>{}</code> <code>random_state</code> <code>int</code> <p>A random state to be passed to <code>transformers.set_seed</code></p> <code>42</code> <code>nr_docs</code> <code>int</code> <p>The number of documents to pass to OpenAI if a prompt      with the <code>[\"DOCUMENTS\"]</code> tag is used.</p> <code>4</code> <code>diversity</code> <code>float</code> <p>The diversity of documents to pass to OpenAI.        Accepts values between 0 and 1. A higher        values results in passing more diverse documents        whereas lower values passes more similar documents.</p> <code>None</code> <code>doc_length</code> <code>int</code> <p>The maximum length of each document. If a document is longer,         it will be truncated. If None, the entire document is passed.</p> <code>None</code> <code>tokenizer</code> <code>Union[str, Callable]</code> <p>The tokenizer used to calculate to split the document into segments        used to count the length of a document.            * If tokenizer is 'char', then the document is split up              into characters which are counted to adhere to <code>doc_length</code>            * If tokenizer is 'whitespace', the document is split up              into words separated by whitespaces. These words are counted              and truncated depending on <code>doc_length</code>            * If tokenizer is 'vectorizer', then the internal CountVectorizer              is used to tokenize the document. These tokens are counted              and trunctated depending on <code>doc_length</code>            * If tokenizer is a callable, then that callable is used to tokenize              the document. These tokens are counted and truncated depending              on <code>doc_length</code></p> <code>None</code> <p>Usage:</p> <p>To use a gpt-like model:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Create your representation model\ngenerator = pipeline('text-generation', model='gpt2')\nrepresentation_model = TextGeneration(generator)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTo pic(representation_model=representation_model)\n</code></pre> <p>You can use a custom prompt and decide where the keywords should be inserted by using the <code>[KEYWORDS]</code> or documents with thte <code>[DOCUMENTS]</code> tag:</p> <pre><code>from bertopic.representation import TextGeneration\n\nprompt = \"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?\"\"\n\n# Create your representation model\ngenerator = pipeline('text2text-generation', model='google/flan-t5-base')\nrepresentation_model = TextGeneration(generator)\n</code></pre> Source code in <code>bertopic\\representation\\_textgeneration.py</code> <pre><code>class TextGeneration(BaseRepresentation):\n    \"\"\" Text2Text or text generation with transformers\n\n    Arguments:\n        model: A transformers pipeline that should be initialized as \"text-generation\"\n               for gpt-like models or \"text2text-generation\" for T5-like models.\n               For example, `pipeline('text-generation', model='gpt2')`. If a string\n               is passed, \"text-generation\" will be selected by default.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n                NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n                to decide where the keywords and documents need to be\n                inserted.\n        pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline\n                         when it is called.\n        random_state: A random state to be passed to `transformers.set_seed`\n        nr_docs: The number of documents to pass to OpenAI if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to OpenAI.\n                   Accepts values between 0 and 1. A higher\n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document.\n                       * If tokenizer is 'char', then the document is split up\n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n\n    Usage:\n\n    To use a gpt-like model:\n\n    ```python\n    from bertopic.representation import TextGeneration\n    from bertopic import BERTopic\n\n    # Create your representation model\n    generator = pipeline('text-generation', model='gpt2')\n    representation_model = TextGeneration(generator)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTo pic(representation_model=representation_model)\n    ```\n\n    You can use a custom prompt and decide where the keywords should\n    be inserted by using the `[KEYWORDS]` or documents with thte `[DOCUMENTS]` tag:\n\n    ```python\n    from bertopic.representation import TextGeneration\n\n    prompt = \"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?\"\"\n\n    # Create your representation model\n    generator = pipeline('text2text-generation', model='google/flan-t5-base')\n    representation_model = TextGeneration(generator)\n    ```\n    \"\"\"\n    def __init__(self,\n                 model: Union[str, pipeline],\n                 prompt: str = None,\n                 pipeline_kwargs: Mapping[str, Any] = {},\n                 random_state: int = 42,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None\n                 ):\n        set_seed(random_state)\n        if isinstance(model, str):\n            self.model = pipeline(\"text-generation\", model=model)\n        elif isinstance(model, Pipeline):\n            self.model = model\n        else:\n            raise ValueError(\"Make sure that the HF model that you\"\n                             \"pass is either a string referring to a\"\n                             \"HF model or a `transformers.pipeline` object.\")\n        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT\n        self.default_prompt_ = DEFAULT_PROMPT\n        self.pipeline_kwargs = pipeline_kwargs\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n\n        self.prompts_ = []\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topic representations and return a single label\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top 4 representative documents per topic\n        if self.prompt != DEFAULT_PROMPT and \"[DOCUMENTS]\" in self.prompt:\n            repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n                c_tf_idf,\n                documents,\n                topics,\n                500,\n                self.nr_docs,\n                self.diversity\n            )\n        else:\n            repr_docs_mappings = {topic: None for topic in topics.keys()}\n\n        updated_topics = {}\n        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n\n            # Prepare prompt\n            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n            prompt = self._create_prompt(truncated_docs, topic, topics)\n            self.prompts_.append(prompt)\n\n            # Extract result from generator and use that as label\n            topic_description = self.model(prompt, **self.pipeline_kwargs)\n            topic_description = [(description[\"generated_text\"].replace(prompt, \"\"), 1) for description in topic_description]\n\n            if len(topic_description) &lt; 10:\n                topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n\n            updated_topics[topic] = topic_description\n\n        return updated_topics\n\n    def _create_prompt(self, docs, topic, topics):\n        keywords = \", \".join(list(zip(*topics[topic]))[0])\n\n        # Use the default prompt and replace keywords\n        if self.prompt == DEFAULT_PROMPT:\n            prompt = self.prompt.replace(\"[KEYWORDS]\", keywords)\n\n        # Use a prompt that leverages either keywords or documents in\n        # a custom location\n        else:\n            prompt = self.prompt\n            if \"[KEYWORDS]\" in prompt:\n                prompt = prompt.replace(\"[KEYWORDS]\", keywords)\n            if \"[DOCUMENTS]\" in prompt:\n                to_replace = \"\"\n                for doc in docs:\n                    to_replace += f\"- {doc}\\n\"\n                prompt = prompt.replace(\"[DOCUMENTS]\", to_replace)\n\n        return prompt\n</code></pre>"},{"location":"api/representation/generation.html#bertopic.representation._textgeneration.TextGeneration.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topic representations and return a single label</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_textgeneration.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topic representations and return a single label\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top 4 representative documents per topic\n    if self.prompt != DEFAULT_PROMPT and \"[DOCUMENTS]\" in self.prompt:\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n            c_tf_idf,\n            documents,\n            topics,\n            500,\n            self.nr_docs,\n            self.diversity\n        )\n    else:\n        repr_docs_mappings = {topic: None for topic in topics.keys()}\n\n    updated_topics = {}\n    for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n\n        # Prepare prompt\n        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n        prompt = self._create_prompt(truncated_docs, topic, topics)\n        self.prompts_.append(prompt)\n\n        # Extract result from generator and use that as label\n        topic_description = self.model(prompt, **self.pipeline_kwargs)\n        topic_description = [(description[\"generated_text\"].replace(prompt, \"\"), 1) for description in topic_description]\n\n        if len(topic_description) &lt; 10:\n            topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n\n        updated_topics[topic] = topic_description\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/keybert.html","title":"<code>KeyBERTInspired</code>","text":"Source code in <code>bertopic\\representation\\_keybert.py</code> <pre><code>class KeyBERTInspired(BaseRepresentation):\n    def __init__(self,\n                 top_n_words: int = 10,\n                 nr_repr_docs: int = 5,\n                 nr_samples: int = 500,\n                 nr_candidate_words: int = 100,\n                 random_state: int = 42):\n        \"\"\" Use a KeyBERT-like model to fine-tune the topic representations\n\n        The algorithm follows KeyBERT but does some optimization in\n        order to speed up inference.\n\n        The steps are as follows. First, we extract the top n representative\n        documents per topic. To extract the representative documents, we\n        randomly sample a number of candidate documents per cluster\n        which is controlled by the `nr_samples` parameter. Then,\n        the top n representative documents  are extracted by calculating\n        the c-TF-IDF representation for the  candidate documents and finding,\n        through cosine similarity, which are closest to the topic c-TF-IDF representation.\n        Next, the top n words per topic are extracted based on their\n        c-TF-IDF representation, which is controlled by the `nr_repr_docs`\n        parameter.\n\n        Then, we extract the embeddings for words and representative documents\n        and create topic embeddings by averaging the representative documents.\n        Finally, the most similar words to each topic are extracted by\n        calculating the cosine similarity between word and topic embeddings.\n\n        Arguments:\n            top_n_words: The top n words to extract per topic.\n            nr_repr_docs: The number of representative documents to extract per cluster.\n            nr_samples: The number of candidate documents to extract per cluster.\n            nr_candidate_words: The number of candidate words per cluster.\n            random_state: The random state for randomly sampling candidate documents.\n\n        Usage:\n\n        ```python\n        from bertopic.representation import KeyBERTInspired\n        from bertopic import BERTopic\n\n        # Create your representation model\n        representation_model = KeyBERTInspired()\n\n        # Use the representation model in BERTopic on top of the default pipeline\n        topic_model = BERTopic(representation_model=representation_model)\n        ```\n        \"\"\"\n        self.top_n_words = top_n_words\n        self.nr_repr_docs = nr_repr_docs\n        self.nr_samples = nr_samples\n        self.nr_candidate_words = nr_candidate_words\n        self.random_state = random_state\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # We extract the top n representative documents per class\n        _, representative_docs, repr_doc_indices, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, self.nr_samples, self.nr_repr_docs)\n\n        # We extract the top n words per class\n        topics = self._extract_candidate_words(topic_model, c_tf_idf, topics)\n\n        # We calculate the similarity between word and document embeddings and create\n        # topic embeddings from the representative document embeddings\n        sim_matrix, words = self._extract_embeddings(topic_model, topics, representative_docs, repr_doc_indices)\n\n        # Find the best matching words based on the similarity matrix for each topic\n        updated_topics = self._extract_top_words(words, topics, sim_matrix)\n\n        return updated_topics\n\n    def _extract_candidate_words(self,\n                                 topic_model,\n                                 c_tf_idf: csr_matrix,\n                                 topics: Mapping[str, List[Tuple[str, float]]]\n                                 ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" For each topic, extract candidate words based on the c-TF-IDF\n        representation.\n\n        Arguments:\n            topic_model: A BERTopic model\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The top words per topic\n\n        Returns:\n            topics: The `self.top_n_words` per topic\n        \"\"\"\n        labels = [int(label) for label in sorted(list(topics.keys()))]\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = topic_model.vectorizer_model.get_feature_names_out()\n        else:\n            words = topic_model.vectorizer_model.get_feature_names()\n\n        indices = topic_model._top_n_idx_sparse(c_tf_idf, self.nr_candidate_words)\n        scores = topic_model._top_n_values_sparse(c_tf_idf, indices)\n        sorted_indices = np.argsort(scores, 1)\n        indices = np.take_along_axis(indices, sorted_indices, axis=1)\n        scores = np.take_along_axis(scores, sorted_indices, axis=1)\n\n        # Get top 30 words per topic based on c-TF-IDF score\n        topics = {label: [(words[word_index], score)\n                          if word_index is not None and score &gt; 0\n                          else (\"\", 0.00001)\n                          for word_index, score in zip(indices[index][::-1], scores[index][::-1])\n                          ]\n                  for index, label in enumerate(labels)}\n        topics = {label: list(zip(*values[:self.nr_candidate_words]))[0] for label, values in topics.items()}\n\n        return topics\n\n    def _extract_embeddings(self,\n                            topic_model,\n                            topics: Mapping[str, List[Tuple[str, float]]],\n                            representative_docs: List[str],\n                            repr_doc_indices: List[List[int]]\n                            ) -&gt; Union[np.ndarray, List[str]]:\n        \"\"\" Extract the representative document embeddings and create topic embeddings.\n        Then extract word embeddings and calculate the cosine similarity between topic\n        embeddings and the word embeddings. Topic embeddings are the average of\n        representative document embeddings.\n\n        Arguments:\n            topic_model: A BERTopic model\n            topics: The top words per topic\n            representative_docs: A flat list of representative documents\n            repr_doc_indices: The indices of representative documents\n                              that belong to each topic\n\n        Returns:\n            sim: The similarity matrix between word and topic embeddings\n            vocab: The complete vocabulary of input documents\n        \"\"\"\n        # Calculate representative docs embeddings and create topic embeddings\n        repr_embeddings = topic_model._extract_embeddings(representative_docs, method=\"document\", verbose=False)\n        topic_embeddings = [np.mean(repr_embeddings[i[0]:i[-1]+1], axis=0) for i in repr_doc_indices]\n\n        # Calculate word embeddings and extract best matching with updated topic_embeddings\n        vocab = list(set([word for words in topics.values() for word in words]))\n        word_embeddings = topic_model._extract_embeddings(vocab, method=\"document\", verbose=False)\n        sim = cosine_similarity(topic_embeddings, word_embeddings)\n\n        return sim, vocab\n\n    def _extract_top_words(self,\n                           vocab: List[str],\n                           topics: Mapping[str, List[Tuple[str, float]]],\n                           sim: np.ndarray\n                           ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract the top n words per topic based on the\n        similarity matrix between topics and words.\n\n        Arguments:\n            vocab: The complete vocabulary of input documents\n            labels: All topic labels\n            topics: The top words per topic\n            sim: The similarity matrix between word and topic embeddings\n\n        Returns:\n            updated_topics: The updated topic representations\n        \"\"\"\n        labels = [int(label) for label in sorted(list(topics.keys()))]\n        updated_topics = {}\n        for i, topic in enumerate(labels):\n            indices = [vocab.index(word) for word in topics[topic]]\n            values = sim[:, indices][i]\n            word_indices = [indices[index] for index in np.argsort(values)[-self.top_n_words:]]\n            updated_topics[topic] = [(vocab[index], val) for val, index in zip(np.sort(values)[-self.top_n_words:], word_indices)][::-1]\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/keybert.html#bertopic.representation._keybert.KeyBERTInspired.__init__","title":"<code>__init__(self, top_n_words=10, nr_repr_docs=5, nr_samples=500, nr_candidate_words=100, random_state=42)</code>  <code>special</code>","text":"<p>Use a KeyBERT-like model to fine-tune the topic representations</p> <p>The algorithm follows KeyBERT but does some optimization in order to speed up inference.</p> <p>The steps are as follows. First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the <code>nr_samples</code> parameter. Then, the top n representative documents  are extracted by calculating the c-TF-IDF representation for the  candidate documents and finding, through cosine similarity, which are closest to the topic c-TF-IDF representation. Next, the top n words per topic are extracted based on their c-TF-IDF representation, which is controlled by the <code>nr_repr_docs</code> parameter.</p> <p>Then, we extract the embeddings for words and representative documents and create topic embeddings by averaging the representative documents. Finally, the most similar words to each topic are extracted by calculating the cosine similarity between word and topic embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>top_n_words</code> <code>int</code> <p>The top n words to extract per topic.</p> <code>10</code> <code>nr_repr_docs</code> <code>int</code> <p>The number of representative documents to extract per cluster.</p> <code>5</code> <code>nr_samples</code> <code>int</code> <p>The number of candidate documents to extract per cluster.</p> <code>500</code> <code>nr_candidate_words</code> <code>int</code> <p>The number of candidate words per cluster.</p> <code>100</code> <code>random_state</code> <code>int</code> <p>The random state for randomly sampling candidate documents.</p> <code>42</code> <p>Usage:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> Source code in <code>bertopic\\representation\\_keybert.py</code> <pre><code>def __init__(self,\n             top_n_words: int = 10,\n             nr_repr_docs: int = 5,\n             nr_samples: int = 500,\n             nr_candidate_words: int = 100,\n             random_state: int = 42):\n    \"\"\" Use a KeyBERT-like model to fine-tune the topic representations\n\n    The algorithm follows KeyBERT but does some optimization in\n    order to speed up inference.\n\n    The steps are as follows. First, we extract the top n representative\n    documents per topic. To extract the representative documents, we\n    randomly sample a number of candidate documents per cluster\n    which is controlled by the `nr_samples` parameter. Then,\n    the top n representative documents  are extracted by calculating\n    the c-TF-IDF representation for the  candidate documents and finding,\n    through cosine similarity, which are closest to the topic c-TF-IDF representation.\n    Next, the top n words per topic are extracted based on their\n    c-TF-IDF representation, which is controlled by the `nr_repr_docs`\n    parameter.\n\n    Then, we extract the embeddings for words and representative documents\n    and create topic embeddings by averaging the representative documents.\n    Finally, the most similar words to each topic are extracted by\n    calculating the cosine similarity between word and topic embeddings.\n\n    Arguments:\n        top_n_words: The top n words to extract per topic.\n        nr_repr_docs: The number of representative documents to extract per cluster.\n        nr_samples: The number of candidate documents to extract per cluster.\n        nr_candidate_words: The number of candidate words per cluster.\n        random_state: The random state for randomly sampling candidate documents.\n\n    Usage:\n\n    ```python\n    from bertopic.representation import KeyBERTInspired\n    from bertopic import BERTopic\n\n    # Create your representation model\n    representation_model = KeyBERTInspired()\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n    \"\"\"\n    self.top_n_words = top_n_words\n    self.nr_repr_docs = nr_repr_docs\n    self.nr_samples = nr_samples\n    self.nr_candidate_words = nr_candidate_words\n    self.random_state = random_state\n</code></pre>"},{"location":"api/representation/keybert.html#bertopic.representation._keybert.KeyBERTInspired.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>The topic c-TF-IDF representation</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_keybert.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: The topic c-TF-IDF representation\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # We extract the top n representative documents per class\n    _, representative_docs, repr_doc_indices, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, self.nr_samples, self.nr_repr_docs)\n\n    # We extract the top n words per class\n    topics = self._extract_candidate_words(topic_model, c_tf_idf, topics)\n\n    # We calculate the similarity between word and document embeddings and create\n    # topic embeddings from the representative document embeddings\n    sim_matrix, words = self._extract_embeddings(topic_model, topics, representative_docs, repr_doc_indices)\n\n    # Find the best matching words based on the similarity matrix for each topic\n    updated_topics = self._extract_top_words(words, topics, sim_matrix)\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/langchain.html","title":"<code>LangChain</code>","text":"<p>Using chains in langchain to generate topic labels.</p> <p>The classic example uses <code>langchain.chains.question_answering.load_qa_chain</code>. This returns a chain that takes a list of documents and a question as input.</p> <p>You can also use Runnables such as those composed using the LangChain Expression Language.</p> <p>Parameters:</p> Name Type Description Default <code>chain</code> <p>The langchain chain or Runnable with a <code>batch</code> method.    Input keys must be <code>input_documents</code> and <code>question</code>.    Output key must be <code>output_text</code>.</p> required <code>prompt</code> <code>str</code> <p>The prompt to be used in the model. If no prompt is given,     <code>self.default_prompt_</code> is used instead.</p> <code>None</code> <code>nr_docs</code> <code>int</code> <p>The number of documents to pass to LangChain if a prompt      with the <code>[\"DOCUMENTS\"]</code> tag is used.</p> <code>4</code> <code>diversity</code> <code>float</code> <p>The diversity of documents to pass to LangChain.        Accepts values between 0 and 1. A higher         values results in passing more diverse documents        whereas lower values passes more similar documents.</p> <code>None</code> <code>doc_length</code> <code>int</code> <p>The maximum length of each document. If a document is longer,         it will be truncated. If None, the entire document is passed.</p> <code>None</code> <code>tokenizer</code> <code>Union[str, Callable]</code> <p>The tokenizer used to calculate to split the document into segments        used to count the length of a document.             * If tokenizer is 'char', then the document is split up               into characters which are counted to adhere to <code>doc_length</code>            * If tokenizer is 'whitespace', the document is split up              into words separated by whitespaces. These words are counted              and truncated depending on <code>doc_length</code>            * If tokenizer is 'vectorizer', then the internal CountVectorizer              is used to tokenize the document. These tokens are counted              and trunctated depending on <code>doc_length</code>. They are decoded with               whitespaces.            * If tokenizer is a callable, then that callable is used to tokenize              the document. These tokens are counted and truncated depending              on <code>doc_length</code></p> <code>None</code> <code>chain_config</code> <p>The configuration for the langchain chain. Can be used to set options           like max_concurrency to avoid rate limiting errors.</p> <code>None</code> <p>Usage:</p> <p>To use this, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:</p> <p><code>pip install langchain</code> <code>pip install openai</code></p> <p>Then, you can create your chain as follows:</p> <pre><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\nchain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type=\"stuff\")\n</code></pre> <p>Finally, you can pass the chain to BERTopic as follows:</p> <pre><code>from bertopic.representation import LangChain\n\n# Create your representation model\nrepresentation_model = LangChain(chain)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"What are these documents about? Please give a single label.\"\nrepresentation_model = LangChain(chain, prompt=prompt)\n</code></pre> <p>You can also use a Runnable instead of a chain. The example below uses the LangChain Expression Language:</p> <pre><code>from bertopic.representation import LangChain\nfrom langchain.chains.question_answering import load_qa_chain\nfrom langchain.chat_models import ChatAnthropic\nfrom langchain.schema.document import Document\nfrom langchain.schema.runnable import RunnablePassthrough\nfrom langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer\n\nprompt = ...\nllm = ...\n\n# We will construct a special privacy-preserving chain using Microsoft Presidio\n\npii_handler = PresidioReversibleAnonymizer(analyzed_fields=[\"PERSON\"])\n\nchain = (\n    {\n        \"input_documents\": (\n            lambda inp: [\n                Document(\n                    page_content=pii_handler.anonymize(\n                        d.page_content,\n                        language=\"en\",\n                    ),\n                )\n                for d in inp[\"input_documents\"]\n            ]\n        ),\n        \"question\": RunnablePassthrough(),\n    }\n    | load_qa_chain(representation_llm, chain_type=\"stuff\")\n    | (lambda output: {\"output_text\": pii_handler.deanonymize(output[\"output_text\"])})\n)\n\nrepresentation_model = LangChain(chain, prompt=representation_prompt)\n</code></pre> Source code in <code>bertopic\\representation\\_langchain.py</code> <pre><code>class LangChain(BaseRepresentation):\n    \"\"\" Using chains in langchain to generate topic labels.\n\n    The classic example uses `langchain.chains.question_answering.load_qa_chain`.\n    This returns a chain that takes a list of documents and a question as input.\n\n    You can also use Runnables such as those composed using the LangChain Expression Language.\n\n    Arguments:\n        chain: The langchain chain or Runnable with a `batch` method.\n               Input keys must be `input_documents` and `question`.\n               Output key must be `output_text`.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n        nr_docs: The number of documents to pass to LangChain if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to LangChain.\n                   Accepts values between 0 and 1. A higher \n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document. \n                       * If tokenizer is 'char', then the document is split up \n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`. They are decoded with \n                         whitespaces.\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n        chain_config: The configuration for the langchain chain. Can be used to set options\n                      like max_concurrency to avoid rate limiting errors.\n    Usage:\n\n    To use this, you will need to install the langchain package first.\n    Additionally, you will need an underlying LLM to support langchain,\n    like openai:\n\n    `pip install langchain`\n    `pip install openai`\n\n    Then, you can create your chain as follows:\n\n    ```python\n    from langchain.chains.question_answering import load_qa_chain\n    from langchain.llms import OpenAI\n    chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type=\"stuff\")\n    ```\n\n    Finally, you can pass the chain to BERTopic as follows:\n\n    ```python\n    from bertopic.representation import LangChain\n\n    # Create your representation model\n    representation_model = LangChain(chain)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can also use a custom prompt:\n\n    ```python\n    prompt = \"What are these documents about? Please give a single label.\"\n    representation_model = LangChain(chain, prompt=prompt)\n    ```\n\n    You can also use a Runnable instead of a chain.\n    The example below uses the LangChain Expression Language:\n\n    ```python\n    from bertopic.representation import LangChain\n    from langchain.chains.question_answering import load_qa_chain\n    from langchain.chat_models import ChatAnthropic\n    from langchain.schema.document import Document\n    from langchain.schema.runnable import RunnablePassthrough\n    from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer\n\n    prompt = ...\n    llm = ...\n\n    # We will construct a special privacy-preserving chain using Microsoft Presidio\n\n    pii_handler = PresidioReversibleAnonymizer(analyzed_fields=[\"PERSON\"])\n\n    chain = (\n        {\n            \"input_documents\": (\n                lambda inp: [\n                    Document(\n                        page_content=pii_handler.anonymize(\n                            d.page_content,\n                            language=\"en\",\n                        ),\n                    )\n                    for d in inp[\"input_documents\"]\n                ]\n            ),\n            \"question\": RunnablePassthrough(),\n        }\n        | load_qa_chain(representation_llm, chain_type=\"stuff\")\n        | (lambda output: {\"output_text\": pii_handler.deanonymize(output[\"output_text\"])})\n    )\n\n    representation_model = LangChain(chain, prompt=representation_prompt)\n    ```\n    \"\"\"\n    def __init__(self,\n                 chain,\n                 prompt: str = None,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None,\n                 chain_config = None,\n                 ):\n        self.chain = chain\n        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT\n        self.default_prompt_ = DEFAULT_PROMPT\n        self.chain_config = chain_config\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, int]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top 4 representative documents per topic\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n            c_tf_idf=c_tf_idf,\n            documents=documents,\n            topics=topics,\n            nr_samples=500,\n            nr_repr_docs=self.nr_docs,\n            diversity=self.diversity\n        )\n\n        # Generate label using langchain's batch functionality\n        chain_docs: List[List[Document]] = [\n            [\n                Document(\n                    page_content=truncate_document(\n                        topic_model,\n                        self.doc_length,\n                        self.tokenizer,\n                        doc\n                    )\n                )\n                for doc in docs\n            ]\n            for docs in repr_docs_mappings.values()\n        ]\n\n        # `self.chain` must take `input_documents` and `question` as input keys\n        inputs = [\n            {\"input_documents\": docs, \"question\": self.prompt}\n            for docs in chain_docs\n        ]\n\n        # `self.chain` must return a dict with an `output_text` key\n        # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`\n        outputs = self.chain.batch(inputs=inputs, config=self.chain_config)\n        labels = [output[\"output_text\"].strip() for output in outputs]\n\n        updated_topics = {\n            topic: [(label, 1)] + [(\"\", 0) for _ in range(9)]\n            for topic, label in zip(repr_docs_mappings.keys(), labels)\n        }\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/langchain.html#bertopic.representation._langchain.LangChain.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>The topic c-TF-IDF representation</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_langchain.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, int]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: The topic c-TF-IDF representation\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top 4 representative documents per topic\n    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(\n        c_tf_idf=c_tf_idf,\n        documents=documents,\n        topics=topics,\n        nr_samples=500,\n        nr_repr_docs=self.nr_docs,\n        diversity=self.diversity\n    )\n\n    # Generate label using langchain's batch functionality\n    chain_docs: List[List[Document]] = [\n        [\n            Document(\n                page_content=truncate_document(\n                    topic_model,\n                    self.doc_length,\n                    self.tokenizer,\n                    doc\n                )\n            )\n            for doc in docs\n        ]\n        for docs in repr_docs_mappings.values()\n    ]\n\n    # `self.chain` must take `input_documents` and `question` as input keys\n    inputs = [\n        {\"input_documents\": docs, \"question\": self.prompt}\n        for docs in chain_docs\n    ]\n\n    # `self.chain` must return a dict with an `output_text` key\n    # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`\n    outputs = self.chain.batch(inputs=inputs, config=self.chain_config)\n    labels = [output[\"output_text\"].strip() for output in outputs]\n\n    updated_topics = {\n        topic: [(label, 1)] + [(\"\", 0) for _ in range(9)]\n        for topic, label in zip(repr_docs_mappings.keys(), labels)\n    }\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/mmr.html","title":"<code>MaximalMarginalRelevance</code>","text":"<p>Calculate Maximal Marginal Relevance (MMR) between candidate keywords and the document.</p> <p>MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.</p> <p>Parameters:</p> Name Type Description Default <code>diversity</code> <code>float</code> <p>How diverse the select keywords/keyphrases are.         Values range between 0 and 1 with 0 being not diverse at all         and 1 being most diverse.</p> <code>0.1</code> <code>top_n_words</code> <code>int</code> <p>The number of keywords/keyhprases to return</p> <code>10</code> <p>Usage:</p> <pre><code>from bertopic.representation import MaximalMarginalRelevance\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = MaximalMarginalRelevance(diversity=0.3)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> Source code in <code>bertopic\\representation\\_mmr.py</code> <pre><code>class MaximalMarginalRelevance(BaseRepresentation):\n    \"\"\" Calculate Maximal Marginal Relevance (MMR)\n    between candidate keywords and the document.\n\n    MMR considers the similarity of keywords/keyphrases with the\n    document, along with the similarity of already selected\n    keywords and keyphrases. This results in a selection of keywords\n    that maximize their within diversity with respect to the document.\n\n    Arguments:\n        diversity: How diverse the select keywords/keyphrases are.\n                    Values range between 0 and 1 with 0 being not diverse at all\n                    and 1 being most diverse.\n        top_n_words: The number of keywords/keyhprases to return\n\n    Usage:\n\n    ```python\n    from bertopic.representation import MaximalMarginalRelevance\n    from bertopic import BERTopic\n\n    # Create your representation model\n    representation_model = MaximalMarginalRelevance(diversity=0.3)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n    \"\"\"\n    def __init__(self, diversity: float = 0.1, top_n_words: int = 10):\n        self.diversity = diversity\n        self.top_n_words = top_n_words\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topic representations\n\n        Arguments:\n            topic_model: The BERTopic model\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n\n        if topic_model.embedding_model is None:\n            warnings.warn(\"MaximalMarginalRelevance can only be used BERTopic was instantiated\"\n                          \"with the `embedding_model` parameter.\")\n            return topics\n\n        updated_topics = {}\n        for topic, topic_words in topics.items():\n            words = [word[0] for word in topic_words]\n            word_embeddings = topic_model._extract_embeddings(words, method=\"word\", verbose=False)\n            topic_embedding = topic_model._extract_embeddings(\" \".join(words), method=\"word\", verbose=False).reshape(1, -1)\n            topic_words = mmr(topic_embedding, word_embeddings, words, self.diversity, self.top_n_words)\n            updated_topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]\n        return updated_topics\n</code></pre>"},{"location":"api/representation/mmr.html#bertopic.representation._mmr.MaximalMarginalRelevance.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topic representations</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>The BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_mmr.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topic representations\n\n    Arguments:\n        topic_model: The BERTopic model\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n\n    if topic_model.embedding_model is None:\n        warnings.warn(\"MaximalMarginalRelevance can only be used BERTopic was instantiated\"\n                      \"with the `embedding_model` parameter.\")\n        return topics\n\n    updated_topics = {}\n    for topic, topic_words in topics.items():\n        words = [word[0] for word in topic_words]\n        word_embeddings = topic_model._extract_embeddings(words, method=\"word\", verbose=False)\n        topic_embedding = topic_model._extract_embeddings(\" \".join(words), method=\"word\", verbose=False).reshape(1, -1)\n        topic_words = mmr(topic_embedding, word_embeddings, words, self.diversity, self.top_n_words)\n        updated_topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]\n    return updated_topics\n</code></pre>"},{"location":"api/representation/openai.html","title":"<code>OpenAI</code>","text":"<p>Using the OpenAI API to generate topic labels based     on one of their Completion of ChatCompletion models.</p> <pre><code>The default method is `openai.Completion` if `chat=False`.\nThe prompts will also need to follow a completion task. If you\nare looking for a more interactive chats, use `chat=True`\nwith `model=gpt-3.5-turbo`.\n\nFor an overview see:\nhttps://platform.openai.com/docs/models\n\n!!! arguments\n    client: A `openai.OpenAI` client\n    !!! model \"Model to use within OpenAI, defaults to `\"text-ada-001\"`.\"\n           NOTE: If a `gpt-3.5-turbo` model is used, make sure to set\n           `chat` to True.\n    !!! generator_kwargs \"Kwargs passed to `openai.Completion.create`\"\n                      for fine-tuning the output.\n    !!! prompt \"The prompt to be used in the model. If no prompt is given,\"\n            `self.default_prompt_` is used instead.\n            NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n            to decide where the keywords and documents need to be\n            inserted.\n    !!! delay_in_seconds \"The delay in seconds between consecutive prompts\"\n                      in order to prevent RateLimitErrors.\n    !!! exponential_backoff \"Retry requests with a random exponential backoff.\"\n                         A short sleep is used when a rate limit error is hit,\n                         then the requests is retried. Increase the sleep length\n                         if errors are hit until 10 unsuccesfull requests.\n                         If True, overrides `delay_in_seconds`.\n    !!! chat \"Set this to True if a GPT-3.5 model is used.\"\n          See: https://platform.openai.com/docs/models/gpt-3-5\n    !!! nr_docs \"The number of documents to pass to OpenAI if a prompt\"\n             with the `[\"DOCUMENTS\"]` tag is used.\n    !!! diversity \"The diversity of documents to pass to OpenAI.\"\n               Accepts values between 0 and 1. A higher\n               values results in passing more diverse documents\n               whereas lower values passes more similar documents.\n    !!! doc_length \"The maximum length of each document. If a document is longer,\"\n                it will be truncated. If None, the entire document is passed.\n    !!! tokenizer \"The tokenizer used to calculate to split the document into segments\"\n               used to count the length of a document.\n                   * If tokenizer is 'char', then the document is split up\n                     into characters which are counted to adhere to `doc_length`\n                   * If tokenizer is 'whitespace', the document is split up\n                     into words separated by whitespaces. These words are counted\n                     and truncated depending on `doc_length`\n                   * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                     is used to tokenize the document. These tokens are counted\n                     and trunctated depending on `doc_length`\n                   * If tokenizer is a callable, then that callable is used to tokenize\n                     the document. These tokens are counted and truncated depending\n                     on `doc_length`\n\nUsage:\n\nTo use this, you will need to install the openai package first:\n\n`pip install openai`\n\nThen, get yourself an API key and use OpenAI's API as follows:\n\n```python\nimport openai\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create your representation model\nclient = openai.OpenAI(api_key=MY_API_KEY)\nrepresentation_model = OpenAI(client, delay_in_seconds=5)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n```\n\nYou can also use a custom prompt:\n\n```python\nprompt = \"I have the following documents: [DOCUMENTS]\n</code></pre> <p>These documents are about the following topic: '\"     representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)     <code>If you want to use OpenAI's ChatGPT model:</code>python     representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)     ```</p> Source code in <code>bertopic\\representation\\_openai.py</code> <pre><code>class OpenAI(BaseRepresentation):\n    \"\"\" Using the OpenAI API to generate topic labels based\n    on one of their Completion of ChatCompletion models.\n\n    The default method is `openai.Completion` if `chat=False`.\n    The prompts will also need to follow a completion task. If you\n    are looking for a more interactive chats, use `chat=True`\n    with `model=gpt-3.5-turbo`.\n\n    For an overview see:\n    https://platform.openai.com/docs/models\n\n    Arguments:\n        client: A `openai.OpenAI` client\n        model: Model to use within OpenAI, defaults to `\"text-ada-001\"`.\n               NOTE: If a `gpt-3.5-turbo` model is used, make sure to set\n               `chat` to True.\n        generator_kwargs: Kwargs passed to `openai.Completion.create`\n                          for fine-tuning the output.\n        prompt: The prompt to be used in the model. If no prompt is given,\n                `self.default_prompt_` is used instead.\n                NOTE: Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt\n                to decide where the keywords and documents need to be\n                inserted.\n        delay_in_seconds: The delay in seconds between consecutive prompts\n                          in order to prevent RateLimitErrors.\n        exponential_backoff: Retry requests with a random exponential backoff.\n                             A short sleep is used when a rate limit error is hit,\n                             then the requests is retried. Increase the sleep length\n                             if errors are hit until 10 unsuccesfull requests.\n                             If True, overrides `delay_in_seconds`.\n        chat: Set this to True if a GPT-3.5 model is used.\n              See: https://platform.openai.com/docs/models/gpt-3-5\n        nr_docs: The number of documents to pass to OpenAI if a prompt\n                 with the `[\"DOCUMENTS\"]` tag is used.\n        diversity: The diversity of documents to pass to OpenAI.\n                   Accepts values between 0 and 1. A higher\n                   values results in passing more diverse documents\n                   whereas lower values passes more similar documents.\n        doc_length: The maximum length of each document. If a document is longer,\n                    it will be truncated. If None, the entire document is passed.\n        tokenizer: The tokenizer used to calculate to split the document into segments\n                   used to count the length of a document.\n                       * If tokenizer is 'char', then the document is split up\n                         into characters which are counted to adhere to `doc_length`\n                       * If tokenizer is 'whitespace', the document is split up\n                         into words separated by whitespaces. These words are counted\n                         and truncated depending on `doc_length`\n                       * If tokenizer is 'vectorizer', then the internal CountVectorizer\n                         is used to tokenize the document. These tokens are counted\n                         and trunctated depending on `doc_length`\n                       * If tokenizer is a callable, then that callable is used to tokenize\n                         the document. These tokens are counted and truncated depending\n                         on `doc_length`\n\n    Usage:\n\n    To use this, you will need to install the openai package first:\n\n    `pip install openai`\n\n    Then, get yourself an API key and use OpenAI's API as follows:\n\n    ```python\n    import openai\n    from bertopic.representation import OpenAI\n    from bertopic import BERTopic\n\n    # Create your representation model\n    client = openai.OpenAI(api_key=MY_API_KEY)\n    representation_model = OpenAI(client, delay_in_seconds=5)\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can also use a custom prompt:\n\n    ```python\n    prompt = \"I have the following documents: [DOCUMENTS] \\nThese documents are about the following topic: '\"\n    representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)\n    ```\n\n    If you want to use OpenAI's ChatGPT model:\n\n    ```python\n    representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\n    ```\n    \"\"\"\n    def __init__(self,\n                 client,\n                 model: str = \"text-ada-001\",\n                 prompt: str = None,\n                 generator_kwargs: Mapping[str, Any] = {},\n                 delay_in_seconds: float = None,\n                 exponential_backoff: bool = False,\n                 chat: bool = False,\n                 nr_docs: int = 4,\n                 diversity: float = None,\n                 doc_length: int = None,\n                 tokenizer: Union[str, Callable] = None\n                 ):\n        self.client = client\n        self.model = model\n\n        if prompt is None:\n            self.prompt = DEFAULT_CHAT_PROMPT if chat else DEFAULT_PROMPT\n        else:\n            self.prompt = prompt\n\n        self.default_prompt_ = DEFAULT_CHAT_PROMPT if chat else DEFAULT_PROMPT\n        self.delay_in_seconds = delay_in_seconds\n        self.exponential_backoff = exponential_backoff\n        self.chat = chat\n        self.nr_docs = nr_docs\n        self.diversity = diversity\n        self.doc_length = doc_length\n        self.tokenizer = tokenizer\n        self.prompts_ = []\n\n        self.generator_kwargs = generator_kwargs\n        if self.generator_kwargs.get(\"model\"):\n            self.model = generator_kwargs.get(\"model\")\n        if self.generator_kwargs.get(\"prompt\"):\n            del self.generator_kwargs[\"prompt\"]\n        if not self.generator_kwargs.get(\"stop\") and not chat:\n            self.generator_kwargs[\"stop\"] = \"\\n\"\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: The topic c-TF-IDF representation\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Extract the top n representative documents per topic\n        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n        # Generate using OpenAI's Language Model\n        updated_topics = {}\n        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n            prompt = self._create_prompt(truncated_docs, topic, topics)\n            self.prompts_.append(prompt)\n\n            # Delay\n            if self.delay_in_seconds:\n                time.sleep(self.delay_in_seconds)\n\n            if self.chat:\n                messages = [\n                    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n                kwargs = {\"model\": self.model, \"messages\": messages, **self.generator_kwargs}\n                if self.exponential_backoff:\n                    response = chat_completions_with_backoff(self.client, **kwargs)\n                else:\n                    response = self.client.chat.completions.create(**kwargs)\n\n                # Check whether content was actually generated\n                # Adresses #1570 for potential issues with OpenAI's content filter\n                if hasattr(response.choices[0].message, \"content\"):\n                    label = response.choices[0].message.content.strip().replace(\"topic: \", \"\")\n                else:\n                    label = \"No label returned\"\n            else:\n                if self.exponential_backoff:\n                    response = completions_with_backoff(self.client, model=self.model, prompt=prompt, **self.generator_kwargs)\n                else:\n                    response = self.client.completions.create(model=self.model, prompt=prompt, **self.generator_kwargs)\n                label = response.choices[0].message.content.strip()\n\n            updated_topics[topic] = [(label, 1)]\n\n        return updated_topics\n\n    def _create_prompt(self, docs, topic, topics):\n        keywords = list(zip(*topics[topic]))[0]\n\n        # Use the Default Chat Prompt\n        if self.prompt == DEFAULT_CHAT_PROMPT or self.prompt == DEFAULT_PROMPT:\n            prompt = self.prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            prompt = self._replace_documents(prompt, docs)\n\n        # Use a custom prompt that leverages keywords, documents or both using\n        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively\n        else:\n            prompt = self.prompt\n            if \"[KEYWORDS]\" in prompt:\n                prompt = prompt.replace(\"[KEYWORDS]\", \", \".join(keywords))\n            if \"[DOCUMENTS]\" in prompt:\n                prompt = self._replace_documents(prompt, docs)\n\n        return prompt\n\n    @staticmethod\n    def _replace_documents(prompt, docs):\n        to_replace = \"\"\n        for doc in docs:\n            to_replace += f\"- {doc}\\n\"\n        prompt = prompt.replace(\"[DOCUMENTS]\", to_replace)\n        return prompt\n</code></pre>"},{"location":"api/representation/openai.html#bertopic.representation._openai.OpenAI.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>The topic c-TF-IDF representation</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_openai.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: The topic c-TF-IDF representation\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Extract the top n representative documents per topic\n    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity)\n\n    # Generate using OpenAI's Language Model\n    updated_topics = {}\n    for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):\n        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]\n        prompt = self._create_prompt(truncated_docs, topic, topics)\n        self.prompts_.append(prompt)\n\n        # Delay\n        if self.delay_in_seconds:\n            time.sleep(self.delay_in_seconds)\n\n        if self.chat:\n            messages = [\n                {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                {\"role\": \"user\", \"content\": prompt}\n            ]\n            kwargs = {\"model\": self.model, \"messages\": messages, **self.generator_kwargs}\n            if self.exponential_backoff:\n                response = chat_completions_with_backoff(self.client, **kwargs)\n            else:\n                response = self.client.chat.completions.create(**kwargs)\n\n            # Check whether content was actually generated\n            # Adresses #1570 for potential issues with OpenAI's content filter\n            if hasattr(response.choices[0].message, \"content\"):\n                label = response.choices[0].message.content.strip().replace(\"topic: \", \"\")\n            else:\n                label = \"No label returned\"\n        else:\n            if self.exponential_backoff:\n                response = completions_with_backoff(self.client, model=self.model, prompt=prompt, **self.generator_kwargs)\n            else:\n                response = self.client.completions.create(model=self.model, prompt=prompt, **self.generator_kwargs)\n            label = response.choices[0].message.content.strip()\n\n        updated_topics[topic] = [(label, 1)]\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/pos.html","title":"<code>PartOfSpeech</code>","text":"<p>Extract Topic Keywords based on their Part-of-Speech</p> <p>DEFAULT_PATTERNS = [             [{'POS': 'ADJ'}, {'POS': 'NOUN'}],             [{'POS': 'NOUN'}],             [{'POS': 'ADJ'}] ]</p> <p>From candidate topics, as extracted with c-TF-IDF, find documents that contain keywords found in the candidate topics. These candidate documents then serve as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic.</p> <p>These candidate keywords are first judged by whether they fall within the DEFAULT_PATTERNS or the user-defined pattern. Then, the resulting keywords are sorted by their respective c-TF-IDF values.</p> <p>Parameters:</p> Name Type Description Default <code>model</code> <code>Union[str, spacy.language.Language]</code> <p>The Spacy model to use</p> <code>'en_core_web_sm'</code> <code>top_n_words</code> <code>int</code> <p>The top n words to extract</p> <code>10</code> <code>pos_patterns</code> <code>List[str]</code> <p>Patterns for Spacy to use.           See https://spacy.io/usage/rule-based-matching</p> <code>None</code> <p>Usage:</p> <pre><code>from bertopic.representation import PartOfSpeech\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can define custom POS patterns to be extracted:</p> <pre><code>pos_patterns = [\n            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n            [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n]\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\", pos_patterns=pos_patterns)\n</code></pre> Source code in <code>bertopic\\representation\\_pos.py</code> <pre><code>class PartOfSpeech(BaseRepresentation):\n    \"\"\" Extract Topic Keywords based on their Part-of-Speech\n\n    DEFAULT_PATTERNS = [\n                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n                [{'POS': 'NOUN'}],\n                [{'POS': 'ADJ'}]\n    ]\n\n    From candidate topics, as extracted with c-TF-IDF,\n    find documents that contain keywords found in the\n    candidate topics. These candidate documents then\n    serve as the representative set of documents from\n    which the Spacy model can extract a set of candidate\n    keywords for each topic.\n\n    These candidate keywords are first judged by whether\n    they fall within the DEFAULT_PATTERNS or the user-defined\n    pattern. Then, the resulting keywords are sorted by\n    their respective c-TF-IDF values.\n\n    Arguments:\n        model: The Spacy model to use\n        top_n_words: The top n words to extract\n        pos_patterns: Patterns for Spacy to use.\n                      See https://spacy.io/usage/rule-based-matching\n\n    Usage:\n\n    ```python\n    from bertopic.representation import PartOfSpeech\n    from bertopic import BERTopic\n\n    # Create your representation model\n    representation_model = PartOfSpeech(\"en_core_web_sm\")\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n\n    You can define custom POS patterns to be extracted:\n\n    ```python\n    pos_patterns = [\n                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n                [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n    ]\n    representation_model = PartOfSpeech(\"en_core_web_sm\", pos_patterns=pos_patterns)\n    ```\n    \"\"\"\n    def __init__(self,\n                 model: Union[str, Language] = \"en_core_web_sm\",\n                 top_n_words: int = 10,\n                 pos_patterns: List[str] = None):\n        if isinstance(model, str):\n            self.model = spacy.load(model)\n        elif isinstance(model, Language):\n            self.model = model\n        else:\n            raise ValueError(\"Make sure that the Spacy model that you\"\n                             \"pass is either a string referring to a\"\n                             \"Spacy model or a Spacy nlp object.\")\n\n        self.top_n_words = top_n_words\n\n        if pos_patterns is None:\n            self.pos_patterns = [\n                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n                [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n            ]\n        else:\n            self.pos_patterns = pos_patterns\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: A BERTopic model\n            documents: All input documents\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        matcher = Matcher(self.model.vocab)\n        matcher.add(\"Pattern\", self.pos_patterns)\n\n        candidate_topics = {}\n        for topic, values in topics.items():\n            keywords = list(zip(*values))[0]\n\n            # Extract candidate documents\n            candidate_documents = []\n            for keyword in keywords:\n                selection = documents.loc[documents.Topic == topic, :]\n                selection = selection.loc[selection.Document.str.contains(keyword), \"Document\"]\n                if len(selection) &gt; 0:\n                    for document in selection[:2]:\n                        candidate_documents.append(document)\n            candidate_documents = list(set(candidate_documents))\n\n            # Extract keywords\n            docs_pipeline = self.model.pipe(candidate_documents)\n            updated_keywords = []\n            for doc in docs_pipeline:\n                matches = matcher(doc)\n                for _, start, end in matches:\n                    updated_keywords.append(doc[start:end].text)\n            candidate_topics[topic] = list(set(updated_keywords))\n\n        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n        # and will be removed in 1.2. Please use get_feature_names_out instead.\n        if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n            words = list(topic_model.vectorizer_model.get_feature_names_out())\n        else:\n            words = list(topic_model.vectorizer_model.get_feature_names())\n\n        # Match updated keywords with c-TF-IDF values\n        words_lookup = dict(zip(words, range(len(words))))\n        updated_topics = {topic: [] for topic in topics.keys()}\n\n        for topic, candidate_keywords in candidate_topics.items():\n            word_indices = [words_lookup.get(keyword) for keyword in candidate_keywords if words_lookup.get(keyword)]\n            vals = topic_model.c_tf_idf_[:, np.array(word_indices)][topic + topic_model._outliers]\n            indices = np.argsort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n            vals = np.sort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n            topic_words = [(words[word_indices[index]], val) for index, val in zip(indices, vals)]\n            updated_topics[topic] = topic_words\n            if len(updated_topics[topic]) &lt; self.top_n_words:\n                updated_topics[topic] += [(\"\", 0) for _ in range(self.top_n_words-len(updated_topics[topic]))]\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/pos.html#bertopic.representation._pos.PartOfSpeech.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>A BERTopic model</p> required <code>documents</code> <code>DataFrame</code> <p>All input documents</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_pos.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: A BERTopic model\n        documents: All input documents\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    matcher = Matcher(self.model.vocab)\n    matcher.add(\"Pattern\", self.pos_patterns)\n\n    candidate_topics = {}\n    for topic, values in topics.items():\n        keywords = list(zip(*values))[0]\n\n        # Extract candidate documents\n        candidate_documents = []\n        for keyword in keywords:\n            selection = documents.loc[documents.Topic == topic, :]\n            selection = selection.loc[selection.Document.str.contains(keyword), \"Document\"]\n            if len(selection) &gt; 0:\n                for document in selection[:2]:\n                    candidate_documents.append(document)\n        candidate_documents = list(set(candidate_documents))\n\n        # Extract keywords\n        docs_pipeline = self.model.pipe(candidate_documents)\n        updated_keywords = []\n        for doc in docs_pipeline:\n            matches = matcher(doc)\n            for _, start, end in matches:\n                updated_keywords.append(doc[start:end].text)\n        candidate_topics[topic] = list(set(updated_keywords))\n\n    # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0\n    # and will be removed in 1.2. Please use get_feature_names_out instead.\n    if version.parse(sklearn_version) &gt;= version.parse(\"1.0.0\"):\n        words = list(topic_model.vectorizer_model.get_feature_names_out())\n    else:\n        words = list(topic_model.vectorizer_model.get_feature_names())\n\n    # Match updated keywords with c-TF-IDF values\n    words_lookup = dict(zip(words, range(len(words))))\n    updated_topics = {topic: [] for topic in topics.keys()}\n\n    for topic, candidate_keywords in candidate_topics.items():\n        word_indices = [words_lookup.get(keyword) for keyword in candidate_keywords if words_lookup.get(keyword)]\n        vals = topic_model.c_tf_idf_[:, np.array(word_indices)][topic + topic_model._outliers]\n        indices = np.argsort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n        vals = np.sort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words:][::-1]\n        topic_words = [(words[word_indices[index]], val) for index, val in zip(indices, vals)]\n        updated_topics[topic] = topic_words\n        if len(updated_topics[topic]) &lt; self.top_n_words:\n            updated_topics[topic] += [(\"\", 0) for _ in range(self.top_n_words-len(updated_topics[topic]))]\n\n    return updated_topics\n</code></pre>"},{"location":"api/representation/zeroshot.html","title":"<code>ZeroShotClassification</code>","text":"<p>Zero-shot Classification on topic keywords with candidate labels</p> <p>Parameters:</p> Name Type Description Default <code>candidate_topics</code> <code>List[str]</code> <p>A list of labels to assign to the topics if they               exceed <code>min_prob</code></p> required <code>model</code> <code>str</code> <p>A transformers pipeline that should be initialized as    \"zero-shot-classification\". For example,    <code>pipeline(\"zero-shot-classification\", model=\"facebook/bart-large-mnli\")</code></p> <code>'facebook/bart-large-mnli'</code> <code>pipeline_kwargs</code> <code>Mapping[str, Any]</code> <p>Kwargs that you can pass to the transformers.pipeline              when it is called. NOTE: Use <code>{\"multi_label\": True}</code>              to extract multiple labels for each topic.</p> <code>{}</code> <code>min_prob</code> <code>float</code> <p>The minimum probability to assign a candidate label to a topic</p> <code>0.8</code> <p>Usage:</p> <pre><code>from bertopic.representation import ZeroShotClassification\nfrom bertopic import BERTopic\n\n# Create your representation model\ncandidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\nrepresentation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> Source code in <code>bertopic\\representation\\_zeroshot.py</code> <pre><code>class ZeroShotClassification(BaseRepresentation):\n    \"\"\" Zero-shot Classification on topic keywords with candidate labels\n\n    Arguments:\n        candidate_topics: A list of labels to assign to the topics if they\n                          exceed `min_prob`\n        model: A transformers pipeline that should be initialized as\n               \"zero-shot-classification\". For example,\n               `pipeline(\"zero-shot-classification\", model=\"facebook/bart-large-mnli\")`\n        pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline\n                         when it is called. NOTE: Use `{\"multi_label\": True}`\n                         to extract multiple labels for each topic.\n        min_prob: The minimum probability to assign a candidate label to a topic\n\n    Usage:\n\n    ```python\n    from bertopic.representation import ZeroShotClassification\n    from bertopic import BERTopic\n\n    # Create your representation model\n    candidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\n    representation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n    # Use the representation model in BERTopic on top of the default pipeline\n    topic_model = BERTopic(representation_model=representation_model)\n    ```\n    \"\"\"\n    def __init__(self,\n                 candidate_topics: List[str],\n                 model: str = \"facebook/bart-large-mnli\",\n                 pipeline_kwargs: Mapping[str, Any] = {},\n                 min_prob: float = 0.8\n                 ):\n        self.candidate_topics = candidate_topics\n        if isinstance(model, str):\n            self.model = pipeline(\"zero-shot-classification\", model=model)\n        elif isinstance(model, Pipeline):\n            self.model = model\n        else:\n            raise ValueError(\"Make sure that the HF model that you\"\n                             \"pass is either a string referring to a\"\n                             \"HF model or a `transformers.pipeline` object.\")\n        self.pipeline_kwargs = pipeline_kwargs\n        self.min_prob = min_prob\n\n    def extract_topics(self,\n                       topic_model,\n                       documents: pd.DataFrame,\n                       c_tf_idf: csr_matrix,\n                       topics: Mapping[str, List[Tuple[str, float]]]\n                       ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: Not used\n            documents: Not used\n            c_tf_idf: Not used\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        # Classify topics\n        topic_descriptions = [\" \".join(list(zip(*topics[topic]))[0]) for topic in topics.keys()]\n        classifications = self.model(topic_descriptions, self.candidate_topics, **self.pipeline_kwargs)\n\n        # Extract labels\n        updated_topics = {}\n        for topic, classification in zip(topics.keys(), classifications):\n            topic_description = topics[topic]\n\n            # Multi-label assignment\n            if self.pipeline_kwargs.get(\"multi_label\"):\n                topic_description = []\n                for label, score in zip(classification[\"labels\"], classification[\"scores\"]):\n                    if score &gt; self.min_prob:\n                        topic_description.append((label, score))\n\n            # Single label assignment\n            elif classification[\"scores\"][0] &gt; self.min_prob:\n                topic_description = [(classification[\"labels\"][0], classification[\"scores\"][0])]\n\n            # Make sure that 10 items are returned\n            if len(topic_description) == 0:\n                topic_description = topics[topic]\n            elif len(topic_description) &lt; 10:\n                topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n            updated_topics[topic] = topic_description\n\n        return updated_topics\n</code></pre>"},{"location":"api/representation/zeroshot.html#bertopic.representation._zeroshot.ZeroShotClassification.extract_topics","title":"<code>extract_topics(self, topic_model, documents, c_tf_idf, topics)</code>","text":"<p>Extract topics</p> <p>Parameters:</p> Name Type Description Default <code>topic_model</code> <p>Not used</p> required <code>documents</code> <code>DataFrame</code> <p>Not used</p> required <code>c_tf_idf</code> <code>csr_matrix</code> <p>Not used</p> required <code>topics</code> <code>Mapping[str, List[Tuple[str, float]]]</code> <p>The candidate topics as calculated with c-TF-IDF</p> required <p>Returns:</p> Type Description <code>updated_topics</code> <p>Updated topic representations</p> Source code in <code>bertopic\\representation\\_zeroshot.py</code> <pre><code>def extract_topics(self,\n                   topic_model,\n                   documents: pd.DataFrame,\n                   c_tf_idf: csr_matrix,\n                   topics: Mapping[str, List[Tuple[str, float]]]\n                   ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n    \"\"\" Extract topics\n\n    Arguments:\n        topic_model: Not used\n        documents: Not used\n        c_tf_idf: Not used\n        topics: The candidate topics as calculated with c-TF-IDF\n\n    Returns:\n        updated_topics: Updated topic representations\n    \"\"\"\n    # Classify topics\n    topic_descriptions = [\" \".join(list(zip(*topics[topic]))[0]) for topic in topics.keys()]\n    classifications = self.model(topic_descriptions, self.candidate_topics, **self.pipeline_kwargs)\n\n    # Extract labels\n    updated_topics = {}\n    for topic, classification in zip(topics.keys(), classifications):\n        topic_description = topics[topic]\n\n        # Multi-label assignment\n        if self.pipeline_kwargs.get(\"multi_label\"):\n            topic_description = []\n            for label, score in zip(classification[\"labels\"], classification[\"scores\"]):\n                if score &gt; self.min_prob:\n                    topic_description.append((label, score))\n\n        # Single label assignment\n        elif classification[\"scores\"][0] &gt; self.min_prob:\n            topic_description = [(classification[\"labels\"][0], classification[\"scores\"][0])]\n\n        # Make sure that 10 items are returned\n        if len(topic_description) == 0:\n            topic_description = topics[topic]\n        elif len(topic_description) &lt; 10:\n            topic_description += [(\"\", 0) for _ in range(10-len(topic_description))]\n        updated_topics[topic] = topic_description\n\n    return updated_topics\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html","title":"Best Practices","text":"<p> - Overview of Best Practices</p> <p>Through the nature of BERTopic, its modularity, many variations of the topic modeling technique is possible. However, during the development and through the usage of the package, a set of best practices have been developed that generally lead to great results.</p> <p>The following are a number of steps, parameters, and settings that you can use that will generally improve the quality of the resulting topics. In other words, after going through the quick start and getting a feeling for the API these steps should get you to the next level of performance.</p> <p>Note</p> <p>Although these are called best practices, it does not necessarily mean that they work across all use cases perfectly. The underlying modular nature of BERTopic is meant to take different use cases into account. After going through these practices it is advised to fine-tune wherever necessary. </p> <p>To showcase how these \"best practices\" work, we will go through an example dataset and apply all practices to it.</p>"},{"location":"getting_started/best_practices/best_practices.html#data","title":"Data","text":"<p>For this example, we will use a dataset containing abstracts and metadata from ArXiv articles.</p> <pre><code>from datasets import load_dataset\n\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\n\n# Extract abstracts to train on and corresponding titles\nabstracts = dataset[\"abstract\"]\ntitles = dataset[\"title\"]\n</code></pre> <p>Sentence Splitter</p> <p>Whenever you have large documents, you typically want to split them up into either paragraphs or sentences. A nice way to do so is by using NLTK's sentence splitter which is nothing more than:</p> <pre><code>from nltk.tokenize import sent_tokenize, word_tokenize\nsentences = [sent_tokenize(abstract) for abstract in abstracts]\nsentences = [sentence for doc in sentences for sentence in doc]\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#pre-calculate-embeddings","title":"Pre-calculate Embeddings","text":"<p>After having created our data, namely <code>abstracts</code>, we can dive into the very first best practice, pre-calculating embeddings.</p> <p>BERTopic works by converting documents into numerical values, called embeddings. This process can be very costly, especially if we want to iterate over parameters. Instead, we can calculate those embeddings once and feed them to BERTopic to skip calculating embeddings each time.</p> <pre><code>from sentence_transformers import SentenceTransformer\n\n# Pre-calculate embeddings\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = embedding_model.encode(abstracts, show_progress_bar=True)\n</code></pre> <p>Tip</p> <p>New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the MTEB leaderboard. It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look. </p>"},{"location":"getting_started/best_practices/best_practices.html#preventing-stochastic-behavior","title":"Preventing Stochastic Behavior","text":"<p>In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the curse of dimensionality to a certain degree.</p> <p>As a default, this is done with UMAP which is an incredible algorithm for reducing dimensional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a <code>random_state</code> of the model before passing it to BERTopic.</p> <p>As a result, we can now fully reproduce the results each time we run the model.</p> <pre><code>from umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#controlling-number-of-topics","title":"Controlling Number of Topics","text":"<p>There is a parameter to control the number of topics, namely <code>nr_topics</code>. This parameter, however, merges topics after they have been created. It is a parameter that supports creating a fixed number of topics.</p> <p>However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely <code>min_cluster_size</code> that indirectly controls the number of topics that will be created.</p> <p>A higher <code>min_cluster_size</code> will generate fewer topics and a lower <code>min_cluster_size</code> will generate more topics.</p> <p>Here, we will go with <code>min_cluster_size=150</code> to prevent too many micro-clusters from being created:</p> <pre><code>from hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#improving-default-representation","title":"Improving Default Representation","text":"<p>The default representation of topics is calculated through c-TF-IDF. However, c-TF-IDF is powered by the CountVectorizer which converts text into tokens. Using the CountVectorizer, we can do a number of things:</p> <ul> <li>Remove stopwords</li> <li>Ignore infrequent words</li> <li>Increase the n-gram range</li> </ul> <p>In other words, we can preprocess the topic representations after documents are assigned to topics. This will not influence the clustering process in any way.</p> <p>Here, we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(stop_words=\"english\", min_df=2, ngram_range=(1, 2))\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#additional-representations","title":"Additional Representations","text":"<p>Previously, we have tuned the default representation but there are quite a number of other topic representations in BERTopic that we can choose from. From KeyBERTInspired and PartOfSpeech, to OpenAI's ChatGPT and open-source alternatives, many representations are possible.</p> <p>In BERTopic, you can model many different topic representations simultaneously to test them out and get different perspectives of topic descriptions. This is called multi-aspect topic modeling.</p> <p>Here, we will demonstrate a number of interesting and useful representations in BERTopic:</p> <ul> <li>KeyBERTInspired</li> <li>A method that derives inspiration from how KeyBERT works</li> <li>PartOfSpeech</li> <li>Using SpaCy's POS tagging to extract words</li> <li>MaximalMarginalRelevance</li> <li>Diversify the topic words</li> <li>OpenAI</li> <li>Use ChatGPT to label our topics</li> </ul> <pre><code>import openai\nfrom bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech\n\n# KeyBERT\nkeybert_model = KeyBERTInspired()\n\n# Part-of-Speech\npos_model = PartOfSpeech(\"en_core_web_sm\")\n\n# MMR\nmmr_model = MaximalMarginalRelevance(diversity=0.3)\n\n# GPT-3.5\nclient = openai.OpenAI(api_key=\"sk-...\")\nprompt = \"\"\"\nI have a topic that contains the following documents: \n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\nopenai_model = OpenAI(client, model=\"gpt-3.5-turbo\", exponential_backoff=True, chat=True, prompt=prompt)\n\n# All representation models\nrepresentation_model = {\n    \"KeyBERT\": keybert_model,\n    # \"OpenAI\": openai_model,  # Uncomment if you will use OpenAI\n    \"MMR\": mmr_model,\n    \"POS\": pos_model\n}\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#training","title":"Training","text":"<p>Now that we have a set of best practices, we can use them in our training loop. Here, several different representations, keywords and labels for our topics will be created. If you want to iterate over the topic model it is advised to use the pre-calculated embeddings as that significantly speeds up training.</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic(\n\n  # Pipeline models\n  embedding_model=embedding_model,\n  umap_model=umap_model,\n  hdbscan_model=hdbscan_model,\n  vectorizer_model=vectorizer_model,\n  representation_model=representation_model,\n\n  # Hyperparameters\n  top_n_words=10,\n  verbose=True\n)\n\n# Train model\ntopics, probs = topic_model.fit_transform(abstracts, embeddings)\n\n# Show topics\ntopic_model.get_topic_info()\n</code></pre> <p>To get all representations for a single topic, we simply run the following:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(1, full=True)\n{'Main': [('adversarial', 0.028838938990764302),\n  ('attacks', 0.021726302042463556),\n  ('attack', 0.016803574415028524),\n  ('robustness', 0.013046135743326167),\n  ('adversarial examples', 0.01151254557995679),\n  ('examples', 0.009920962487998853),\n  ('perturbations', 0.009053305826870773),\n  ('adversarial attacks', 0.008747627064844006),\n  ('malware', 0.007675131707700338),\n  ('defense', 0.007365955840313783)],\n 'KeyBERT': [('adversarial training', 0.76427937),\n  ('adversarial attack', 0.74271905),\n  ('vulnerable adversarial', 0.73302543),\n  ('adversarial', 0.7311052),\n  ('adversarial examples', 0.7179245),\n  ('adversarial attacks', 0.7082),\n  ('adversarially', 0.7005141),\n  ('adversarial robustness', 0.69911957),\n  ('adversarial perturbations', 0.6588783),\n  ('adversary', 0.4467769)],\n 'OpenAI': [('Adversarial attacks and defense', 1)],\n 'MMR': [('adversarial', 0.028838938990764302),\n  ('attacks', 0.021726302042463556),\n  ('attack', 0.016803574415028524),\n  ('robustness', 0.013046135743326167),\n  ('adversarial examples', 0.01151254557995679),\n  ('examples', 0.009920962487998853),\n  ('perturbations', 0.009053305826870773),\n  ('adversarial attacks', 0.008747627064844006),\n  ('malware', 0.007675131707700338),\n  ('defense', 0.007365955840313783)],\n 'POS': [('adversarial', 0.028838938990764302),\n  ('attacks', 0.021726302042463556),\n  ('attack', 0.016803574415028524),\n  ('robustness', 0.013046135743326167),\n  ('adversarial examples', 0.01151254557995679),\n  ('examples', 0.009920962487998853),\n  ('perturbations', 0.009053305826870773),\n  ('adversarial attacks', 0.008747627064844006),\n  ('malware', 0.007675131707700338),\n  ('defense', 0.007365955840313783)]}\n</code></pre> <p>NOTE: The labels generated by OpenAI's ChatGPT are especially interesting to use throughout your model. Below, we will go into more detail how to set that as a custom label.</p> <p>Parameters</p> <p>If you would like to return the topic-document probability matrix, then it is advised to use <code>calculate_probabilities=True</code>. Do note that this can significantly slow down training. To speed it up, use cuML's HDBSCAN instead. You could also approximate the topic-document probability matrix with <code>.approximate_distribution</code> which will be discussed later.</p>"},{"location":"getting_started/best_practices/best_practices.html#custom-labels","title":"(Custom) Labels","text":"<p>The default label of each topic are the top 3 words in each topic combined with an underscore between them.</p> <p>This, of course, might not be the best label that you can think of for a certain topic. Instead, we can use <code>.set_topic_labels</code> to manually label all or certain topics.</p> <p>We can also use <code>.set_topic_labels</code> to use one of the other topic representations that we had before, like <code>KeyBERTInspired</code> or even <code>OpenAI</code>.</p> <pre><code># Label the topics yourself\ntopic_model.set_topic_labels({1: \"Space Travel\", 7: \"Religion\"})\n\n# or use one of the other topic representations, like KeyBERTInspired\nkeybert_topic_labels = {topic: \" | \".join(list(zip(*values))[0][:3]) for topic, values in topic_model.topic_aspects_[\"KeyBERT\"].items()}\ntopic_model.set_topic_labels(keybert_topic_labels)\n\n# or ChatGPT's labels\nchatgpt_topic_labels = {topic: \" | \".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_[\"OpenAI\"].items()}\nchatgpt_topic_labels[-1] = \"Outlier Topic\"\ntopic_model.set_topic_labels(chatgpt_topic_labels)\n</code></pre> <p>Now that we have set the updated topic labels, we can access them with the many functions used throughout BERTopic. Most notably, you can show the updated labels in visualizations with the <code>custom_labels=True</code> parameters.</p> <p>If we were to run <code>topic_model.get_topic_info()</code> it will now include the column <code>CustomName</code>. That is the custom label that we just created for each topic.</p>"},{"location":"getting_started/best_practices/best_practices.html#topic-document-distribution","title":"Topic-Document Distribution","text":"<p>If using <code>calculate_probabilities=True</code> is not possible, then you can approximate the topic-document distributions using <code>.approximate_distribution</code>. It is a fast and flexible method for creating different topic-document distributions.</p> <pre><code># `topic_distr` contains the distribution of topics in each document\ntopic_distr, _ = topic_model.approximate_distribution(abstracts, window=8, stride=4)\n</code></pre> <p>Next, lets take a look at a specific abstract and see how the topic distribution was extracted:</p> <pre><code># Visualize the topic-document distribution for a single document\ntopic_model.visualize_distribution(topic_distr[abstract_id], custom_labels=True)\n</code></pre> <p>It seems to have extracted a number of topics that are relevant and shows the distributions of these topics across the abstract. We can go one step further and visualize them on a token-level:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(abstracts[abstract_id], calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(abstracts[abstract_id], topic_token_distr[0])\ndf\n</code></pre> <p>use_embedding_model</p> <p>As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected embedding_model instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#outlier-reduction","title":"Outlier Reduction","text":"<p>By default, HDBSCAN generates outliers which is a helpful mechanic in creating accurate topic representations. However, you might want to assign every single document to a topic. We can use <code>.reduce_outliers</code> to map some or all outliers to a topic:</p> <pre><code># Reduce outliers\nnew_topics = topic_model.reduce_outliers(abstracts, topics)\n\n# Reduce outliers with pre-calculate embeddings instead\nnew_topics = topic_model.reduce_outliers(abstracts, topics, strategy=\"embeddings\", embeddings=embeddings)\n</code></pre> <p>Update Topics with Outlier Reduction</p> <p>After having generated updated topic assignments, we can pass them to BERTopic in order to update the topic representations:</p> <pre><code>topic_model.update_topics(docs, topics=new_topics)\n</code></pre> <p>It is important to realize that updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.</p>"},{"location":"getting_started/best_practices/best_practices.html#visualize-topics","title":"Visualize Topics","text":"<p>With visualizations, we are closing into the realm of subjective \"best practices\". These are things that I generally do because I like the representations but your experience might differ.</p> <p>Having said that, there are two visualizations that are my go-to when visualizing the topics themselves:</p> <ul> <li><code>topic_model.visualize_topics()</code></li> <li><code>topic_model.visualize_hierarchy()</code></li> </ul> <pre><code># Visualize topics with custom labels\ntopic_model.visualize_topics(custom_labels=True)\n\n# Visualize hierarchy with custom labels\ntopic_model.visualize_hierarchy(custom_labels=True)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#visualize-documents","title":"Visualize Documents","text":"<p>When visualizing documents, it helps to have embedded the documents beforehand to speed up computation. Fortunately, we have already done that as a \"best practice\".</p> <p>Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.</p> <pre><code># Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\n</code></pre> <p>The following plot is interactive which means that you can zoom in, double click on a label to only see that one and generally interact with the plot:</p> <pre><code># Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts\n# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset\n# NOTE: You can also hide the annotations with `hide_annotations=True` which is helpful to see the larger structure\ntopic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, custom_labels=True)\n</code></pre> <p>2-dimensional space</p> <p>Although visualizing the documents in 2-dimensional gives an idea of their underlying structure, there is a risk involved.</p> <p>Visualizing the documents in 2-dimensional space means that we have lost significant information since the original embeddings were more than 384 dimensions. Condensing all that information in 2 dimensions is simply not possible. In other words, it is merely an approximation, albeit quite an accurate one.</p>"},{"location":"getting_started/best_practices/best_practices.html#serialization","title":"Serialization","text":"<p>When saving a BERTopic model, there are several ways in doing so. You can either save the entire model with <code>pickle</code>, <code>pytorch</code>, or <code>safetensors</code>.</p> <p>Personally, I would advise going with <code>safetensors</code> whenever possible. The reason for this is that the format allows for a very small topic model to be saved and shared.</p> <p>When saving a model with <code>safetensors</code>, it skips over saving the dimensionality reduction and clustering models. The <code>.transform</code> function will still work without these models but instead assign topics based on the similarity between document embeddings and the topic embeddings.</p> <p>As a result, the <code>.transform</code> step might give different results but it is generally worth it considering the smaller and significantly faster model.</p> <pre><code>embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"my_model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n</code></pre> <p>Embedding Model</p> <p>Using <code>safetensors</code>, we are not saving the underlying embedding model but merely a pointer to the model. For example, in the above example we are saving the string <code>\"sentence-transformers/all-MiniLM-L6-v2\"</code> so that we can load in the embedding model alongside the topic model.</p> <p>This currently only works if you are using a sentence transformer model. If you are using a different model, you can load it in when loading the topic model like this:</p> <pre><code>from sentence_transformers import SentenceTransformer\n\n# Define embedding model\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n# Load model and add embedding model\nloaded_model = BERTopic.load(\"my_model_dir\", embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/best_practices/best_practices.html#inference","title":"Inference","text":"<p>To speed up the inference, we can leverage a \"best practice\" that we used before, namely serialization. When you save a model as <code>safetensors</code> and then load it in, we are removing the dimensionality reduction and clustering steps from the pipeline. </p> <p>Instead, the assignment of topics is done through cosine similarity of document embeddings and topic embeddings. This speeds up inferences significantly. </p> <p>To show its effect, let's start by disabling the logger:</p> <pre><code>from bertopic._utils import MyLogger\nlogger = MyLogger(\"ERROR\")\nloaded_model.verbose = False\ntopic_model.verbose = False\n</code></pre> <p>Then, we run inference on both the loaded model and the non-loaded model:</p> <pre><code>&gt;&gt;&gt; %timeit loaded_model.transform(abstracts[:100])\n343 ms \u00b1 31.1 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n</code></pre> <pre><code>&gt;&gt;&gt; %timeit topic_model.transform(abstracts[:100])\n1.37 s \u00b1 166 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n</code></pre> <p>Based on the above, the <code>loaded_model</code> seems to be quite a bit faster for inference than the original <code>topic_model</code>.</p>"},{"location":"getting_started/clustering/clustering.html","title":"3. Clustering","text":"<p>After reducing the dimensionality of our input embeddings, we need to cluster them into groups of similar embeddings to extract our topics.  This process of clustering is quite important because the more performant our clustering technique the more accurate our topic representations are.</p> <p>In BERTopic, we typically use HDBSCAN as it is quite capable of capturing structures with different densities. However, there is not one perfect  clustering model and you might want to be using something entirely different for your use case. Moreover, what if a new state-of-the-art model  is released tomorrow? We would like to be able to use that in BERTopic, right? Since BERTopic assumes some independence among steps, we can allow for this modularity:</p> <p> </p> <p>As a result, the <code>hdbscan_model</code> parameter in BERTopic now allows for a variety of clustering models. To do so, the class should have  the following attributes:</p> <ul> <li><code>.fit(X)</code> <ul> <li>A function that can be used to fit the model</li> </ul> </li> <li><code>.predict(X)</code> <ul> <li>A predict function that transforms the input to cluster labels</li> </ul> </li> <li><code>.labels_</code><ul> <li>The labels after fitting the model</li> </ul> </li> </ul> <p>In other words, it should have the following structure:</p> <pre><code>class ClusterModel:\n    def fit(self, X):\n        self.labels_ = None\n        return self\n\n    def predict(self, X):\n        return X\n</code></pre> <p>In this section, we will go through several examples of clustering algorithms and how they can be implemented.  </p>"},{"location":"getting_started/clustering/clustering.html#hdbscan","title":"HDBSCAN","text":"<p>As a default, BERTopic uses HDBSCAN to perform its clustering. To use a HDBSCAN model with custom parameters,  we simply define it and pass it to BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model)\n</code></pre> <p>Here, we can define any parameters in HDBSCAN to optimize for the best performance based on whatever validation metrics you are using. </p>"},{"location":"getting_started/clustering/clustering.html#k-means","title":"k-Means","text":"<p>Although HDBSCAN works quite well in BERTopic and is typically advised, you might want to be using k-Means instead.  It allows you to select how many clusters you would like and forces every single point to be in a cluster. Therefore, no  outliers will be created. This also has disadvantages. When you force every single point in a cluster, it will mean  that the cluster is highly likely to contain noise which can hurt the topic representations. As a small tip, using  the <code>vectorizer_model=CountVectorizer(stop_words=\"english\")</code> helps quite a bit to then improve the topic representation. </p> <p>Having said that, using k-Means is quite straightforward:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\n\ncluster_model = KMeans(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre> <p>Note</p> <p>As you might have noticed, the <code>cluster_model</code> is passed to <code>hdbscan_model</code> which might be a bit confusing considering  you are not passing an HDBSCAN model. For now, the name of the parameter is kept the same to adhere to the current  state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible. </p>"},{"location":"getting_started/clustering/clustering.html#agglomerative-clustering","title":"Agglomerative Clustering","text":"<p>Like k-Means, there are a bunch more clustering algorithms in <code>sklearn</code> that you can be using. Some of these models do  not have a <code>.predict()</code> method but still can be used in BERTopic. However, using BERTopic's <code>.transform()</code> function  will then give errors. </p> <p>Here, we will demonstrate Agglomerative Clustering:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import AgglomerativeClustering\n\ncluster_model = AgglomerativeClustering(n_clusters=50)\ntopic_model = BERTopic(hdbscan_model=cluster_model)\n</code></pre>"},{"location":"getting_started/clustering/clustering.html#cuml-hdbscan","title":"cuML HDBSCAN","text":"<p>Although the original HDBSCAN implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,  we can use cuML to speed up HDBSCAN through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model)\n</code></pre> <p>The great thing about using cuML's HDBSCAN implementation is that it supports many features of the original implementation. In other words,  <code>calculate_probabilities=True</code> also works!</p> <p>Note</p> <p>As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., <code>.transform</code>) using cuML's HDBSCAN.  However, it is still possible to calculate the topic-document probability matrix for the data on which the model was trained (i.e., <code>.fit</code> and <code>.fit_transform</code>).</p> <p>Note</p> <p>If you want to install cuML together with BERTopic using Google Colab, you can run the following code:</p> <pre><code>!pip install bertopic\n!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64\n</code></pre>"},{"location":"getting_started/ctfidf/ctfidf.html","title":"c-TF-IDF","text":"<p>In BERTopic, in order to get an accurate representation of the topics from our bag-of-words matrix, TF-IDF was adjusted to work on a cluster/categorical/topic level instead of a document level. This adjusted TF-IDF representation is called c-TF-IDF and takes into account what makes the documents in one cluster different from documents in another cluster:</p> <p></p> <p> Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word <code>x</code> in class <code>c</code>, where <code>c</code> refers to the cluster we created before. This results in our class-based <code>tf</code> representation. This representation is L1-normalized to account for the differences in topic sizes.     Then, we take the logarithm of one plus the average number of words per class <code>A</code> divided by the frequency of word <code>x</code> across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based <code>idf</code> representation. Like with the classic TF-IDF, we then multiply <code>tf</code> with <code>idf</code> to get the importance score per word in each class. In other words, the classical TF-IDF procedure is not used here but a modified version of the algorithm that allows for a much better representation. </p> <p>Since the topic representation is somewhat independent of the clustering step, we can change how the c-TF-IDF representation will look like. This can be in the form of parameter tuning, different weighting schemes, or using a diversity metric on top of it. This allows for some modularity concerning the weighting scheme:</p> <p> </p> <p>This class-based TF-IDF representation is enabled by default in BERTopic. However, we can explicitly pass it to BERTopic through the <code>ctfidf_model</code> allowing for parameter tuning and the customization of the topic extraction technique:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer()\ntopic_model = BERTopic(ctfidf_model=ctfidf_model )\n</code></pre>"},{"location":"getting_started/ctfidf/ctfidf.html#parameters","title":"Parameters","text":"<p>There are two parameters worth exploring in the <code>ClassTfidfTransformer</code>, namely <code>bm25_weighting</code> and <code>reduce_frequent_words</code>.</p>"},{"location":"getting_started/ctfidf/ctfidf.html#bm25_weighting","title":"bm25_weighting","text":"<p>The <code>bm25_weighting</code> is a boolean parameter that indicates whether a class-based BM-25 weighting measure is used instead of the default method as defined in the formula at the beginning of this page. </p> <p>Instead of using the following weighting scheme:</p> <p></p> <p>the class-based BM-25 weighting is used instead:</p> <p></p> <p>At smaller datasets, this variant can be more robust to stop words that appear in your data. It can be enabled as follows:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(bm25_weighting=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model )\n</code></pre>"},{"location":"getting_started/ctfidf/ctfidf.html#reduce_frequent_words","title":"reduce_frequent_words","text":"<p>Some words appear quite often in every topic but are generally not considered stop words as found in the <code>CountVectorizer(stop_words=\"english\")</code> list. To further reduce these frequent words, we can use <code>reduce_frequent_words</code> to take the square root of the term frequency after applying the weighting scheme. </p> <p>Instead of the default term frequency:</p> <p></p> <p>we take the square root of the term frequency after normalizing the frequency matrix:</p> <p></p> <p>Although seemingly a small change, it can have quite a large effect on the number of stop words in the resulting topic representations. It can be enabled as follows:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model )\n</code></pre> <p>Tip</p> <p>Both parameters can be used simultaneously: <code>ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)</code></p>"},{"location":"getting_started/dim_reduction/dim_reduction.html","title":"2. Dimensionality Reduction","text":"<p>An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality. </p> <p>A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.  UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions.  However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can  use any other dimensionality reduction algorithm. The image below illustrates this modularity:</p> <p> </p> <p>As a result, the <code>umap_model</code> parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have  the following attributes:</p> <ul> <li><code>.fit(X)</code> <ul> <li>A function that can be used to fit the model</li> </ul> </li> <li><code>.transform(X)</code> <ul> <li>A transform function that transforms the input to a lower dimensional size</li> </ul> </li> </ul> <p>In other words, it should have the following structure:</p> <pre><code>class DimensionalityReduction:\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n</code></pre> <p>In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.  </p>"},{"location":"getting_started/dim_reduction/dim_reduction.html#umap","title":"UMAP","text":"<p>As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters,  we simply define it and pass it to BERTopic:</p> <pre><code>from bertopic import BERTopic\nfrom umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre> <p>Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using. </p>"},{"location":"getting_started/dim_reduction/dim_reduction.html#pca","title":"PCA","text":"<p>Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and perform inference. To use PCA, we can simply import it from <code>sklearn</code> and pass it to the <code>umap_model</code> parameter:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.decomposition import PCA\n\ndim_model = PCA(n_components=5)\ntopic_model = BERTopic(umap_model=dim_model)\n</code></pre> <p>As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN. </p> <p>Note</p> <p>As you might have noticed, the <code>dim_model</code> is passed to <code>umap_model</code> which might be a bit confusing considering  you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current  state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible. </p>"},{"location":"getting_started/dim_reduction/dim_reduction.html#truncated-svd","title":"Truncated SVD","text":"<p>Like PCA, there are a bunch more dimensionality reduction techniques in <code>sklearn</code> that you can be using. Here, we will demonstrate Truncated SVD  but any model can be used as long as it has both a <code>.fit()</code> and <code>.transform()</code> method:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.decomposition import TruncatedSVD\n\ndim_model = TruncatedSVD(n_components=5)\ntopic_model = BERTopic(umap_model=dim_model)\n</code></pre>"},{"location":"getting_started/dim_reduction/dim_reduction.html#cuml-umap","title":"cuML UMAP","text":"<p>Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,  we can use cuML to speed up UMAP through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.manifold import UMAP\n\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre> <p>Note</p> <p>If you want to install cuML together with BERTopic using Google Colab, you can run the following code:</p> <pre><code>!pip install bertopic\n!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64\n</code></pre>"},{"location":"getting_started/dim_reduction/dim_reduction.html#skip-dimensionality-reduction","title":"Skip dimensionality reduction","text":"<p>Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an \"empty\" model that simply returns the data pass it to: </p> <pre><code>from bertopic import BERTopic\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\n# Fit BERTopic without actually performing any dimensionality reduction\nempty_dimensionality_model = BaseDimensionalityReduction()\ntopic_model = BERTopic(umap_model=empty_dimensionality_model)\n</code></pre> <p>In other words, we go from this pipeline:</p> <p></p> SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality  reduction Clustering Topic  representation <p></p> <p>To the following pipeline:</p> <p></p> SBERT HDBSCAN c-TF-IDF Embeddings Clustering Topic  representation <p></p>"},{"location":"getting_started/distribution/distribution.html","title":"Topic Distributions","text":"<p>BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic. </p> <p>Another option is to use a cluster model that can perform soft clustering, like HDBSCAN. As BERTopic focuses on modularity, we may still want to model that mixture of topics even when we are using a hard-clustering model, like k-Means without the need to split up our documents. This is where <code>.approximate_distribution</code> comes in!</p> <p></p> the right problem is difficult Solving right  the  Solving  the  right problem right  problem is problem  is difficult create token sets topic-token set similarity document-topic distribution multi-topic assignment on a token level solving topic 2 topic 1 topic 3 topic 4 the right problem is difficult 0.75 0.32 0.16 0.21 0.29 0.81 0.47 0.26 0.12 0.33 <p></p> <p>To perform this approximation, each document is split into tokens according to the provided tokenizer in the <code>CountVectorizer</code>. Then, a sliding window is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document: </p> <p>Solving the right problem is difficult.</p> <p>can be split up into <code>solving the right</code>, <code>the right problem</code>, <code>right problem is</code>, and <code>problem is difficult</code>. These are called token sets.  For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.  Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document. </p> <p>Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.  With this approach, we take into account that there may be multiple topics for a single word. </p> <p>We can make this multiple-topic word assignment a bit more accurate by then splitting these token sets up into individual tokens and assigning the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes  to a document's topic distribution.</p>"},{"location":"getting_started/distribution/distribution.html#example","title":"Example","text":"<p>To calculate our topic distributions, we first need to fit a basic topic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic().fit(docs)\n</code></pre> <p>After doing so, we can approximate the topic distributions for your documents:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs)\n</code></pre> <p>The resulting <code>topic_distr</code> is a n x m matrix where n are the topics and m the documents. We can then visualize the distribution  of topics in a document:</p> <pre><code>topic_model.visualize_distribution(topic_distr[1])\n</code></pre> <p>Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first  calculate topic distributions on a token level and then visualize the results:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])\ndf\n</code></pre> <p> </p> <p>Tip</p> <p>You can also approximate the topic distributions for unseen documents. It will not be as accurate as <code>.transform</code> but it is quite fast and can serve you well in a production setting. </p> <p>Note</p> <p>To get the stylized dataframe for <code>.visualize_approximate_distribution</code> you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via <code>pip install jinja2</code></p>"},{"location":"getting_started/distribution/distribution.html#parameters","title":"Parameters","text":"<p>There are a few parameters that are of interest which will be discussed below.</p>"},{"location":"getting_started/distribution/distribution.html#batch_size","title":"batch_size","text":"<p>Creating token sets for each document can result in quite a large list of token sets. The similarity of these token sets with the topics can result a large matrix that might not fit into memory anymore. To circumvent this, we can process batches of documents instead to minimize the memory overload. The value for <code>batch_size</code> indicates the number of documents that will be processed at once:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500)\n</code></pre>"},{"location":"getting_started/distribution/distribution.html#window","title":"window","text":"<p>The number of tokens that are combined into token sets are defined by the <code>window</code> parameter. Seeing as we are performing a sliding window, we can change the size of the window. A larger window takes more tokens into account but setting it too large can result in considering too much information. Personally, I like to have this window between 4 and 8:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, window=4)\n</code></pre>"},{"location":"getting_started/distribution/distribution.html#stride","title":"stride","text":"<p>The sliding window that is performed on a document shifts, as a default, 1 token to the right each time to create its token sets. As a result, especially with large windows, a single token gets judged several times. We can use the <code>stride</code> parameter to increase the number of tokens the window shifts to the right. By increasing this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with <code>window</code> is preferred. For example, if we have a very large dataset, we can set <code>stride=4</code> and <code>window=8</code> to judge token sets that contain 8 tokens but that are shifted with 4 steps  each time. As a result, this increases the computational speed quite a bit:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, window=4)\n</code></pre>"},{"location":"getting_started/distribution/distribution.html#use_embedding_model","title":"use_embedding_model","text":"<p>As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected <code>embedding_model</code> instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:</p> <pre><code>topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html","title":"Embedding Models","text":"<p>BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers (<code>\"all-MiniLM-L6-v2\"</code>) as it is quite capable of capturing the semantic similarity between documents. </p> <p>However, there is not one perfect  embedding model and you might want to be using something entirely different for your use case. Since BERTopic assumes some independence among steps, we can allow for this modularity:</p> <p> </p> <p>This modularity allows us not only to choose any embedding model to convert our documents into numerical representations, we can use essentially any data to perform our clustering.  When new state-of-the-art pre-trained embedding models are released, BERTopic will be able to use them. As a result, BERTopic grows with any new models being released. Out of the box, BERTopic supports several embedding techniques. In this section, we will go through several of them and how they can be implemented. </p>"},{"location":"getting_started/embeddings/embeddings.html#sentence-transformers","title":"Sentence Transformers","text":"<p>You can select any model from sentence-transformers here  and pass it through BERTopic with <code>embedding_model</code>:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(embedding_model=\"all-MiniLM-L6-v2\")\n</code></pre> <p>Or select a SentenceTransformer model with your parameters:</p> <pre><code>from sentence_transformers import SentenceTransformer\n\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\ntopic_model = BERTopic(embedding_model=sentence_model)\n</code></pre> <p>Tip 1!</p> <p>This embedding back-end was put here first for a reason, sentence-transformers works amazing out of the box! Playing around with different models can give you great results. Also, make sure to frequently visit this page as new models are often released. </p> <p>Tip 2!</p> <p>New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the MTEB leaderboard. It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look. </p> <p>Many of these models can be used with <code>SentenceTransformers</code> in BERTopic, like so:</p> <pre><code>from sentence_transformers import SentenceTransformer\n\nembedding_model = SentenceTransformer(\"BAAI/bge-base-en-v1.5\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#hugging-face-transformers","title":"\ud83e\udd17 Hugging Face Transformers","text":"<p>To use a Hugging Face transformers model, load in a pipeline and point  to any model found on their model hub (https://huggingface.co/models):</p> <pre><code>from transformers.pipelines import pipeline\n\nembedding_model = pipeline(\"feature-extraction\", model=\"distilbert-base-cased\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre> <p>Tip!</p> <p>These transformers also work quite well using <code>sentence-transformers</code> which has great optimizations tricks that make using it a bit faster. </p>"},{"location":"getting_started/embeddings/embeddings.html#flair","title":"Flair","text":"<p>Flair allows you to choose almost any embedding model that  is publicly available. Flair can be used as follows:</p> <pre><code>from flair.embeddings import TransformerDocumentEmbeddings\n\nroberta = TransformerDocumentEmbeddings('roberta-base')\ntopic_model = BERTopic(embedding_model=roberta)\n</code></pre> <p>You can select any \ud83e\udd17 transformers model here.</p> <p>Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.  Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily  pass it to BERTopic to use those word embeddings as document embeddings: </p> <pre><code>from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings\n\nglove_embedding = WordEmbeddings('crawl')\ndocument_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])\n\ntopic_model = BERTopic(embedding_model=document_glove_embeddings)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#spacy","title":"Spacy","text":"<p>Spacy is an amazing framework for processing text. There are  many models available across many languages for modeling text. </p> <p>To use Spacy's non-transformer models in BERTopic:</p> <pre><code>import spacy\n\nnlp = spacy.load(\"en_core_web_md\", exclude=['tagger', 'parser', 'ner', \n                                            'attribute_ruler', 'lemmatizer'])\n\ntopic_model = BERTopic(embedding_model=nlp)\n</code></pre> <p>Using spacy-transformer models:</p> <pre><code>import spacy\n\nspacy.prefer_gpu()\nnlp = spacy.load(\"en_core_web_trf\", exclude=['tagger', 'parser', 'ner', \n                                             'attribute_ruler', 'lemmatizer'])\n\ntopic_model = BERTopic(embedding_model=nlp)\n</code></pre> <p>If you run into memory issues with spacy-transformer models, try:</p> <pre><code>import spacy\nfrom thinc.api import set_gpu_allocator, require_gpu\n\nnlp = spacy.load(\"en_core_web_trf\", exclude=['tagger', 'parser', 'ner', \n                                             'attribute_ruler', 'lemmatizer'])\nset_gpu_allocator(\"pytorch\")\nrequire_gpu(0)\n\ntopic_model = BERTopic(embedding_model=nlp)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#universal-sentence-encoder-use","title":"Universal Sentence Encoder (USE)","text":"<p>The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here  for embedding the documents. The model is trained and optimized for greater-than-word length text,  such as sentences, phrases, or short paragraphs.</p> <p>Using USE in BERTopic is rather straightforward:</p> <pre><code>import tensorflow_hub\nembedding_model = tensorflow_hub.load(\"https://tfhub.dev/google/universal-sentence-encoder/4\")\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#gensim","title":"Gensim","text":"<p>BERTopic supports the <code>gensim.downloader</code> module, which allows it to download any word embedding model supported by Gensim.  Typically, these are Glove, Word2Vec, or FastText embeddings:</p> <pre><code>import gensim.downloader as api\nft = api.load('fasttext-wiki-news-subwords-300')\ntopic_model = BERTopic(embedding_model=ft)\n</code></pre> <p>Tip!</p> <p>Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.</p>"},{"location":"getting_started/embeddings/embeddings.html#scikit-learn-embeddings","title":"Scikit-Learn Embeddings","text":"<p>Scikit-Learn is a framework for more than just machine learning.  It offers many preprocessing tools, some of which can be used to create representations  for text. Many of these tools are relatively lightweight and do not require a GPU.  While the representations may be less expressive than many BERT models, the fact that  it runs much faster can make it a relevant candidate to consider. </p> <p>If you have a scikit-learn compatible pipeline that you'd like to use to embed text then you can also pass this to BERTopic. </p> <pre><code>from sklearn.pipeline import make_pipeline\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\npipe = make_pipeline(\n    TfidfVectorizer(),\n    TruncatedSVD(100)\n)\n\ntopic_model = BERTopic(embedding_model=pipe)\n</code></pre> <p>Warning</p> <p>One caveat to be aware of is that scikit-learns base <code>Pipeline</code> class does not support the <code>.partial_fit()</code>-API. If you have a pipeline that theoretically should be able to support online learning then you might want to explore the scikit-partial project.  Moreover, since this backend does not generate representations on a word level,  it does not support the <code>bertopic.representation</code> models.</p>"},{"location":"getting_started/embeddings/embeddings.html#openai","title":"OpenAI","text":"<p>To use OpenAI's external API, we need to define our key and explicitly call <code>bertopic.backend.OpenAIBackend</code> to be used in our topic model:</p> <pre><code>import openai\nfrom bertopic.backend import OpenAIBackend\n\nclient = openai.OpenAI(api_key=\"sk-...\")\nembedding_model = OpenAIBackend(client, \"text-embedding-ada-002\")\n\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#cohere","title":"Cohere","text":"<p>To use Cohere's external API, we need to define our key and explicitly call <code>bertopic.backend.CohereBackend</code> to be used in our topic model:</p> <pre><code>import cohere\nfrom bertopic.backend import CohereBackend\n\nclient = cohere.Client(\"MY_API_KEY\")\nembedding_model = CohereBackend(client)\n\ntopic_model = BERTopic(embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#multimodal","title":"Multimodal","text":"<p>To create embeddings for both text and images in the same vector space, we can use the <code>MultiModalBackend</code>.  This model uses a clip-vit based model that is capable of embedding text, images, or both:</p> <pre><code>from bertopic.backend import MultiModalBackend\nmodel = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n\n# Embed documents only\ndoc_embeddings = model.embed_documents(docs)\n\n# Embeding images only\nimage_embeddings = model.embed_images(images)\n\n# Embed both images and documents, then average them\ndoc_image_embeddings = model.embed(docs, images)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#custom-backend","title":"Custom Backend","text":"<p>If your backend or model cannot be found in the ones currently available, you can use the <code>bertopic.backend.BaseEmbedder</code> class to  create your backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:</p> <pre><code>from bertopic.backend import BaseEmbedder\nfrom sentence_transformers import SentenceTransformer\n\nclass CustomEmbedder(BaseEmbedder):\n    def __init__(self, embedding_model):\n        super().__init__()\n        self.embedding_model = embedding_model\n\n    def embed(self, documents, verbose=False):\n        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)\n        return embeddings \n\n# Create custom backend\nembedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\ncustom_embedder = CustomEmbedder(embedding_model=embedding_model)\n\n# Pass custom backend to bertopic\ntopic_model = BERTopic(embedding_model=custom_embedder)\n</code></pre>"},{"location":"getting_started/embeddings/embeddings.html#custom-embeddings","title":"Custom Embeddings","text":"<p>The base models in BERTopic are BERT-based models that work well with document similarity tasks. Your documents,  however, might be too specific for a general pre-trained model to be used. Fortunately, you can use the embedding  model in BERTopic to create document features.   </p> <p>You only need to prepare the document embeddings yourself and pass them through <code>fit_transform</code> of BERTopic: <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train our topic model using our pre-trained sentence-transformers embeddings\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre></p> <p>As you can see above, we used a SentenceTransformer model to create the embedding. You could also have used  <code>\ud83e\udd17 transformers</code>, <code>Doc2Vec</code>, or any other embedding method. </p>"},{"location":"getting_started/embeddings/embeddings.html#tf-idf","title":"TF-IDF","text":"<p>As mentioned above, any embedding technique can be used. However, when running UMAP, the typical distance metric is  <code>cosine</code> which does not work quite well for a TF-IDF matrix. Instead, BERTopic will recognize that a sparse matrix  is passed and use <code>hellinger</code> instead which works quite well for the similarity between probability distributions. </p> <p>We simply create a TF-IDF matrix and use them as embeddings in our <code>fit_transform</code> method:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\n# Create TF-IDF sparse matrix\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nvectorizer = TfidfVectorizer(min_df=5)\nembeddings = vectorizer.fit_transform(docs)\n\n# Train our topic model using TF-IDF vectors\ntopic_model = BERTopic(stop_words=\"english\")\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre> <p>Here, you will probably notice that creating the embeddings is quite fast whereas <code>fit_transform</code> is quite slow.  This is to be expected as reducing the dimensionality of a large sparse matrix takes some time. The inverse of using  transformer embeddings is true: creating the embeddings is slow whereas <code>fit_transform</code> is quite fast. </p>"},{"location":"getting_started/guided/guided.html","title":"Guided Topic Modeling","text":"<p>Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of.  </p> <p>To model that bug, we can create a seed topic representation containing the words <code>bug</code>, <code>login</code>, <code>password</code>,  and <code>username</code>. By defining those words, a Guided Topic Modeling approach will try to converge at least one topic to those words.</p> <p></p> \"drug cancer drugs doctor\"  \"windows drive dos file\"  \"space launch orbit lunar\"  Concatenate and embed the keywords/keyphrases using the embedding model. For each document, generate labels by finding which seeded topic fits best based on cosine similarity between embeddings. Average the embedding of each document with the selected seeded topic.  Define seed topics through keywords or keyphrases.  \"drug\", \"cancer\", \"drugs\", \"doctor\"  Seed topic 1 Seed topic 2 Seed topic 3 \"windows\", \"drive\", \"dos\", \"file\"  \"space\", \"launch\", \"orbit\", \"lunar\"  Seed  topic 3  Seed  topic 2  No seed topic match found Seed topic 2  seed topic  embedding document  embedding + 2 Mutiply the IDF values of the seeded keywords across all topics with 1.2. Word IDF Multiplier Adjusted IDF drug 1.2 .55 .66 1.2 doctor .78 .94 cat 1  .22 .22 1 dog .11 .11 space 1.2 .35 .42 1.2  launch .89 1.07 <p></p> <p>Guided BERTopic has two main steps:</p> <p>First, we create embeddings for each seeded topic by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label.  If it is most similar to the average document embedding, it will get the -1 label.  These labels are then passed through UMAP to create a semi-supervised approach that should nudge  the topic creation to the seeded topics.</p> <p>Second, we take all words in seed_topic_list and assign them a multiplier larger than 1.  Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing  the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.</p>"},{"location":"getting_started/guided/guided.html#example","title":"Example","text":"<p>To demonstrate Guided BERTopic, we use the 20 Newsgroups dataset as our example. We have frequently used this dataset in BERTopic examples and we sometimes see a topic generated about health with words such as <code>drug</code> and <code>cancer</code>  being important. However, due to the stochastic nature of UMAP, this topic is not always found. </p> <p>In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,  there may be several other topics that we know should be in the documents. Let's also initialize those:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\n\nseed_topic_list = [[\"drug\", \"cancer\", \"drugs\", \"doctor\"],\n                   [\"windows\", \"drive\", \"dos\", \"file\"],\n                   [\"space\", \"launch\", \"orbit\", \"lunar\"]]\n\ntopic_model = BERTopic(seed_topic_list=seed_topic_list)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>As you can see above, the <code>seed_topic_list</code> contains a list of topic representations. By defining the above topics  BERTopic is more likely to model the defined seeded topics. However, BERTopic is merely nudged towards creating those  topics. In practice, if the seeded topics do not exist or might be divided into smaller topics, then they will  not be modeled. Thus, seed topics need to be accurate to accurately converge towards them. </p>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html","title":"Hierarchical Topic Modeling","text":"<p>When tweaking your topic model, the number of topics that are generated has a large effect on the quality of the topic representations. Some topics could be merged and having an understanding of the effect will help you understand which topics should and which should not be merged. </p> <p>That is where hierarchical topic modeling comes in. It tries to model the possible hierarchical nature of the topics you have created to understand which topics are similar to each other. Moreover, you will have more insight into sub-topics that might exist in your data. </p> <p></p> Create a distance matrix by calculating the cosine similarity between c-TF-IDF representations of each topic.  Apply a linkage function of choice on the distance matrix to model the hierarchical structure of topics.  Topic 26 Topic 1 Topic 38 Topic 42 re-calculate c-TF-IDF Update the c-TF-IDF representation based on the collection of documents across the merged topics.   Topic 1 .12 .12 .53 .53 .74 .74 .89 .89 .24 .24 .01 .01 1 1 1 1 ... ... ... ... ... ... ... ... 1 2 3 1 2 3 n ... . . . n <p></p> <p>In BERTopic, we can approximate this potential hierarchy by making use of our topic-term matrix (c-TF-IDF matrix). This matrix contains information about the importance of every word in every topic and makes for a nice numerical representation of our topics. The smaller the distance between two c-TF-IDF representations, the more similar we assume they are. In practice, this process of merging topics is done through the hierarchical clustering capabilities of <code>scipy</code> (see here). It allows for several linkage methods through which we can approximate our topic hierarchy. As a default, we are using the ward but many others are available. </p> <p>Whenever we merge two topics, we can calculate the c-TF-IDF representation of these two merged by summing their bag-of-words representation. We assume that two sets of topics are merged and that all others are kept the same, regardless of their location in the hierarchy. This helps us isolate the potential effect of merging sets of topics. As a result, we can see the topic representation at each level in the tree. </p>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#example","title":"Example","text":"<p>To demonstrate hierarchical topic modeling with BERTopic, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents. </p> <p>First, we train a basic BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Next, we can use our fitted BERTopic model to extract possible hierarchies from our c-TF-IDF matrix:</p> <pre><code>hierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>The resulting <code>hierarchical_topics</code> is a dataframe in which merged topics are described. For example, if you would  merge two topics, what would the topic representation of the new topic be? </p>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#linkage-functions","title":"Linkage functions","text":"<p>When creating the potential hierarchical nature of topics, we use Scipy's ward <code>linkage</code> function as a default  to generate the hierarchy. However, you might want to use a different linkage function  for your use case, such as <code>single</code>, <code>complete</code>, <code>average</code>, <code>centroid</code>, or <code>median</code>. In BERTopic, you can define the  linkage function yourself, including the distance function that you would like to use:</p> <pre><code>from scipy.cluster import hierarchy as sch\nfrom bertopic import BERTopic\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Hierarchical topics\nlinkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)\nhierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)\n</code></pre>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#visualizations","title":"Visualizations","text":"<p>To visualize these results, we can start by running a familiar function, namely <code>topic_model.visualize_hierarchy</code>:</p> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations  help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,  we can now see which sub-topics can be found within certain larger themes. </p> <p>Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can  use <code>topic_model.get_topic_tree</code> to create a text-based representation of this hierarchy. Although the general structure is more difficult  to view, we can see better which topics could be logically merged:</p> <pre><code>&gt;&gt;&gt; tree = topic_model.get_topic_tree(hierarchical_topics)\n&gt;&gt;&gt; print(tree)\n.\n\u2514\u2500atheists_atheism_god_moral_atheist\n     \u251c\u2500atheists_atheism_god_atheist_argument\n     \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n     \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n     \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n</code></pre> Click here to view the full tree. <pre><code>  .\n  \u251c\u2500people_armenian_said_god_armenians\n  \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u2502    \u251c\u2500jehovah_lord_mormon_mcconkie_god\n  \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500ra_satan_thou_god_lucifer \u2500\u2500 Topic: 94\n  \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500jehovah_lord_mormon_mcconkie_unto \u2500\u2500 Topic: 78\n  \u2502    \u2502    \u2502    \u2514\u2500jesus_mary_god_hell_sin\n  \u2502    \u2502    \u2502         \u251c\u2500jesus_hell_god_eternal_heaven\n  \u2502    \u2502    \u2502         \u2502    \u251c\u2500hell_jesus_eternal_god_heaven\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500jesus_tomb_disciples_resurrection_john \u2500\u2500 Topic: 69\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500hell_eternal_god_jesus_heaven \u2500\u2500 Topic: 53\n  \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500aaron_baptism_sin_law_god \u2500\u2500 Topic: 89\n  \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500mary_sin_maria_priest_conception \u2500\u2500 Topic: 56\n  \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500marriage_married_marry_ceremony_marriages \u2500\u2500 Topic: 110\n  \u2502    \u2514\u2500people_armenian_armenians_said_mr\n  \u2502         \u251c\u2500people_armenian_armenians_said_israel\n  \u2502         \u2502    \u251c\u2500god_homosexual_homosexuality_atheists_sex\n  \u2502         \u2502    \u2502    \u251c\u2500homosexual_homosexuality_sex_gay_homosexuals\n  \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500kinsey_sex_gay_men_sexual \u2500\u2500 Topic: 44\n  \u2502         \u2502    \u2502    \u2502    \u2514\u2500homosexuality_homosexual_sin_homosexuals_gay\n  \u2502         \u2502    \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500gay_homosexual_homosexuals_sexual_cramer \u2500\u2500 Topic: 50\n  \u2502         \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500homosexuality_homosexual_sin_paul_sex \u2500\u2500 Topic: 27\n  \u2502         \u2502    \u2502    \u2514\u2500god_atheists_atheism_moral_atheist\n  \u2502         \u2502    \u2502         \u251c\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500jim_context_challenges_articles_quote \u2500\u2500 Topic: 36\n  \u2502         \u2502    \u2502         \u2502    \u2514\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500islam_quran_islamic_rushdie_muslims \u2500\u2500 Topic: 31\n  \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500judas_scripture_bible_books_greek \u2500\u2500 Topic: 33\n  \u2502         \u2502    \u2502         \u2514\u2500atheists_atheism_god_moral_atheist\n  \u2502         \u2502    \u2502              \u251c\u2500atheists_atheism_god_atheist_argument\n  \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n  \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n  \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n  \u2502         \u2502    \u2514\u2500armenian_armenians_people_israel_said\n  \u2502         \u2502         \u251c\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502    \u251c\u2500tax_rights_government_income_taxes\n  \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500rights_right_slavery_slaves_residence \u2500\u2500 Topic: 106\n  \u2502         \u2502         \u2502    \u2502    \u2514\u2500tax_government_taxes_income_libertarians\n  \u2502         \u2502         \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500government_libertarians_libertarian_regulation_party \u2500\u2500 Topic: 58\n  \u2502         \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500tax_taxes_income_billion_deficit \u2500\u2500 Topic: 41\n  \u2502         \u2502         \u2502    \u2514\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502         \u251c\u2500gun_guns_militia_firearms_amendment\n  \u2502         \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500blacks_penalty_death_cruel_punishment \u2500\u2500 Topic: 55\n  \u2502         \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500gun_guns_militia_firearms_amendment \u2500\u2500 Topic: 7\n  \u2502         \u2502         \u2502         \u2514\u2500armenian_armenians_israel_jews_turkish\n  \u2502         \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500israel_israeli_jews_arab_jewish \u2500\u2500 Topic: 4\n  \u2502         \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500armenian_armenians_turkish_armenia_azerbaijan \u2500\u2500 Topic: 15\n  \u2502         \u2502         \u2514\u2500stephanopoulos_president_mr_myers_ms\n  \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500serbs_muslims_stephanopoulos_mr_bosnia \u2500\u2500 Topic: 35\n  \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500myers_stephanopoulos_president_ms_mr \u2500\u2500 Topic: 87\n  \u2502         \u2514\u2500batf_fbi_koresh_compound_gas\n  \u2502              \u251c\u2500\u25a0\u2500\u2500reno_workers_janet_clinton_waco \u2500\u2500 Topic: 77\n  \u2502              \u2514\u2500batf_fbi_koresh_gas_compound\n  \u2502                   \u251c\u2500batf_koresh_fbi_warrant_compound\n  \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500batf_warrant_raid_compound_fbi \u2500\u2500 Topic: 42\n  \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500koresh_batf_fbi_children_compound \u2500\u2500 Topic: 61\n  \u2502                   \u2514\u2500\u25a0\u2500\u2500fbi_gas_tear_bds_building \u2500\u2500 Topic: 23\n  \u2514\u2500use_like_just_dont_new\n      \u251c\u2500game_team_year_games_like\n      \u2502    \u251c\u2500game_team_games_25_year\n      \u2502    \u2502    \u251c\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502    \u251c\u2500window_printer_use_problem_mhz\n      \u2502    \u2502    \u2502    \u2502    \u251c\u2500mhz_wire_simms_wiring_battery\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_mhz_battery_cpu_heat\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_pds_simm_vram_lc\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500pds_nubus_lc_slot_card \u2500\u2500 Topic: 119\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500simms_simm_vram_meg_dram \u2500\u2500 Topic: 32\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500mhz_battery_cpu_heat_speed\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500fan_cpu_heat_sink_fans \u2500\u2500 Topic: 92\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500mhz_speed_cpu_fpu_clock \u2500\u2500 Topic: 22\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500monitor_turn_power_computer_electricity \u2500\u2500 Topic: 91\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500battery_batteries_concrete_duo_discharge\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500duo_battery_apple_230_problem \u2500\u2500 Topic: 121\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500battery_batteries_concrete_discharge_temperature \u2500\u2500 Topic: 75\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500leds_uv_blue_light_boards \u2500\u2500 Topic: 66\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500wire_wiring_ground_neutral_outlets \u2500\u2500 Topic: 120\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500scope_scopes_phone_dial_number\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500dial_number_phone_line_output \u2500\u2500 Topic: 93\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500scope_scopes_motorola_generator_oscilloscope \u2500\u2500 Topic: 113\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500celp_dsp_sampling_antenna_digital\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500antenna_antennas_receiver_cable_transmitter \u2500\u2500 Topic: 70\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500celp_dsp_sampling_speech_voice \u2500\u2500 Topic: 52\n      \u2502    \u2502    \u2502    \u2502    \u2514\u2500window_printer_xv_mouse_windows\n      \u2502    \u2502    \u2502    \u2502         \u251c\u2500window_xv_error_widget_problem\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500error_symbol_undefined_xterm_rx\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500symbol_error_undefined_doug_parse \u2500\u2500 Topic: 63\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500rx_remote_server_xdm_xterm \u2500\u2500 Topic: 45\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500window_xv_widget_application_expose\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500window_widget_expose_application_event\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500gc_mydisplay_draw_gxxor_drawing \u2500\u2500 Topic: 103\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500window_widget_application_expose_event \u2500\u2500 Topic: 25\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500xv_den_polygon_points_algorithm\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500den_polygon_points_algorithm_polygons \u2500\u2500 Topic: 28\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500xv_24bit_image_bit_images \u2500\u2500 Topic: 57\n      \u2502    \u2502    \u2502    \u2502         \u2514\u2500printer_fonts_print_mouse_postscript\n      \u2502    \u2502    \u2502    \u2502              \u251c\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500scanner_logitech_grayscale_ocr_scanman \u2500\u2500 Topic: 108\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u2514\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u251c\u2500\u25a0\u2500\u2500printer_print_deskjet_hp_ink \u2500\u2500 Topic: 18\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u2514\u2500\u25a0\u2500\u2500fonts_font_truetype_tt_atm \u2500\u2500 Topic: 49\n      \u2502    \u2502    \u2502    \u2502              \u2514\u2500mouse_ghostscript_midi_driver_postscript\n      \u2502    \u2502    \u2502    \u2502                   \u251c\u2500ghostscript_midi_postscript_files_file\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500ghostscript_postscript_pageview_ghostview_dsc \u2500\u2500 Topic: 104\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u2514\u2500midi_sound_file_windows_driver\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u251c\u2500\u25a0\u2500\u2500location_mar_file_host_rwrr \u2500\u2500 Topic: 83\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u2514\u2500\u25a0\u2500\u2500midi_sound_driver_blaster_soundblaster \u2500\u2500 Topic: 98\n      \u2502    \u2502    \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500mouse_driver_mice_ball_problem \u2500\u2500 Topic: 68\n      \u2502    \u2502    \u2502    \u2514\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502         \u251c\u25001st_sale_condition_comics_hulk\n      \u2502    \u2502    \u2502         \u2502    \u251c\u2500sale_condition_offer_asking_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500condition_stereo_amp_speakers_asking\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500miles_car_amfm_toyota_cassette \u2500\u2500 Topic: 62\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500amp_speakers_condition_stereo_audio \u2500\u2500 Topic: 24\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500games_sale_pom_cds_shipping\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u251c\u2500pom_cds_sale_shipping_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500size_shipping_sale_condition_mattress \u2500\u2500 Topic: 100\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500pom_cds_cd_sale_picture \u2500\u2500 Topic: 37\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500games_game_snes_sega_genesis \u2500\u2500 Topic: 40\n      \u2502    \u2502    \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u251c\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500lens_tape_camera_backup_lenses\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500tape_backup_tapes_drive_4mm \u2500\u2500 Topic: 107\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500lens_camera_lenses_zoom_pouch \u2500\u2500 Topic: 114\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u25001st_hulk_comics_art_appears \u2500\u2500 Topic: 105\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500books_book_cover_trek_chemistry \u2500\u2500 Topic: 125\n      \u2502    \u2502    \u2502         \u2502         \u2514\u2500tickets_hotel_ticket_voucher_package\n      \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500hotel_voucher_package_vacation_room \u2500\u2500 Topic: 74\n      \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500tickets_ticket_june_airlines_july \u2500\u2500 Topic: 84\n      \u2502    \u2502    \u2502         \u2514\u2500game_team_games_season_hockey\n      \u2502    \u2502    \u2502              \u251c\u2500game_hockey_team_25_550\n      \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500espn_pt_pts_game_la \u2500\u2500 Topic: 17\n      \u2502    \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500team_25_game_hockey_550 \u2500\u2500 Topic: 2\n      \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500year_game_hit_baseball_players \u2500\u2500 Topic: 0\n      \u2502    \u2502    \u2514\u2500bike_car_greek_insurance_msg\n      \u2502    \u2502         \u251c\u2500car_bike_insurance_cars_engine\n      \u2502    \u2502         \u2502    \u251c\u2500car_insurance_cars_radar_engine\n      \u2502    \u2502         \u2502    \u2502    \u251c\u2500insurance_health_private_care_canada\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500insurance_health_private_care_canada \u2500\u2500 Topic: 99\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500insurance_car_accident_rates_sue \u2500\u2500 Topic: 82\n      \u2502    \u2502         \u2502    \u2502    \u2514\u2500car_cars_radar_engine_detector\n      \u2502    \u2502         \u2502    \u2502         \u251c\u2500car_radar_cars_detector_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500radar_detector_detectors_ka_alarm \u2500\u2500 Topic: 39\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500car_cars_mustang_ford_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500clutch_shift_shifting_transmission_gear \u2500\u2500 Topic: 88\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500car_cars_mustang_ford_v8 \u2500\u2500 Topic: 14\n      \u2502    \u2502         \u2502    \u2502         \u2514\u2500oil_diesel_odometer_diesels_car\n      \u2502    \u2502         \u2502    \u2502              \u251c\u2500odometer_oil_sensor_car_drain\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500odometer_sensor_speedo_gauge_mileage \u2500\u2500 Topic: 96\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500oil_drain_car_leaks_taillights \u2500\u2500 Topic: 102\n      \u2502    \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500diesel_diesels_emissions_fuel_oil \u2500\u2500 Topic: 79\n      \u2502    \u2502         \u2502    \u2514\u2500bike_riding_ride_bikes_motorcycle\n      \u2502    \u2502         \u2502         \u251c\u2500bike_ride_riding_bikes_lane\n      \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500bike_ride_riding_lane_car \u2500\u2500 Topic: 11\n      \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500bike_bikes_miles_honda_motorcycle \u2500\u2500 Topic: 19\n      \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500countersteering_bike_motorcycle_rear_shaft \u2500\u2500 Topic: 46\n      \u2502    \u2502         \u2514\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u251c\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u251c\u2500greek_kuwait_greece_turkish_greeks\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500greek_greece_turkish_greeks_cyprus \u2500\u2500 Topic: 71\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kuwait_iraq_iran_gulf_arabia \u2500\u2500 Topic: 76\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2514\u2500msg_dog_drugs_drug_food\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u251c\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500clinton_bush_quayle_reagan_panicking \u2500\u2500 Topic: 101\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u2514\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500cooper_trial_weaver_spence_witnesses \u2500\u2500 Topic: 90\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500dog_dogs_bike_trained_springer \u2500\u2500 Topic: 67\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2514\u2500msg_drugs_drug_food_chinese\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500msg_food_chinese_foods_taste \u2500\u2500 Topic: 30\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500drugs_drug_marijuana_cocaine_alcohol \u2500\u2500 Topic: 72\n      \u2502    \u2502              \u2502    \u2502    \u2514\u2500water_theory_universe_science_larsons\n      \u2502    \u2502              \u2502    \u2502         \u251c\u2500water_nuclear_cooling_steam_dept\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500rocketry_rockets_engines_nuclear_plutonium \u2500\u2500 Topic: 115\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u2514\u2500water_cooling_steam_dept_plants\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500water_dept_phd_environmental_atmospheric \u2500\u2500 Topic: 97\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500cooling_water_steam_towers_plants \u2500\u2500 Topic: 109\n      \u2502    \u2502              \u2502    \u2502         \u2514\u2500theory_universe_larsons_larson_science\n      \u2502    \u2502              \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500theory_universe_larsons_larson_science \u2500\u2500 Topic: 54\n      \u2502    \u2502              \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500oort_cloud_grbs_gamma_burst \u2500\u2500 Topic: 80\n      \u2502    \u2502              \u2502    \u2514\u2500helmet_kirlian_photography_lock_wax\n      \u2502    \u2502              \u2502         \u251c\u2500helmet_kirlian_photography_leaf_mask\n      \u2502    \u2502              \u2502         \u2502    \u251c\u2500kirlian_photography_leaf_pictures_deleted\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u251c\u2500deleted_joke_stuff_maddi_nickname\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500joke_maddi_nickname_nicknames_frank \u2500\u2500 Topic: 43\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500deleted_stuff_bookstore_joke_motto \u2500\u2500 Topic: 81\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kirlian_photography_leaf_pictures_aura \u2500\u2500 Topic: 85\n      \u2502    \u2502              \u2502         \u2502    \u2514\u2500helmet_mask_liner_foam_cb\n      \u2502    \u2502              \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500helmet_liner_foam_cb_helmets \u2500\u2500 Topic: 112\n      \u2502    \u2502              \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500mask_goalies_77_santore_tl \u2500\u2500 Topic: 123\n      \u2502    \u2502              \u2502         \u2514\u2500lock_wax_paint_plastic_ear\n      \u2502    \u2502              \u2502              \u251c\u2500\u25a0\u2500\u2500lock_cable_locks_bike_600 \u2500\u2500 Topic: 117\n      \u2502    \u2502              \u2502              \u2514\u2500wax_paint_ear_plastic_skin\n      \u2502    \u2502              \u2502                   \u251c\u2500\u25a0\u2500\u2500wax_paint_plastic_scratches_solvent \u2500\u2500 Topic: 65\n      \u2502    \u2502              \u2502                   \u2514\u2500\u25a0\u2500\u2500ear_wax_skin_greasy_acne \u2500\u2500 Topic: 116\n      \u2502    \u2502              \u2514\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u251c\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500m4_mp_14_mw_mo \u2500\u2500 Topic: 111\n      \u2502    \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500test_ensign_nameless_deane_deanebinahccbrandeisedu \u2500\u2500 Topic: 118\n      \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500ites_cheek_hello_hi_ken \u2500\u2500 Topic: 3\n      \u2502    \u2514\u2500space_medical_health_disease_cancer\n      \u2502         \u251c\u2500medical_health_disease_cancer_patients\n      \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500cancer_centers_center_medical_research \u2500\u2500 Topic: 122\n      \u2502         \u2502    \u2514\u2500health_medical_disease_patients_hiv\n      \u2502         \u2502         \u251c\u2500patients_medical_disease_candida_health\n      \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n      \u2502         \u2502         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n      \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n      \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n      \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n      \u2502         \u2514\u2500space_launch_nasa_shuttle_orbit\n      \u2502              \u251c\u2500space_moon_station_nasa_launch\n      \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500sky_advertising_billboard_billboards_space \u2500\u2500 Topic: 59\n      \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500space_station_moon_redesign_nasa \u2500\u2500 Topic: 16\n      \u2502              \u2514\u2500space_mission_hst_launch_orbit\n      \u2502                   \u251c\u2500space_launch_nasa_orbit_propulsion\n      \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500space_launch_nasa_propulsion_astronaut \u2500\u2500 Topic: 47\n      \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500orbit_km_jupiter_probe_earth \u2500\u2500 Topic: 86\n      \u2502                   \u2514\u2500\u25a0\u2500\u2500hst_mission_shuttle_orbit_arrays \u2500\u2500 Topic: 60\n      \u2514\u2500drive_file_key_windows_use\n          \u251c\u2500key_file_jpeg_encryption_image\n          \u2502    \u251c\u2500key_encryption_clipper_chip_keys\n          \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500key_clipper_encryption_chip_keys \u2500\u2500 Topic: 1\n          \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500entry_file_ripem_entries_key \u2500\u2500 Topic: 73\n          \u2502    \u2514\u2500jpeg_image_file_gif_images\n          \u2502         \u251c\u2500motif_graphics_ftp_available_3d\n          \u2502         \u2502    \u251c\u2500motif_graphics_openwindows_ftp_available\n          \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500openwindows_motif_xview_windows_mouse \u2500\u2500 Topic: 20\n          \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500graphics_widget_ray_3d_available \u2500\u2500 Topic: 95\n          \u2502         \u2502    \u2514\u2500\u25a0\u2500\u25003d_machines_version_comments_contact \u2500\u2500 Topic: 38\n          \u2502         \u2514\u2500jpeg_image_gif_images_format\n          \u2502              \u251c\u2500\u25a0\u2500\u2500gopher_ftp_files_stuffit_images \u2500\u2500 Topic: 51\n          \u2502              \u2514\u2500\u25a0\u2500\u2500jpeg_image_gif_format_images \u2500\u2500 Topic: 13\n          \u2514\u2500drive_db_card_scsi_windows\n              \u251c\u2500db_windows_dos_mov_os2\n              \u2502    \u251c\u2500\u25a0\u2500\u2500copy_protection_program_software_disk \u2500\u2500 Topic: 64\n              \u2502    \u2514\u2500\u25a0\u2500\u2500db_windows_dos_mov_os2 \u2500\u2500 Topic: 8\n              \u2514\u2500drive_card_scsi_drives_ide\n                      \u251c\u2500drive_scsi_drives_ide_disk\n                      \u2502    \u251c\u2500\u25a0\u2500\u2500drive_scsi_drives_ide_disk \u2500\u2500 Topic: 6\n                      \u2502    \u2514\u2500\u25a0\u2500\u2500meg_sale_ram_drive_shipping \u2500\u2500 Topic: 12\n                      \u2514\u2500card_modem_monitor_video_drivers\n                          \u251c\u2500\u25a0\u2500\u2500card_monitor_video_drivers_vga \u2500\u2500 Topic: 5\n                          \u2514\u2500\u25a0\u2500\u2500modem_port_serial_irq_com \u2500\u2500 Topic: 10\n</code></pre>"},{"location":"getting_started/hierarchicaltopics/hierarchicaltopics.html#merge-topics","title":"Merge topics","text":"<p>After seeing the potential hierarchy of your topic, you might want to merge specific  topics. For example, if topic 1 is  <code>1_space_launch_moon_nasa</code> and topic 2 is <code>2_spacecraft_solar_space_orbit</code> it might  make sense to merge those two topics as they are quite similar in meaning. In BERTopic,  you can use <code>.merge_topics</code> to manually select and merge those topics. Doing so will  update their topic representation which in turn updates the entire model:</p> <pre><code>topics_to_merge = [1, 2]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> <p>If you have several groups of topics you want to merge, create a list of lists instead:</p> <pre><code>topics_to_merge = [[1, 2],\n                   [3, 4]]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre>"},{"location":"getting_started/manual/manual.html","title":"Manual Topic Modeling","text":"<p>Although topic modeling is typically done by discovering topics in an unsupervised manner, there might be times when you already have a bunch of clusters or classes from which you want to model the topics. For example, the often used 20 NewsGroups dataset is already split up into 20 classes. Here, we might want to see how we can transform those 20 classes into 20 topics. Instead of using BERTopic to discover previously unknown topics, we are now going to manually pass them to BERTopic without actually learning them. </p> <p>We can view this as a manual topic modeling approach. There is no underlying algorithm for detecting these topics since you already have done that before. Whether that is simply because they are already available, like with the 20 NewsGroups dataset, or maybe because you have created clusters of documents before using packages like human-learn, bulk, thisnotthat or something entirely different. </p> <p>In other words, we can pass our labels to BERTopic and it will try to transform those labels into topics by running the c-TF-IDF representations on the set of documents within each label. This process allows us to model the topics themselves and similarly gives us the option to use everything BERTopic has to offer. </p> <p></p> Documents Labels c-TF-IDF <p></p> <p>To do so, we need to skip over the dimensionality reduction and clustering steps since we already know the labels for our documents. We can use the documents and labels from the 20 NewsGroups dataset to create topics from those 20 labels:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# Get labeled data\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n</code></pre> <p>Then, we make sure to create empty instances of the dimensionality reduction and clustering steps. We pass those to BERTopic to simply skip over them and go to the topic representation process:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.backend import BaseEmbedder\nfrom bertopic.cluster import BaseCluster\nfrom bertopic.vectorizers import ClassTfidfTransformer\nfrom bertopic.dimensionality import BaseDimensionalityReduction\n\n# Prepare our empty sub-models and reduce frequent words while we are at it.\nempty_embedding_model = BaseEmbedder()\nempty_dimensionality_model = BaseDimensionalityReduction()\nempty_cluster_model = BaseCluster()\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\n\n# Fit BERTopic without actually performing any clustering\ntopic_model= BERTopic(\n        embedding_model=empty_embedding_model,\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=empty_cluster_model,\n        ctfidf_model=ctfidf_model\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> <p>Let's take a look at a few topics that we get out of training this way by running <code>topic_model.get_topic_info()</code>:</p> <p></p> Topic Count  Name 0 0 999  0_game_hockey_team_25  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  <p></p> <p>We can see several interesting topics appearing here. They seem to relate to the 20 classes we had as input. Now, let's map those topics to our original classes to view their relationship:</p> <pre><code># Map input `y` to topics\nmappings = topic_model.topic_mapper_.get_mappings()\nmappings = {value: data[\"target_names\"][key] for key, value in mappings.items()}\n\n# Assign original classes to our topics\ndf = topic_model.get_topic_info()\ndf[\"Class\"] = df.Topic.map(mappings)\ndf\n</code></pre> <p></p> Topic Count  Name Class 0 0 999  0_game_hockey_team_25  rec.sport.hockey  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  sci.space  comp.windows.x  sci.med  rec.autos  sci.crypt  rec.sport.baseball  rec.motorcycles  soc.religion.christian  <p></p> <p>We can see that the c-TF-IDF representations nicely extract the words that give a nice representation of our input classes. This is all done without actually embedding and clustering the data.</p> <p>As a result, the entire \"training\" process only takes a couple of seconds. Moreover, we can still perform BERTopic-specific features like dynamic topic modeling, topics per class, hierarchical topic modeling, modeling topic distributions, etc.</p> <p>Note</p> <p>The resulting <code>topics</code> may be a different mapping from the <code>y</code> labels. To map <code>y</code> to <code>topics</code>, we can run the following:</p> <pre><code>mappings = topic_model.topic_mapper_.get_mappings()\ny_mapped = [mappings[val] for val in y]\n</code></pre>"},{"location":"getting_started/merge/merge.html","title":"Merge Multiple Fitted Models","text":"<p>After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use online BERTopic, you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box. </p> <p>Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model. </p> <p>Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases. </p> <p>Lastly, this methods also allows for a degree of <code>federated learning</code> where each node trains a topic model that are aggregated in a central server.</p>"},{"location":"getting_started/merge/merge.html#example","title":"Example","text":"<p>To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.</p> <p>First, we train three separate models on different parts of the data:</p> <pre><code>from umap import UMAP\nfrom bertopic import BERTopic\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\n\n# Extract abstracts to train on and corresponding titles\nabstracts_1 = dataset[\"abstract\"][:5_000]\nabstracts_2 = dataset[\"abstract\"][5_000:10_000]\nabstracts_3 = dataset[\"abstract\"][10_000:15_000]\n\n# Create topic models\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\ntopic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)\ntopic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)\ntopic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)\n</code></pre> <p>Then, we can combine all three models into one with <code>.merge_models</code>:</p> <pre><code># Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])\n</code></pre> <p>When we inspect the first model, we can see it has 52 topics:</p> <pre><code>&gt;&gt;&gt; len(topic_model_1.get_topic_info())\n52\n</code></pre> <p>Now, we inspect the merged model, we can see it has 57 topics:</p> <pre><code>&gt;&gt;&gt; len(merged_model.get_topic_info())\n57\n</code></pre> <p>It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model. </p> <p>Note</p> <p>Note that the models are merged sequentially. This means that the comparison starts with <code>topic_model_1</code> and that  each new topic from <code>topic_model_2</code> and <code>topic_model_3</code> will be added to <code>topic_model_1</code>.</p> <p>We can check the newly added topics in the <code>merged_model</code> by simply looking at the 6 latest topics that were added. The order of topics from <code>topic_model_1</code> remains the same. All new topics are simply added on top of them. </p> <p>Let's inspect them:</p> <pre><code>&gt;&gt;&gt; merged_model.get_topic_info().tail(5)\n</code></pre> Topic Count Name Representation Representative_Docs 52 51 47 50_activity_mobile_wearable_sensors ['activity', 'mobile', 'wearable', 'sensors', 'falls', 'human', 'phone', 'recognition', 'activities', 'accelerometer'] nan 53 52 48 25_music_musical_audio_chord ['music', 'musical', 'audio', 'chord', 'and', 'we', 'to', 'that', 'of', 'for'] nan 54 53 32 36_fairness_discrimination_fair_groups ['fairness', 'discrimination', 'fair', 'groups', 'protected', 'decision', 'we', 'of', 'classifier', 'to'] nan 55 54 30 38_traffic_driver_prediction_flow ['traffic', 'driver', 'prediction', 'flow', 'trajectory', 'the', 'and', 'congestion', 'of', 'transportation'] nan 56 55 22 50_spiking_neurons_networks_learning ['spiking', 'neurons', 'networks', 'learning', 'neural', 'snn', 'dynamics', 'plasticity', 'snns', 'of'] nan <p>It seems that topics about activity, music, fairness, traffic, and spiking networks were added to the base topic model! Two things that you might have noticed. First,  the representative documents were not added to the model. This is because of privacy reasons, you might want to combine models that were trained on different stations which would allow for a degree of <code>federated learning</code>. Second, the names of the new topics contain topic ids that refer to one of the old models. They were purposefully left this way so that the user can identify which topics were newly added which you could inspect in the original models.</p>"},{"location":"getting_started/merge/merge.html#min_similarity","title":"min_similarity","text":"<p>The way the models are merged is through comparison of their topic embeddings. If topics between models are similar enough, then they will be regarded as the same topics  and the topic of the first model in the list will be chosen. However, if topics between models are dissimilar enough, then the topic of the latter model will be added to the former.</p> <p>This (dis)similarity is can be tweaked using the <code>min_similarity</code> parameter. Increasing this value will increase the change of adding new topics. In contrast, decreasing this value  will make it more strict and threfore decrease the change of adding new topics. The value is set to <code>0.7</code> by default, so let's see what happens if we were to increase this value to `0.9``:</p> <pre><code># Combine all models into one\nmerged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3], min_similarity=0.9)\n</code></pre> <p>When we inspect the number of topics in our new model, we can see that they have increased quite a bit:</p> <pre><code>&gt;&gt;&gt; len(merged_model.get_topic_info())\n102\n</code></pre> <p>This demonstrates the influence of <code>min_similarity</code> on the number of new topics that are added to the base model.</p>"},{"location":"getting_started/multiaspect/multiaspect.html","title":"6C. Multiple Representations","text":"<p>During the development of BERTopic, many different types of representations can be created, from keywords and phrases to summaries and custom labels. There is a variety of techniques that one can choose from to represent a topic. As such, there are a number of interesting and creative ways one can summarize topics. A topic is more than just a single representation.</p> <p>Therefore, <code>multi-aspect topic modeling</code> is introduced! During the <code>.fit</code> or <code>.fit_transform</code> stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).</p> <p> </p> <p>The approach is rather straightforward. We might want to represent our topics using a <code>PartOfSpeech</code> representation model but we might also want to try out <code>KeyBERTInspired</code> and compare those representation models. We can do this as follows:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic.representation import PartOfSpeech\nfrom bertopic.representation import MaximalMarginalRelevance\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Documents to train on\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# The main representation of a topic\nmain_representation = KeyBERTInspired()\n\n# Additional ways of representing a topic\naspect_model1 = PartOfSpeech(\"en_core_web_sm\")\naspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]\n\n# Add all models together to be run in a single `fit`\nrepresentation_model = {\n   \"Main\": main_representation,\n   \"Aspect1\":  aspect_model1,\n   \"Aspect2\":  aspect_model2 \n}\ntopic_model = BERTopic(representation_model=representation_model).fit(docs)\n</code></pre> <p>As show above, to perform multi-aspect topic modeling, we make sure that <code>representation_model</code> is a dictionary where each representation model pipeline is defined.  The main pipeline, that is used in most visualization options, is defined with the <code>\"Main\"</code> key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as <code>\"Aspect1\"</code> and <code>\"Aspect2\"</code>. </p> <p>After we have fitted our model, we can access all representations with <code>topic_model.get_topic_info()</code>:</p> <p> </p> <p>As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in <code>topic_model.topic_aspects_</code>. </p>"},{"location":"getting_started/multimodal/multimodal.html","title":"Multimodal Topic Modeling","text":"<p>Documents or text are often accompanied by imagery or the other way around. For example, social media images with captions and products with descriptions. Topic modeling has traditionally focused on creating topics from textual representations. However, as more multimodal representations are created, the need for multimodal topics increases.</p> <p>BERTopic can perform multimodal topic modeling in a number of ways during <code>.fit</code> and <code>.fit_transform</code> stages. </p>"},{"location":"getting_started/multimodal/multimodal.html#text-images","title":"Text + Images","text":"<p>The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them. </p> <p> </p> <p>In this example, we are going to use images from <code>flickr</code> that each have a caption accociated to it: </p> <pre><code># NOTE: This requires the `datasets` package which you can \n# install with `pip install datasets`\nfrom datasets import load_dataset\n\nds = load_dataset(\"maderix/flickr_bw_rgb\")\nimages = ds[\"train\"][\"image\"]\ndocs = ds[\"train\"][\"caption\"]\n</code></pre> <p>The <code>docs</code> variable contains the captions for each image in <code>images</code>. We can now use these variables to run our multimodal example:</p> <p>Tip</p> <p>Do note that it is better to pass the paths of the images instead of the images themselves as there is no need to keep all images in memory. When passing the paths of the images, they are only opened temporarily when they are needed.</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import VisualRepresentation\n\n# Additional ways of representing a topic\nvisual_model = VisualRepresentation()\n\n# Make sure to add the `visual_model` to a dictionary\nrepresentation_model = {\n   \"Visual_Aspect\":  visual_model,\n}\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\n</code></pre> <p>In this example, we are clustering the documents and are then looking for the best matching images to the resulting clusters. </p> <p>We can now access our image representations for each topic with <code>topic_model.topic_aspects_[\"Visual_Aspect\"]</code>. If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:</p> <pre><code>import base64\nfrom io import BytesIO\nfrom IPython.display import HTML\n\ndef image_base64(im):\n    if isinstance(im, str):\n        im = get_thumbnail(im)\n    with BytesIO() as buffer:\n        im.save(buffer, 'jpeg')\n        return base64.b64encode(buffer.getvalue()).decode()\n\n\ndef image_formatter(im):\n    return f'&lt;img src=\"data:image/jpeg;base64,{image_base64(im)}\"&gt;'\n\n# Extract dataframe\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\n\n# Visualize the images\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\n</code></pre> <p> </p> <p>Tip</p> <p>In the example above, we are clustering the documents but since you have  images, you might want to cluster those or cluster an aggregation of both  images and documents. For that, you can use the new <code>MultiModalBackend</code>  to generate embeddings: </p> <pre><code>from bertopic.backend import MultiModalBackend\nmodel = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n\n# Embed documents only\ndoc_embeddings = model.embed_documents(docs)\n\n# Embeding images only\nimage_embeddings = model.embed_images(images)\n\n# Embed both images and documents, then average them\ndoc_image_embeddings = model.embed(docs, images)\n</code></pre>"},{"location":"getting_started/multimodal/multimodal.html#images-only","title":"Images Only","text":"<p>Traditional topic modeling techniques can only be run on textual data, as is shown in the example above. However, there are plenty of cases where textual data is not available but images are. BERTopic allows topic modeling to be performed using only images as your input data.</p> <p> </p> <p>To run BERTopic on images only, we first need to embed our images and then define a model that convert images to text. To do so, we are going to need some images. We will take the same images as the above but instead save them locally and pass the paths to the images instead. As mentioned before, this will make sure that we do not hold too many images in memory whilst only a small subset is needed:</p> <pre><code>import os\nimport glob\nimport zipfile\nimport numpy as np\nimport pandas as pd\nfrom tqdm import tqdm\nfrom sentence_transformers import util\n\n# Flickr 8k images\nimg_folder = 'photos/'\ncaps_folder = 'captions/'\nif not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:\n    os.makedirs(img_folder, exist_ok=True)\n\n    if not os.path.exists('Flickr8k_Dataset.zip'):   #Download dataset if does not exist\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')\n\n    for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:\n        with zipfile.ZipFile(file, 'r') as zf:\n            for member in tqdm(zf.infolist(), desc='Extracting'):\n                zf.extract(member, folder)\nimages = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))\n</code></pre> <p>Next, we can run our pipeline:</p> <pre><code>from bertopic.representation import KeyBERTInspired, VisualRepresentation\nfrom bertopic.backend import MultiModalBackend\n\n# Image embedding model\nembedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n\n# Image to text representation model\nrepresentation_model = {\n    \"Visual_Aspect\": VisualRepresentation(image_to_text_model=\"nlpconnect/vit-gpt2-image-captioning\")\n}\n</code></pre> <p>Using these models, we can run our pipeline:</p> <pre><code>from bertopic import BERTopic\n\n# Train our model with images only\ntopic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, min_topic_size=30)\ntopics, probs = topic_model.fit_transform(documents=None, images=images)\n</code></pre> <p>We can now access our image representations for each topic with <code>topic_model.topic_aspects_[\"Visual_Aspect\"]</code>. If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:</p> <pre><code>import base64\nfrom io import BytesIO\nfrom IPython.display import HTML\n\ndef image_base64(im):\n    if isinstance(im, str):\n        im = get_thumbnail(im)\n    with BytesIO() as buffer:\n        im.save(buffer, 'jpeg')\n        return base64.b64encode(buffer.getvalue()).decode()\n\n\ndef image_formatter(im):\n    return f'&lt;img src=\"data:image/jpeg;base64,{image_base64(im)}\"&gt;'\n\n# Extract dataframe\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\n\n# Visualize the images\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\n</code></pre> <p> </p>"},{"location":"getting_started/online/online.html","title":"Online Topic Modeling","text":"<p>Online topic modeling (sometimes called \"incremental topic modeling\") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a <code>.partial_fit</code> function, which is also used in BERTopic. </p> <p>Tip</p> <p>Another method for online topic modeling can be found with the .merge_models functionality of BERTopic. It allows for merging multiple BERTopic models to create a single new one. This method can be used to discover new topics by training a new model and exploring whether that new model added new topics to the original model when merging. A major benefit, compared to <code>.partial_fit</code> is that you can keep using the original UMAP and HDBSCAN models which tends result in improved performance and gives you significant more flexibility.</p> <p>In BERTopic, there are three main goals for using this technique.</p> <ul> <li>To reduce the memory necessary for training a topic model. </li> <li>To continuously update the topic model as new data comes in. </li> <li>To continuously find new topics as new data comes in. </li> </ul> <p>In BERTopic, online topic modeling can be a bit tricky as there are several steps involved in which online learning needs to be made available. To recap, BERTopic consists of the following 6 steps:</p> <ol> <li>Extract embeddings</li> <li>Reduce dimensionality</li> <li>Cluster reduced embeddings</li> <li>Tokenize topics</li> <li>Extract topic words</li> <li>(Optional) Fine-tune topic words</li> </ol> <p>For some steps, an online variant is more important than others. Typically, in step 1 we use pre-trained language models that are in less need of continuous updates. This means that we can use an embedding model like Sentence-Transformers for extracting the embeddings and still use it in an online setting. Similarly, steps 5 and 6 do not necessarily need online variants since they are built upon step 4, tokenization. If that tokenization is by itself incremental, then so will steps 5 and 6. </p> <p></p> SBERT IncrementalPCA  MiniBatchKMeans  Online CountVectorizer Embeddings Dimensionality  reduction Clustering Incremental  Bag-of-Words c-TF-IDF Topic  representation Online variants of these steps in the main BERTopic pipeline  are needed in order to enable incremental learning.  <p></p> <p>This means that we will need online variants for steps 2 through 4. Steps 2 and 3, dimensionality reduction and clustering, can be modeled through the use of Scikit-Learn's <code>.partial_fit</code> function. In other words, it supports any algorithm that can be trained using <code>.partial_fit</code> since these algorithms can be trained incrementally. For example, incremental dimensionality reduction can be achieved using Scikit-Learn's <code>IncrementalPCA</code> and incremental clustering with <code>MiniBatchKMeans</code>.</p> <p>Lastly, we need to develop an online variant for step 5, tokenization. In this step, a Bag-of-words representation is created through the <code>CountVectorizer</code>. However, as new data comes in, its vocabulary will need to be updated. For that purpose, <code>bertopic.vectorizers.OnlineCountVectorizer</code> was created that not only updates out-of-vocabulary words but also implements decay and cleaning functions to prevent the sparse bag-of-words matrix to become too large. Most notably, the <code>decay</code> parameter is a value between 0 and 1 to weigh the percentage of frequencies that the previous bag-of-words matrix should be reduced to. For example, a value of <code>.1</code> will decrease the frequencies in the bag-of-words matrix by 10% at each iteration. This will make sure that recent data has more weight than previous iterations. Similarly, <code>delete_min_df</code> will remove certain words from its vocabulary if their frequency is lower than a set value. This ties together with the <code>decay</code> parameter as some words will decay over time if not used. For more information regarding the <code>OnlineCountVectorizer</code>, please see the vectorizers documentation.</p>"},{"location":"getting_started/online/online.html#example","title":"Example","text":"<p>Online topic modeling in BERTopic is rather straightforward. We first need to have our documents split into chunks such that we can train and update our topic model incrementally. </p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# Prepare documents\nall_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))[\"data\"]\ndoc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]\n</code></pre> <p>Here, we created chunks of 1000 documents to be fed in BERTopic. Then, we will need to define several sub-models that support online learning. Specifically, we are going to be using <code>IncrementalPCA</code>, <code>MiniBatchKMeans</code>, and the <code>OnlineCountVectorizer</code>:</p> <pre><code>from sklearn.cluster import MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom bertopic.vectorizers import OnlineCountVectorizer\n\n# Prepare sub-models that support online learning\numap_model = IncrementalPCA(n_components=5)\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\n</code></pre> <p>After having defined our sub-models, we can start training our topic model incrementally by looping over our document chunks:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic(umap_model=umap_model,\n                       hdbscan_model=cluster_model,\n                       vectorizer_model=vectorizer_model)\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n</code></pre> <p>And that is it! During each iteration, you can access the predicted topics through the <code>.topics_</code> attribute. </p> <p>Note</p> <p>Do note that in BERTopic it is not possible to use <code>.partial_fit</code> after the <code>.fit</code> as they work quite differently concerning internally updating topics, frequencies, representations, etc. </p> <p>Tip</p> <p>You can use any other dimensionality reduction and clustering algorithm as long as they have a <code>.partial_fit</code> function. Moreover, you can use dimensionality reduction algorithms that do not support <code>.partial_fit</code> functions but do have a <code>.fit</code> function to first train it on a large amount of data and then continuously  add documents. The dimensionality reduction will not be updated but may be trained sufficiently to properly reduce the embeddings without the need to continuously add documents.</p> <p>Warning</p> <p>Only the most recent batch of documents is tracked. If you want to be using online topic modeling for low-memory use cases, then it is advised to also update the <code>.topics_</code> attribute. Otherwise, variations such as hierarchical topic modeling will not work. </p> <pre><code># Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration\ntopics = []\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n    topics.extend(topic_model.topics_)\n\ntopic_model.topics_ = topics\n</code></pre>"},{"location":"getting_started/online/online.html#river","title":"River","text":"<p>To continuously find new topics as they come in, we can use the package river. It contains several clustering models that can create new clusters as new data comes in. To make sure we can use their models, we first need to create a class that has a <code>.partial_fit</code> function and the option to extract labels through <code>.labels_</code>:</p> <pre><code>from river import stream\nfrom river import cluster\n\nclass River:\n    def __init__(self, model):\n        self.model = model\n\n    def partial_fit(self, umap_embeddings):\n        for umap_embedding, _ in stream.iter_array(umap_embeddings):\n            self.model = self.model.learn_one(umap_embedding)\n\n        labels = []\n        for umap_embedding, _ in stream.iter_array(umap_embeddings):\n            label = self.model.predict_one(umap_embedding)\n            labels.append(label)\n\n        self.labels_ = labels\n        return self\n</code></pre> <p>Then, we can choose any <code>river.cluster</code> model that we are interested in and pass it to the <code>River</code> class before using it in BERTopic:</p> <pre><code># Using DBSTREAM to detect new topics as they come in\ncluster_model = River(cluster.DBSTREAM())\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\")\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)\n\n# Prepare model\ntopic_model = BERTopic(\n    hdbscan_model=cluster_model, \n    vectorizer_model=vectorizer_model, \n    ctfidf_model=ctfidf_model,\n)\n\n\n# Incrementally fit the topic model by training on 1000 documents at a time\nfor docs in doc_chunks:\n    topic_model.partial_fit(docs)\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html","title":"Outlier reduction","text":"<p>When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created  that do not fall within any of the created topics. These are labeled as -1. Depending on your use case, you might want to decrease the number of documents that are labeled as outliers. Fortunately, there are a number of strategies one might  use to reduce the number of outliers after you have trained your BERTopic model. </p> <p>The main way to reduce your outliers in BERTopic is by using the <code>.reduce_outliers</code> function. To make it work without too much tweaking, you will only need to pass the <code>docs</code> and their corresponding <code>topics</code>. You can pass outlier and non-outlier documents together since it will only try to reduce outlier documents and label them to a non-outlier topic. </p> <p>The following is a minimal example:</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers\nnew_topics = topic_model.reduce_outliers(docs, topics)\n</code></pre> <p>Note</p> <p>You can use the <code>threshold</code> parameter to select the minimum distance or similarity when matching outlier documents with non-outlier topics. This allows the user to change the amount of outlier documents are assigned to non-outlier topics. </p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#strategies","title":"Strategies","text":"<p>The default method for reducing outliers is by calculating the c-TF-IDF representations of outlier documents and assigning them  to the best matching c-TF-IDF representations of non-outlier topics. </p> <p>However, there are a number of other strategies one can use, either separately or in conjunction that are worthwhile to explore:</p> <ul> <li>Using the topic-document probabilities to assign topics</li> <li>Using the topic-document distributions to assign topics</li> <li>Using c-TF-IDF representations to assign topics</li> <li>Using document and topic embeddings to assign topics</li> </ul>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#probabilities","title":"Probabilities","text":"<p>This strategy uses the soft-clustering as performed by HDBSCAN to find the  best matching topic for each outlier document. To use this, make  sure to calculate the <code>probabilities</code> beforehand by instantiating  BERTopic with <code>calculate_probabilities=True</code>.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model and calculate the document-topic probabilities\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `probabilities` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy=\"probabilities\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#topic-distributions","title":"Topic Distributions","text":"<p>Use the topic distributions, as calculated with <code>.approximate_distribution</code> to find the most frequent topic in each outlier document. You can use the  <code>distributions_params</code> variable to tweak the parameters of  <code>.approximate_distribution</code>.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `distributions` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"distributions\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#c-tf-idf","title":"c-TF-IDF","text":"<p>Calculate the c-TF-IDF representation for each outlier document and  find the best matching c-TF-IDF topic representation using  cosine similarity.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `c-tf-idf` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"c-tf-idf\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#embeddings","title":"Embeddings","text":"<p>Using the embeddings of each outlier documents, find the best  matching topic embedding using cosine similarity.</p> <pre><code>from bertopic import BERTopic\n\n# Train your BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Reduce outliers using the `embeddings` strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"embeddings\")\n</code></pre> <p>Note</p> <p>If you have pre-calculated the documents embeddings you can speed up the outlier reduction process for the <code>\"embeddings\"</code> strategy as it will prevent re-calculating  the document embeddings.</p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#chain-strategies","title":"Chain Strategies","text":"<p>Since the <code>.reduce_outliers</code> function does not internally update the topics, we can easily try out different strategies but also chain them together.  You might want to do a first pass with the <code>\"c-tf-idf\"</code> strategy as it is quite fast. Then, we can perform the <code>\"distributions\"</code> strategy on the  outliers that are left since this method is typically much slower:</p> <pre><code># Use the \"c-TF-IDF\" strategy with a threshold\nnew_topics = topic_model.reduce_outliers(docs, new_topics , strategy=\"c-tf-idf\", threshold=0.1)\n\n# Reduce all outliers that are left with the \"distributions\" strategy\nnew_topics = topic_model.reduce_outliers(docs, topics, strategy=\"distributions\")\n</code></pre>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#update-topics","title":"Update Topics","text":"<p>After generating our updated topics, we can feed them back into BERTopic in one of two ways. We can either update the topic representations themselves based on the documents that now belong to new topics or we can only update the topic frequency without updating the topic representations themselves.</p> <p>Warning</p> <p>In both cases, it is important to realize that  updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2. </p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#update-topic-representation","title":"Update Topic Representation","text":"<p>When outlier documents are generated, they are not used when modeling the topic representations. These documents are completely ignored when finding good descriptions of topics. Thus, after having reduced the number of outliers in your topic model, you might want to update the topic representations with the documents that now belong to actual topics. To do so, we can make use of the <code>.update_topics</code> function:</p> <pre><code>topic_model.update_topics(docs, topics=new_topics)\n</code></pre> <p>As seen above, you will only need to pass the documents on which the model was trained including the new topics that were generated using one of the above four strategies. </p>"},{"location":"getting_started/outlier_reduction/outlier_reduction.html#exploration","title":"Exploration","text":"<p>When you are reducing the number of topics, it might be worthwhile to iteratively visualize the results in order to get an intuitive understanding of the effect of the above four strategies. Making use of <code>.visualize_documents</code>, we can quickly iterate over the different strategies and view their effects. Here, an example will be shown on how to approach such a pipeline. </p> <p>First, we train our model:</p> <pre><code>from umap import UMAP\nfrom bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Prepare data, extract embeddings, and prepare sub-models\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\nvectorizer_model = CountVectorizer(stop_words=\"english\")\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=True)\n\n# We reduce our embeddings to 2D as it will allows us to quickly iterate later on\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, \n                          min_dist=0.0, metric='cosine').fit_transform(embeddings)\n\n# Train our topic model\ntopic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, \n                       vectorizer_model=vectorizer_model calculate_probabilities=True, nr_topics=40)\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre> <p>After having trained our model, let us take a look at the 2D representation of the generated topics:</p> <pre><code>topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, \n                                hide_document_hover=True, hide_annotations=True)\n</code></pre> <p>Next, we reduce the number of outliers using the <code>probabilities</code> strategy:</p> <pre><code>new_topics = reduce_outliers(topic_model, docs, topics, probabilities=probs, \n                             threshold=0.05, strategy=\"probabilities\")\ntopic_model.update_topics(docs, topics=new_topics)\n</code></pre> <p>And finally, we visualize the results:</p> <pre><code>topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, \n                                hide_document_hover=True, hide_annotations=True)\n</code></pre>"},{"location":"getting_started/parameter%20tuning/parametertuning.html","title":"Hyperparameter Tuning","text":"<p>Although BERTopic works quite well out of the box, there are a number of hyperparameters to tune according to your use case.  This section will focus on important parameters directly accessible in BERTopic but also hyperparameter optimization in sub-models  such as HDBSCAN and UMAP.</p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#bertopic","title":"BERTopic","text":"<p>When instantiating BERTopic, there are several hyperparameters that you can directly adjust that could significantly improve the performance of your topic model. In this section, we will go through the most impactful parameters in BERTopic and directions on how to optimize them. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#language","title":"language","text":"<p>The <code>language</code> parameter is used to simplify the selection of models for those who are not familiar with sentence-transformers models. </p> <p>In essence, there are two options to choose from:  </p> <ul> <li><code>language = \"english\"</code> or</li> <li><code>language = \"multilingual\"</code></li> </ul> <p>The English model is \"all-MiniLM-L6-v2\" and can be found here. It is the default model that is used in BERTopic and works great for English documents. </p> <p>The multilingual model is \"paraphrase-multilingual-MiniLM-L12-v2\" and supports over 50+ languages which can be found here. The model is very similar to the base model but is trained on many languages and has a slightly different architecture. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#top_n_words","title":"top_n_words","text":"<p><code>top_n_words</code> refers to the number of words per topic that you want to be extracted. In practice, I would advise you to keep this value below 30 and preferably between 10 and 20. The reasoning for this is that the more words you put in a topic the less coherent it can become. The top words are the most representative of the topic and should be focused on. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#n_gram_range","title":"n_gram_range","text":"<p>The <code>n_gram_range</code> parameter refers to the CountVectorizer used when creating the topic representation. It relates to the number of words you want in your topic representation. For example, \"New\" and \"York\" are two separate words but are often used as \"New York\" which represents an n-gram of 2. Thus, the <code>n_gram_range</code> should be set to (1, 2) if you want \"New York\" in your topic representation. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#min_topic_size","title":"min_topic_size","text":"<p><code>min_topic_size</code> is an important parameter! It is used to specify what the minimum size of a topic can be. The lower this value the more topics are created. If you set this value too high, then it is possible that simply no topics will be created! Set this value too low and you will get many microclusters. </p> <p>It is advised to play around with this value depending on the size of your dataset. If it nears a million documents, then it is advised to set it much higher than the default of 10, for example, 100 or even 500. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#nr_topics","title":"nr_topics","text":"<p><code>nr_topics</code> can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced. For example, if your topic model results in 100 topics but you have set <code>nr_topics</code> to 20 then the topic model will try to reduce the number of topics from 100 to 20. </p> <p>This reduction can take a while as each reduction in topics activates a c-TF-IDF calculation. If this is set to None, no reduction is applied. Use \"auto\" to automatically reduce topics using HDBSCAN.</p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#low_memory","title":"low_memory","text":"<p><code>low_memory</code> sets UMAP's <code>low_memory</code> to True to make sure that less memory is used in the computation. This slows down computation but allows UMAP to be run on low-memory machines. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#calculate_probabilities","title":"calculate_probabilities","text":"<p><code>calculate_probabilities</code> lets you calculate the probabilities of each topic in each document. This is computationally quite expensive and is turned off by default. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#umap","title":"UMAP","text":"<p>UMAP is an amazing technique for dimensionality reduction. In BERTopic, it is used to reduce the dimensionality of document embedding into something easier to use with HDBSCAN to create good clusters.</p> <p>However, it does has a significant number of parameters you could take into account. As exposing all parameters in BERTopic would be difficult to manage, we can instantiate our UMAP model and pass it to BERTopic:</p> <pre><code>from umap import UMAP\n\numap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False)\ntopic_model = BERTopic(umap_model=umap_model).fit(docs)\n</code></pre>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#n_neighbors","title":"n_neighbors","text":"<p><code>n_neighbors</code> is the number of neighboring sample points used when making the manifold approximation. Increasing this value typically results in a  more global view of the embedding structure whilst smaller values result in a more local view. Increasing this value often results in larger clusters  being created. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#n_components","title":"n_components","text":"<p><code>n_components</code> refers to the dimensionality of the embeddings after reducing them. This is set as a default to <code>5</code> to reduce dimensionality  as much as possible whilst trying to maximize the information kept in the resulting embeddings. Although lowering or increasing this value influences the quality of embeddings, its effect is largest on the performance of HDBSCAN. Increasing this value too much and HDBSCAN will have a  hard time clustering the high-dimensional embeddings. Lower this value too much and too little information in the resulting embeddings are available  to create proper clusters. If you want to increase this value, I would advise setting using a metric for HDBSCAN that works well in high dimensional data. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#metric","title":"metric","text":"<p><code>metric</code> refers to the method used to compute the distances in high dimensional space. The default is <code>cosine</code> as we are dealing with high dimensional data. However, BERTopic is also able to use any input, even regular tabular data, to cluster the documents. Thus, you might want to change the metric  to something that fits your use case. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#low_memory_1","title":"low_memory","text":"<p><code>low_memory</code> is used when datasets may consume a lot of memory. Using millions of documents can lead to memory issues and setting this value to <code>True</code>  might alleviate some of the issues. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#hdbscan","title":"HDBSCAN","text":"<p>After reducing the embeddings with UMAP, we use HDBSCAN to cluster our documents into clusters of similar documents. Similar to UMAP, HDBSCAN has many parameters that could be tweaked to improve the cluster's quality.</p> <pre><code>from hdbscan import HDBSCAN\n\nhdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)\ntopic_model = BERTopic(hdbscan_model=hdbscan_model).fit(docs)\n</code></pre>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#min_cluster_size","title":"min_cluster_size","text":"<p><code>min_cluster_size</code> is arguably the most important parameter in HDBSCAN. It controls the minimum size of a cluster and thereby the number of clusters  that will be generated. It is set to <code>10</code> as a default. Increasing this value results in fewer clusters but of larger size whereas decreasing this value  results in more micro clusters being generated. Typically, I would advise increasing this value rather than decreasing it. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#min_samples","title":"min_samples","text":"<p><code>min_samples</code> is automatically set to <code>min_cluster_size</code> and controls the number of outliers generated. Setting this value significantly lower than  <code>min_cluster_size</code> might help you reduce the amount of noise you will get. Do note that outliers are to be expected and forcing the output  to have no outliers may not properly represent the data. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#metric_1","title":"metric","text":"<p><code>metric</code>, like with HDBSCAN is used to calculate the distances. Here, we went with <code>euclidean</code> as, after reducing the dimensionality, we have  low dimensional data and not much optimization is necessary. However, if you increase <code>n_components</code> in UMAP, then it would be advised to look into  metrics that work with high dimensional data. </p>"},{"location":"getting_started/parameter%20tuning/parametertuning.html#prediction_data","title":"prediction_data","text":"<p>Make sure you always set this value to <code>True</code> as it is needed to predict new points later on. You can set this to False if you do not wish to predict  any unseen data points. </p>"},{"location":"getting_started/quickstart/quickstart.html","title":"Quick Start","text":""},{"location":"getting_started/quickstart/quickstart.html#installation","title":"Installation","text":"<p>Installation, with sentence-transformers, can be done using pypi:</p> <pre><code>pip install bertopic\n</code></pre> <p>You may want to install more depending on the transformers and language backends that you will be using.  The possible installations are: </p> <pre><code># Choose an embedding backend\npip install bertopic[flair, gensim, spacy, use]\n\n# Topic modeling with images\npip install bertopic[vision]\n</code></pre>"},{"location":"getting_started/quickstart/quickstart.html#quick-start","title":"Quick Start","text":"<p>We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of English documents:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>After generating topics, we can access the frequent topics that were generated:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()\n\nTopic   Count   Name\n-1      4630    -1_can_your_will_any\n0       693     49_windows_drive_dos_file\n1       466     32_jesus_bible_christian_faith\n2       441     2_space_launch_orbit_lunar\n3       381     22_key_encryption_keys_encrypted\n</code></pre> <p>-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most  frequent topic that was generated, topic 0:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(0)\n\n[('windows', 0.006152228076250982),\n ('drive', 0.004982897610645755),\n ('dos', 0.004845038866360651),\n ('file', 0.004140142872194834),\n ('disk', 0.004131678774810884),\n ('mac', 0.003624848635985097),\n ('memory', 0.0034840976976789903),\n ('software', 0.0034415334250699077),\n ('email', 0.0034239554442333257),\n ('pc', 0.003047105930670237)]\n</code></pre> <p>Using <code>.get_document_info</code>, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:</p> <pre><code>&gt;&gt;&gt; topic_model.get_document_info(docs)\n\nDocument                               Topic    Name                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...\nFinally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...\nThink! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...\n1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...\n</code></pre> <p>Multilingual</p> <p>Use <code>BERTopic(language=\"multilingual\")</code> to select a model that supports 50+ languages. </p>"},{"location":"getting_started/quickstart/quickstart.html#fine-tune-topic-representations","title":"Fine-tune Topic Representations","text":"<p>In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is <code>KeyBERTInspired</code>, which for many users increases the coherence and reduces stopwords from the resulting topic representations:</p> <pre><code>from bertopic.representation import KeyBERTInspired\n\n# Fine-tune your topic representations\nrepresentation_model = KeyBERTInspired()\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\n\n# Fine-tune topic representations with GPT\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, model=\"gpt-3.5-turbo\", chat=True)\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Multi-aspect Topic Modeling</p> <p>Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic. </p>"},{"location":"getting_started/quickstart/quickstart.html#visualizations","title":"Visualizations","text":"<p>After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good  understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the many visualization options in BERTopic. For example, we can visualize the topics that were generated in a way very similar to  LDAvis:</p> <pre><code>topic_model.visualize_topics()\n</code></pre>"},{"location":"getting_started/quickstart/quickstart.html#saveload-bertopic-model","title":"Save/Load BERTopic model","text":"<p>There are three methods for saving BERTopic:</p> <ol> <li>A light model with <code>.safetensors</code> and config files</li> <li>A light model with pytorch <code>.bin</code> and config files</li> <li>A full model with <code>.pickle</code></li> </ol> <p>Method 3 allows for saving the entire topic model but has several drawbacks:</p> <ul> <li>Arbitrary code can be run from <code>.pickle</code> files</li> <li>The resulting model is rather large (often &gt; 500MB) since all sub-models need to be saved</li> <li>Explicit and specific version control is needed as they typically only run if the environment is exactly the same</li> </ul> <p>It is advised to use methods 1 or 2 for saving.</p> <p>These methods have a number of advantages:</p> <ul> <li><code>.safetensors</code> is a relatively safe format</li> <li>The resulting model can be very small (often &lt; 20MB) since no sub-models need to be saved</li> <li>Although version control is important, there is a bit more flexibility with respect to specific versions of packages</li> <li>More easily used in production</li> <li>Share models with the HuggingFace Hub</li> </ul> <p>Tip</p> <p>For more detail about how to load in a custom vectorizer, representation model, and more, it is highly advised to checkout the serialization page. It contains more examples, details, and some tips and tricks for loading and saving your environment. </p> <p>The methods are as used as follows:</p> <pre><code>topic_model = BERTopic().fit(my_docs)\n\n# Method 1 - safetensors\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 2 - pytorch\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"pytorch\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 3 - pickle\ntopic_model.save(\"my_model\", serialization=\"pickle\")\n</code></pre> <p>To load a model:</p> <pre><code># Load from directory\nloaded_model = BERTopic.load(\"path/to/my/model_dir\")\n\n# Load from file\nloaded_model = BERTopic.load(\"my_model\")\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_Wikipedia\")\n</code></pre> <p>Warning</p> <p>When saving the model, make sure to also keep track of the versions of dependencies and Python used.  Loading and saving the model should be done using the same dependencies and Python. Moreover, models  saved in one version of BERTopic should not be loaded in other versions. </p>"},{"location":"getting_started/representation/llm.html","title":"6B. LLM & Generative AI","text":"<p>As we have seen in the previous section, the topics that you get from BERTopic can be fine-tuned using a number of approaches. Here, we are going to focus on text generation Large Language Models such as ChatGPT, GPT-4, and open-source solutions. </p> <p>Using these techniques, we can further fine-tune topics to generate labels, summaries, poems of topics, and more. To do so, we first generate a set of keywords and documents that describe a topic best using BERTopic's c-TF-IDF calculate. Then, these candidate keywords and documents are passed to the text generation model and asked to generate output that fits the topic best. </p> <p>A huge benefit of this is that we can describe a topic with only a few documents and we therefore do not need to pass all documents to the text generation model. Not only speeds this the generation of topic labels up significantly, you also do not need a massive amount of credits when using an external API, such as Cohere or OpenAI.</p>"},{"location":"getting_started/representation/llm.html#prompt-engineering","title":"Prompt Engineering","text":"<p>In most of the examples below, we use certain tags to customize our prompts. There are currently two tags, namely <code>\"[KEYWORDS]\"</code> and <code>\"[DOCUMENTS]\"</code>.  These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.  For example, if we have the following prompt:</p> <pre><code>prompt = \"\"\"\nI have topic that contains the following documents: \\n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the above information, can you give a short label of the topic?\n\"\"\"\n</code></pre> <p>then that will be rendered as follows:</p> <pre><code>\"\"\"\nI have a topic that contains the following documents: \n- Our videos are also made possible by your support on patreon.co.\n- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.\n- If you want to help us make more videos, you can do so there.\n- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.\n\nThe topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch \n\nBased on the above information, can you give a short label of the topic?\n\"\"\"\n</code></pre> <p>Tip 1</p> <p>You can access the default prompts of these models with <code>representation_model.default_prompt_</code>. The prompts that were generated after training can be accessed with <code>topic_model.representation_model.prompts_</code>.</p>"},{"location":"getting_started/representation/llm.html#selecting-documents","title":"Selecting Documents","text":"<p>By default, four of the most representative documents will be passed to <code>[DOCUMENTS]</code>. These documents are selected by calculating their similarity (through c-TF-IDF representations) with the main c-TF-IDF representation of the topics. The four best matching documents per topic are selected. </p> <p>To increase the number of documents passed to <code>[DOCUMENTS]</code>, we can use the <code>nr_docs</code> parameter which is accessible in all LLMs on this page. Using this value allows you to select the top n most representative documents instead. If you have a long enough context length, then you could even give the LLM dozens of documents.</p> <p>However, some of these documents might be very similar to one another and might be near duplicates. They will not provide much additional information about the content of the topic. Instead, we can use the <code>diversity</code> parameter in each LLM to only select documents that are sufficiently diverse. It takes values between 0 and 1 but a value of 0.1 already does wonders!</p>"},{"location":"getting_started/representation/llm.html#truncating-documents","title":"Truncating Documents","text":"<p>We can truncate the input documents in <code>[DOCUMENTS]</code> in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:</p> <ul> <li><code>doc_length</code><ul> <li>The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.</li> </ul> </li> <li><code>tokenizer</code><ul> <li>The tokenizer used to calculate to split the document into segments used to count the length of a document. <ul> <li>If tokenizer is  <code>'char'</code>, then the document is split up into characters which are counted to adhere to <code>doc_length</code></li> <li>If tokenizer is <code>'whitespace'</code>, the document is split up into words separated by whitespaces. These words are counted       and truncated depending on <code>doc_length</code></li> <li>If tokenizer is <code>'vectorizer'</code>, then the internal CountVectorizer is used to tokenize the document. These tokens are counted and trunctated depending on <code>doc_length</code></li> <li>If tokenizer is a callable, then that callable is used to tokenized the document. These tokens are counted and truncated depending on <code>doc_length</code></li> </ul> </li> </ul> </li> </ul> <p>This means that the definition of <code>doc_length</code> changes depending on what constitutes a token in the <code>tokenizer</code> parameter. If a token is a character, then <code>doc_length</code> refers to max length in characters. If a token is a word, then <code>doc_length</code> refers to the max length in words.</p> <p>Let's illustrate this with an example. In the code below, we will use <code>tiktoken</code> to count the number of tokens in each document and limit them to 100 tokens. All documents that have more than 100 tokens will be truncated.</p> <p>We start by installing the relevant packages:</p> <pre><code>pip install tiktoken openai\n</code></pre> <p>Then, we use <code>bertopic.representation.OpenAI</code> to represent our topics with nicely written labels. We specify that documents that we put in the prompt cannot exceed 100 tokens each. Since we will put 4 documents in the prompt, they will total roughly 400 tokens:</p> <pre><code>import openai\nimport tiktoken\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Tokenizer\ntokenizer= tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n\n# Create your representation model\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(\n    client,\n    model=\"gpt-3.5-turbo\", \n    delay_in_seconds=2, \n    chat=True,\n    nr_docs=4,\n    doc_length=100,\n    tokenizer=tokenizer\n)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"getting_started/representation/llm.html#transformers","title":"\ud83e\udd17 Transformers","text":"<p>Nearly every week, there are new and improved models released on the \ud83e\udd17 Model Hub that, with some creativity, allow for  further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these  methods are created as a way to support whatever might be released in the future. </p> <p>Using a GPT-like model from the huggingface hub is rather straightforward:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = TextGeneration('gpt2')\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>GPT2, however, is not the most accurate model out there on HuggingFace models. You can get  much better results with a <code>flan-T5</code> like model:</p> <pre><code>from transformers import pipeline\nfrom bertopic.representation import TextGeneration\n\nprompt = \"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?\"\"\n\n# Create your representation model\ngenerator = pipeline('text2text-generation', model='google/flan-t5-base')\nrepresentation_model = TextGeneration(generator)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation beef  volcanoes  immune system  earth  european union  cotton \ud83e\udd17 Transformers <p></p> <p>As can be seen from the example above, if you would like to use a <code>text2text-generation</code> model, you will to  pass a <code>transformers.pipeline</code> with the <code>\"text2text-generation\"</code> parameter. Moreover, you can use a custom prompt and decide where the keywords should be inserted by using the <code>[KEYWORDS]</code> or documents with the <code>[DOCUMENTS]</code> tag.</p>"},{"location":"getting_started/representation/llm.html#zephyr-mistral-7b","title":"Zephyr (Mistral 7B)","text":"<p>We can go a step further with open-source Large Language Models (LLMs) that have shown to match the performance of closed-source LLMs like ChatGPT.</p> <p>In this example, we will show you how to use Zephyr, a fine-tuning version of Mistral 7B. Mistral 7B outperforms other open-source LLMs at a much smaller scale and is a worthwhile solution for use cases such as topic modeling. We want to keep inference as fast as possible and a relatively small model helps with that. Zephyr is a fine-tuned version of Mistral 7B that was trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization (DPO).</p> <p>To use Zephyr in BERTopic, we will first need to install and update a couple of packages that can handle quantized versions of Zephyr:</p> <pre><code>pip install ctransformers[cuda]\npip install --upgrade git+https://github.com/huggingface/transformers\n</code></pre> <p>Instead of loading in the full model, we can instead load a quantized model which is a compressed version of the original model:</p> <pre><code>from ctransformers import AutoModelForCausalLM\nfrom transformers import AutoTokenizer, pipeline\n\n# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"TheBloke/zephyr-7B-alpha-GGUF\",\n    model_file=\"zephyr-7b-alpha.Q4_K_M.gguf\",\n    model_type=\"mistral\",\n    gpu_layers=50,\n    hf=True\n)\ntokenizer = AutoTokenizer.from_pretrained(\"HuggingFaceH4/zephyr-7b-alpha\")\n\n# Pipeline\ngenerator = pipeline(\n    model=model, tokenizer=tokenizer,\n    task='text-generation',\n    max_new_tokens=50,\n    repetition_penalty=1.1\n)\n</code></pre> <p>This Zephyr model requires a specific prompt template in order to work:</p> <pre><code>prompt = \"\"\"&lt;|system|&gt;You are a helpful, respectful and honest assistant for labeling topics..&lt;/s&gt;\n&lt;|user|&gt;\nI have a topic that contains the following documents:\n[DOCUMENTS]\n\nThe topic is described by the following keywords: '[KEYWORDS]'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.&lt;/s&gt;\n&lt;|assistant|&gt;\"\"\"\n</code></pre> <p>After creating this prompt template, we can create our representation model to be used in BERTopic:</p> <pre><code>from bertopic.representation import TextGeneration\n\n# Text generation with Zephyr\nzephyr = TextGeneration(generator, prompt=prompt)\nrepresentation_model = {\"Zephyr\": zephyr}\n\n# Topic Modeling\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\n</code></pre>"},{"location":"getting_started/representation/llm.html#llama-2","title":"Llama 2","text":"<p>Full Llama 2 Tutorial: </p> <p>Open-source LLMs are starting to become more and more popular. Here, we will go through a minimal example of using Llama 2 together with BERTopic. </p> <p>First, we need to load in our Llama 2 model:</p> <pre><code>from torch import bfloat16\nimport transformers\n\n# set quantization configuration to load large model with less GPU memory\n# this requires the `bitsandbytes` library\nbnb_config = transformers.BitsAndBytesConfig(\n    load_in_4bit=True,  # 4-bit quantization\n    bnb_4bit_quant_type='nf4',  # Normalized float 4\n    bnb_4bit_use_double_quant=True,  # Second quantization after the first\n    bnb_4bit_compute_dtype=bfloat16  # Computation type\n)\n\n# Llama 2 Tokenizer\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\n\n# Llama 2 Model\nmodel = transformers.AutoModelForCausalLM.from_pretrained(\n    model_id,\n    trust_remote_code=True,\n    quantization_config=bnb_config,\n    device_map='auto',\n)\nmodel.eval()\n\n# Our text generator\ngenerator = transformers.pipeline(\n    model=model, tokenizer=tokenizer,\n    task='text-generation',\n    temperature=0.1,\n    max_new_tokens=500,\n    repetition_penalty=1.1\n)\n</code></pre> <p>After doing so, we will need to define a prompt that works with both Llama 2 as well as BERTopic:</p> <pre><code># System prompt describes information given to all conversations\nsystem_prompt = \"\"\"\n&lt;s&gt;[INST] &lt;&lt;SYS&gt;&gt;\nYou are a helpful, respectful and honest assistant for labeling topics.\n&lt;&lt;/SYS&gt;&gt;\n\"\"\"\n\n# Example prompt demonstrating the output we are looking for\nexample_prompt = \"\"\"\nI have a topic that contains the following documents:\n- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.\n- Meat, but especially beef, is the word food in terms of emissions.\n- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.\n\nThe topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.\n\n[/INST] Environmental impacts of eating meat\n\"\"\"\n\n# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags\nmain_prompt = \"\"\"\n[INST]\nI have a topic that contains the following documents:\n[DOCUMENTS]\n\nThe topic is described by the following keywords: '[KEYWORDS]'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.\n[/INST]\n\"\"\"\n\nprompt = system_prompt + example_prompt + main_prompt\n</code></pre> <p>Three pieces of the prompt were created:  </p> <ul> <li><code>system_prompt</code> helps us guide the model during a conversation. For example, we can say that it is a helpful assisant that is specialized in labeling topics.</li> <li><code>example_prompt</code> gives an example of a correctly labeled topic to guide Llama 2</li> <li><code>main_prompt</code> contains the main question we are going to ask it, namely to label a topic. Note that it uses the <code>[DOCUMENTS]</code>  and <code>[KEYWORDS]</code> to provide the most relevant documents and keywords as additional context</li> </ul> <p>After having generated our prompt template, we can start running our topic model:</p> <pre><code>from bertopic.representation import TextGeneration\nfrom bertopic import BERTopic\n\n# Text generation with Llama 2\nllama2 = TextGeneration(generator, prompt=prompt)\nrepresentation_model = {\n    \"Llama2\": llama2,\n}\n\n# Create our BERTopic model\ntopic_model = BERTopic(representation_model=representation_model,  verbose=True)\n</code></pre>"},{"location":"getting_started/representation/llm.html#llamacpp","title":"llama.cpp","text":"<p>An amazing framework for using LLMs for inference is <code>llama.cpp</code> which has python bindings that we can use in BERTopic. To start with, we first need to install <code>llama-cpp-python</code>:</p> <pre><code>pip install llama-cpp-python\n</code></pre> <p>or using the following for hardware acceleration:</p> <pre><code>CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install llama-cpp-python\n</code></pre> <p>Note</p> <p>There are a number of installation options depending on your hardware and OS. Make sure that you select the correct one to optimize your performance.</p> <p>After installation, you need to download your LLM locally before we use it in BERTopic, like so:</p> <pre><code>wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf\n</code></pre> <p>Finally, we can now use the model the model with BERTopic in just a couple of lines:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import LlamaCPP\n\n# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha\nrepresentation_model = LlamaCPP(\"zephyr-7b-alpha.Q4_K_M.gguf\")\n\n# Create our BERTopic model\ntopic_model = BERTopic(representation_model=representation_model,  verbose=True)\n</code></pre> <p>If you want to have more control over the LLMs parameters, you can run it like so:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import LlamaCPP\nfrom llama_cpp import Llama\n\n# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha\nllm = Llama(model_path=\"zephyr-7b-alpha.Q4_K_M.gguf\", n_gpu_layers=-1, n_ctx=4096, stop=\"Q:\")\nrepresentation_model = LlamaCPP(llm)\n\n# Create our BERTopic model\ntopic_model = BERTopic(representation_model=representation_model,  verbose=True)\n</code></pre> <p>Note</p> <p>The default template that is being used uses a \"Q: ... A: ... \" type of structure which is why the <code>stop</code> is set at <code>\"Q:\"</code>.  The default template is: <pre><code>\"\"\"\nQ: I have a topic that contains the following documents: \n[DOCUMENTS]\n\nThe topic is described by the following keywords: '[KEYWORDS]'.\n\nBased on the above information, can you give a short label of the topic?\nA: \n\"\"\"\n</code></pre></p>"},{"location":"getting_started/representation/llm.html#openai","title":"OpenAI","text":"<p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords. To use this, you will need to install openai first:</p> <pre><code>pip install openai\n</code></pre> <p>Then, get yourself an API key and use OpenAI's API as follows:</p> <pre><code>import openai\nfrom bertopic.representation import OpenAI\nfrom bertopic import BERTopic\n\n# Create your representation model\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation Organic vs Conventional Food: Environmental and Health Considerations  Volcanic Eruptions and Impacts  The Immune System: Understanding and Boosting Immunity  The Moon's Tides and Orbit Phenomena  Democracy in the European Union  Plastic Pollution and its environmental impact OpenAI <p></p> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"I have the following documents: [DOCUMENTS] \\nThese documents are about the following topic: '\"\nrepresentation_model = OpenAI(client, prompt=prompt)\n</code></pre>"},{"location":"getting_started/representation/llm.html#chatgpt","title":"ChatGPT","text":"<p>Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.  In order to use ChatGPT with BERTopic, we need to define the model and make sure to enable <code>chat</code>:</p> <pre><code>representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\n</code></pre> <p>Prompting with ChatGPT is very satisfying and is customizable as follows:</p> <pre><code>prompt = \"\"\"\nI have a topic that contains the following documents: \n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n\"\"\"\n</code></pre> <p>Note</p> <p>Whenever you create a custom prompt, it is important to add  <pre><code>Based on the information above, extract a short topic label in the following format:\ntopic: &lt;topic label&gt;\n</code></pre> at the end of your prompt as BERTopic extracts everything that comes after <code>topic:</code>. Having  said that, if <code>topic:</code> is not in the output, then it will simply extract the entire response, so  feel free to experiment with the prompts. </p>"},{"location":"getting_started/representation/llm.html#summarization","title":"Summarization","text":"<p>Due to the structure of the prompts in OpenAI's chat models, we can extract different types of topic representations from their GPT models.  Instead of extracting a topic label, we can instead ask it to extract a short description of the topic instead:</p> <pre><code>summarization_prompt = \"\"\"\nI have a topic that is described by the following keywords: [KEYWORDS]\nIn this topic, the following documents are a small but representative subset of all documents in the topic:\n[DOCUMENTS]\n\nBased on the information above, please give a description of this topic in the following format:\ntopic: &lt;description&gt;\n\"\"\"\n\nrepresentation_model = OpenAI(client, model=\"gpt-3.5-turbo\", chat=True, prompt=summarization_prompt, nr_docs=5, delay_in_seconds=3)\n</code></pre> <p>The above is not constrained to just creating a short description or summary of the topic, we can extract labels, keywords, poems, example documents, extensitive descriptions, and more using this method! If you want to have multiple representations of a single topic, it might be worthwhile to also check out multi-aspect topic modeling with BERTopic.</p>"},{"location":"getting_started/representation/llm.html#langchain","title":"LangChain","text":"<p>Langchain is a package that helps users with chaining large language models. In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this  external knowledge are the most representative documents in each topic. </p> <p>To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:</p> <pre><code>pip install langchain, openai\n</code></pre> <p>Then, you can create your chain as follows:</p> <pre><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\nchain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type=\"stuff\")\n</code></pre> <p>Finally, you can pass the chain to BERTopic as follows:</p> <pre><code>from bertopic.representation import LangChain\n\n# Create your representation model\nrepresentation_model = LangChain(chain)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"What are these documents about? Please give a single label.\"\nrepresentation_model = LangChain(chain, prompt=prompt)\n</code></pre> <p>Note</p> <p>The prompt does not make use of <code>[KEYWORDS]</code> and <code>[DOCUMENTS]</code> tags as  the documents are already used within langchain's <code>load_qa_chain</code>. </p>"},{"location":"getting_started/representation/llm.html#cohere","title":"Cohere","text":"<p>Instead of using a language model from \ud83e\udd17 transformers, we can use external APIs instead that  do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords. To use this, you will need to install cohere first:</p> <pre><code>pip install cohere\n</code></pre> <p>Then, get yourself an API key and use Cohere's API as follows:</p> <pre><code>import cohere\nfrom bertopic.representation import Cohere\nfrom bertopic import BERTopic\n\n# Create your representation model\nco = cohere.Client(my_api_key)\nrepresentation_model = Cohere(co)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation Organic food  Exploding planets  How your immune system works  How tides work  How democratic is the European Union?  Plastic pollution  Cohere <p></p> <p>You can also use a custom prompt:</p> <pre><code>prompt = \"\"\"\nI have topic that contains the following documents: [DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS].\nBased on the above information, can you give a short label of the topic?\n\"\"\"\nrepresentation_model = Cohere(co, prompt=prompt)\n</code></pre>"},{"location":"getting_started/representation/representation.html","title":"6A. Representation Models","text":"<p>One of the core components of BERTopic is its Bag-of-Words representation and weighting with c-TF-IDF. This method is fast and can quickly generate a number of keywords for a topic without depending on the clustering task. As a result, topics can easily and quickly be updated after training the model without the need to re-train it.  Although these give good topic representations, we may want to further fine-tune the topic representations. </p> <p>As such, there are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations. These are optional  and are not used by default. You are not restrained by the how the representation can be fine-tuned, from GPT-like models to fast keyword extraction  with KeyBERT-like models:</p> <p>For each model below, an example will be shown on how it may change or improve upon the default topic keywords that are generated. The dataset used in these examples can be found here. </p> <p>If you want to have multiple representations of a single topic, it might be worthwhile to also check out multi-aspect topic modeling with BERTopic.</p>"},{"location":"getting_started/representation/representation.html#keybertinspired","title":"KeyBERTInspired","text":"<p>After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage  c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate  the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents. </p> <p></p>  n Topic Extract representative documents Embed candidate keywords Compare embedded keywords with embedded documents Embed and  average documents  Extract candidate keywords Compare c-TF-IDF sampled documents with the topic c-TF-IDF.  Extract top n words per topic based on their c-TF-IDF scores <p></p> <p>Thus, the algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation organic | meat | foods | crops | beef | produce | food | diet | cows | eating  explosion | explodes | eruptions | eruption | blast | volcanoes | volcanic  immune | immunology | antibodies | disease | cells | infection | cell | system   moon | moons | lunar | tides | tidal | gravity | orbit | satellites | earth | orbits  eu | democracy | european | democratic | parliament | governments | voting  plastics | plastic | pollution | microplastics | environmental | polymers | bpa KeyBERT-Inspired <p></p>"},{"location":"getting_started/representation/representation.html#partofspeech","title":"PartOfSpeech","text":"<p>Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from  all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of  keywords and documents that best represent a topic. </p> <p></p>  n Topic Extract documents that contain at least one keyword Sort keywords by their c-TF-IDF value Use the POS matcher on those documents to generate new candidate keywords Extract candidate keywords <p></p> <p>More specifically, we find documents that contain the keywords from our candidate topics as calculated with c-TF-IDF. These documents serve  as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic. These candidate keywords are first put through Spacy's POS module to see whether they match with the <code>DEFAULT_PATTERNS</code>:</p> <pre><code>DEFAULT_PATTERNS = [\n            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n            [{'POS': 'NOUN'}],\n            [{'POS': 'ADJ'}]\n]\n</code></pre> <p>These patterns follow Spacy's Rule-Based Matching. Then, the resulting keywords are sorted by  their respective c-TF-IDF values.</p> <pre><code>from bertopic.representation import PartOfSpeech\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation meat | organic | food | beef | emissions | most | health | pesticides | production  explosion | atmosphere | eruption | kilometers | eruptions | fireball | super immune | system | cells | immunology | adaptive | body | memory | cell   moon | earth | lunar | tides | water | orbit | base | moons | surface | gravity  democratic | vote | parliament | member | union | states | national | countries   plastic | plastics | tons | pollution | waste | microplastics | polymers | bag PartOfSpeech <p></p> <p>You can define custom POS patterns to be extracted:</p> <pre><code>pos_patterns = [\n            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],\n            [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]\n]\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\", pos_patterns=pos_patterns)\n</code></pre>"},{"location":"getting_started/representation/representation.html#maximalmarginalrelevance","title":"MaximalMarginalRelevance","text":"<p>When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like \"car\" and \"cars\"  essentially represent the same information and often redundant. </p> <p></p> <p></p> <p>To decrease this redundancy and improve the diversity of keywords, we can use an algorithm called Maximal Marginal Relevance (MMR). MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.</p> <pre><code>from bertopic.representation import MaximalMarginalRelevance\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = MaximalMarginalRelevance(diversity=0.3)\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation meat | organic | beef | emissions | health | pesticides | foods | farming | conventional  explosion | atmosphere | eruption | eruptions | crust | volcanoes | earthquakes  immune | system | cells | immunology | adaptive | body | memory | antibodies  moon | lunar | tides | moons | surface | gravity | tide | meters | oceans | dust  eu | democratic | vote | parliament | citizen | laws | institutions | influence | nations  plastics | tons | pollution | waste | microplastics | polymers | ocean | bpa | cotton MaximalMarginalRelevance <p></p>"},{"location":"getting_started/representation/representation.html#zero-shot-classification","title":"Zero-Shot Classification","text":"<p>For some use cases, you might already have a set of candidate labels that you would like to automatically assign to some of the topics.  Although we can use guided or supervised BERTopic for that, we can also use zero-shot classification to assign labels to our topics.  For that, we can make use of \ud83e\udd17 transformers on their models on the model hub. </p> <p>To perform this classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels.  If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords. </p> <p>We use it in BERTopic as follows:</p> <pre><code>from bertopic.representation import ZeroShotClassification\nfrom bertopic import BERTopic\n\n# Create your representation model\ncandidate_topics = [\"space and nasa\", \"bicycles\", \"sports\"]\nrepresentation_model = ZeroShotClassification(candidate_topics, model=\"facebook/bart-large-mnli\")\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p></p> meat | organic | food | beef | emissions | eat | of | eating | is the | explosion | atmosphere | eruption | kilometers | of |   immune | system | your | cells | my | and | is | the | how | of  moon | earth | lunar | tides | the | water | orbit | base | moons   eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers Default Representation Organic food  the | explosion | atmosphere | eruption | kilometers | of  Your immune system  moon | earth | lunar | tides | the | water | orbit | base | moons  eu | european | democratic | vote | parliament | member | union  plastic | plastics | tons | pollution | waste | microplastics | polymers  ZeroShotClassification <p></p>"},{"location":"getting_started/representation/representation.html#chain-models","title":"Chain Models","text":"<p>All of the above models can make use of the candidate topics, as generated by c-TF-IDF, to further fine-tune the topic representations. For example, <code>MaximalMarginalRelevance</code> takes the keywords in the candidate topics and re-ranks them. Similarly, the keywords in the candidate topic can be used as the input for GPT-prompts in <code>OpenAI</code>.</p> <p>Although the default candidate topics are generated by c-TF-IDF, what if we were to chain these models? For example, we can use <code>MaximalMarginalRelevance</code> to improve upon the keywords in each topic before passing them to <code>OpenAI</code>. </p> <p>This is supported in BERTopic by simply passing a list of representation models when instantation the topic model:</p> <pre><code>from bertopic.representation import MaximalMarginalRelevance, OpenAI\nfrom bertopic import BERTopic\nimport openai\n\n# Create your representation models\nclient = openai.OpenAI(api_key=\"sk-...\")\nopenai_generator = OpenAI(client)\nmmr = MaximalMarginalRelevance(diversity=0.3)\nrepresentation_models = [mmr, openai_generator]\n\n# Use the chained models\ntopic_model = BERTopic(representation_model=representation_models)\n</code></pre>"},{"location":"getting_started/representation/representation.html#custom-model","title":"Custom Model","text":"<p>Although several representation models have been implemented in BERTopic, new technologies get released often and we should not have to wait until they get implemented in BERTopic. Therefore, you can create your own representation model and use that to fine-tune the topics. </p> <p>The following is the basic structure for creating your custom model. Note that it returns the same topics as the those  calculated with c-TF-IDF:</p> <pre><code>from bertopic.representation._base import BaseRepresentation\n\n\nclass CustomRepresentationModel(BaseRepresentation):\n    def extract_topics(self, topic_model, documents, c_tf_idf, topics\n                      ) -&gt; Mapping[str, List[Tuple[str, float]]]:\n        \"\"\" Extract topics\n\n        Arguments:\n            topic_model: The BERTopic model\n            documents: A dataframe of documents with their related topics\n            c_tf_idf: The c-TF-IDF matrix\n            topics: The candidate topics as calculated with c-TF-IDF\n\n        Returns:\n            updated_topics: Updated topic representations\n        \"\"\"\n        updated_topics = topics.copy()\n        return updated_topics\n</code></pre> <p>Then, we can use that model as follows:</p> <pre><code>from bertopic import BERTopic\n\n# Create our custom representation model\nrepresentation_model = CustomRepresentationModel()\n\n# Pass our custom representation model to BERTopic\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>There are a few things to take into account when creating your custom model:</p> <ul> <li>It needs to have the exact same parameter input: <code>topic_model</code>, <code>documents</code>, <code>c_tf_idf</code>, <code>topics</code>.</li> <li>Make sure that <code>updated_topics</code> has the exact same structure as <code>topics</code>:</li> </ul> <pre><code>updated_topics = {\n    \"1\", [(\"space\", 0.9), (\"nasa\", 0.7)], \n    \"2\": [(\"science\", 0.66), (\"article\", 0.6)]\n}\n</code></pre> <p>Tip</p> <p>You can change the <code>__init__</code> however you want, it does not influence the underlying structure. This also means that you can save data/embeddings/representations/sentiment in your custom representation  model. </p>"},{"location":"getting_started/search/search.html","title":"Search Topics","text":"<p>After having created a BERTopic model, you might end up with over a hundred topics. Searching through those  can be quite cumbersome especially if you are searching for a specific topic. Fortunately, BERTopic allows you  to search for topics using search terms. First, let's create and train a BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Create topics\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>After having trained our model, we can use <code>find_topics</code> to search for topics that are similar  to an input search_term. Here, we are going to be searching for topics that closely relate the  search term \"motor\". Then, we extract the most similar topic and check the results: </p> <pre><code>&gt;&gt;&gt; similar_topics, similarity = topic_model.find_topics(\"motor\", top_n=5)\n&gt;&gt;&gt; topic_model.get_topic(similar_topics[0])\n[('bike', 0.02275997701645559),\n ('motorcycle', 0.011391202866080292),\n ('bikes', 0.00981187573649205),\n ('dod', 0.009614623748226669),\n ('honda', 0.008247663662558535),\n ('ride', 0.0064683227888861945),\n ('harley', 0.006355502638631013),\n ('riding', 0.005766601561614182),\n ('motorcycles', 0.005596372493714447),\n ('advice', 0.005534544418830091)]\n</code></pre> <p>It definitely seems that a topic was found that closely matches \"motor\". The topic seems to be motorcycle  related and therefore matches our \"motor\" input. You can use the <code>similarity</code> variable to see how similar  the extracted topics are to the search term. </p> <p>Note</p> <p>You can only use this method if an embedding model was supplied to BERTopic using <code>embedding_model</code>. </p>"},{"location":"getting_started/seed_words/seed_words.html","title":"Seed Words","text":"<p>When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the \"TNM\" classification is a method for identifying the stage of most cancers. The word \"TNM\" is an abbreviation and might not be correctly captured in generic embedding models. </p> <p>To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of <code>seed_words</code> in the <code>bertopic.vectorizer.ClassTfidfTransformer</code>. The <code>ClassTfidfTransformer</code> is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as \"TNM\". </p> <p>To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like \"agent\" and \"robot\" should be important in such a topic were it to be found. Using the <code>ClassTfidfTransformer</code>, we can define those <code>seed_words</code> and also choose by how much their values are multiplied. </p> <p>The full example is then as follows:</p> <pre><code>from umap import UMAP\nfrom datasets import load_dataset\nfrom bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\n# Let's take a subset of ArXiv abstracts as the training data\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\nabstracts = dataset[\"abstract\"][:5_000]\n\n# For illustration purposes, we make sure the output is fixed when running this code multiple times\numap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)\n\n# We can choose any number of seed words for which we want their representation\n# to be strengthen. We increase the importance of these words as we want them to be more\n# likely to end up in the topic representations.\nctfidf_model = ClassTfidfTransformer(\n    seed_words=[\"agent\", \"robot\", \"behavior\", \"policies\", \"environment\"], \n    seed_multiplier=2\n)\n\n# We run the topic model with the seeded words\ntopic_model = BERTopic(\n    umap_model=umap_model,\n    min_topic_size=15,\n    ctfidf_model=ctfidf_model,\n).fit(abstracts)\n</code></pre> <p>Then, when we run <code>topic_model.get_topic(0)</code>, we get the following output:</p> <pre><code>[('policy', 0.023413102511982354),\n ('reinforcement', 0.021796126795834238),\n ('agent', 0.021131601305431902),\n ('policies', 0.01888385271486409),\n ('environment', 0.017819874593917057),\n ('learning', 0.015321710504308708),\n ('robot', 0.013881115279230468),\n ('control', 0.013297705894983875),\n ('the', 0.013247933839985382),\n ('to', 0.013058208312484141)]\n</code></pre> <p>As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.</p> <p>A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the <code>ClassTfidfTransformer</code> as candidate words to be further optimized. In many cases, words like \"TNM\" might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models. </p> <p>Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like \"TNM\" are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.). </p> <p>Moreover, these <code>seed_words</code> can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.</p>"},{"location":"getting_started/semisupervised/semisupervised.html","title":"Semi-supervised Topic Modeling","text":"<p>In BERTopic, you have several options to nudge the creation of topics toward certain pre-specified topics. Here, we will be looking at semi-supervised topic modeling with BERTopic. </p> <p>Semi-supervised modeling allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have. </p> <p></p> SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality  reduction Labels Clustering Topic  representation <p></p> <p>In other words, we use a semi-supervised UMAP instance to reduce the dimensionality of embeddings before clustering the documents  with HDBSCAN. </p> <p>First, let us prepare the data needed for our topic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\ncategories = data[\"target\"]\ncategory_names = data[\"target_names\"]\n</code></pre> <p>We are using the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is  assigned to one of 20 categories. Using this dataset we can try to extract its corresponding topic model whilst  taking its underlying categories into account. These categories are here the variable <code>targets</code>.</p> <p>Each document can be put into one of the following categories:</p> <pre><code>&gt;&gt;&gt; category_names\n\n['alt.atheism',\n 'comp.graphics',\n 'comp.os.ms-windows.misc',\n 'comp.sys.ibm.pc.hardware',\n 'comp.sys.mac.hardware',\n 'comp.windows.x',\n 'misc.forsale',\n 'rec.autos',\n 'rec.motorcycles',\n 'rec.sport.baseball',\n 'rec.sport.hockey',\n 'sci.crypt',\n 'sci.electronics',\n 'sci.med',\n 'sci.space',\n 'soc.religion.christian',\n 'talk.politics.guns',\n 'talk.politics.mideast',\n 'talk.politics.misc',\n 'talk.religion.misc'] \n</code></pre> <p>To perform this semi-supervised approach, we can take in some pre-defined topics and simply pass those to the <code>y</code> parameter when fitting BERTopic. These labels can be pre-defined topics or simply documents that you feel belong together regardless of their content. BERTopic will nudge the creation of topics toward these categories  using the pre-defined labels. </p> <p>To perform supervised topic modeling, we simply use all categories:</p> <pre><code>topic_model = BERTopic(verbose=True).fit(docs, y=categories)\n</code></pre> <p>The topic model will be much more attuned to the categories that were defined previously. However, this does not mean that only topics for these categories will be found. BERTopic is likely to find more specific topics in those you have already defined. This allows you to discover previously unknown topics!</p>"},{"location":"getting_started/semisupervised/semisupervised.html#partial-labels","title":"Partial labels","text":"<p>At times, you might only have labels for a subset of documents. Fortunately, we can still use those labels to at least nudge the documents for which those labels exist. The documents for which we do not have labels are assigned a -1. For this example, imagine we only have the labels of categories that are related to computers and we want to create a topic model using semi-supervised modeling:</p> <pre><code>labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',\n              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',\n              'comp.windows.x',]\nindices = [category_names.index(label) for label in labels_to_add]\ny = [label if label in indices else -1 for label in categories]\n</code></pre> <p>The <code>y</code> variable contains many -1 values since we do not know all the categories. </p> <p>Next, we use those newly constructed labels to again BERTopic semi-supervised:</p> <pre><code>topic_model = BERTopic(verbose=True).fit(docs, y=y)\n</code></pre> <p>And that is it! By defining certain classes for our documents, we can steer the topic modeling towards modeling the pre-defined categories. </p>"},{"location":"getting_started/serialization/serialization.html","title":"Serialization","text":"<p>Saving, loading, and sharing a BERTopic model can be done in several ways. It is generally advised to go with <code>.safetensors</code> as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as <code>.pickle</code> and pytorch <code>.bin</code> are also possible.</p>"},{"location":"getting_started/serialization/serialization.html#saving","title":"Saving","text":"<p>There are three methods for saving BERTopic:</p> <ol> <li>A light model with <code>.safetensors</code> and config files</li> <li>A light model with pytorch <code>.bin</code> and config files</li> <li>A full model with <code>.pickle</code></li> </ol> <p>Tip</p> <p>It is advised to use methods 1 or 2 for saving as they generated very small models. Especially method 1 (<code>safetensors</code>)  allows for a relatively safe format compared to the other methods. </p> <p>The methods are used as follows:</p> <pre><code>topic_model = BERTopic().fit(my_docs)\n\n# Method 1 - safetensors\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"safetensors\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 2 - pytorch\nembedding_model = \"sentence-transformers/all-MiniLM-L6-v2\"\ntopic_model.save(\"path/to/my/model_dir\", serialization=\"pytorch\", save_ctfidf=True, save_embedding_model=embedding_model)\n\n# Method 3 - pickle\ntopic_model.save(\"my_model\", serialization=\"pickle\")\n</code></pre> <p>Warning</p> <p>When saving the model, make sure to also keep track of the versions of dependencies and Python used.  Loading and saving the model should be done using the same dependencies and Python. Moreover, models  saved in one version of BERTopic are not guaranteed to load in other versions. </p>"},{"location":"getting_started/serialization/serialization.html#pickle-drawbacks","title":"Pickle Drawbacks","text":"<p>Saving the model with <code>pickle</code> allows for saving the entire topic model, including dimensionality reduction and clustering algorithms, but has several drawbacks:</p> <ul> <li>Arbitrary code can be run from <code>.pickle</code> files</li> <li>The resulting model is rather large (often &gt; 500MB) since all sub-models need to be saved</li> <li>Explicit and specific version control is needed as they typically only run if the environment is exactly the same</li> </ul>"},{"location":"getting_started/serialization/serialization.html#safetensors-and-pytorch-advantages","title":"Safetensors and Pytorch Advantages","text":"<p>Saving the topic modeling with <code>.safetensors</code> or <code>pytorch</code> has a number of advantages:</p> <ul> <li><code>.safetensors</code> is a relatively safe format</li> <li>The resulting model can be very small (often &lt; 20MB&gt;) since no sub-models need to be saved</li> <li>Although version control is important, there is a bit more flexibility with respect to specific versions of packages</li> <li>More easily used in production</li> <li>Share models with the HuggingFace Hub</li> </ul> <p> </p> <p>The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing <code>safetensors</code>, <code>pytorch</code>, and <code>pickle</code>. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings. </p>"},{"location":"getting_started/serialization/serialization.html#huggingface-hub","title":"HuggingFace Hub","text":"<p>When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account which you can do in a number of ways:</p> <ul> <li>Log in to your Hugging Face account with the command below</li> </ul> <pre><code>huggingface-cli login\n\n# or using an environment variable\nhuggingface-cli login --token $HUGGINGFACE_TOKEN\n</code></pre> <ul> <li>Alternatively, you can programmatically login using login() in a notebook or a script</li> </ul> <pre><code>from huggingface_hub import login\nlogin()\n</code></pre> <ul> <li>Or you can give a token with the <code>token</code> variable</li> </ul> <p>When you have logged in to your HuggingFace account, you can save and upload the model as follows:</p> <pre><code>from bertopic import BERTopic\n\n# Train model\ntopic_model = BERTopic().fit(my_docs)\n\n# Push to HuggingFace Hub\ntopic_model.push_to_hf_hub(\n    repo_id=\"MaartenGr/BERTopic_ArXiv\",\n    save_ctfidf=True\n)\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_ArXiv\")\n</code></pre>"},{"location":"getting_started/serialization/serialization.html#parameters","title":"Parameters","text":"<p>There are number of parameters that may be worthwile to know:</p> <ul> <li><code>private</code><ul> <li>Whether to create a private repository</li> </ul> </li> <li><code>serialization</code><ul> <li>The type of serialization. Either <code>safetensors</code> or <code>pytorch</code>. Make sure to run <code>pip install safetensors</code> for safetensors.</li> </ul> </li> <li><code>save_embedding_model</code><ul> <li>A pointer towards a HuggingFace model to be loaded in with SentenceTransformers. E.g., <code>sentence-transformers/all-MiniLM-L6-v2</code></li> </ul> </li> <li><code>save_ctfidf</code><ul> <li>Whether to save c-TF-IDF information</li> </ul> </li> </ul>"},{"location":"getting_started/serialization/serialization.html#loading","title":"Loading","text":"<p>To load a model:</p> <pre><code># Load from directory\nloaded_model = BERTopic.load(\"path/to/my/model_dir\")\n\n# Load from file\nloaded_model = BERTopic.load(\"my_model\")\n\n# Load from HuggingFace\nloaded_model = BERTopic.load(\"MaartenGr/BERTopic_Wikipedia\")\n</code></pre> <p>The embedding model cannot always be saved using a non-pickle method if, for example, you are using OpenAI embeddings. Instead, you can load them in as follows:</p> <pre><code># Define embedding model\nimport openai\nfrom bertopic.backend import OpenAIBackend\n\nclient = openai.OpenAI(api_key=\"sk-...\")\nembedding_model = OpenAIBackend(client, \"text-embedding-ada-002\")\n\n# Load model and add embedding model\nloaded_model = BERTopic.load(\"path/to/my/model_dir\", embedding_model=embedding_model)\n</code></pre>"},{"location":"getting_started/supervised/supervised.html","title":"Supervised Topic Modeling","text":"<p>Although topic modeling is typically done by discovering topics in an unsupervised manner, there might be times when you already have a bunch of clusters or classes from which you want to model the topics. For example, the often used 20 NewsGroups dataset is already split up into 20 classes. Similarly, you might already have created some labels yourself through packages like human-learn, bulk, thisnotthat or something entirely different. </p> <p>Instead of using BERTopic to discover previously unknown topics, we are now going to manually pass them to BERTopic and try to learn the relationship between those topics and the input documents. </p> <p>In other words, we are going to be performing classification instead! </p> <p>We can view this as a supervised topic modeling approach. Instead of using a clustering algorithm, we are going to be using a classification algorithm instead. </p> <p>Generally, we have the following pipeline:</p> <p></p> SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality  reduction Clustering Topic  representation <p></p> <p>Instead, we are now going to skip over the dimensionality reduction step and replace the clustering step with a classification model:</p> <p></p> SBERT Logistic Regression c-TF-IDF Embeddings Classifier Topic representation <p></p> <p>In other words, we can pass our labels to BERTopic and it will not only learn how to predict labels for new instances, but it also transforms those labels into topics by running the c-TF-IDF representations on the set of documents within each label. This process allows us to model the topics themselves and similarly gives us the option to use everything BERTopic has to offer. </p> <p>To do so, we need to skip over the dimensionality reduction step and replace the clustering step with a classification algorithm. We can use the documents and labels from the 20 NewsGroups dataset to create topics from those 20 labels:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\n\n# Get labeled data\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n</code></pre> <p>Then, we make sure to create empty instances of the dimensionality reduction and clustering steps. We pass those to BERTopic to simply skip over them and go to the topic representation process:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\nfrom bertopic.dimensionality import BaseDimensionalityReduction\nfrom sklearn.linear_model import LogisticRegression\n\n# Get labeled data\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data['data']\ny = data['target']\n\n# Skip over dimensionality reduction, replace cluster model with classifier,\n# and reduce frequent words while we are at it.\nempty_dimensionality_model = BaseDimensionalityReduction()\nclf = LogisticRegression()\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\n\n# Create a fully supervised BERTopic instance\ntopic_model= BERTopic(\n        umap_model=empty_dimensionality_model,\n        hdbscan_model=clf,\n        ctfidf_model=ctfidf_model\n)\ntopics, probs = topic_model.fit_transform(docs, y=y)\n</code></pre> <p>Let's take a look at a few topics that we get out of training this way by running <code>topic_model.get_topic_info()</code>:</p> <p></p> Topic Count  Name 0 0 999  0_game_hockey_team_25  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  <p></p> <p>We can see several interesting topics appearing here. They seem to relate to the 20 classes we had as input. Now, let's map those topics to our original classes to view their relationship:</p> <pre><code># Map input `y` to topics\nmappings = topic_model.topic_mapper_.get_mappings()\nmappings = {value: data[\"target_names\"][key] for key, value in mappings.items()}\n\n# Assign original classes to our topics\ndf = topic_model.get_topic_info()\ndf[\"Class\"] = df.Topic.map(mappings)\ndf\n</code></pre> Topic Count  Name Class 0 0 999  0_game_hockey_team_25  rec.sport.hockey  1_god_church_jesus_christ  997  1 1 2 2 996  2_bike_dod_ride_bikes  3_baseball_game_he_year  994  3 3 4 4 991  4_key_encryption_db_clipper  5_car_cars_engine_ford  990 5 5 6 6 990 6_medical_patients_cancer_disease  7_window_server_widget_motif  988  7 7 8 8 988  8_space_launch_nasa_orbit  sci.space  comp.windows.x  sci.med  rec.autos  sci.crypt  rec.sport.baseball  rec.motorcycles  soc.religion.christian  <p></p> <p>We can see that the c-TF-IDF representations extract the words that give a good representation of our input classes. This is all done directly from the labeling. A welcome side-effect is that we now have a classification algorithm that allows us to predict the topics of unseen data:</p> <pre><code>&gt;&gt;&gt; topic, _ = topic_model.transform(\"this is a document about cars\")\n&gt;&gt;&gt; topic_model.get_topic(topic)\n[('car', 0.4407600315538472),\n ('cars', 0.32348015696446325),\n ('engine', 0.28032518444946686),\n ('ford', 0.2500224508115155),\n ('oil', 0.2325984913598611),\n ('dealer', 0.2310723968585826),\n ('my', 0.22045777551991935),\n ('it', 0.21327993649430219),\n ('tires', 0.20420842634292657),\n ('brake', 0.20246902481367085)]\n</code></pre> <p>Moreover, we can still perform BERTopic-specific features like dynamic topic modeling, topics per class, hierarchical topic modeling, modeling topic distributions, etc.</p> <p>Note</p> <p>The resulting <code>topics</code> may be a different mapping from the <code>y</code> labels. To map <code>y</code> to <code>topics</code>, we can run the following:</p> <pre><code>mappings = topic_model.topic_mapper_.get_mappings()\ny_mapped = [mappings[val] for val in y]\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html","title":"Tips &amp; Tricks","text":""},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#document-length","title":"Document length","text":"<p>As a default, we are using sentence-transformers to embed our documents. However, as the name implies, the embedding model works best for either sentences or paragraphs. This means that whenever you have a set of documents, where each documents contains several paragraphs, the document is truncated and the topic model is only trained on a small part of the data. </p> <p>One way to solve this issue is by splitting up longer documents into either sentences or paragraphs before embedding them. Another solution is to approximate the topic distributions of topics after having trained your topic model. </p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#removing-stop-words","title":"Removing stop words","text":"<p>At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context in order to create accurate embeddings. </p> <p>Instead, we can use the <code>CountVectorizer</code> to preprocess our documents after having generated embeddings and clustered  our documents. Personally, I have found almost no disadvantages to using the <code>CountVectorizer</code> to remove stopwords and  it is something I would strongly advise to try out:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre> <p>We can also use the <code>ClassTfidfTransformer</code> to reduce the impact of frequent words. The end result is very similar to explictly removing stopwords but this process does this automatically:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\n</code></pre> <p>Lastly, we can use a KeyBERT-Inspired model to reduce the appearance of stop words. This also often improves the topic representation:</p> <pre><code>from bertopic.representation import KeyBERTInspired\nfrom bertopic import BERTopic\n\n# Create your representation model\nrepresentation_model = KeyBERTInspired()\n\n# Use the representation model in BERTopic on top of the default pipeline\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#diversify-topic-representation","title":"Diversify topic representation","text":"<p>After having calculated our top n words per topic there might be many words that essentially  mean the same thing. As a little bonus, we can use <code>bertopic.representation.MaximalMarginalRelevance</code> in BERTopic to  diversify words in each topic such that we limit the number of duplicate words we find in each topic.  This is done using an algorithm called Maximal Marginal Relevance which compares word embeddings  with the topic embedding. </p> <p>We do this by specifying a value between 0 and 1, with 0 being not at all diverse and 1 being completely diverse:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import MaximalMarginalRelevance\n\nrepresentation_model = MaximalMarginalRelevance(diversity=0.2)\ntopic_model = BERTopic(representation_model=representation_model)\n</code></pre> <p>Since MMR is using word embeddings to diversify the topic representations, it is necessary to pass the embedding model to BERTopic if you are using pre-computed embeddings:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.representation import MaximalMarginalRelevance\nfrom sentence_transformers import SentenceTransformer\n\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\nrepresentation_model = MaximalMarginalRelevance(diversity=0.2)\ntopic_model = BERTopic(embedding_model=sentence_model, representation_model=representation_model)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#topic-term-matrix","title":"Topic-term matrix","text":"<p>Although BERTopic focuses on clustering our documents, the end result does contain a topic-term matrix.  This topic-term matrix is calculated using c-TF-IDF, a TF-IDF procedure optimized for class-based analyses. </p> <p>To extract the topic-term matrix (or c-TF-IDF matrix) with the corresponding words, we can simply do the following:</p> <pre><code>topic_term_matrix = topic_model.c_tf_idf_\nwords = topic_model.vectorizer_model.get_feature_names()\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#pre-compute-embeddings","title":"Pre-compute embeddings","text":"<p>Typically, we want to iterate fast over different versions of our BERTopic model whilst we are trying to optimize it to a specific use case. To speed up this process, we can pre-compute the embeddings, save them,  and pass them to BERTopic so it does not need to calculate the embeddings each time: </p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train our topic model using our pre-trained sentence-transformers embeddings\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs, embeddings)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#speed-up-umap","title":"Speed up UMAP","text":"<p>At times, UMAP may take a while to fit on the embeddings that you have. This often happens when you have the embeddings millions of documents that you want to reduce in dimensionality. There is a trick that  can speed up this process somewhat: Initializing UMAP with rescaled PCA embeddings. </p> <p>Without going in too much detail (look here for more information), you can reduce the embeddings using PCA  and use that as a starting point. This can speed up the dimensionality reduction a bit: </p> <pre><code>import numpy as np\nfrom umap import UMAP\nfrom bertopic import BERTopic\nfrom sklearn.decomposition import PCA\n\n\ndef rescale(x, inplace=False):\n    \"\"\" Rescale an embedding so optimization will not have convergence issues.\n    \"\"\"\n    if not inplace:\n        x = np.array(x, copy=True)\n\n    x /= np.std(x[:, 0]) * 10000\n\n    return x\n\n\n# Initialize and rescale PCA embeddings\npca_embeddings = rescale(PCA(n_components=5).fit_transform(embeddings))\n\n# Start UMAP from PCA embeddings\numap_model = UMAP(\n    n_neighbors=15,\n    n_components=5,\n    min_dist=0.0,\n    metric=\"cosine\",\n    init=pca_embeddings,\n)\n\n# Pass the model to BERTopic:\ntopic_model = BERTopic(umap_model=umap_model)\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#gpu-acceleration","title":"GPU acceleration","text":"<p>You can use cuML to speed up both  UMAP and HDBSCAN through GPU acceleration:</p> <pre><code>from bertopic import BERTopic\nfrom cuml.cluster import HDBSCAN\nfrom cuml.manifold import UMAP\n\n# Create instances of GPU-accelerated UMAP and HDBSCAN\numap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)\nhdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)\n\n# Pass the above models to be used in BERTopic\ntopic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Depending on the embeddings you are using, you might want to normalize them first in order to  force a cosine-related distance metric in UMAP:</p> <pre><code>from cuml.preprocessing import normalize\nembeddings = normalize(embeddings)\n</code></pre> <p>Note</p> <p>As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., <code>.transform</code>) using cuML's HDBSCAN.  However, it is still possible to calculate the topic-document probability matrix for the data on which the model was trained (i.e., <code>.fit</code> and <code>.fit_tranform</code>).</p> <p>Note</p> <p>If you want to install cuML together with BERTopic using Google Colab, you can run the following code:</p> <pre><code>!pip install bertopic\n!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com\n!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64\n</code></pre>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#lightweight-installation","title":"Lightweight installation","text":"<p>The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely <code>\"all-MiniLM-L6-v2\"</code>. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires <code>pytorch</code> which often results in a rather large environment, memory-wise. </p> <p>Fortunately, it is possible to install BERTopic without <code>sentence-transformers</code> and use it as a lightweight solution instead. The installation can be done as follows:</p> <pre><code>pip install --no-deps bertopic\npip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml\n</code></pre> <p>Then, we can use BERTopic without <code>sentence-transformers</code> as follows using a CPU-based embedding technique:</p> <pre><code>from sklearn.pipeline import make_pipeline\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\npipe = make_pipeline(\n    TfidfVectorizer(),\n    TruncatedSVD(100)\n)\n\ntopic_model = BERTopic(embedding_model=pipe)\n</code></pre> <p>As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!</p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#wordcloud","title":"WordCloud","text":"<p>To minimize the number of dependencies in BERTopic, it is not possible to generate wordclouds out-of-the-box. However,  there is a minimal script that you can use to generate wordclouds in BERTopic. First, you will need to install  the wordcloud package with <code>pip install wordcloud</code>. Then, run the following code  to generate the wordcloud for a specific topic:</p> <pre><code>from wordcloud import WordCloud\nimport matplotlib.pyplot as plt\n\ndef create_wordcloud(model, topic):\n    text = {word: value for word, value in model.get_topic(topic)}\n    wc = WordCloud(background_color=\"white\", max_words=1000)\n    wc.generate_from_frequencies(text)\n    plt.imshow(wc, interpolation=\"bilinear\")\n    plt.axis(\"off\")\n    plt.show()\n\n# Show wordcloud\ncreate_wordcloud(topic_model, topic=1)\n</code></pre> <p></p> <p>Tip</p> <p>To increase the number of words shown in the wordcloud, you can increase the <code>top_n_words</code>  parameter when instantiating BERTopic. You can also increase the number of words in a topic after training the model using <code>.update_topics()</code>. </p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#finding-similar-topics-between-models","title":"Finding similar topics between models","text":"<p>Whenever you have trained seperate BERTopic models on different datasets, it might  be worthful to find the similarities among these models. Is there overlap between  topics in model A and topic in model B? In other words, can we find topics in model A that are similar to those in model B? </p> <p>We can compare the topic representations of several models in two ways. First, by comparing the topic embeddings that are created when using the same embedding model across both fitted BERTopic instances. Second, we can compare the c-TF-IDF representations instead assuming we have fixed the vocabulary in both instances. </p> <p>This example will go into the former, using the same embedding model across two BERTopic instances. To do this comparison, let's first create an example where I trained two models, one on an English dataset and one on a Dutch dataset:</p> <pre><code>from datasets import load_dataset\nfrom bertopic import BERTopic\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# The same embedding model needs to be used for both topic models\n# and since we are dealing with multiple languages, the model needs to be multi-lingual\nsentence_model = SentenceTransformer(\"paraphrase-multilingual-MiniLM-L12-v2\")\n\n# To make this example reproducible\numap_model = UMAP(n_neighbors=15, n_components=5, \n                  min_dist=0.0, metric='cosine', random_state=42)\n\n# English\nen_dataset = load_dataset(\"stsb_multi_mt\", name=\"en\", split=\"train\").to_pandas().sentence1.tolist()\nen_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model)\nen_model.fit(en_dataset)\n\n# Dutch\nnl_dataset = load_dataset(\"stsb_multi_mt\", name=\"nl\", split=\"train\").to_pandas().sentence1.tolist()\nnl_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model)\nnl_model.fit(nl_dataset)\n</code></pre> <p>In the code above, there is one important thing to note and that is the <code>sentence_model</code>. This model needs to be exactly the same in all BERTopic models, otherwise, it is not possible to compare topic models. </p> <p>Next, we can calculate the similarity between topics in the English topic model <code>en_model</code> and the Dutch model <code>nl_model</code>. To do so, we can simply calculate the cosine similarity between the <code>topic_embedding</code> of both models: </p> <pre><code>from sklearn.metrics.pairwise import cosine_similarity\nsim_matrix = cosine_similarity(en_model.topic_embeddings_, nl_model.topic_embeddings_)\n</code></pre> <p>Now that we know which topics are similar to each other, we can extract the most similar topics. Let's say that we have topic 10 in the <code>en_model</code> which represents a topic related to trains:</p> <pre><code>&gt;&gt;&gt; topic = 10\n&gt;&gt;&gt; en_model.get_topic(topic)\n[('train', 0.2588080580844999),\n ('tracks', 0.1392140438801078),\n ('station', 0.12126454635946024),\n ('passenger', 0.058057876475695866),\n ('engine', 0.05123717127783682),\n ('railroad', 0.048142847325312044),\n ('waiting', 0.04098973702226946),\n ('track', 0.03978248702913929),\n ('subway', 0.03834661195748458),\n ('steam', 0.03834661195748458)]\n</code></pre> <p>To find the matching topic, we extract the most similar topic in the <code>sim_matrix</code>:</p> <pre><code>&gt;&gt;&gt; most_similar_topic = np.argmax(sim_matrix[topic + 1])-1\n&gt;&gt;&gt; nl_model.get_topic(most_similar_topic)\n[('trein', 0.24186603209316418),\n ('spoor', 0.1338118418551581),\n ('sporen', 0.07683661859111401),\n ('station', 0.056990389779394225),\n ('stoommachine', 0.04905829711711234),\n ('zilveren', 0.04083879598477808),\n ('treinen', 0.03534099197032758),\n ('treinsporen', 0.03534099197032758),\n ('staat', 0.03481332997324445),\n ('zwarte', 0.03179591746822408)]\n</code></pre> <p>It seems to be working as, for example, <code>trein</code> is a translation of <code>train</code> and <code>sporen</code> a translation of <code>tracks</code>! You can do this for every single topic to find out which topic in the <code>en_model</code> might belong to a model in the <code>nl_model</code>. </p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#multimodal-data","title":"Multimodal data","text":"<p>Concept is a variation  of BERTopic for multimodal data, such as images with captions. Although we can use that  package for multimodal data, we can perform a small trick with BERTopic to have a similar feature. </p> <p>BERTopic is a relatively modular approach that attempts to isolate steps from one another. This means,  for example, that you can use k-Means instead of HDBSCAN or PCA instead of UMAP as it does not make  any assumptions with respect to the nature of the clustering. </p> <p>Similarly, you can pass pre-calculated embeddings to BERTopic that represent the documents that you have.  However, it does not make any assumption with respect to the relationship between those embeddings and  the documents. This means that we could pass any metadata to BERTopic to cluster on instead of document  embeddings. In this example, we can separate our embeddings from our documents so that the embeddings  are generated from images instead of their corresponding images. Thus, we will cluster image embeddings but  create the topic representation from the related captions. </p> <p>In this example, we first need to fetch our data, namely the Flickr 8k dataset that contains images  with captions:</p> <pre><code>import os\nimport glob\nimport zipfile\nimport numpy as np\nimport pandas as pd\nfrom tqdm import tqdm\nfrom PIL import Image\nfrom sentence_transformers import SentenceTransformer, util\n\n# Flickr 8k images\nimg_folder = 'photos/'\ncaps_folder = 'captions/'\nif not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:\n    os.makedirs(img_folder, exist_ok=True)\n\n    if not os.path.exists('Flickr8k_Dataset.zip'):   #Download dataset if does not exist\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')\n        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')\n\n    for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:\n        with zipfile.ZipFile(file, 'r') as zf:\n            for member in tqdm(zf.infolist(), desc='Extracting'):\n                zf.extract(member, folder)\nimages = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))\n\n# Prepare dataframe\ncaptions = pd.read_csv(\"captions/Flickr8k.lemma.token.txt\",sep='\\t',names=[\"img_id\",\"img_caption\"])\ncaptions.img_id = captions.apply(lambda row: \"photos/Flicker8k_Dataset/\" + row.img_id.split(\".jpg\")[0] + \".jpg\", 1)\ncaptions = captions.groupby([\"img_id\"])[\"img_caption\"].apply(','.join).reset_index()\ncaptions = pd.merge(captions, pd.Series(images, name=\"img_id\"), on=\"img_id\")\n\n# Extract images together with their documents/captions\nimages = captions.img_id.to_list()\ndocs = captions.img_caption.to_list()\n</code></pre> <p>Now that we have our images and captions, we need to generate our image embeddings:</p> <pre><code>model = SentenceTransformer('clip-ViT-B-32')\n\n# Prepare images\nbatch_size = 32\nnr_iterations = int(np.ceil(len(images) / batch_size))\n\n# Embed images per batch\nembeddings = []\nfor i in tqdm(range(nr_iterations)):\n    start_index = i * batch_size\n    end_index = (i * batch_size) + batch_size\n\n    images_to_embed = [Image.open(filepath) for filepath in images[start_index:end_index]]\n    img_emb = model.encode(images_to_embed, show_progress_bar=False)\n    embeddings.extend(img_emb.tolist())\n\n    # Close images\n    for image in images_to_embed:\n        image.close()\nembeddings = np.array(embeddings)\n</code></pre> <p>Finally, we can fit BERTopic the way we are used to, with documents and embeddings:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.cluster import KMeans\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\ntopics, probs = topic_model.fit_transform(docs, embeddings)\ncaptions[\"Topic\"] = topics\n</code></pre> <p>After fitting our model, let's inspect a topic about skateboarders:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(2)\n[('skateboard', 0.09592033177340711),\n ('skateboarder', 0.07792520092546491),\n ('trick', 0.07481578896400298),\n ('ramp', 0.056952605147927216),\n ('skate', 0.03745127816149923),\n ('perform', 0.036546213623432654),\n ('bicycle', 0.03453483070441857),\n ('bike', 0.033233021253898994),\n ('jump', 0.026709362981948037),\n ('air', 0.025422798170830936)]\n</code></pre> <p>Based on the above output, we can take an image to see if the representation makes sense:</p> <pre><code>image = captions.loc[captions.Topic == 2, \"img_id\"].values.tolist()[0]\nImage.open(image)\n</code></pre> <p></p>"},{"location":"getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic","title":"KeyBERT &amp; BERTopic","text":"<p>Although BERTopic focuses on topic extraction methods that does not assume specific structures for the generated clusters, it is possible to do this on a more local level. More specifically, we can use KeyBERT to generate a number of keywords for each document and then build a vocabulary on top of that as the input for BERTopic. This way, we can select words that we know have meaning to a topic, without focusing on the centroid of that cluster. This also allows more frequent words to pop-up regardless of the structure and density of a cluster. </p> <p>To do this, we first need to run KeyBERT on our data and create our vocabulary:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom keybert import KeyBERT\n\n# Prepare documents \ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# Extract keywords\nkw_model = KeyBERT()\nkeywords = kw_model.extract_keywords(docs)\n\n# Create our vocabulary\nvocabulary = [k[0] for keyword in keywords for k in keyword]\nvocabulary = list(set(vocabulary))\n</code></pre> <p>Then, we pass our <code>vocabulary</code> to BERTopic and train the model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer_model= CountVectorizer(vocabulary=vocabulary)\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre>"},{"location":"getting_started/topicreduction/topicreduction.html","title":"Topic Reduction","text":"<p>BERTopic uses HDBSCAN for clustering the data and it cannot specify the number of clusters you would want. To a certain extent,  this is an advantage, as we can trust HDBSCAN to be better in finding the number of clusters than we are. Instead, we can try to reduce the number of topics that have been created. Below, you will find three methods of doing  so. </p>"},{"location":"getting_started/topicreduction/topicreduction.html#manual-topic-reduction","title":"Manual Topic Reduction","text":"<p>Each resulting topic has its feature vector constructed from c-TF-IDF. Using those feature vectors, we can find the most similar  topics and merge them. If we do this iteratively, starting from the least frequent topic, we can reduce the number of topics quite easily. We do this until we reach the value of <code>nr_topics</code>:  </p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(nr_topics=20)\n</code></pre> <p>It is also possible to manually select certain topics that you believe should be merged.  For example, if topic 1 is <code>1_space_launch_moon_nasa</code> and topic 2 is <code>2_spacecraft_solar_space_orbit</code>  it might make sense to merge those two topics: </p> <pre><code>topics_to_merge = [1, 2]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre> <p>If you have several groups of topics you want to merge, create a list of lists instead:</p> <pre><code>topics_to_merge = [[1, 2]\n                   [3, 4]]\ntopic_model.merge_topics(docs, topics_to_merge)\n</code></pre>"},{"location":"getting_started/topicreduction/topicreduction.html#automatic-topic-reduction","title":"Automatic Topic Reduction","text":"<p>One issue with the approach above is that it will merge topics regardless of whether they are very similar. They  are simply the most similar out of all options. This can be resolved by reducing the number of topics automatically.  To do this, we can use HDBSCAN to cluster our topics using each c-TF-IDF representation. Then, we merge topics that are clustered together.  Another benefit of HDBSCAN is that it generates outliers. These outliers prevent topics from being merged if no other topics are similar.</p> <p>To use this option, we simply set <code>nr_topics</code> to <code>\"auto\"</code>:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(nr_topics=\"auto\")\n</code></pre>"},{"location":"getting_started/topicreduction/topicreduction.html#topic-reduction-after-training","title":"Topic Reduction after Training","text":"<p>Finally, we can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so is that you can decide the number of topics after knowing how many are created. It is difficult to predict before training your model how many topics that are in your documents and how many will be extracted.  Instead, we can decide afterward how many topics seem realistic:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Create topics -&gt; Typically over 50 topics\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Further reduce topics\ntopic_model.reduce_topics(docs, nr_topics=30)\n\n# Access updated topics\ntopics = topic_model.topics_\n</code></pre> <p>The reasoning for putting <code>docs</code> as a parameter is that the documents are not saved within  BERTopic on purpose. If you were to have a million documents, it is very inefficient to save those in BERTopic instead of a dedicated database.  </p>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html","title":"Update Topic Representations","text":"<p>The topics that are extracted from BERTopic are represented by words. These words are extracted from the documents  occupying their topics using a class-based TF-IDF. This allows us to extract words that are interesting to a topic but  less so to another. </p>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html#update-topic-representation-after-training","title":"Update Topic Representation after Training","text":"<p>When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to remove stop_words or you want to try out a different n_gram_range. We can use the function <code>update_topics</code> to update  the topic representation with new parameters for <code>c-TF-IDF</code>: </p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Create topics\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic(n_gram_range=(2, 3))\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>From the model created above, one of the most frequent topics is the following:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(31)[:10]\n[('clipper chip', 0.007240771542316232),\n ('key escrow', 0.004601603973377443),\n ('law enforcement', 0.004277247929596332),\n ('intercon com', 0.0035961920238955824),\n ('amanda walker', 0.003474856425297157),\n ('serial number', 0.0029876119137150358),\n ('com amanda', 0.002789303096817983),\n ('intercon com amanda', 0.0027386688593327084),\n ('amanda intercon', 0.002585262048515583),\n ('amanda intercon com', 0.002585262048515583)]\n</code></pre> <p>Although there does seems to be some relation between words, it is difficult, at least for me, to intuitively understand  what the topic is about. Instead, let's simplify the topic representation by setting <code>n_gram_range</code> to (1, 3) to  also allow for single words.</p> <pre><code>&gt;&gt;&gt; topic_model.update_topics(docs, n_gram_range=(1, 3))\n&gt;&gt;&gt; topic_model.get_topic(31)[:10]\n[('encryption', 0.008021846079148017),\n ('clipper', 0.00789642647602742),\n ('chip', 0.00637127942464045),\n ('key', 0.006363124787175884),\n ('escrow', 0.005030980365244285),\n ('clipper chip', 0.0048271268437973395),\n ('keys', 0.0043245812747907545),\n ('crypto', 0.004311198708675516),\n ('intercon', 0.0038772934659295076),\n ('amanda', 0.003516026493904586)]\n</code></pre> <p>To me, the combination of the words above seem a bit more intuitive than the words we previously had! You can play  around with <code>n_gram_range</code> or use your own custom <code>sklearn.feature_extraction.text.CountVectorizer</code> and pass that instead: </p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(stop_words=\"english\", ngram_range=(1, 5))\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>Tip!</p> <p>If you want to change the topics to something else, whether that is merging them or removing outliers, you can pass  a custom list of topics to update them: <code>topic_model.update_topics(docs, topics=my_updated_topics)</code></p>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html#custom-labels","title":"Custom labels","text":"<p>The topic labels are currently automatically generated by taking the top 3 words and combining them  using the <code>_</code> separator. Although this is an informative label, in practice, this is definitely not the prettiest nor necessarily the most accurate label. For example, although the topic label  <code>1_space_nasa_orbit</code> is informative, but we would prefer to have a bit more intuitive label, such as  <code>space travel</code>. The difficulty with creating such topic labels is that much of the interpretation is left to the user. Would <code>space travel</code> be more accurate or perhaps <code>space explorations</code>? To truly understand which labels are most suited, going into some of the documents in topics is especially helpful. </p> <p>Although we can go through every single topic ourselves and try to label them, we can start by creating an overview of labels that have the length and number of words that we are looking for. To do so, we can generate our list of topic labels with <code>.generate_topic_labels</code> and define the number of words, the separator, word length, etc:</p> <pre><code>topic_labels = topic_model.generate_topic_labels(nr_words=3,\n                                                 topic_prefix=False,\n                                                 word_length=10,\n                                                 separator=\", \")\n</code></pre> <p>Tip</p> <p>If you created multiple topic representations or aspects, you can choose one of these aspects with <code>aspect=\"Aspect1\"</code> or whatever you named the aspect.</p> <p>In the above example, <code>1_space_nasa_orbit</code> would turn into <code>space, nasa, orbit</code> since we selected 3 words, no topic prefix, and the <code>,</code> separator. We can then either change our <code>topic_labels</code> to whatever we want or directly pass them to <code>.set_topic_labels</code> so that they can be used across most visualization functions:</p> <pre><code>topic_model.set_topic_labels(topic_labels)\n</code></pre> <p>It is also possible to only change a few topic labels at a time by passing a dictionary  where the key represents the topic ID and the value is the topic label:</p> <pre><code>topic_model.set_topic_labels({1: \"Space Travel\", 7: \"Religion\"})\n</code></pre> <p>Then, to make use of those custom topic labels across visualizations, such as <code>.visualize_hierarchy()</code>,  we can use the <code>custom_labels=True</code> parameter that is found in most visualizations. </p> <pre><code>fig = topic_model.visualize_barchart(custom_labels=True)\n</code></pre>"},{"location":"getting_started/topicrepresentation/topicrepresentation.html#optimize-labels","title":"Optimize labels","text":"<p>The great advantage of passing custom labels to BERTopic is that when more accurate zero-shot are released,  we can simply use those on top of BERTopic to further fine-tune the labeling. For example, let's say you  have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could use the bart-large-mnli model to find which user-defined  labels best represent the BERTopic-generated labels:</p> <pre><code>from transformers import pipeline\nclassifier = pipeline(\"zero-shot-classification\", model=\"facebook/bart-large-mnli\")\n\n# A selected topic representation\n# 'god jesus atheists atheism belief atheist believe exist beliefs existence'\nsequence_to_classify =  \" \".join([word for word, _ in topic_model.get_topic(1)])\n\n# Our set of potential topic labels\ncandidate_labels = ['cooking', 'dancing', 'religion']\nclassifier(sequence_to_classify, candidate_labels)\n\n#{'labels': ['cooking', 'dancing', 'religion'],\n# 'scores': [0.086, 0.063, 0.850],\n# 'sequence': 'god jesus atheists atheism belief atheist believe exist beliefs existence'}\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html","title":"Dynamic Topic Modeling","text":"<p>Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics  over time. These methods allow you to understand how a topic is represented across different times.  For example, in 1995 people may talk differently about environmental awareness than those in 2015. Although the  topic itself remains the same, environmental awareness, the exact representation of that topic might differ. </p> <p>BERTopic allows for DTM by calculating the topic representation at each timestep without the need to  run the entire model several times. To do this, we first need to fit BERTopic as if there were no temporal  aspect in the data. Thus, a general topic model will be created. We use the global representation as to the main topics that can be found at, most likely, different timesteps. For each topic and timestep, we calculate the c-TF-IDF representation. This will result in a specific topic representation at each timestep without the need to create clusters from embeddings as they were already created.</p> <p></p>  1 Topic Timestep  1  m Timestep Timestep  1 Timestep  m  n Topic c-TF-IDF c-TF-IDF c-TF-IDF c-TF-IDF topic c-TF-IDF c-TF-IDF at t + 2 c-TF-IDF at t c-TF-IDF at t-1 + 2 Global Tuning Split documents by topic Split documents by topic and timestep Apply pre-fitted c-TF-IDF on each subset of documents.  Tune the c-TF-IDF at each timestep t by either averaging the representations with the global representation or with the representation at t-1.  Evolutionary Tuning Optional tuning of representations <p></p> <p>Next, there are two main ways to further fine-tune these specific topic representations,  namely globally and evolutionary.</p> <p>A topic representation at timestep t can be fine-tuned globally by averaging its c-TF-IDF representation with that of the global representation. This allows each topic representation to move slightly towards the global representation whilst still keeping some of its specific words. </p> <p>A topic representation at timestep t can be fine-tuned evolutionary by averaging its c-TF-IDF representation with that of the c-TF-IDF representation at timestep t-1. This is done for each topic representation allowing for the representations to evolve over time. </p> <p>Both fine-tuning methods are set to <code>True</code> as a default and allow for interesting representations to be created. </p>"},{"location":"getting_started/topicsovertime/topicsovertime.html#example","title":"Example","text":"<p>To demonstrate DTM in BERTopic, we first need to prepare our data. A good example of where DTM is useful is topic  modeling on Twitter data. We can analyze how certain people have talked about certain topics in the years  they have been on Twitter. Due to the controversial nature of his tweets, we are going to be using all  tweets by Donald Trump.  </p> <p>First, we need to load the data and do some very basic cleaning. For example, I am not interested in his  re-tweets for this use-case: </p> <pre><code>import re\nimport pandas as pd\n\n# Prepare data\ntrump = pd.read_csv('https://drive.google.com/uc?export=download&amp;id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')\ntrump.text = trump.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == \"f\") &amp; (trump.text != \"\"), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n</code></pre> <p>Then, we need to extract the global topic representations by simply creating and training a BERTopic model:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(tweets)\n</code></pre> <p>From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this  by simply calling <code>topics_over_time</code> and passing the tweets, the corresponding timestamps, and the related topics:</p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)\n</code></pre> <p>And that is it! Aside from what you always need for BERTopic, you now only need to add <code>timestamps</code>  to quickly calculate the topics over time. </p>"},{"location":"getting_started/topicsovertime/topicsovertime.html#parameters","title":"Parameters","text":"<p>There are a few parameters that are of interest which will be discussed below. </p>"},{"location":"getting_started/topicsovertime/topicsovertime.html#tuning","title":"Tuning","text":"<p>Both <code>global_tuning</code> and <code>evolutionary_tuning</code> are set to True as a default, but can easily be changed. Perhaps  you do not want the representations to be influenced by the global representation and merely see how they  evolved over time:</p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, \n                                                global_tuning=True, evolution_tuning=True, nr_bins=20)\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html#bins","title":"Bins","text":"<p>If you have more than 100 unique timestamps, then there will be topic representations created for each of those  timestamps which can negatively affect the topic representations. It is advised to keep the number of unique  timestamps below 50. To do this, you can simply set the number of bins that are created when calculating the  topic representations. The timestamps will be taken and put into equal-sized bins:</p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html#datetime-format","title":"Datetime format","text":"<p>If you are passing strings (dates) instead of integers, then BERTopic will try to automatically detect  which datetime format your strings have. Unfortunately, this will not always work if they are in an unexpected format.  We can use <code>datetime_format</code> to pass the format the timestamps have: </p> <pre><code>topics_over_time = topic_model.topics_over_time(tweets, timestamps, datetime_format=\"%b%M\", nr_bins=20)\n</code></pre>"},{"location":"getting_started/topicsovertime/topicsovertime.html#visualization","title":"Visualization","text":"<p>To me, DTM becomes truly interesting when you have a good way of visualizing how topics have changed over time.  A nice way of doing so is by leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency  of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.  Simply call <code>visualize_topics_over_time</code> with the newly created topics over time:</p> <pre><code>topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)\n</code></pre> <p>I used <code>top_n_topics</code> to only show the top 20 most frequent topics. If I were to visualize all topics, which is possible by  leaving <code>top_n_topics</code> empty, there is a chance that hundreds of lines will fill the plot. </p> <p>You can also use <code>topics</code> to show specific topics:</p> <pre><code>topic_model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])\n</code></pre>"},{"location":"getting_started/topicsperclass/topicsperclass.html","title":"Topics per Class","text":"<p>In some cases, you might be interested in how certain topics are represented over certain categories. Perhaps  there are specific groups of users for which you want to see how they talk about certain topics. </p> <p>Instead of running the topic model per class, we can simply create a topic model and then extract, for each topic, its representation per class. This allows you to see how certain topics, calculated over all documents, are represented for certain subgroups. </p> <p></p>  1 Topic  1 Class  m Class  1 Class  m Class  n Topic c-TF-IDF c-TF-IDF c-TF-IDF c-TF-IDF Split documents by topic Split documents by topic and class Apply pre-fitted c-TF-IDF on each subset of documents.  <p></p> <p>To do so, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents. </p> <p>First, let's prepare the data:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\ntargets = data[\"target\"]\ntarget_names = data[\"target_names\"]\nclasses = [data[\"target_names\"][i] for i in data[\"target\"]]\n</code></pre> <p>Next, we want to extract the topics across all documents without taking the categories into account:</p> <pre><code>topic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Now that we have created our global topic model, let us calculate the topic representations across each category:</p> <pre><code>topics_per_class = topic_model.topics_per_class(docs, classes=classes)\n</code></pre> <p>The <code>classes</code> variable contains the class for each document. Then, we simply visualize these topics per class:</p> <pre><code>topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)\n</code></pre> <p>You can hover over the bars to see the topic representation per class.</p> <p>As you can see in the visualization above, the topics <code>93_homosexual_homosexuality_sex</code> and <code>58_bike_bikes_motorcycle</code>  are somewhat distributed over all classes. </p> <p>You can see that the topic representation between rec.motorcycles and rec.autos in <code>58_bike_bikes_motorcycle</code> clearly  differs from one another. It seems that BERTopic has tried to combine those two categories into a single topic. However,  since they do contain two separate topics, the topic representation in those two categories differs. </p> <p>We see something similar for <code>93_homosexual_homosexuality_sex</code>, where the topic is distributed among several categories  and is represented slightly differently. </p> <p>Thus, you can see that although in certain categories the topic is similar, the way the topic is represented can differ.   </p>"},{"location":"getting_started/vectorizers/vectorizers.html","title":"4. Vectorizers","text":"<p>In topic modeling, the quality of the topic representations is key for interpreting the topics, communicating results, and understanding patterns. It is of utmost  importance to make sure that the topic representations fit with your use case. </p> <p>In practice, there is not one correct way of creating topic representations. Some use cases might opt for higher n-grams, whereas others might focus more on single  words without any stop words. The diversity in use cases also means that we need to have some flexibility in BERTopic to make sure it can be used across most use cases.  The image below illustrates this modularity:</p> <p> </p> <p>In this section, we will go through several examples of vectorization algorithms and how they can be implemented.  </p>"},{"location":"getting_started/vectorizers/vectorizers.html#countvectorizer","title":"CountVectorizer","text":"<p>One often underestimated component of BERTopic is the <code>CountVectorizer</code> and <code>c-TF-IDF</code> calculation. Together, they are responsible for creating the topic representations and luckily  can be quite flexible in parameter tuning. Here, we will go through tips and tricks for tuning your <code>CountVectorizer</code> and see how they might affect the topic representations. </p> <p>Before starting, it should be noted that you can pass the <code>CountVectorizer</code> before and after training your topic model. Passing it before training allows you to  minimize the size of the resulting <code>c-TF-IDF</code> matrix:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Train BERTopic with a custom CountVectorizer\nvectorizer_model = CountVectorizer(min_df=10)\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Passing it after training allows you to fine-tune the topic representations by using <code>.update_topics()</code>:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Train a BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Fine-tune topic representations after training BERTopic\nvectorizer_model = CountVectorizer(stop_words=\"english\", ngram_range=(1, 3), min_df=10)\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>The great thing about using <code>.update_topics()</code> is that it allows you to tweak the topic representations without re-training your model! Thus, here we will be focusing  on fine-tuning our topic representations after training our model. </p> <p>Note</p> <p>The great thing about processing our topic representations with the <code>CountVectorizer</code> is that it does not influence the quality of clusters as that is  being performed before generating the topic representations. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#basic-usage","title":"Basic Usage","text":"<p>First, let's start with defining our documents and training our topic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Prepare documents\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\n# Train a BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n</code></pre> <p>Now, let's see the top 10 most frequent topics that have been generated:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\nTopic   Count   Name\n1   0   1822    0_game_team_games_he\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_ites_hello_cheek_hi\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_you_your_post_jim\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_fire_gas\n10  9   174 9_amp_audio_condition_asking\n</code></pre> <p>The topic representations generated already seem quite interpretable! However, I am quite sure we do much better without having  to re-train our model. Next, we will go through common parameters in <code>CountVectorizer</code> and focus on the effects that they might have. As a baseline, we will be comparing  them to the topic representation above. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#parameters","title":"Parameters","text":"<p>There are several basic parameters in the CountVectorizer that we can use to improve upon the quality of the resulting topic representations.</p>"},{"location":"getting_started/vectorizers/vectorizers.html#ngram_range","title":"ngram_range","text":"<p>The <code>ngram_range</code> parameter allows us to decide how many tokens each entity is in a topic representation. For example, we have words like  <code>game</code> and <code>team</code> with a length of 1 in a topic but it would also make sense to have words like <code>hockey league</code> with a length of 2. To allow for these words to be generated,  we can set the <code>ngram_range</code> parameter:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=\"english\")\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>As you might have noticed, I also added <code>stop_words=\"english\"</code>. This is necessary as longer words tend to have many stop words and removing them allows  for nicer topic representations:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\n    Topic   Count   Name\n1   0   1822    0_game_team_games_players\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_hello ites_forget hello_ites 15_huh hi\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_post_jim_context_forged\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_gas_compound\n10  9   174 9_amp_audio_condition_asking\n</code></pre> <p>Although they look very similar, if we zoom in on topic 8, we can see longer words in our representation:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic(8)\n[('fbi', 0.019637149205975653),\n ('koresh', 0.019054514637064403),\n ('gas', 0.014156057632897179),\n ('compound', 0.012381224868591681),\n ('batf', 0.010349992314076047),\n ('children', 0.009336408916322387),\n ('tear gas', 0.008941747802855279),\n ('tear', 0.008446786597564537),\n ('davidians', 0.007911119583253022),\n ('started', 0.007398687505638955)]\n</code></pre> <p><code>tear</code> and <code>gas</code> have now been combined into a single representation. This helps us understand what those individual words might have been representing. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#stop_words","title":"stop_words","text":"<p>In some of the topics, we can see stop words appearing like <code>he</code> or <code>the</code>. Stop words are something we typically want to prevent in our topic representations as they do not give additional information to the topic.  To prevent those stop words, we can use the <code>stop_words</code> parameter in the <code>CountVectorizer</code> to remove them from the representations:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(stop_words=\"english\")\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>After running the above, we get the following output:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\n    Topic   Count   Name\n1   0   1822    0_game_team_games_players\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_ites_cheek_hello_hi\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_monitor_card_video_vga\n6   5   438 5_post_jim_context_forged\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_tobacco\n9   8   218 8_fbi_koresh_gas_compound\n10  9   174 9_amp_audio_condition_stereo\n</code></pre> <p>As you can see, the topic representations already look much better! Stop words are removed and the representations are more interpretable.  We can also pass in a list of stop words if you have multiple languages to take into account. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#min_df","title":"min_df","text":"<p>One important parameter to keep in mind is the <code>min_df</code>. This is typically an integer representing how frequent a word must be before  being added to our representation. You can imagine that if we have a million documents and a certain word only appears a single time across all of them, then  it would be highly unlikely to be representative of a topic. Typically, the <code>c-TF-IDF</code> calculation removes that word from the topic representation but when  you have millions of documents, that will also lead to a very large topic-term matrix. To prevent a huge vocabulary, we can set the <code>min_df</code> to only accept  words that have a minimum frequency. </p> <p>When you have millions of documents or error issues, I would advise increasing the value of <code>min_df</code> as long as the topic representations might sense:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(min_df=10)\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>With the following topic representation:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\n    Topic   Count   Name\n1   0   1822    0_game_team_games_he\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_hello_hi_yep_huh\n4   3   493 3_israel_jews_jewish_peace\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_you_your_post_jim\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_fire_gas\n10  9   174 9_audio_condition_stereo_asking\n</code></pre> <p>As you can see, the output is nearly the same which is what we would like to achieve. All words that appear less than 10 times are now removed  from our topic-term matrix (i.e., <code>c-TF-IDF</code> matrix) which drastically lowers the matrix in size. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#max_features","title":"max_features","text":"<p>A parameter similar to <code>min_df</code> is <code>max_features</code> which allows you to select the top n most frequent words to be used in the topic representation.  Setting this, for example, to <code>10_000</code> creates a topic-term matrix with <code>10_000</code> terms. This helps you control the size of the topic-term matrix  directly without having to fiddle around with the <code>min_df</code> parameter:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(max_features=10_000)\ntopic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre> <p>With the following representation:</p> <pre><code>&gt;&gt;&gt; topic_model.get_topic_info()[1:11]\nTopic   Count   Name\n1   0   1822    0_game_team_games_he\n2   1   580 1_key_clipper_chip_encryption\n3   2   532 2_hello_hi_yep_huh\n4   3   493 3_israel_israeli_jews_arab\n5   4   453 4_card_monitor_video_drivers\n6   5   438 5_you_your_post_jim\n7   6   314 6_car_cars_engine_ford\n8   7   279 7_health_newsgroup_cancer_1993\n9   8   218 8_fbi_koresh_fire_gas\n10  9   174 9_amp_audio_condition_asking\n</code></pre> <p>As with <code>min_df</code>, we would like the topic representations to be very similar. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#tokenizer","title":"tokenizer","text":"<p>The default tokenizer in the CountVectorizer works well for western languages but fails to tokenize some non-western languages, like Chinese. Fortunately, we can use the <code>tokenizer</code> variable in the CountVectorizer to use <code>jieba</code>, which is a package for Chinese text segmentation. Using it is straightforward:</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer\nimport jieba\n\ndef tokenize_zh(text):\n    words = jieba.lcut(text)\n    return words\n\nvectorizer = CountVectorizer(tokenizer=tokenize_zh)\n</code></pre> <p>Then, we can simply pass the vectorizer to update our topic representations:</p> <pre><code>topic_model.update_topics(docs, vectorizer_model=vectorizer_model)\n</code></pre>"},{"location":"getting_started/vectorizers/vectorizers.html#onlinecountvectorizer","title":"OnlineCountVectorizer","text":"<p>When using the online/incremental variant of BERTopic, we need a <code>CountVectorizer</code> than can incrementally update its representation. For that purpose, <code>OnlineCountVectorizer</code> was created that not only updates out-of-vocabulary words but also implements decay and cleaning functions to prevent the sparse bag-of-words matrix to become too large. It is a class that can be found in <code>bertopic.vectorizers</code> which extends <code>sklearn.feature_extraction.text.CountVectorizer</code>. In other words, you can use the exact same parameter in <code>OnlineCountVectorizer</code> as found in Scikit-Learn's <code>CountVectorizer</code>. We can use it as follows:</p> <pre><code>from bertopic import BERTopic\nfrom bertopic.vectorizers import OnlineCountVectorizer\n\n# Train BERTopic with a custom OnlineCountVectorizer\nvectorizer_model = OnlineCountVectorizer()\ntopic_model = BERTopic(vectorizer_model=vectorizer_model)\n</code></pre>"},{"location":"getting_started/vectorizers/vectorizers.html#parameters_1","title":"Parameters","text":"<p>Other than parameters found in <code>CountVectorizer</code>, such as <code>stop_words</code>  and <code>ngram_range</code>, we can two parameters in <code>OnlineCountVectorizer</code> to adjust the way old data is processed and kept. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#decay","title":"decay","text":"<p>At each iteration, we sum the bag-of-words representation of the new documents with the bag-of-words representation of all documents processed thus far. In other words, the bag-of-words matrix keeps increasing with each iteration. However, especially in a streaming setting, older documents might become less and less relevant as time goes on. Therefore, a <code>decay</code> parameter was implemented that decays the bag-of-words' frequencies at each iteration before adding the document frequencies of new documents. The <code>decay</code> parameter is a value between 0 and 1 and indicates the percentage of frequencies the previous bag-of-words matrix should be reduced to. For example, a value of <code>.1</code> will decrease the frequencies in the bag-of-words matrix by 10% at each iteration before adding the new bag-of-words matrix. This will make sure that recent data has more weight than previous iterations. </p>"},{"location":"getting_started/vectorizers/vectorizers.html#delete_min_df","title":"delete_min_df","text":"<p>In BERTopic, we might want to remove words from the topic representation that appear infrequently. The <code>min_df</code> in the <code>CountVectorizer</code> works quite well for that. However, when we have a streaming setting, the <code>min_df</code> does not work as well since a word's frequency might start below <code>min_df</code> but will end up higher than that over time. Setting that value high might not always be advised. </p> <p>As a result, the vocabulary of the resulting bag-of-words matrix can become quite large. Similarly, if we implement the <code>decay</code> parameter, then some values will decrease over time until they are below <code>min_df</code>. For these reasons, the <code>delete_min_df</code> parameter was implemented. The parameter takes positive integers and indicates, at each iteration, which words will be removed. If the value is set to 5, it will check after each iteration if the total frequency of a word is exceeded by that value. If so, the word will be removed in its entirety from the bag-of-words matrix. This helps to keep the bag-of-words matrix of a manageable size. </p> <p>Note</p> <p>Although the <code>delete_min_df</code> parameter removes words from the bag-of-words matrix, it is not permanent. If new documents come in where those previously deleted words are used frequently, they get added back to the matrix. </p>"},{"location":"getting_started/visualization/visualization.html","title":"Visualization","text":"<p>Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works.  Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing  if they make sense is an important factor in alleviating this issue. </p>"},{"location":"getting_started/visualization/visualization.html#visualize-topics","title":"Visualize Topics","text":"<p>After having trained our <code>BERTopic</code> model, we can iteratively go through hundreds of topics to get a good  understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.  Instead, we can visualize the topics that were generated in a way very similar to  LDAvis. </p> <p>We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using  plotly such that we can create an interactive view.</p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>Then, we can call <code>.visualize_topics</code> to create a 2D representation of your topics. The resulting graph is a  plotly interactive graph which can be converted to HTML:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>You can use the slider to select the topic which then lights up red. If you hover over a topic, then general  information is given about the topic, including the size of the topic and its corresponding words.</p>"},{"location":"getting_started/visualization/visualization.html#visualize-documents","title":"Visualize Documents","text":"<p>Using the previous method, we can visualize the topics and get insight into their relationships. However,  you might want a more fine-grained approach where we can visualize the documents inside the topics to see  if they were assigned correctly or whether they make sense. To do so, we can use the <code>topic_model.visualize_documents()</code>  function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization  purposes. This process can be quite expensive, so it is advised to adhere to the following pipeline:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This was done for demonstration purposes  as saving all those documents in the visualization can be quite expensive and result in large files. However,  it might be interesting to set <code>hide_document_hover=False</code> in order to hover over the points and see the content of the documents.    </p>"},{"location":"getting_started/visualization/visualization.html#custom-hover","title":"Custom Hover","text":"<p>When you visualize the documents, you might not always want to see the complete document over hover. Many documents have shorter information that might be more interesting to visualize, such as its title. To create the hover based on a documents' title instead of its content, you can simply pass a variable (<code>titles</code>) containing the title for each document:</p> <pre><code>topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-topic-hierarchy","title":"Visualize Topic Hierarchy","text":"<p>The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical  structure of the topics, we can use <code>scipy.cluster.hierarchy</code> to create clusters and visualize how  they relate to one another. This might help to select an appropriate <code>nr_topics</code> when reducing the number  of topics that you have created. To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>Note</p> <p>Do note that this is not the actual procedure of <code>.reduce_topics()</code> when <code>nr_topics</code> is set to  auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles  the actual procedure of <code>.reduce_topics()</code> when any number of <code>nr_topics</code> is selected. </p>"},{"location":"getting_started/visualization/visualization.html#hierarchical-labels","title":"Hierarchical labels","text":"<p>Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens  to the topic representations when merging topics. To do so, we first need to calculate the representations of the  hierarchical topics:</p> <p>First, we train a basic BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>To visualize these results, we simply need to pass the resulting <code>hierarchical_topics</code> to our <code>.visualize_hierarchy</code> function:</p> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations  help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,  we can now see which sub-topics can be found within certain larger themes. </p>"},{"location":"getting_started/visualization/visualization.html#text-based-topic-tree","title":"Text-based topic tree","text":"<p>Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can  use <code>topic_model.get_topic_tree</code> to create a text-based representation of this hierarchy. Although the general structure is more difficult  to view, we can see better which topics could be logically merged:</p> <pre><code>&gt;&gt;&gt; tree = topic_model.get_topic_tree(hierarchical_topics)\n&gt;&gt;&gt; print(tree)\n.\n\u2514\u2500atheists_atheism_god_moral_atheist\n     \u251c\u2500atheists_atheism_god_atheist_argument\n     \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n     \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n     \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n</code></pre> Click here to view the full tree. <pre><code>  .\n  \u251c\u2500people_armenian_said_god_armenians\n  \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u2502    \u251c\u2500jehovah_lord_mormon_mcconkie_god\n  \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500ra_satan_thou_god_lucifer \u2500\u2500 Topic: 94\n  \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500jehovah_lord_mormon_mcconkie_unto \u2500\u2500 Topic: 78\n  \u2502    \u2502    \u2502    \u2514\u2500jesus_mary_god_hell_sin\n  \u2502    \u2502    \u2502         \u251c\u2500jesus_hell_god_eternal_heaven\n  \u2502    \u2502    \u2502         \u2502    \u251c\u2500hell_jesus_eternal_god_heaven\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500jesus_tomb_disciples_resurrection_john \u2500\u2500 Topic: 69\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500hell_eternal_god_jesus_heaven \u2500\u2500 Topic: 53\n  \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500aaron_baptism_sin_law_god \u2500\u2500 Topic: 89\n  \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500mary_sin_maria_priest_conception \u2500\u2500 Topic: 56\n  \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500marriage_married_marry_ceremony_marriages \u2500\u2500 Topic: 110\n  \u2502    \u2514\u2500people_armenian_armenians_said_mr\n  \u2502         \u251c\u2500people_armenian_armenians_said_israel\n  \u2502         \u2502    \u251c\u2500god_homosexual_homosexuality_atheists_sex\n  \u2502         \u2502    \u2502    \u251c\u2500homosexual_homosexuality_sex_gay_homosexuals\n  \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500kinsey_sex_gay_men_sexual \u2500\u2500 Topic: 44\n  \u2502         \u2502    \u2502    \u2502    \u2514\u2500homosexuality_homosexual_sin_homosexuals_gay\n  \u2502         \u2502    \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500gay_homosexual_homosexuals_sexual_cramer \u2500\u2500 Topic: 50\n  \u2502         \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500homosexuality_homosexual_sin_paul_sex \u2500\u2500 Topic: 27\n  \u2502         \u2502    \u2502    \u2514\u2500god_atheists_atheism_moral_atheist\n  \u2502         \u2502    \u2502         \u251c\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500jim_context_challenges_articles_quote \u2500\u2500 Topic: 36\n  \u2502         \u2502    \u2502         \u2502    \u2514\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500islam_quran_islamic_rushdie_muslims \u2500\u2500 Topic: 31\n  \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500judas_scripture_bible_books_greek \u2500\u2500 Topic: 33\n  \u2502         \u2502    \u2502         \u2514\u2500atheists_atheism_god_moral_atheist\n  \u2502         \u2502    \u2502              \u251c\u2500atheists_atheism_god_atheist_argument\n  \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n  \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n  \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n  \u2502         \u2502    \u2514\u2500armenian_armenians_people_israel_said\n  \u2502         \u2502         \u251c\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502    \u251c\u2500tax_rights_government_income_taxes\n  \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500rights_right_slavery_slaves_residence \u2500\u2500 Topic: 106\n  \u2502         \u2502         \u2502    \u2502    \u2514\u2500tax_government_taxes_income_libertarians\n  \u2502         \u2502         \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500government_libertarians_libertarian_regulation_party \u2500\u2500 Topic: 58\n  \u2502         \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500tax_taxes_income_billion_deficit \u2500\u2500 Topic: 41\n  \u2502         \u2502         \u2502    \u2514\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502         \u251c\u2500gun_guns_militia_firearms_amendment\n  \u2502         \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500blacks_penalty_death_cruel_punishment \u2500\u2500 Topic: 55\n  \u2502         \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500gun_guns_militia_firearms_amendment \u2500\u2500 Topic: 7\n  \u2502         \u2502         \u2502         \u2514\u2500armenian_armenians_israel_jews_turkish\n  \u2502         \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500israel_israeli_jews_arab_jewish \u2500\u2500 Topic: 4\n  \u2502         \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500armenian_armenians_turkish_armenia_azerbaijan \u2500\u2500 Topic: 15\n  \u2502         \u2502         \u2514\u2500stephanopoulos_president_mr_myers_ms\n  \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500serbs_muslims_stephanopoulos_mr_bosnia \u2500\u2500 Topic: 35\n  \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500myers_stephanopoulos_president_ms_mr \u2500\u2500 Topic: 87\n  \u2502         \u2514\u2500batf_fbi_koresh_compound_gas\n  \u2502              \u251c\u2500\u25a0\u2500\u2500reno_workers_janet_clinton_waco \u2500\u2500 Topic: 77\n  \u2502              \u2514\u2500batf_fbi_koresh_gas_compound\n  \u2502                   \u251c\u2500batf_koresh_fbi_warrant_compound\n  \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500batf_warrant_raid_compound_fbi \u2500\u2500 Topic: 42\n  \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500koresh_batf_fbi_children_compound \u2500\u2500 Topic: 61\n  \u2502                   \u2514\u2500\u25a0\u2500\u2500fbi_gas_tear_bds_building \u2500\u2500 Topic: 23\n  \u2514\u2500use_like_just_dont_new\n      \u251c\u2500game_team_year_games_like\n      \u2502    \u251c\u2500game_team_games_25_year\n      \u2502    \u2502    \u251c\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502    \u251c\u2500window_printer_use_problem_mhz\n      \u2502    \u2502    \u2502    \u2502    \u251c\u2500mhz_wire_simms_wiring_battery\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_mhz_battery_cpu_heat\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_pds_simm_vram_lc\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500pds_nubus_lc_slot_card \u2500\u2500 Topic: 119\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500simms_simm_vram_meg_dram \u2500\u2500 Topic: 32\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500mhz_battery_cpu_heat_speed\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500fan_cpu_heat_sink_fans \u2500\u2500 Topic: 92\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500mhz_speed_cpu_fpu_clock \u2500\u2500 Topic: 22\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500monitor_turn_power_computer_electricity \u2500\u2500 Topic: 91\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500battery_batteries_concrete_duo_discharge\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500duo_battery_apple_230_problem \u2500\u2500 Topic: 121\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500battery_batteries_concrete_discharge_temperature \u2500\u2500 Topic: 75\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500leds_uv_blue_light_boards \u2500\u2500 Topic: 66\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500wire_wiring_ground_neutral_outlets \u2500\u2500 Topic: 120\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500scope_scopes_phone_dial_number\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500dial_number_phone_line_output \u2500\u2500 Topic: 93\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500scope_scopes_motorola_generator_oscilloscope \u2500\u2500 Topic: 113\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500celp_dsp_sampling_antenna_digital\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500antenna_antennas_receiver_cable_transmitter \u2500\u2500 Topic: 70\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500celp_dsp_sampling_speech_voice \u2500\u2500 Topic: 52\n      \u2502    \u2502    \u2502    \u2502    \u2514\u2500window_printer_xv_mouse_windows\n      \u2502    \u2502    \u2502    \u2502         \u251c\u2500window_xv_error_widget_problem\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500error_symbol_undefined_xterm_rx\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500symbol_error_undefined_doug_parse \u2500\u2500 Topic: 63\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500rx_remote_server_xdm_xterm \u2500\u2500 Topic: 45\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500window_xv_widget_application_expose\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500window_widget_expose_application_event\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500gc_mydisplay_draw_gxxor_drawing \u2500\u2500 Topic: 103\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500window_widget_application_expose_event \u2500\u2500 Topic: 25\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500xv_den_polygon_points_algorithm\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500den_polygon_points_algorithm_polygons \u2500\u2500 Topic: 28\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500xv_24bit_image_bit_images \u2500\u2500 Topic: 57\n      \u2502    \u2502    \u2502    \u2502         \u2514\u2500printer_fonts_print_mouse_postscript\n      \u2502    \u2502    \u2502    \u2502              \u251c\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500scanner_logitech_grayscale_ocr_scanman \u2500\u2500 Topic: 108\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u2514\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u251c\u2500\u25a0\u2500\u2500printer_print_deskjet_hp_ink \u2500\u2500 Topic: 18\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u2514\u2500\u25a0\u2500\u2500fonts_font_truetype_tt_atm \u2500\u2500 Topic: 49\n      \u2502    \u2502    \u2502    \u2502              \u2514\u2500mouse_ghostscript_midi_driver_postscript\n      \u2502    \u2502    \u2502    \u2502                   \u251c\u2500ghostscript_midi_postscript_files_file\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500ghostscript_postscript_pageview_ghostview_dsc \u2500\u2500 Topic: 104\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u2514\u2500midi_sound_file_windows_driver\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u251c\u2500\u25a0\u2500\u2500location_mar_file_host_rwrr \u2500\u2500 Topic: 83\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u2514\u2500\u25a0\u2500\u2500midi_sound_driver_blaster_soundblaster \u2500\u2500 Topic: 98\n      \u2502    \u2502    \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500mouse_driver_mice_ball_problem \u2500\u2500 Topic: 68\n      \u2502    \u2502    \u2502    \u2514\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502         \u251c\u25001st_sale_condition_comics_hulk\n      \u2502    \u2502    \u2502         \u2502    \u251c\u2500sale_condition_offer_asking_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500condition_stereo_amp_speakers_asking\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500miles_car_amfm_toyota_cassette \u2500\u2500 Topic: 62\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500amp_speakers_condition_stereo_audio \u2500\u2500 Topic: 24\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500games_sale_pom_cds_shipping\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u251c\u2500pom_cds_sale_shipping_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500size_shipping_sale_condition_mattress \u2500\u2500 Topic: 100\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500pom_cds_cd_sale_picture \u2500\u2500 Topic: 37\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500games_game_snes_sega_genesis \u2500\u2500 Topic: 40\n      \u2502    \u2502    \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u251c\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500lens_tape_camera_backup_lenses\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500tape_backup_tapes_drive_4mm \u2500\u2500 Topic: 107\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500lens_camera_lenses_zoom_pouch \u2500\u2500 Topic: 114\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u25001st_hulk_comics_art_appears \u2500\u2500 Topic: 105\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500books_book_cover_trek_chemistry \u2500\u2500 Topic: 125\n      \u2502    \u2502    \u2502         \u2502         \u2514\u2500tickets_hotel_ticket_voucher_package\n      \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500hotel_voucher_package_vacation_room \u2500\u2500 Topic: 74\n      \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500tickets_ticket_june_airlines_july \u2500\u2500 Topic: 84\n      \u2502    \u2502    \u2502         \u2514\u2500game_team_games_season_hockey\n      \u2502    \u2502    \u2502              \u251c\u2500game_hockey_team_25_550\n      \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500espn_pt_pts_game_la \u2500\u2500 Topic: 17\n      \u2502    \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500team_25_game_hockey_550 \u2500\u2500 Topic: 2\n      \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500year_game_hit_baseball_players \u2500\u2500 Topic: 0\n      \u2502    \u2502    \u2514\u2500bike_car_greek_insurance_msg\n      \u2502    \u2502         \u251c\u2500car_bike_insurance_cars_engine\n      \u2502    \u2502         \u2502    \u251c\u2500car_insurance_cars_radar_engine\n      \u2502    \u2502         \u2502    \u2502    \u251c\u2500insurance_health_private_care_canada\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500insurance_health_private_care_canada \u2500\u2500 Topic: 99\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500insurance_car_accident_rates_sue \u2500\u2500 Topic: 82\n      \u2502    \u2502         \u2502    \u2502    \u2514\u2500car_cars_radar_engine_detector\n      \u2502    \u2502         \u2502    \u2502         \u251c\u2500car_radar_cars_detector_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500radar_detector_detectors_ka_alarm \u2500\u2500 Topic: 39\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500car_cars_mustang_ford_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500clutch_shift_shifting_transmission_gear \u2500\u2500 Topic: 88\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500car_cars_mustang_ford_v8 \u2500\u2500 Topic: 14\n      \u2502    \u2502         \u2502    \u2502         \u2514\u2500oil_diesel_odometer_diesels_car\n      \u2502    \u2502         \u2502    \u2502              \u251c\u2500odometer_oil_sensor_car_drain\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500odometer_sensor_speedo_gauge_mileage \u2500\u2500 Topic: 96\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500oil_drain_car_leaks_taillights \u2500\u2500 Topic: 102\n      \u2502    \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500diesel_diesels_emissions_fuel_oil \u2500\u2500 Topic: 79\n      \u2502    \u2502         \u2502    \u2514\u2500bike_riding_ride_bikes_motorcycle\n      \u2502    \u2502         \u2502         \u251c\u2500bike_ride_riding_bikes_lane\n      \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500bike_ride_riding_lane_car \u2500\u2500 Topic: 11\n      \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500bike_bikes_miles_honda_motorcycle \u2500\u2500 Topic: 19\n      \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500countersteering_bike_motorcycle_rear_shaft \u2500\u2500 Topic: 46\n      \u2502    \u2502         \u2514\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u251c\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u251c\u2500greek_kuwait_greece_turkish_greeks\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500greek_greece_turkish_greeks_cyprus \u2500\u2500 Topic: 71\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kuwait_iraq_iran_gulf_arabia \u2500\u2500 Topic: 76\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2514\u2500msg_dog_drugs_drug_food\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u251c\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500clinton_bush_quayle_reagan_panicking \u2500\u2500 Topic: 101\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u2514\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500cooper_trial_weaver_spence_witnesses \u2500\u2500 Topic: 90\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500dog_dogs_bike_trained_springer \u2500\u2500 Topic: 67\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2514\u2500msg_drugs_drug_food_chinese\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500msg_food_chinese_foods_taste \u2500\u2500 Topic: 30\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500drugs_drug_marijuana_cocaine_alcohol \u2500\u2500 Topic: 72\n      \u2502    \u2502              \u2502    \u2502    \u2514\u2500water_theory_universe_science_larsons\n      \u2502    \u2502              \u2502    \u2502         \u251c\u2500water_nuclear_cooling_steam_dept\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500rocketry_rockets_engines_nuclear_plutonium \u2500\u2500 Topic: 115\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u2514\u2500water_cooling_steam_dept_plants\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500water_dept_phd_environmental_atmospheric \u2500\u2500 Topic: 97\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500cooling_water_steam_towers_plants \u2500\u2500 Topic: 109\n      \u2502    \u2502              \u2502    \u2502         \u2514\u2500theory_universe_larsons_larson_science\n      \u2502    \u2502              \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500theory_universe_larsons_larson_science \u2500\u2500 Topic: 54\n      \u2502    \u2502              \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500oort_cloud_grbs_gamma_burst \u2500\u2500 Topic: 80\n      \u2502    \u2502              \u2502    \u2514\u2500helmet_kirlian_photography_lock_wax\n      \u2502    \u2502              \u2502         \u251c\u2500helmet_kirlian_photography_leaf_mask\n      \u2502    \u2502              \u2502         \u2502    \u251c\u2500kirlian_photography_leaf_pictures_deleted\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u251c\u2500deleted_joke_stuff_maddi_nickname\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500joke_maddi_nickname_nicknames_frank \u2500\u2500 Topic: 43\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500deleted_stuff_bookstore_joke_motto \u2500\u2500 Topic: 81\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kirlian_photography_leaf_pictures_aura \u2500\u2500 Topic: 85\n      \u2502    \u2502              \u2502         \u2502    \u2514\u2500helmet_mask_liner_foam_cb\n      \u2502    \u2502              \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500helmet_liner_foam_cb_helmets \u2500\u2500 Topic: 112\n      \u2502    \u2502              \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500mask_goalies_77_santore_tl \u2500\u2500 Topic: 123\n      \u2502    \u2502              \u2502         \u2514\u2500lock_wax_paint_plastic_ear\n      \u2502    \u2502              \u2502              \u251c\u2500\u25a0\u2500\u2500lock_cable_locks_bike_600 \u2500\u2500 Topic: 117\n      \u2502    \u2502              \u2502              \u2514\u2500wax_paint_ear_plastic_skin\n      \u2502    \u2502              \u2502                   \u251c\u2500\u25a0\u2500\u2500wax_paint_plastic_scratches_solvent \u2500\u2500 Topic: 65\n      \u2502    \u2502              \u2502                   \u2514\u2500\u25a0\u2500\u2500ear_wax_skin_greasy_acne \u2500\u2500 Topic: 116\n      \u2502    \u2502              \u2514\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u251c\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500m4_mp_14_mw_mo \u2500\u2500 Topic: 111\n      \u2502    \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500test_ensign_nameless_deane_deanebinahccbrandeisedu \u2500\u2500 Topic: 118\n      \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500ites_cheek_hello_hi_ken \u2500\u2500 Topic: 3\n      \u2502    \u2514\u2500space_medical_health_disease_cancer\n      \u2502         \u251c\u2500medical_health_disease_cancer_patients\n      \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500cancer_centers_center_medical_research \u2500\u2500 Topic: 122\n      \u2502         \u2502    \u2514\u2500health_medical_disease_patients_hiv\n      \u2502         \u2502         \u251c\u2500patients_medical_disease_candida_health\n      \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n      \u2502         \u2502         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n      \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n      \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n      \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n      \u2502         \u2514\u2500space_launch_nasa_shuttle_orbit\n      \u2502              \u251c\u2500space_moon_station_nasa_launch\n      \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500sky_advertising_billboard_billboards_space \u2500\u2500 Topic: 59\n      \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500space_station_moon_redesign_nasa \u2500\u2500 Topic: 16\n      \u2502              \u2514\u2500space_mission_hst_launch_orbit\n      \u2502                   \u251c\u2500space_launch_nasa_orbit_propulsion\n      \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500space_launch_nasa_propulsion_astronaut \u2500\u2500 Topic: 47\n      \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500orbit_km_jupiter_probe_earth \u2500\u2500 Topic: 86\n      \u2502                   \u2514\u2500\u25a0\u2500\u2500hst_mission_shuttle_orbit_arrays \u2500\u2500 Topic: 60\n      \u2514\u2500drive_file_key_windows_use\n          \u251c\u2500key_file_jpeg_encryption_image\n          \u2502    \u251c\u2500key_encryption_clipper_chip_keys\n          \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500key_clipper_encryption_chip_keys \u2500\u2500 Topic: 1\n          \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500entry_file_ripem_entries_key \u2500\u2500 Topic: 73\n          \u2502    \u2514\u2500jpeg_image_file_gif_images\n          \u2502         \u251c\u2500motif_graphics_ftp_available_3d\n          \u2502         \u2502    \u251c\u2500motif_graphics_openwindows_ftp_available\n          \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500openwindows_motif_xview_windows_mouse \u2500\u2500 Topic: 20\n          \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500graphics_widget_ray_3d_available \u2500\u2500 Topic: 95\n          \u2502         \u2502    \u2514\u2500\u25a0\u2500\u25003d_machines_version_comments_contact \u2500\u2500 Topic: 38\n          \u2502         \u2514\u2500jpeg_image_gif_images_format\n          \u2502              \u251c\u2500\u25a0\u2500\u2500gopher_ftp_files_stuffit_images \u2500\u2500 Topic: 51\n          \u2502              \u2514\u2500\u25a0\u2500\u2500jpeg_image_gif_format_images \u2500\u2500 Topic: 13\n          \u2514\u2500drive_db_card_scsi_windows\n              \u251c\u2500db_windows_dos_mov_os2\n              \u2502    \u251c\u2500\u25a0\u2500\u2500copy_protection_program_software_disk \u2500\u2500 Topic: 64\n              \u2502    \u2514\u2500\u25a0\u2500\u2500db_windows_dos_mov_os2 \u2500\u2500 Topic: 8\n              \u2514\u2500drive_card_scsi_drives_ide\n                      \u251c\u2500drive_scsi_drives_ide_disk\n                      \u2502    \u251c\u2500\u25a0\u2500\u2500drive_scsi_drives_ide_disk \u2500\u2500 Topic: 6\n                      \u2502    \u2514\u2500\u25a0\u2500\u2500meg_sale_ram_drive_shipping \u2500\u2500 Topic: 12\n                      \u2514\u2500card_modem_monitor_video_drivers\n                          \u251c\u2500\u25a0\u2500\u2500card_monitor_video_drivers_vga \u2500\u2500 Topic: 5\n                          \u2514\u2500\u25a0\u2500\u2500modem_port_serial_irq_com \u2500\u2500 Topic: 10\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-hierarchical-documents","title":"Visualize Hierarchical Documents","text":"<p>We can extend the previous method by calculating the topic representation at different levels of the hierarchy and  plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:</p> <p><pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by  reducing their dimensionality ourselves:</p> <pre><code># Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This makes the resulting visualization  smaller and fit into your RAM. However, it might be interesting to set <code>hide_document_hover=False</code> to hover  over the points and see the content of the documents. </p>"},{"location":"getting_started/visualization/visualization.html#visualize-terms","title":"Visualize Terms","text":"<p>We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores  for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within  topics. Moreover, you can easily compare topic representations to each other.  To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-topic-similarity","title":"Visualize Topic Similarity","text":"<p>Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity  matrix by simply applying cosine similarities through those topic embeddings. The result will be a  matrix indicating how similar certain topics are to each other.  To visualize the heatmap, run the following:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Note</p> <p>You can set <code>n_clusters</code> in <code>visualize_heatmap</code> to order the topics by their similarity.  This will result in blocks being formed in the heatmap indicating which clusters of topics are  similar to each other. This step is very much recommended as it will make reading the heatmap easier.      </p>"},{"location":"getting_started/visualization/visualization.html#visualize-term-score-decline","title":"Visualize Term Score Decline","text":"<p>Topics are represented by a number of words starting with the best representative word.  Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word  to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline  with each word that is added. At some point adding words to the topic representation only marginally  increases the total c-TF-IDF score and would not be beneficial for its representation. </p> <p>To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word.  In other words, the position of the words (term rank), where the words with  the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis  will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline  of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method,  the select the best number of words in a topic. </p> <p>To visualize the c-TF-IDF score decline, run the following:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>To enable the log scale on the y-axis for a better view of individual topics, run the following:</p> <pre><code>topic_model.visualize_term_rank(log_scale=True)\n</code></pre> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an  analysis by the amazing tmtoolkit. Reference to that specific analysis can be found  here. </p>"},{"location":"getting_started/visualization/visualization.html#visualize-topics-over-time","title":"Visualize Topics over Time","text":"<p>After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by  leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency  of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.  Simply call <code>.visualize_topics_over_time</code> with the newly created topics over time:</p> <pre><code>import re\nimport pandas as pd\nfrom bertopic import BERTopic\n\n# Prepare data\ntrump = pd.read_csv('https://drive.google.com/uc?export=download&amp;id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')\ntrump.text = trump.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == \"f\") &amp; (trump.text != \"\"), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n\n# Create topics over time\nmodel = BERTopic(verbose=True)\ntopics, probs = model.fit_transform(tweets)\ntopics_over_time = model.topics_over_time(tweets, timestamps)\n</code></pre> <p>Then, we visualize some interesting topics: </p> <pre><code>model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-topics-per-class","title":"Visualize Topics per Class","text":"<p>You might want to extract and visualize the topic representation per class. For example, if you have  specific groups of users that might approach topics differently, then extracting them would help understanding  how these users talk about certain topics. In other words, this is simply creating a topic representation for  certain classes that you might have in your data. </p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Prepare data and classes\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\nclasses = [data[\"target_names\"][i] for i in data[\"target\"]]\n\n# Create topic model and calculate topics per class\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_per_class = topic_model.topics_per_class(docs, classes=classes)\n</code></pre> <p>Then, we visualize the topic representation of major topics per class: </p> <pre><code>topic_model.visualize_topics_per_class(topics_per_class)\n</code></pre>"},{"location":"getting_started/visualization/visualization.html#visualize-probablities-or-distribution","title":"Visualize Probablities or Distribution","text":"<p>We can generate the topic-document probability matrix by simply setting <code>calculate_probabilities=True</code> if a HDBSCAN model is used:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>The resulting <code>probs</code> variable contains the soft-clustering as done through HDBSCAN. </p> <p>If a non-HDBSCAN model is used, we can estimate the topic distributions after training our model:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic()\ntopics, _ = topic_model.fit_transform(docs) \ntopic_distr, _ = topic_model.approximate_distribution(docs, min_similarity=0)\n</code></pre> <p>Then, we either pass the <code>probs</code> or <code>topic_distr</code> variable to <code>.visualize_distribution</code> to visualize either the probability distributions or the topic distributions:</p> <pre><code># To visualize the probabilities of topic assignment\ntopic_model.visualize_distribution(probs[0])\n\n# To visualize the topic distributions in a document\ntopic_model.visualize_distribution(topic_distr[0])\n</code></pre> <p>Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first calculate topic distributions on a token level and then visualize the results:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])\ndf\n</code></pre> <p> </p> <p>Note</p> <p>To get the stylized dataframe for <code>.visualize_approximate_distribution</code> you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via <code>pip install jinja2</code></p> <p>Note</p> <p>The distribution of the probabilities does not give an indication to  the distribution of the frequencies of topics across a document. It merely shows how confident BERTopic is that certain topics can be found in a document.</p>"},{"location":"getting_started/visualization/visualize_documents.html","title":"Documents","text":"<p>Using the <code>.visualize_topics</code>, we can visualize the topics and get insight into their relationships. However,  you might want a more fine-grained approach where we can visualize the documents inside the topics to see  if they were assigned correctly or whether they make sense. To do so, we can use the <code>topic_model.visualize_documents()</code>  function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization  purposes. This process can be quite expensive, so it is advised to adhere to the following pipeline:</p> <pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic\ntopic_model = BERTopic().fit(docs, embeddings)\n\n# Run the visualization with the original embeddings\ntopic_model.visualize_documents(docs, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This was done for demonstration purposes  as saving all those documents in the visualization can be quite expensive and result in large files. However,  it might be interesting to set <code>hide_document_hover=False</code> in order to hover over the points and see the content of the documents.    </p>"},{"location":"getting_started/visualization/visualize_documents.html#custom-hover","title":"Custom Hover","text":"<p>When you visualize the documents, you might not always want to see the complete document over hover. Many documents have shorter information that might be more interesting to visualize, such as its title. To create the hover based on a documents' title instead of its content, you can simply pass a variable (<code>titles</code>) containing the title for each document:</p> <pre><code>topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)\n</code></pre>"},{"location":"getting_started/visualization/visualize_documents.html#visualize-probablities-or-distribution","title":"Visualize Probablities or Distribution","text":"<p>We can generate the topic-document probability matrix by simply setting <code>calculate_probabilities=True</code> if a HDBSCAN model is used:</p> <pre><code>from bertopic import BERTopic\ntopic_model = BERTopic(calculate_probabilities=True)\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>The resulting <code>probs</code> variable contains the soft-clustering as done through HDBSCAN. </p> <p>If a non-HDBSCAN model is used, we can estimate the topic distributions after training our model:</p> <pre><code>from bertopic import BERTopic\n\ntopic_model = BERTopic()\ntopics, _ = topic_model.fit_transform(docs) \ntopic_distr, _ = topic_model.approximate_distribution(docs, min_similarity=0)\n</code></pre> <p>Then, we either pass the <code>probs</code> or <code>topic_distr</code> variable to <code>.visualize_distribution</code> to visualize either the probability distributions or the topic distributions:</p> <pre><code># To visualize the probabilities of topic assignment\ntopic_model.visualize_distribution(probs[0])\n\n# To visualize the topic distributions in a document\ntopic_model.visualize_distribution(topic_distr[0])\n</code></pre> <p>Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first calculate topic distributions on a token level and then visualize the results:</p> <pre><code># Calculate the topic distributions on a token-level\ntopic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)\n\n# Visualize the token-level distributions\ndf = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])\ndf\n</code></pre> <p> </p> <p>Note</p> <p>To get the stylized dataframe for <code>.visualize_approximate_distribution</code> you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via <code>pip install jinja2</code></p> <p>Note</p> <p>The distribution of the probabilities does not give an indication to  the distribution of the frequencies of topics across a document. It merely shows how confident BERTopic is that certain topics can be found in a document.</p>"},{"location":"getting_started/visualization/visualize_hierarchy.html","title":"Hierarchy","text":"<p>The topics that you create can be hierarchically reduced. In order to understand the potential hierarchical  structure of the topics, we can use <code>scipy.cluster.hierarchy</code> to create clusters and visualize how  they relate to one another. This might help to select an appropriate <code>nr_topics</code> when reducing the number  of topics that you have created. To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_hierarchy()\n</code></pre> <p>Note</p> <p>Do note that this is not the actual procedure of <code>.reduce_topics()</code> when <code>nr_topics</code> is set to  auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles  the actual procedure of <code>.reduce_topics()</code> when any number of <code>nr_topics</code> is selected. </p>"},{"location":"getting_started/visualization/visualize_hierarchy.html#hierarchical-labels","title":"Hierarchical labels","text":"<p>Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens  to the topic representations when merging topics. To do so, we first need to calculate the representations of the  hierarchical topics:</p> <p>First, we train a basic BERTopic model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))[\"data\"]\ntopic_model = BERTopic(verbose=True)\ntopics, probs = topic_model.fit_transform(docs)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> <p>To visualize these results, we simply need to pass the resulting <code>hierarchical_topics</code> to our <code>.visualize_hierarchy</code> function:</p> <pre><code>topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)\n</code></pre> <p>If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations  help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,  we can now see which sub-topics can be found within certain larger themes. </p>"},{"location":"getting_started/visualization/visualize_hierarchy.html#text-based-topic-tree","title":"Text-based topic tree","text":"<p>Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can  use <code>topic_model.get_topic_tree</code> to create a text-based representation of this hierarchy. Although the general structure is more difficult  to view, we can see better which topics could be logically merged:</p> <pre><code>&gt;&gt;&gt; tree = topic_model.get_topic_tree(hierarchical_topics)\n&gt;&gt;&gt; print(tree)\n.\n\u2514\u2500atheists_atheism_god_moral_atheist\n     \u251c\u2500atheists_atheism_god_atheist_argument\n     \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n     \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n     \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n</code></pre> Click here to view the full tree. <pre><code>  .\n  \u251c\u2500people_armenian_said_god_armenians\n  \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u251c\u2500god_jesus_jehovah_lord_christ\n  \u2502    \u2502    \u2502    \u251c\u2500jehovah_lord_mormon_mcconkie_god\n  \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500ra_satan_thou_god_lucifer \u2500\u2500 Topic: 94\n  \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500jehovah_lord_mormon_mcconkie_unto \u2500\u2500 Topic: 78\n  \u2502    \u2502    \u2502    \u2514\u2500jesus_mary_god_hell_sin\n  \u2502    \u2502    \u2502         \u251c\u2500jesus_hell_god_eternal_heaven\n  \u2502    \u2502    \u2502         \u2502    \u251c\u2500hell_jesus_eternal_god_heaven\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500jesus_tomb_disciples_resurrection_john \u2500\u2500 Topic: 69\n  \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500hell_eternal_god_jesus_heaven \u2500\u2500 Topic: 53\n  \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500aaron_baptism_sin_law_god \u2500\u2500 Topic: 89\n  \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500mary_sin_maria_priest_conception \u2500\u2500 Topic: 56\n  \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500marriage_married_marry_ceremony_marriages \u2500\u2500 Topic: 110\n  \u2502    \u2514\u2500people_armenian_armenians_said_mr\n  \u2502         \u251c\u2500people_armenian_armenians_said_israel\n  \u2502         \u2502    \u251c\u2500god_homosexual_homosexuality_atheists_sex\n  \u2502         \u2502    \u2502    \u251c\u2500homosexual_homosexuality_sex_gay_homosexuals\n  \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500kinsey_sex_gay_men_sexual \u2500\u2500 Topic: 44\n  \u2502         \u2502    \u2502    \u2502    \u2514\u2500homosexuality_homosexual_sin_homosexuals_gay\n  \u2502         \u2502    \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500gay_homosexual_homosexuals_sexual_cramer \u2500\u2500 Topic: 50\n  \u2502         \u2502    \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500homosexuality_homosexual_sin_paul_sex \u2500\u2500 Topic: 27\n  \u2502         \u2502    \u2502    \u2514\u2500god_atheists_atheism_moral_atheist\n  \u2502         \u2502    \u2502         \u251c\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500jim_context_challenges_articles_quote \u2500\u2500 Topic: 36\n  \u2502         \u2502    \u2502         \u2502    \u2514\u2500islam_quran_judas_islamic_book\n  \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500islam_quran_islamic_rushdie_muslims \u2500\u2500 Topic: 31\n  \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500judas_scripture_bible_books_greek \u2500\u2500 Topic: 33\n  \u2502         \u2502    \u2502         \u2514\u2500atheists_atheism_god_moral_atheist\n  \u2502         \u2502    \u2502              \u251c\u2500atheists_atheism_god_atheist_argument\n  \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500atheists_atheism_god_atheist_argument \u2500\u2500 Topic: 21\n  \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500br_god_exist_genetic_existence \u2500\u2500 Topic: 124\n  \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500moral_morality_objective_immoral_morals \u2500\u2500 Topic: 29\n  \u2502         \u2502    \u2514\u2500armenian_armenians_people_israel_said\n  \u2502         \u2502         \u251c\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502    \u251c\u2500tax_rights_government_income_taxes\n  \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500rights_right_slavery_slaves_residence \u2500\u2500 Topic: 106\n  \u2502         \u2502         \u2502    \u2502    \u2514\u2500tax_government_taxes_income_libertarians\n  \u2502         \u2502         \u2502    \u2502         \u251c\u2500\u25a0\u2500\u2500government_libertarians_libertarian_regulation_party \u2500\u2500 Topic: 58\n  \u2502         \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500tax_taxes_income_billion_deficit \u2500\u2500 Topic: 41\n  \u2502         \u2502         \u2502    \u2514\u2500armenian_armenians_israel_people_jews\n  \u2502         \u2502         \u2502         \u251c\u2500gun_guns_militia_firearms_amendment\n  \u2502         \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500blacks_penalty_death_cruel_punishment \u2500\u2500 Topic: 55\n  \u2502         \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500gun_guns_militia_firearms_amendment \u2500\u2500 Topic: 7\n  \u2502         \u2502         \u2502         \u2514\u2500armenian_armenians_israel_jews_turkish\n  \u2502         \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500israel_israeli_jews_arab_jewish \u2500\u2500 Topic: 4\n  \u2502         \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500armenian_armenians_turkish_armenia_azerbaijan \u2500\u2500 Topic: 15\n  \u2502         \u2502         \u2514\u2500stephanopoulos_president_mr_myers_ms\n  \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500serbs_muslims_stephanopoulos_mr_bosnia \u2500\u2500 Topic: 35\n  \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500myers_stephanopoulos_president_ms_mr \u2500\u2500 Topic: 87\n  \u2502         \u2514\u2500batf_fbi_koresh_compound_gas\n  \u2502              \u251c\u2500\u25a0\u2500\u2500reno_workers_janet_clinton_waco \u2500\u2500 Topic: 77\n  \u2502              \u2514\u2500batf_fbi_koresh_gas_compound\n  \u2502                   \u251c\u2500batf_koresh_fbi_warrant_compound\n  \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500batf_warrant_raid_compound_fbi \u2500\u2500 Topic: 42\n  \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500koresh_batf_fbi_children_compound \u2500\u2500 Topic: 61\n  \u2502                   \u2514\u2500\u25a0\u2500\u2500fbi_gas_tear_bds_building \u2500\u2500 Topic: 23\n  \u2514\u2500use_like_just_dont_new\n      \u251c\u2500game_team_year_games_like\n      \u2502    \u251c\u2500game_team_games_25_year\n      \u2502    \u2502    \u251c\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502    \u251c\u2500window_printer_use_problem_mhz\n      \u2502    \u2502    \u2502    \u2502    \u251c\u2500mhz_wire_simms_wiring_battery\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_mhz_battery_cpu_heat\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500simms_pds_simm_vram_lc\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500pds_nubus_lc_slot_card \u2500\u2500 Topic: 119\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500simms_simm_vram_meg_dram \u2500\u2500 Topic: 32\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500mhz_battery_cpu_heat_speed\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500mhz_cpu_speed_heat_fan\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500fan_cpu_heat_sink_fans \u2500\u2500 Topic: 92\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500mhz_speed_cpu_fpu_clock \u2500\u2500 Topic: 22\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500monitor_turn_power_computer_electricity \u2500\u2500 Topic: 91\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500battery_batteries_concrete_duo_discharge\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500duo_battery_apple_230_problem \u2500\u2500 Topic: 121\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500battery_batteries_concrete_discharge_temperature \u2500\u2500 Topic: 75\n      \u2502    \u2502    \u2502    \u2502    \u2502    \u2514\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500wire_wiring_ground_neutral_outlets\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500leds_uv_blue_light_boards \u2500\u2500 Topic: 66\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500wire_wiring_ground_neutral_outlets \u2500\u2500 Topic: 120\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500scope_scopes_phone_dial_number\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500dial_number_phone_line_output \u2500\u2500 Topic: 93\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500scope_scopes_motorola_generator_oscilloscope \u2500\u2500 Topic: 113\n      \u2502    \u2502    \u2502    \u2502    \u2502         \u2514\u2500celp_dsp_sampling_antenna_digital\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500antenna_antennas_receiver_cable_transmitter \u2500\u2500 Topic: 70\n      \u2502    \u2502    \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500celp_dsp_sampling_speech_voice \u2500\u2500 Topic: 52\n      \u2502    \u2502    \u2502    \u2502    \u2514\u2500window_printer_xv_mouse_windows\n      \u2502    \u2502    \u2502    \u2502         \u251c\u2500window_xv_error_widget_problem\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u251c\u2500error_symbol_undefined_xterm_rx\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500symbol_error_undefined_doug_parse \u2500\u2500 Topic: 63\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500rx_remote_server_xdm_xterm \u2500\u2500 Topic: 45\n      \u2502    \u2502    \u2502    \u2502         \u2502    \u2514\u2500window_xv_widget_application_expose\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u251c\u2500window_widget_expose_application_event\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500gc_mydisplay_draw_gxxor_drawing \u2500\u2500 Topic: 103\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500window_widget_application_expose_event \u2500\u2500 Topic: 25\n      \u2502    \u2502    \u2502    \u2502         \u2502         \u2514\u2500xv_den_polygon_points_algorithm\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500den_polygon_points_algorithm_polygons \u2500\u2500 Topic: 28\n      \u2502    \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500xv_24bit_image_bit_images \u2500\u2500 Topic: 57\n      \u2502    \u2502    \u2502    \u2502         \u2514\u2500printer_fonts_print_mouse_postscript\n      \u2502    \u2502    \u2502    \u2502              \u251c\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500scanner_logitech_grayscale_ocr_scanman \u2500\u2500 Topic: 108\n      \u2502    \u2502    \u2502    \u2502              \u2502    \u2514\u2500printer_fonts_print_font_deskjet\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u251c\u2500\u25a0\u2500\u2500printer_print_deskjet_hp_ink \u2500\u2500 Topic: 18\n      \u2502    \u2502    \u2502    \u2502              \u2502         \u2514\u2500\u25a0\u2500\u2500fonts_font_truetype_tt_atm \u2500\u2500 Topic: 49\n      \u2502    \u2502    \u2502    \u2502              \u2514\u2500mouse_ghostscript_midi_driver_postscript\n      \u2502    \u2502    \u2502    \u2502                   \u251c\u2500ghostscript_midi_postscript_files_file\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500ghostscript_postscript_pageview_ghostview_dsc \u2500\u2500 Topic: 104\n      \u2502    \u2502    \u2502    \u2502                   \u2502    \u2514\u2500midi_sound_file_windows_driver\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u251c\u2500\u25a0\u2500\u2500location_mar_file_host_rwrr \u2500\u2500 Topic: 83\n      \u2502    \u2502    \u2502    \u2502                   \u2502         \u2514\u2500\u25a0\u2500\u2500midi_sound_driver_blaster_soundblaster \u2500\u2500 Topic: 98\n      \u2502    \u2502    \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500mouse_driver_mice_ball_problem \u2500\u2500 Topic: 68\n      \u2502    \u2502    \u2502    \u2514\u2500game_team_games_25_season\n      \u2502    \u2502    \u2502         \u251c\u25001st_sale_condition_comics_hulk\n      \u2502    \u2502    \u2502         \u2502    \u251c\u2500sale_condition_offer_asking_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u251c\u2500condition_stereo_amp_speakers_asking\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500miles_car_amfm_toyota_cassette \u2500\u2500 Topic: 62\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500amp_speakers_condition_stereo_audio \u2500\u2500 Topic: 24\n      \u2502    \u2502    \u2502         \u2502    \u2502    \u2514\u2500games_sale_pom_cds_shipping\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u251c\u2500pom_cds_sale_shipping_cd\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500size_shipping_sale_condition_mattress \u2500\u2500 Topic: 100\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500pom_cds_cd_sale_picture \u2500\u2500 Topic: 37\n      \u2502    \u2502    \u2502         \u2502    \u2502         \u2514\u2500\u25a0\u2500\u2500games_game_snes_sega_genesis \u2500\u2500 Topic: 40\n      \u2502    \u2502    \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u251c\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u251c\u2500lens_tape_camera_backup_lenses\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500tape_backup_tapes_drive_4mm \u2500\u2500 Topic: 107\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500lens_camera_lenses_zoom_pouch \u2500\u2500 Topic: 114\n      \u2502    \u2502    \u2502         \u2502         \u2502    \u2514\u25001st_hulk_comics_art_appears\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u25001st_hulk_comics_art_appears \u2500\u2500 Topic: 105\n      \u2502    \u2502    \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500books_book_cover_trek_chemistry \u2500\u2500 Topic: 125\n      \u2502    \u2502    \u2502         \u2502         \u2514\u2500tickets_hotel_ticket_voucher_package\n      \u2502    \u2502    \u2502         \u2502              \u251c\u2500\u25a0\u2500\u2500hotel_voucher_package_vacation_room \u2500\u2500 Topic: 74\n      \u2502    \u2502    \u2502         \u2502              \u2514\u2500\u25a0\u2500\u2500tickets_ticket_june_airlines_july \u2500\u2500 Topic: 84\n      \u2502    \u2502    \u2502         \u2514\u2500game_team_games_season_hockey\n      \u2502    \u2502    \u2502              \u251c\u2500game_hockey_team_25_550\n      \u2502    \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500espn_pt_pts_game_la \u2500\u2500 Topic: 17\n      \u2502    \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500team_25_game_hockey_550 \u2500\u2500 Topic: 2\n      \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500year_game_hit_baseball_players \u2500\u2500 Topic: 0\n      \u2502    \u2502    \u2514\u2500bike_car_greek_insurance_msg\n      \u2502    \u2502         \u251c\u2500car_bike_insurance_cars_engine\n      \u2502    \u2502         \u2502    \u251c\u2500car_insurance_cars_radar_engine\n      \u2502    \u2502         \u2502    \u2502    \u251c\u2500insurance_health_private_care_canada\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500insurance_health_private_care_canada \u2500\u2500 Topic: 99\n      \u2502    \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500insurance_car_accident_rates_sue \u2500\u2500 Topic: 82\n      \u2502    \u2502         \u2502    \u2502    \u2514\u2500car_cars_radar_engine_detector\n      \u2502    \u2502         \u2502    \u2502         \u251c\u2500car_radar_cars_detector_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500radar_detector_detectors_ka_alarm \u2500\u2500 Topic: 39\n      \u2502    \u2502         \u2502    \u2502         \u2502    \u2514\u2500car_cars_mustang_ford_engine\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500clutch_shift_shifting_transmission_gear \u2500\u2500 Topic: 88\n      \u2502    \u2502         \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500car_cars_mustang_ford_v8 \u2500\u2500 Topic: 14\n      \u2502    \u2502         \u2502    \u2502         \u2514\u2500oil_diesel_odometer_diesels_car\n      \u2502    \u2502         \u2502    \u2502              \u251c\u2500odometer_oil_sensor_car_drain\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500odometer_sensor_speedo_gauge_mileage \u2500\u2500 Topic: 96\n      \u2502    \u2502         \u2502    \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500oil_drain_car_leaks_taillights \u2500\u2500 Topic: 102\n      \u2502    \u2502         \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500diesel_diesels_emissions_fuel_oil \u2500\u2500 Topic: 79\n      \u2502    \u2502         \u2502    \u2514\u2500bike_riding_ride_bikes_motorcycle\n      \u2502    \u2502         \u2502         \u251c\u2500bike_ride_riding_bikes_lane\n      \u2502    \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500bike_ride_riding_lane_car \u2500\u2500 Topic: 11\n      \u2502    \u2502         \u2502         \u2502    \u2514\u2500\u25a0\u2500\u2500bike_bikes_miles_honda_motorcycle \u2500\u2500 Topic: 19\n      \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500countersteering_bike_motorcycle_rear_shaft \u2500\u2500 Topic: 46\n      \u2502    \u2502         \u2514\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u251c\u2500greek_msg_kuwait_greece_water\n      \u2502    \u2502              \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u251c\u2500greek_msg_kuwait_greece_dog\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u251c\u2500greek_kuwait_greece_turkish_greeks\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500greek_greece_turkish_greeks_cyprus \u2500\u2500 Topic: 71\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kuwait_iraq_iran_gulf_arabia \u2500\u2500 Topic: 76\n      \u2502    \u2502              \u2502    \u2502    \u2502    \u2514\u2500msg_dog_drugs_drug_food\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u251c\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500clinton_bush_quayle_reagan_panicking \u2500\u2500 Topic: 101\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502    \u2514\u2500dog_dogs_cooper_trial_weaver\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500cooper_trial_weaver_spence_witnesses \u2500\u2500 Topic: 90\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500dog_dogs_bike_trained_springer \u2500\u2500 Topic: 67\n      \u2502    \u2502              \u2502    \u2502    \u2502         \u2514\u2500msg_drugs_drug_food_chinese\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500msg_food_chinese_foods_taste \u2500\u2500 Topic: 30\n      \u2502    \u2502              \u2502    \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500drugs_drug_marijuana_cocaine_alcohol \u2500\u2500 Topic: 72\n      \u2502    \u2502              \u2502    \u2502    \u2514\u2500water_theory_universe_science_larsons\n      \u2502    \u2502              \u2502    \u2502         \u251c\u2500water_nuclear_cooling_steam_dept\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500rocketry_rockets_engines_nuclear_plutonium \u2500\u2500 Topic: 115\n      \u2502    \u2502              \u2502    \u2502         \u2502    \u2514\u2500water_cooling_steam_dept_plants\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500water_dept_phd_environmental_atmospheric \u2500\u2500 Topic: 97\n      \u2502    \u2502              \u2502    \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500cooling_water_steam_towers_plants \u2500\u2500 Topic: 109\n      \u2502    \u2502              \u2502    \u2502         \u2514\u2500theory_universe_larsons_larson_science\n      \u2502    \u2502              \u2502    \u2502              \u251c\u2500\u25a0\u2500\u2500theory_universe_larsons_larson_science \u2500\u2500 Topic: 54\n      \u2502    \u2502              \u2502    \u2502              \u2514\u2500\u25a0\u2500\u2500oort_cloud_grbs_gamma_burst \u2500\u2500 Topic: 80\n      \u2502    \u2502              \u2502    \u2514\u2500helmet_kirlian_photography_lock_wax\n      \u2502    \u2502              \u2502         \u251c\u2500helmet_kirlian_photography_leaf_mask\n      \u2502    \u2502              \u2502         \u2502    \u251c\u2500kirlian_photography_leaf_pictures_deleted\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u251c\u2500deleted_joke_stuff_maddi_nickname\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500joke_maddi_nickname_nicknames_frank \u2500\u2500 Topic: 43\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500deleted_stuff_bookstore_joke_motto \u2500\u2500 Topic: 81\n      \u2502    \u2502              \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500kirlian_photography_leaf_pictures_aura \u2500\u2500 Topic: 85\n      \u2502    \u2502              \u2502         \u2502    \u2514\u2500helmet_mask_liner_foam_cb\n      \u2502    \u2502              \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500helmet_liner_foam_cb_helmets \u2500\u2500 Topic: 112\n      \u2502    \u2502              \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500mask_goalies_77_santore_tl \u2500\u2500 Topic: 123\n      \u2502    \u2502              \u2502         \u2514\u2500lock_wax_paint_plastic_ear\n      \u2502    \u2502              \u2502              \u251c\u2500\u25a0\u2500\u2500lock_cable_locks_bike_600 \u2500\u2500 Topic: 117\n      \u2502    \u2502              \u2502              \u2514\u2500wax_paint_ear_plastic_skin\n      \u2502    \u2502              \u2502                   \u251c\u2500\u25a0\u2500\u2500wax_paint_plastic_scratches_solvent \u2500\u2500 Topic: 65\n      \u2502    \u2502              \u2502                   \u2514\u2500\u25a0\u2500\u2500ear_wax_skin_greasy_acne \u2500\u2500 Topic: 116\n      \u2502    \u2502              \u2514\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u251c\u2500m4_mp_14_mw_mo\n      \u2502    \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500m4_mp_14_mw_mo \u2500\u2500 Topic: 111\n      \u2502    \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500test_ensign_nameless_deane_deanebinahccbrandeisedu \u2500\u2500 Topic: 118\n      \u2502    \u2502                   \u2514\u2500\u25a0\u2500\u2500ites_cheek_hello_hi_ken \u2500\u2500 Topic: 3\n      \u2502    \u2514\u2500space_medical_health_disease_cancer\n      \u2502         \u251c\u2500medical_health_disease_cancer_patients\n      \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500cancer_centers_center_medical_research \u2500\u2500 Topic: 122\n      \u2502         \u2502    \u2514\u2500health_medical_disease_patients_hiv\n      \u2502         \u2502         \u251c\u2500patients_medical_disease_candida_health\n      \u2502         \u2502         \u2502    \u251c\u2500\u25a0\u2500\u2500candida_yeast_infection_gonorrhea_infections \u2500\u2500 Topic: 48\n      \u2502         \u2502         \u2502    \u2514\u2500patients_disease_cancer_medical_doctor\n      \u2502         \u2502         \u2502         \u251c\u2500\u25a0\u2500\u2500hiv_medical_cancer_patients_doctor \u2500\u2500 Topic: 34\n      \u2502         \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500pain_drug_patients_disease_diet \u2500\u2500 Topic: 26\n      \u2502         \u2502         \u2514\u2500\u25a0\u2500\u2500health_newsgroup_tobacco_vote_votes \u2500\u2500 Topic: 9\n      \u2502         \u2514\u2500space_launch_nasa_shuttle_orbit\n      \u2502              \u251c\u2500space_moon_station_nasa_launch\n      \u2502              \u2502    \u251c\u2500\u25a0\u2500\u2500sky_advertising_billboard_billboards_space \u2500\u2500 Topic: 59\n      \u2502              \u2502    \u2514\u2500\u25a0\u2500\u2500space_station_moon_redesign_nasa \u2500\u2500 Topic: 16\n      \u2502              \u2514\u2500space_mission_hst_launch_orbit\n      \u2502                   \u251c\u2500space_launch_nasa_orbit_propulsion\n      \u2502                   \u2502    \u251c\u2500\u25a0\u2500\u2500space_launch_nasa_propulsion_astronaut \u2500\u2500 Topic: 47\n      \u2502                   \u2502    \u2514\u2500\u25a0\u2500\u2500orbit_km_jupiter_probe_earth \u2500\u2500 Topic: 86\n      \u2502                   \u2514\u2500\u25a0\u2500\u2500hst_mission_shuttle_orbit_arrays \u2500\u2500 Topic: 60\n      \u2514\u2500drive_file_key_windows_use\n          \u251c\u2500key_file_jpeg_encryption_image\n          \u2502    \u251c\u2500key_encryption_clipper_chip_keys\n          \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500key_clipper_encryption_chip_keys \u2500\u2500 Topic: 1\n          \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500entry_file_ripem_entries_key \u2500\u2500 Topic: 73\n          \u2502    \u2514\u2500jpeg_image_file_gif_images\n          \u2502         \u251c\u2500motif_graphics_ftp_available_3d\n          \u2502         \u2502    \u251c\u2500motif_graphics_openwindows_ftp_available\n          \u2502         \u2502    \u2502    \u251c\u2500\u25a0\u2500\u2500openwindows_motif_xview_windows_mouse \u2500\u2500 Topic: 20\n          \u2502         \u2502    \u2502    \u2514\u2500\u25a0\u2500\u2500graphics_widget_ray_3d_available \u2500\u2500 Topic: 95\n          \u2502         \u2502    \u2514\u2500\u25a0\u2500\u25003d_machines_version_comments_contact \u2500\u2500 Topic: 38\n          \u2502         \u2514\u2500jpeg_image_gif_images_format\n          \u2502              \u251c\u2500\u25a0\u2500\u2500gopher_ftp_files_stuffit_images \u2500\u2500 Topic: 51\n          \u2502              \u2514\u2500\u25a0\u2500\u2500jpeg_image_gif_format_images \u2500\u2500 Topic: 13\n          \u2514\u2500drive_db_card_scsi_windows\n              \u251c\u2500db_windows_dos_mov_os2\n              \u2502    \u251c\u2500\u25a0\u2500\u2500copy_protection_program_software_disk \u2500\u2500 Topic: 64\n              \u2502    \u2514\u2500\u25a0\u2500\u2500db_windows_dos_mov_os2 \u2500\u2500 Topic: 8\n              \u2514\u2500drive_card_scsi_drives_ide\n                      \u251c\u2500drive_scsi_drives_ide_disk\n                      \u2502    \u251c\u2500\u25a0\u2500\u2500drive_scsi_drives_ide_disk \u2500\u2500 Topic: 6\n                      \u2502    \u2514\u2500\u25a0\u2500\u2500meg_sale_ram_drive_shipping \u2500\u2500 Topic: 12\n                      \u2514\u2500card_modem_monitor_video_drivers\n                          \u251c\u2500\u25a0\u2500\u2500card_monitor_video_drivers_vga \u2500\u2500 Topic: 5\n                          \u2514\u2500\u25a0\u2500\u2500modem_port_serial_irq_com \u2500\u2500 Topic: 10\n</code></pre>"},{"location":"getting_started/visualization/visualize_hierarchy.html#visualize-hierarchical-documents","title":"Visualize Hierarchical Documents","text":"<p>We can extend the previous method by calculating the topic representation at different levels of the hierarchy and  plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:</p> <p><pre><code>from sklearn.datasets import fetch_20newsgroups\nfrom sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\nfrom umap import UMAP\n\n# Prepare embeddings\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nembeddings = sentence_model.encode(docs, show_progress_bar=False)\n\n# Train BERTopic and extract hierarchical topics\ntopic_model = BERTopic().fit(docs, embeddings)\nhierarchical_topics = topic_model.hierarchical_topics(docs)\n</code></pre> Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by  reducing their dimensionality ourselves:</p> <pre><code># Run the visualization with the original embeddings\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)\n\n# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\ntopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)\n</code></pre> <p>Note</p> <p>The visualization above was generated with the additional parameter <code>hide_document_hover=True</code> which disables the  option to hover over the individual points and see the content of the documents. This makes the resulting visualization  smaller and fit into your RAM. However, it might be interesting to set <code>hide_document_hover=False</code> to hover  over the points and see the content of the documents. </p>"},{"location":"getting_started/visualization/visualize_terms.html","title":"Terms","text":"<p>We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores  for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within  topics. Moreover, you can easily compare topic representations to each other.  To visualize this hierarchy, run the following:</p> <pre><code>topic_model.visualize_barchart()\n</code></pre>"},{"location":"getting_started/visualization/visualize_terms.html#visualize-term-score-decline","title":"Visualize Term Score Decline","text":"<p>Topics are represented by a number of words starting with the best representative word.  Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word  to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline  with each word that is added. At some point adding words to the topic representation only marginally  increases the total c-TF-IDF score and would not be beneficial for its representation. </p> <p>To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word.  In other words, the position of the words (term rank), where the words with  the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis  will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline  of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method,  the select the best number of words in a topic. </p> <p>To visualize the c-TF-IDF score decline, run the following:</p> <pre><code>topic_model.visualize_term_rank()\n</code></pre> <p>To enable the log scale on the y-axis for a better view of individual topics, run the following:</p> <pre><code>topic_model.visualize_term_rank(log_scale=True)\n</code></pre> <p>This visualization was heavily inspired by the \"Term Probability Decline\" visualization found in an  analysis by the amazing tmtoolkit. Reference to that specific analysis can be found  here. </p>"},{"location":"getting_started/visualization/visualize_topics.html","title":"Topics","text":"<p>Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works.  Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing  if they make sense is an important factor in alleviating this issue. </p>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topics","title":"Visualize Topics","text":"<p>After having trained our <code>BERTopic</code> model, we can iteratively go through hundreds of topics to get a good  understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.  Instead, we can visualize the topics that were generated in a way very similar to  LDAvis. </p> <p>We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using  plotly such that we can create an interactive view.</p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs) \n</code></pre> <p>Then, we can call <code>.visualize_topics</code> to create a 2D representation of your topics. The resulting graph is a  plotly interactive graph which can be converted to HTML:</p> <pre><code>topic_model.visualize_topics()\n</code></pre> <p>You can use the slider to select the topic which then lights up red. If you hover over a topic, then general  information is given about the topic, including the size of the topic and its corresponding words.</p>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topic-similarity","title":"Visualize Topic Similarity","text":"<p>Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity  matrix by simply applying cosine similarities through those topic embeddings. The result will be a  matrix indicating how similar certain topics are to each other.  To visualize the heatmap, run the following:</p> <pre><code>topic_model.visualize_heatmap()\n</code></pre> <p>Note</p> <p>You can set <code>n_clusters</code> in <code>visualize_heatmap</code> to order the topics by their similarity.  This will result in blocks being formed in the heatmap indicating which clusters of topics are  similar to each other. This step is very much recommended as it will make reading the heatmap easier.      </p>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topics-over-time","title":"Visualize Topics over Time","text":"<p>After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by  leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency  of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.  Simply call <code>.visualize_topics_over_time</code> with the newly created topics over time:</p> <pre><code>import re\nimport pandas as pd\nfrom bertopic import BERTopic\n\n# Prepare data\ntrump = pd.read_csv('https://drive.google.com/uc?export=download&amp;id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')\ntrump.text = trump.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == \"f\") &amp; (trump.text != \"\"), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n\n# Create topics over time\nmodel = BERTopic(verbose=True)\ntopics, probs = model.fit_transform(tweets)\ntopics_over_time = model.topics_over_time(tweets, timestamps)\n</code></pre> <p>Then, we visualize some interesting topics: </p> <pre><code>model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])\n</code></pre>"},{"location":"getting_started/visualization/visualize_topics.html#visualize-topics-per-class","title":"Visualize Topics per Class","text":"<p>You might want to extract and visualize the topic representation per class. For example, if you have  specific groups of users that might approach topics differently, then extracting them would help understanding  how these users talk about certain topics. In other words, this is simply creating a topic representation for  certain classes that you might have in your data. </p> <p>First, we need to train our model:</p> <pre><code>from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Prepare data and classes\ndata = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))\ndocs = data[\"data\"]\nclasses = [data[\"target_names\"][i] for i in data[\"target\"]]\n\n# Create topic model and calculate topics per class\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\ntopics_per_class = topic_model.topics_per_class(docs, classes=classes)\n</code></pre> <p>Then, we visualize the topic representation of major topics per class: </p> <pre><code>topic_model.visualize_topics_per_class(topics_per_class)\n</code></pre>"},{"location":"getting_started/zeroshot/zeroshot.html","title":"Zero-shot Topic Modeling","text":"<p>Zero-shot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics.</p> <p>This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.  This allows for extensive flexibility as there are three scenario's to explore.</p> <p>First, both zero-shot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.</p> <p>Second, only zero-shot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.</p> <p>Third, no zero-shot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run. </p> \"Religion\"   the labels Embed cosine similaritydocumentzeroshot For each document, assign topics  based on  between  and  embeddings ManualBERTopicBERTopic Create two models:   (zeroshot documents)  (non-zeroshot documents)  the models into one Merge zeroshot topicslabels Define  through . \"Clustering\" Zeroshot topic 1 \"Topic Modeling\" Zeroshot topic 2 \"Large Language Models (LLM)\" Zeroshot topic 3 \"Topic Modeling\"  \"Large Language Models\"  Topic  Modeling BERTopic + Topic X Topic Y Topic Z Manual BERTopic Topic Modeling LLM Clustering Merged BERTopic Topic Modeling LLM Clustering Topic X Topic Y Topic Z Large Language Models No match found Clustering <p>This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.</p> <p>This creates two models. One for the zero-shot topics and one for the non-zero-shot topics. We combine these two BERTopic models to create a single model that contains both zero-shot and non-zero-shot topics.</p>"},{"location":"getting_started/zeroshot/zeroshot.html#example","title":"Example","text":"<p>In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,  there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers. We know the data and believe there to be at least the following topics: clustering, topic modeling, and large language models.  However, we are not sure whether other topics exist and want to explore those.</p> <p>Zero-shot BERTopic needs two parameters: * <code>zeroshot_topic_list</code> - The names of the topics to assign documents to. Making sure this is as descriptive as possible helps improve the assignment since they are based on cosine similarities between embeddings. * <code>zeroshot_min_similarity</code> - The minimum cosine similarity needed to match a document to a document. It is a value between 0 and 1.</p> <p>Using this feature is straightforward:</p> <pre><code>from datasets import load_dataset\n\nfrom bertopic import BERTopic\nfrom bertopic.representation import KeyBERTInspired\n\n# We select a subsample of 5000 abstracts from ArXiv\ndataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")[\"train\"]\ndocs = dataset[\"abstract\"][:5_000]\n\n# We define a number of topics that we know are in the documents\nzeroshot_topic_list = [\"Clustering\", \"Topic Modeling\", \"Large Language Models\"]\n\n# We fit our model using the zero-shot topics\n# and we define a minimum similarity. For each document,\n# if the similarity does not exceed that value, it will be used\n# for clustering instead.\ntopic_model = BERTopic(\n    embedding_model=\"thenlper/gte-small\", \n    min_topic_size=15,\n    zeroshot_topic_list=zeroshot_topic_list,\n    zeroshot_min_similarity=.85,\n    representation_model=KeyBERTInspired()\n)\ntopics, _ = topic_model.fit_transform(docs)\n</code></pre> <p>When we run <code>topic_model.get_topic_info()</code> you will see something like this:</p> <p> </p> <p>The <code>zeroshot_min_similarity</code> parameter controls how many of the documents are assigned to the predefined zero-shot topics. Lower this value and you will have more documents assigned to zero-shot topics and fewer documents will be clustered. Increase this value you will have fewer documents assigned to zero-shot topics and more documents will be clustered.</p> <p>Note</p> <p>Setting the <code>zeroshot_min_similarity</code> parameter requires a bit of experimentation. Some embedding models have different similarity distributions, so trying out the values manually and exploring the results is highly advised.</p> <p>Tip</p> <p>Because zero-shot topic modeling is essentially merging two different topic models, the  <code>probs</code> will be empty initially. If you want to have the probabilities of topics across documents,  you can run <code>topic_model.transform</code> on your documents to extract the updated <code>probs</code>.</p>"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 6dc27f3b..04b58bc2 100755
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ