diff --git a/src/content/config.ts b/src/content/config.ts index f74d50a..74d19af 100644 --- a/src/content/config.ts +++ b/src/content/config.ts @@ -10,7 +10,7 @@ import { const notebooks = defineCollection({ type: "content", // v2.5.0 and later schema: z.object({ - title: z.string(), + title: z.string().optional(), url: z.string().url().optional(), githubUrl: z.string().url().optional(), googleColabUrl: z.string().url().optional(), diff --git a/src/content/notebooks/search-multilingual-docs-impresso-hf.mdx b/src/content/notebooks/search-multilingual-docs-impresso-hf.mdx new file mode 100644 index 0000000..b688c9b --- /dev/null +++ b/src/content/notebooks/search-multilingual-docs-impresso-hf.mdx @@ -0,0 +1,498 @@ +--- +githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/search_multilingual_docs-ImpressoHF.ipynb +authors: + - impresso-team +title: Searching Relevant texts within an Embedding spaces +sha: 1b1cb5c32ccd6a4f594ddaedc5b8bb2003031de5 +date: 2024-10-29T10:20:36Z +links: [] +googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/search_multilingual_docs-ImpressoHF.ipynb +--- + +{/* cell:0 cell_type:markdown */} + +This notebook demonstrates how to use a pre-trained multilingual embedding model downloaded from Hugging Face to search for relevant texts across languages. + +We'll load the model, embed the texts and demonstrate use cases on how to find relevant texts across the German/French language pair within Impresso. Through these use cases we will get familiar with how to use the utilities functions we provide so that you can extend this to your user case of interest. + +Reccomended Hardware: GPU support, the colab free one (T4) is sufficient. Alternatively, calculations with CPU are possible but much slower. + + + + +{/* cell:1 cell_type:markdown */} +## 1. Install Dependencies + +First, we need to install `sentence-transformers` + + +{/* cell:2 cell_type:code */} +```python +!pip install sentence-transformers +``` + +{/* cell:3 cell_type:markdown */} +## 2. Model Information + +In this example, we are using an off the shelf multilingual embedding model hosted on Huggingface: `gte-multilingual-base'. + +Note: Newer impresso version of the model is in the works. + +This model predicts an embedding representation (list of numbers that stores the "meaning") of a given text (sentence, paragraph, article) that can be used to measure similarity between two texts. + + +{/* cell:4 cell_type:markdown */} +## 3. Loading the embedding model + +This class downloads the model from Hugging Face and loads it ready for prediction. We use the SentenceTransformers library to benefit from their functionality and documentation. + +{/* cell:5 cell_type:code */} +```python +from sentence_transformers import SentenceTransformer +from scipy.spatial.distance import cosine + +embedding_model = SentenceTransformer("Alibaba-NLP/gte-multilingual-base", trust_remote_code=True) +``` + +{/* cell:6 cell_type:markdown */} +### Simple Test + +{/* cell:7 cell_type:code */} +```python +sentence1_en = "This is an example test sentence" +sentence2_en = "This constitutes a sample sentence" +``` + +{/* cell:8 cell_type:code */} +```python +embedding1_en = embedding_model.encode(sentence1_en) +embedding2_en = embedding_model.encode(sentence2_en) + +print("Embedding Representation of sentence1 starts with ... " + str(embedding1_en[:5])) +print("Embedding Representation of sentence2 starts with ... " + str(embedding2_en[:5])) +``` + +{/* cell:9 cell_type:markdown */} +Those numbers look intriguing, but do they really mean something? + +Answer: Yes, they can show us the similarity of the two texts + +{/* cell:10 cell_type:code */} +```python +similarity_value = round(1 - cosine(embedding1_en, embedding2_en),2) +print("Sentence1 and Sentence2 have a cosine similarity of " + str(similarity_value)) +``` + +{/* cell:11 cell_type:markdown */} +FAQ: + +Cosine similarity is not the percentage of similarity between the two texts. + +The higher the cosine similarity, the more similar the two texts are. Range of what is considered "high" varies per model and domain. + +Based on our experiments on contemporary texts, cosine similarity of 0.85+ means the two texts are mostly equivalent + +{/* cell:12 cell_type:markdown */} +### Simple Test Across Languages + +{/* cell:13 cell_type:code */} +```python +sentence1_de = "Das ist ein Beispieltestsatz" +``` + +{/* cell:14 cell_type:code */} +```python +embedding1_de = embedding_model.encode(sentence1_de) +``` + +{/* cell:15 cell_type:code */} +```python +similarity_value = round(1 - cosine(embedding1_en, embedding1_de),2) +print("Sentence1 in English and Sentence1 in German have a cosine similarity of " + str(similarity_value)) +``` + +{/* cell:16 cell_type:markdown */} +### Try your own similarity calculation + +{/* cell:17 cell_type:code */} +```python +input1 = input() +``` + +{/* cell:18 cell_type:code */} +```python +input2 = input() +``` + +{/* cell:19 cell_type:code */} +```python +embedding1 = embedding_model.encode(input1) +embedding2 = embedding_model.encode(input2) +``` + +{/* cell:20 cell_type:code */} +```python +similarity_value = round(1 - cosine(embedding1, embedding2),2) +print("Input1 and Input2 have a cosine similarity of " + str(similarity_value)) +``` + +{/* cell:21 cell_type:markdown */} +Note: You can also calculate the similarity of inputs from different languages + +{/* cell:22 cell_type:markdown */} +## 4. Finding similar texts within collections using the embedding model + +Now that we have seen how the model creates a representation and how we can use it to get the similarity of two texts, let's apply it to a couple of collections. + +{/* cell:23 cell_type:markdown */} +## Setting up utility functions + +Here we setup utilities functions that we use for later. You can safely ignore details of the implementation. You simply need to have the code cell executed + +For these functions we just need to have a high level understanding of the following key information: + +***create_embedding_collection(texts, embedding_model)*** + +This function creates a collection of embeddings from a list of sentences, texts, using the embedding model which we already loaded. It outputs a list of tuples pairing each text with its embedding + +***find_best_match_in_collection(source_collection, target_collection)*** + +This function finds the most similar text in a target_collection for each text in a source_collection based on their precomputed embeddings. Each collection should be a list of tuples (direct output of the create_embedding_collection function) where each tuple contains a sentence and its embedding. + +***print_matches_formatted(matches, link=False, threshold=0)*** + +This function formats and prints each match from the matches output of find_best_match_in_collection. It takes an optional threshold value to display only matches with a similarity above a certain value. + + +Example usage can be found on Section 4.1 + +{/* cell:24 cell_type:code */} +```python +from scipy.spatial.distance import cosine + +def create_embedding_collection(texts, embedding_model, uids=None): + # Encode the sentences using the model + embeddings = embedding_model.encode(texts) + + # Zip the sentences with their corresponding embeddings + if uids: + texts_embedding_collection = list(zip(texts, embeddings, uids)) + else: + texts_embedding_collection = list(zip(texts, embeddings)) + + return texts_embedding_collection + +def find_best_match_in_collection(source_collection, target_collection, matches_sorted=False, link=False): + """ + Finds the most similar sentence in the target collection for each sentence in the source collection + based on cosine similarity of their embeddings. + + Args: + source_collection (list of tuples): A list of tuples where each tuple contains a source sentence and its corresponding embedding. + target_collection (list of tuples): A list of tuples where each tuple contains a target sentence and its corresponding embedding. + + Returns: + list of dict: A list of dictionaries, each containing: + - 'source_text' (str): The original sentence from the source collection. + - 'best_match' (str): The most similar sentence from the target collection. + - 'similarity' (float): The cosine similarity score between the source sentence and the best match, rounded to two decimal places. + + Example: + source = [("sentence in German", embedding_vector)] + target = [("sentence in French", embedding_vector)] + matches = find_best_match_in_collection(source, target) + + Note: + The function assumes the embeddings are already precomputed and provided in the source and target collections. + """ + + matches = [] # List to store the best matches + for source in source_collection: + source_embedding = source[1] # Get the embedding of the source sentence + best_match_text = "" # Initialize the best match text + best_match_similarity = 0 # Initialize the best match similarity score + best_match_content_id = "" + # Iterate through the target collection to find the best match + for target in target_collection: + if source[0] == target[0]: # Skip if comparing the same sentence + continue + target_embedding = target[1] # Get the embedding of the target sentence + # Calculate cosine similarity between source and target embeddings + similarity_value = 1 - cosine(source_embedding, target_embedding) + # Update if the current similarity is higher than the previous best + if similarity_value > best_match_similarity: + best_match_similarity = similarity_value + best_match_text = target[0] + if link: + best_match_content_id = target[2] + + # Append the source sentence, best match, and similarity score to the results list + if link: + matches.append({ + "source_text": source[0], + "best_match": best_match_text, + "similarity": round(best_match_similarity, 2), # Round to 2 decimal places + "Content Item Source": f"https://impresso-project.ch/app/article/{source[2]}", # Corrected f-string for URL + "Content Item Matched": f"https://impresso-project.ch/app/article/{best_match_content_id}" # Corrected f-string for URL + }) + else: + matches.append({ + "source_text": source[0], + "best_match": best_match_text, + "similarity": round(best_match_similarity, 2) # Round to 2 decimal places + }) + + + if matches_sorted: + # Sort the matches by similarity in descending order if specified + matches = sorted(matches, key=lambda x: x['similarity'], reverse=True) + + return matches + +# Function to print the matches in a nicely formatted way +def print_matches_formatted(matches, link=False, threshold=0): + for match_dict in matches: + if threshold > match_dict['similarity']: # if not a good match, skip the match + continue + print("Source Text:") + print(f" {match_dict['source_text']}\n") + + print("Best Match:") + print(f" {match_dict['best_match']}\n") + + print(f"Similarity: {match_dict['similarity']}\n") + + # If 'link' is True, print the URLs for the source and matched content + if link and 'Content Item Source' in match_dict and 'Content Item Matched' in match_dict: + print("Content Item Source:") + print(f" {match_dict['Content Item Source']}\n") + + print("Content Item Matched:") + print(f" {match_dict['Content Item Matched']}\n") + + print("-" * 50) # Print a separator line for readability +``` + +{/* cell:25 cell_type:markdown */} +## 4.1 Searching in a Dummy Sentence Level Collection + +{/* cell:26 cell_type:markdown */} +Here we create a sample sentence text collection to see what the model matches as most similar using minimal additional code. + +{/* cell:27 cell_type:code */} +```python +german_sentences = [ + "Mit diesen drei Kernkraftwerken wird die Schweiz 1972 die höchste installierte nukleare Kapazität pro Kopf der Bevölkerung aller kontinentaleuropäischer Länder aufweisen .", + "In anderen Gegenden wiederum wirbt man gegen den Bau von Wasserkraftwerken aus Gründen des Natur- und Heimatschutzes und wäre mit der Aufstellung thermischer Kraftwerke mittlerer Leistung einverstanden , vorausgesetzt , dass deren Abgase keine unzulässige Verschmutzung der Luft verursachen", + "In Baden beabsichtigt , mit einem Kostenaufwand von 480 Millionen Franken iii Kaiseraugst ein Kernkraftwerk mit einer Leistung von 500 Megawatt zu errichten", + "Der Bürger akzeptiert das Prinzip des doppelten politischen Programms nicht mehr, bei dem das erste dazu dient, gewählt zu werden, und das zweite zum Regieren verwendet wird.", +] + +french_sentences = [ + "Indemnités pour Kaiseraugst Contestées par une minorité au National Le Conseil national a entamé hier de débat sur l'abandon du projet de la centrale nucléaire de Kaiseraugst mais n'a pas encore voté l'entrée enmatière.", + "Certains milieux voués à la protection de l'environnement s'élèvent non seulement contre la construction de centrales mais même contre l'accroissement de la consommation d'énergie.", + "En revanche, le projet retenu pour Kaiseraugst, soit un réacteur à eau légère de 500 MW, coûterait 480 millions de francs et produirait de l'électricité au prix de 2,5 centimes par kwh. ", + "La Suisse disposera en 1972, par habitant, de la capacité nucléaire installée la plus élevée de tous les pays d'Europe continentale." +] + +german_collection = create_embedding_collection(german_sentences, embedding_model) +french_collection = create_embedding_collection(french_sentences, embedding_model) +``` + +{/* cell:28 cell_type:code */} +```python +# Example of finding and printing the best matches +matches = find_best_match_in_collection(source_collection=german_collection, target_collection=french_collection) # Find best matches +print_matches_formatted(matches) # Print the matches +``` + +{/* cell:29 cell_type:markdown */} +## 4.2 Searching in an Article collection exported from the interface + +{/* cell:30 cell_type:markdown */} +Here we are working with collections exported directly from the Impresso Interface. We create a new utility function to transform the csv files into the collection format we used before to use the same functions we have already seen. For this new utility, you only need to provide the file loaded through pandas as well as the embedding model. + +In this example, we work on article level and filter for articles with a minimum length of 2000 characters. You can customise this filter by changing the parameter minimum_characters_in_article or pre-filter your dataframe in any way you please before providing it to the function. + + +{/* cell:31 cell_type:code */} +```python +def interface_exported_csv_to_collection(df, embedding_model, batch_size=16, minimum_characters_in_article=2000): + """ + Converts a DataFrame into a collection of (text, embedding, uid) tuples, + filtering rows where the 'content' column has at least 2000 characters. + The encoding is done in batches to handle large datasets efficiently. + + Args: + df (pd.DataFrame): Input DataFrame containing 'content' (text) and 'uid' (unique identifier). + embedding_model (object): Embedding model with an `encode()` method to generate text embeddings. + batch_size (int, optional): The size of batches for the encoding process. Defaults to 32. + + Returns: + list of tuples: Each tuple contains (source text, embedding, uid) for rows with 'content' >= 2000 characters. + """ + + # Filter rows where 'content' has at least 2000 characters + df_filtered = df[df['content'].apply(lambda x: len(str(x)) >= 2000)] + + # Extract the 'content' and 'uid' columns + texts = df_filtered['content'].tolist() + uids = df_filtered['uid'].tolist() + + # Initialize an empty list to hold the embeddings + embeddings = [] + + # Process the texts in batches + for i in range(0, len(texts), batch_size): + batch_texts = texts[i:i + batch_size] + batch_embeddings = embedding_model.encode(batch_texts) + embeddings.extend(batch_embeddings) + + # Create a collection of tuples (source text, embedding, uid) + collection = list(zip(texts, embeddings, uids)) + + return collection + +``` + +{/* cell:32 cell_type:markdown */} +I setup the file on my google drive so you can also replicate with these files using this code. For another collection, you can simply drag and drop the files into the file interface of the notebook + +{/* cell:33 cell_type:code */} +```python +!pip install gdown +import gdown +# Step 1: Set the file IDs +file_id_french = '1-xHR5Oxlo1iVBpAKnwsxYxPGBZ0r86-3' +file_id_german = '1LGdbdvJVl1tIouYoPJPcL0u45_GftwoB' + +# Step 2: Download the files using gdown +gdown.download(f"https://drive.google.com/uc?export=download&id={file_id_french}", "mariecurie_french.csv", quiet=True) +gdown.download(f"https://drive.google.com/uc?export=download&id={file_id_german}", "mariecurie_german.csv", quiet=True) +``` + +{/* cell:34 cell_type:code */} +```python +import pandas as pd + + +marie_curie_df_german = pd.read_csv("mariecurie_german.csv", sep=";") +marie_curie_df_french = pd.read_csv("mariecurie_french.csv", sep=";") +``` + +{/* cell:35 cell_type:code */} +```python +marie_curie_german_collection = interface_exported_csv_to_collection(marie_curie_df_german, embedding_model, minimum_characters_in_article=2000) +print("German articles prepared: " + str(len(marie_curie_german_collection))) +marie_curie_french_collection = interface_exported_csv_to_collection(marie_curie_df_french[:15], embedding_model, minimum_characters_in_article=2000) # I only prepare the first 15 for compute reasons. +print("French articles prepared: " + str(len(marie_curie_french_collection))) +``` + +{/* cell:36 cell_type:markdown */} +Now we have reached the same data format and so we can use the same utility functions as before. An addition to earlier I add a hyperlink to the interface so we can browse all the details. To do so, we set the value of "link" to True. + +{/* cell:37 cell_type:code */} +```python +# Example of finding and printing the best matches +matches = find_best_match_in_collection(source_collection=marie_curie_french_collection, target_collection=marie_curie_german_collection, link=True) # Find best matches +print_matches_formatted(matches, link=True, threshold=0.70) # Print the matches +``` + +{/* cell:38 cell_type:markdown */} +U might want to save the results into a csv file, here is a utility function to do that. Within Google colab, just download the resulting file as an extra step. + +{/* cell:39 cell_type:code */} +```python +def save_matches_to_csv(matches, filename, link=False, threshold=0): + """ + Saves the match results to a CSV file with specified formatting using pandas. + + Args: + matches (list of dict): The match results from `find_best_match_in_collection`. + filename (str): The name of the CSV file to save results. + link (bool): If True, includes URLs for the source and matched content in the file. + threshold (float): Minimum similarity score to include a match. + """ + # Filter matches based on the threshold + filtered_matches = [ + match for match in matches if match['similarity'] >= threshold + ] + + # Create a DataFrame from the filtered matches + df = pd.DataFrame(filtered_matches) + + # Drop URL columns if link is False + if not link: + df = df.drop(columns=['Content Item Source', 'Content Item Matched'], errors='ignore') + + # Save the DataFrame to a CSV file + df.to_csv(filename, index=False, encoding='utf-8') + +# Call this function after generating matches +save_matches_to_csv(matches, "marie_curie_first10french_matches.csv", link=True, threshold=0) +``` + +{/* cell:40 cell_type:markdown */} +## 4.3 Searching in an Article collection sourced from DataLab (Current Substitute API) + +{/* cell:41 cell_type:markdown */} +This is not how it will look within Datalab. Details about that go to Daniele, but let's just showcase the programmatic access through our API + +{/* cell:42 cell_type:code */} +```python +%pip install --upgrade --force-reinstall impresso +import impresso +impresso_session = impresso.connect() +``` + +{/* cell:43 cell_type:code */} +```python +# some search and get data +fr_result = impresso_session.search.find( + q="recette", + order_by="-date") + +fr_texts = [] +fr_uids = [] + +for uri in fr_result.df.index[:40]: + article = impresso_session.content_items.get(uri) + if len(article.df.content[0]) > 400: # 400 characters minimum length + fr_texts.append(article.df.content[0]) + fr_uids.append(article.df.uid[0]) +``` + +{/* cell:44 cell_type:code */} +```python +# some search and get data +de_result = impresso_session.search.find( + q="rezept", + order_by="-date") + +de_texts = [] +de_uids = [] + +for uri in de_result.df.index[:400]: + article = impresso_session.content_items.get(uri) + if len(article.df.content[0]) > 400: # 400 characters minimum length + de_texts.append(article.df.content[0]) + de_uids.append(article.df.uid[0]) +``` + +{/* cell:45 cell_type:code */} +```python +recipes_fr_collection = create_embedding_collection(fr_texts, embedding_model, uids=fr_uids) +recipes_de_collection = create_embedding_collection(de_texts, embedding_model, uids=de_uids) +``` + +{/* cell:46 cell_type:code */} +```python +# Example of finding and printing the best matches +matches = find_best_match_in_collection(source_collection=recipes_fr_collection, target_collection=recipes_de_collection, link=True) # Find best matches +print_matches_formatted(matches, link=True, threshold=0.60) # Print the matches +``` + +{/* cell:47 cell_type:markdown */} +## 5. Summary and Next Steps + +The pipeline, models and codes we provide is not the only method to find similar texts. You can always experiment with different models. pipelines and data filtering methods. Feel free to re-use our code!