In this project the team intend to build an event suggestion system using RI techniques to provide a good search result of all kind of event in world living, so we peformed an LSA model to work on events of the next format:
- Entry Time
- Exit Time
- Description
- Event Type
- Location
- Offers
- Host
- Entry Cost
- Title
In the rest of the readme
we will be explaining the main charateristics of the model in use and the suggestion system method for event suggestion
We gave to the user the oportunity to send feedback to the system through a like bottom for better search result. So when the suggestion bottom is clicked , it peforms an automatic query search.
Just uncomment the commented lines a the end of the file event_generator.py to fill up the database and run py event_generator.py
if you are in windows, for linux like systems you run python3 event_generator.py
afterwards, uncomment again the lines and run py visual.py
We use Python Faker library to generate synthetic data for events. This synthetic data generation is a common practice in software development for testing purposes, especially when real data is not available or when privacy concerns prevent the use of actual data. The code is structured to generate a list of 100 events, each with various attributes such as event type, title, entry and finish times, entry cost, location, brief description, host name, and offer. These events are then written to individual text files.
Faker Library: The code imports the Faker library, which is a powerful tool for generating fake data. It's used here to generate realistic event details like names, addresses, and dates
Event Generation: The generate_event function is designed to create a single event with a mix of predefined and randomly generated attributes. It uses Faker to generate realistic data for each attribute, such as event types, titles, dates, costs, locations, descriptions, host names, and offers.
Data Variety: The code includes a variety of data types, including strings (for event types, titles, descriptions, and offers), dates (for entry and finish times), and numerical values (for entry costs). This diversity ensures that the generated data closely mimics real-world data.
-
Purpose: The primary purpose of generating synthetic data is to create a dataset that closely resembles real-world data without compromising privacy or data integrity. This is particularly useful in testing environments where real data cannot be used.
-
Applications: Synthetic data can be used for a wide range of applications, including database testing, application performance testing, and machine learning model training. It helps in identifying potential issues in the system and ensuring that the application can handle real-world data effectively.
-
Data Quality: While Faker is excellent for generating synthetic data, it's important to note that the data it generates may not always be of high quality. For example, names generated by Faker may not always match email addresses or domain names. This is a common limitation of synthetic data generation tools 1.
-
Customization: Faker allows for customization of the generated data, but this may require additional time and effort to perfect the system. Developers can create custom providers or use existing ones to generate data that closely matches their specific needs
There are several classes and functions that work together to manage events, perform searches, and extract topics from a collection of documents. It also interacts with a database to store and retrieve historical data.
-
Models: The SemanticLatentModel is imported but not defined in the provided code. It's likely used for semantic search or topic modeling.
-
Topic Extraction: The TopicExtractor class is used to extract topics from a collection of documents. It takes a list of documents, the number of topics, and the number of words per topic as input.
-
Search and Event Management: The SearchItem and SearchResult classes are used to manage search results. The Search function performs a search query using the SemanticLatentModel, and the AddEvent function adds a new event to the historical data.
Brief explanation:
- Event Management
-
The
SearchItem
class represents an event with various attributes such as event type, title, entry time, finish time, entry cost, location, description, host name, and offer. -
The
SearchResult
class is a collection of SearchItem objects, representing the results of a search query.
- Search Functionality
-
The
Search
function uses the SemanticLatentModel to perform a search query based on a given query string. -
The
AddEvent
function adds a new event to the historical data, which is stored in the 'events' table in the database.
- Topic Extraction
- The
GetHistorialTopics
function uses the TopicExtractor to extract the most relevant topics from the historical data.
- Data Management 📈
-
The
GetHistorial
function retrieves the last 30 queries made to the model from the database. -
The
UpdateData
function updates the data in the database with the current historical data.
The SemanticLatentModel class is initialized with a name and optionally a root directory. It sets up a dataset, an event processor, and initializes parameters for query processing and data scaling.
The class provides methods to add individual events or a list of events to the dataset. It also includes functionality to update the Inverse Document Frequency (IDF) values of the dataset.
The class includes methods to process queries, calculate Term Frequency (TF) and Inverse Document Frequency (IDF) for query terms, and generate a query vector.
The SearchQuery method performs a semantic search by tokenizing, tagging, and lemmatizing the query, generating a query vector, and then using Truncated Singular Value Decomposition (TruncatedSVD) to rank documents based on their relevance to the query.
The AddEvent and AddEvents methods allow for the addition of new events to the dataset. The AddEvents method also updates the IDF values of the dataset, which is crucial for semantic search.
The getQueryTFS, getQueryIDFS, and getQueryVector methods are used to process the query. They calculate the TF and IDF for each term in the query and generate a weighted vector representing the query.
The SearchQuery method is the core of the semantic search functionality. It tokenizes, tags, and lemmatizes the query, generates a query vector, and then uses TruncatedSVD to reduce the dimensionality of the document vectors and the query vector. The method then ranks the documents based on their relevance to the query.
The StandardScaler from scikit-learn is used to standardize the data, which is a common preprocessing step in machine learning to ensure that all features have the same scale. This can improve the performance of the model 1.
The getRank method in the SemanticLatentModel class is designed to perform a Latent Semantic Analysis (LSA) on a given query-document vector to rank documents based on their relevance to the query. Here's a detailed report on its functionality and steps:
The method takes two parameters: query_document_vector: A matrix representing the query-document vectors. Each row corresponds to a document, and each column corresponds to a term in the vocabulary. The value at a specific row and column indicates the term frequency of the term in the document. components: The number of components (or dimensions) to reduce the data to using LSA. Latent Semantic Analysis (LSA): The method initializes a TruncatedSVD object from sklearn.decomposition with the specified number of components and the 'randomized' algorithm. This object is used to perform LSA, which is a dimensionality reduction technique that can be used to improve the performance of text classification tasks. The fit_transform method is called on the query_document_vector to perform LSA. This method fits the model to the data and then applies the dimensionality reduction on the data. The result is a new matrix (lsa_matrix) where each row corresponds to a document, and each column corresponds to a principal component.
The explained_variance_ratio_
attribute of the TruncatedSVD
object is accessed to get the proportion of the dataset's variance that lies along each principal component. This gives an indication of how much information (variance) can be attributed to each of the principal components.
The cumulative sum of the explained_variance_ratio
(amount) is calculated using np.cumsum. This gives the total variance explained by each component and all the components before it.
The number of components np.argmax(amount >= context_value) + 1
. This is used to decide how many principal components to keep for further analysis.
A new TruncatedSVD object is initialized with the optimal number of components fit_transform
method is called again on the query_document_vector
to perform LSA with the optimal number of components.
The transposed version of the lsa_matrix
is calculated.
The rank of the documents is computed by performing a matrix multiplication of the lsa_matrix
and its transpose. This results in a similarity matrix where each entry indicates the similarity between two documents.
The method returns the first row of the similarity matrix, which represents the rank of the documents based on their relevance to the query.
Term-Document Matrix: Traditional search engines use a term-document matrix to match search queries with documents. This matrix represents the frequency of terms in documents, but it lacks the ability to understand the semantic relationships between terms and documents.
Latent Semantic Analysis (LSA): LSA addresses this limitation by decomposing the term-document matrix into a set of latent factors or topics. These topics capture the underlying themes or concepts present in the documents. The decomposition process involves Singular Value Decomposition (SVD), which breaks down the matrix into simpler components that represent the main ideas or topics in the corpus.
Semantic Relationships: By analyzing the relationships between terms and topics, LSA can identify synonyms, group terms that are semantically related, and cluster documents based on their topics. This allows the search engine to return more relevant results by understanding the context and meaning of the search query, rather than just matching keywords.
-
Improved Relevance: LSA can improve search relevance by detecting synonyms, clustering terms and documents, and automatically tagging documents with relevant topics. This helps in organizing information more effectively and providing users with more accurate search results.
-
Efficiency: By reducing the term-document matrix to a smaller set of topics, LSA can compress the data, saving computational resources and improving the performance of search operations.
-
Handling Sparse Data: Term-document matrices are often sparse, with many zero entries. LSA can help in identifying and focusing on the most relevant topics, thereby ignoring unimportant topics and improving the efficiency of search operations.
-
Data Cleaning and Feature Modeling: Implementing LSA requires careful data cleaning and feature modeling. The quality of the term-document matrix and the selection of topics are crucial for the success of LSA.
-
Tuning and Experimentation: LSA is sensitive to tuning and may require experimentation with different approaches, such as adjusting term frequency calculations, using skipgrams for document representation, or limiting LSA to fields that are more conducive to its application.
Thanks for visiting our repository⚡