This project is a comprehensive analysis of a network of impressions derived from a dataset. It includes various algorithms for shortest path analysis, missing link prediction, and ranking nodes using random walk-based PageRank. The implementation is written in Python and utilizes the NetworkX library for graph-related operations.
-
Random Walk-Based PageRank
- Implements a random walk algorithm to rank nodes based on their importance.
- Identifies the top leaders in the network based on normalized ratings.
-
Missing Link Prediction
- Uses a matrix-based method to predict missing edges in the network.
- Adds predicted missing links to the graph and recalculates PageRank scores for updated rankings.
-
Shortest Path Analysis
- Computes the shortest distances between all pairs of nodes using Dijkstra's algorithm.
- Provides metrics such as the maximum distance, unreachable node pairs, and node absence rates.
requirements.txt
: Specifies the required libraries for the project.shortest_path_analysis.py
: Contains the implementation for shortest path calculations and related statistics.missing_link_prediction.py
: Implements the missing link prediction algorithm and recalculates PageRank for the updated network.random_walk_pagerank.py
: Performs node ranking using a random walk algorithm.dataset.csv
: The dataset file representing the impression network.Report.pdf
: The report of the final project, containing detailed statistics and results of the project.
- Python 3.7 or higher
- Install dependencies using the following command:
pip install -r requirements.txt
-
Random Walk-Based PageRank
- Execute
random_walk_pagerank.py
to compute the PageRank scores using a random walk method. - Example:
python random_walk_pagerank.py
- Execute
-
Missing Link Prediction
- Run
missing_link_prediction.py
to identify and add missing links, followed by recalculating rankings. - Example:
python missing_link_prediction.py
- Run
-
Shortest Path Analysis
- Run
shortest_path_analysis.py
to compute shortest distances between nodes and related metrics. - Example:
python shortest_path_analysis.py
- Run
The dataset should be in CSV format. The columns are interpreted as follows:
- Parent Node: The second column specifies the source node.
- Child Nodes: Subsequent columns represent target nodes connected to the source node.
- Random Walk-Based PageRank
- Top 10 nodes with the highest ratings
- Missing Link Prediction
- List of predicted links
- Updated PageRank rankings
- Shortest Path Analysis
- Distance matrix
- Maximum distance
- Unreachable node statistics
- NetworkX: For graph creation and manipulation.
- NumPy: For numerical computations.
- CSV: For handling the dataset file.