s4696681: Implementation of GCN Model for Node Classification on the page-page Facebook dataset #164

kaia-santosha · 2023-10-25T11:38:34Z

This is my pull request to merge the node classification facebook page-page problem for the Report Assessment for COMP3710

My implementation is a Graph Convolutional Network for the purpose of classifying nodes in the dataset

My implementation does the following:

imports and preprocesses the facebook page-page dataset
- Creates an adjacency matrix from the data provided in musae_facebook_edges.csv
- Normalises this adjacency matrix as required by the GCN
- creates the feature vectors for each node, since there were an inconsistent number of features per node, bag of word feature vectors were created
- encodes classifications to be integers instead of strings so they can be used in the model
- create the tensors required by pytorch to be fed into the model

The model itself consists of two Graph Convolutional Layers and the model is tuned and trained using nested cross validation

This pull request contains the required files to satisfy the Task Sheet:
modules.py (contains model architecture)
dataset.py (contains preprocessing of data)
train.py (contains the training for the model)
predict.py (is the test driver, where through this all the other files are called)
README.md (contains information about the project and how to run the model)

Added Title and overview explanation (description, what problem it solves, etc.) Added headings for future work on README

All these files are empty but I have created them just so I can go in and edit each one as I progress through the project. My first goal will be to preprocess the dataset in dataset.py

Added a forewarning so markers know why there may be .ipynb files in my future commits

…ataset I have created a function that returns an adjacency matrix that is created by analysing the edge links depicted in the musae_facebook_edges.csv

Added a normalisation function to get the adjacency matrix in the correct form for the GCN algorithm to work

The features for each node are inconsistent in their quantity, thus I have made a function to convert the feature vector for each node into a bag of word vector so all nodes have equal num of features. If a word is part of a nodes feature list it will be indicated by a 1 and if its not it will be indicated by a 0

…an be used in model

since I have seperated the features and labels I have created a train test split for a nodeId list thus can get the train and test features and labels by the ids in the train_ids and test_ids

t-SNE plot created to visualize the initial, high-dimensional node features in a 2D space, giving insights into their structure and relationships prior to any transformation by the GCN.

This function will be called from train.py in order to access the tensors of the preprocessed data required to be inputted in the model

…ed to research and practicing making my own. Been a while since last commit, spent researching quite a few projects online for GCN, I attempt to make my own. It is simple for the sake of getting a baseline. I may increase model complexity later on.

needed to import torch and set device for some functionality to work

no hyperparameter tuning yet, just setting some values just to guage whether the model actually trains or not. Hyperparameter tuning will come later when I do cross validation

Created to code to tune hyperparameters via 10-fold nested cross validation. 10 folds were chosen because the dataset is large thus we can afford to train with 90% train and only 10% test sets.

This took an immense time but it was worthwile as I got an indication on the possible best hyperparameters. Since I intend on modifying my model to more align with the gcn code shown in the model exhibition lecture, my hyperparameters may change. However I still think it was worthwile testing the nested cross validation method as I can reuse it to tune the hyperparameters of my final model.

…s4696681-PatternAnalysis-2023 into topic-recognition

… Exhibition CON session Though my previous model had good performance, I wanted to experiment and hopefully settle on a new model architecture that better incorporates the techniques learned in the model exhibition CON session

…architecture

Added the evaluation loop for the model in predict.py, I also added TSNE visualisation after the model is trained to see how well the model has performed. I still have to integrate it properly with train.py by changing some of the train.py code to be modular (in functions) so they can be called by the predict script.

… reuploading

I attempted to commit a local copy of the adjacency_matrix.npy file but since it is 3.9GB it is too large for github. Instead I have added a link to the downloadable file on google drive in the README

fixed some errors in the code namely the return for the train script to the predict script. Also added more detail to the README

wasnt sure which folder should contain the README so Ive put in in both just to be sure

Commented files and double checked if everything is in order

gayanku · 2023-11-10T04:38:20Z

This is an initial inspection, no action is required at this point

Difficulty: Normal

Readme Overall: Good

Project Overview: Good
Model: Good
Data + Preprocessing: Good
Training / Loss Curve(s): None
Result Demonstration: Good
References: Good

Functionality:

Multi-layer GCN: Good, 2 layers.
Semi supervised: Good
Accuracy: OK. Not mentioned if its Test or Train accuracy.
TSNE or UMAP: Good, though would have expected to see a bit more separation at them mentioned (91%) accuracy.
Discussion: Good

Code: Has early stop in the code, but also uses multiple epochs as a searchable hyper parameter, which is redundant.

Consistent with Results: OK.
File Structure: Good
Commenting: Good
Commit frequency: Good
Commit messages: Good

Other Comments:
Epochs should not need to be in the hyper parameter search. You do have an early stop in the code anyways. You could have used the number of GCN layers as a hyper parameter.

shakes76 · 2023-11-20T09:49:06Z

Marking

Good Practice (Design/Commenting, TF/Torch Usage)

Adequate design and implementation
Good spacing and comments
Header blocks missing -1

Recognition Problem

Solves problem
Driver Script present
File structure present
Shows Usage & Demo & Visualisation & Data usage, no training plots -2
Module present
Commenting minimal -1
No Data leakage
Difficulty: Nromal -5

Commit Log

Meaningful commit messages
Progressive commits used

Documentation

ReadMe acceptable, no refs -1, no usage -1
Model/technical explanation
Good Description and Comments
Markdown used and PDF submitted

Pull Request

Successful Pull Request (Working Algorithm Delivered on Time in Correct Branch)
No Feedback required
Request Description good

shakes76 · 2023-11-20T09:50:25Z

Restore repo README to allow merge, does not affect grade only merge.

wangzhaomxy · 2023-11-21T07:19:18Z

No feedback attempt and no feedback marks lost.

kaia-santosha and others added 26 commits October 12, 2023 17:32

creating README.md and started filling in required README components

76c6fc7

Added Title and overview explanation (description, what problem it solves, etc.) Added headings for future work on README

Create all necessary python files

4e89f7a

All these files are empty but I have created them just so I can go in and edit each one as I progress through the project. My first goal will be to preprocess the dataset in dataset.py

update README.md with forewarning about jupyter notebook

ee80d47

Added a forewarning so markers know why there may be .ipynb files in my future commits

added preprocessing of dataset to produce adjacency matrix from the d…

43e6321

…ataset I have created a function that returns an adjacency matrix that is created by analysing the edge links depicted in the musae_facebook_edges.csv

fixed up adjacancy matrix function

580dc7e

normalise adjacency matrix so it can be used in the GCN model

376bc78

Added a normalisation function to get the adjacency matrix in the correct form for the GCN algorithm to work

added function to convert labels from dataset into integers so they c…

2bd857e

…an be used in model

create train and test sets

f60bd95

since I have seperated the features and labels I have created a train test split for a nodeId list thus can get the train and test features and labels by the ids in the train_ids and test_ids

created TSNE plot for visualising data before GCN

e70a71a

t-SNE plot created to visualize the initial, high-dimensional node features in a 2D space, giving insights into their structure and relationships prior to any transformation by the GCN.

coded function to create and return tensors

f396a08

This function will be called from train.py in order to access the tensors of the preprocessed data required to be inputted in the model

Fixed up imports in dataset.py

15abe2b

needed to import torch and set device for some functionality to work

implemented training code for model

2d30632

no hyperparameter tuning yet, just setting some values just to guage whether the model actually trains or not. Hyperparameter tuning will come later when I do cross validation

Tuning hyperparameters with nested cross validation

31396d2

Created to code to tune hyperparameters via 10-fold nested cross validation. 10 folds were chosen because the dataset is large thus we can afford to train with 90% train and only 10% test sets.

Merge branch 'topic-recognition' of https://github.com/kaia-santosha/…

2469fac

…s4696681-PatternAnalysis-2023 into topic-recognition

ran the nested cross validation to tune hyperparameters of new model …

7e1308b

…architecture

issues with too big file, deleted repo and downloaded last commit now…

e8c3077

… reuploading

updated README to include link to adjacency_matrix file download

de3e178

I attempted to commit a local copy of the adjacency_matrix.npy file but since it is 3.9GB it is too large for github. Instead I have added a link to the downloadable file on google drive in the README

Final commit (fixed small errors in code)

ec2e53d

fixed some errors in the code namely the return for the train script to the predict script. Also added more detail to the README

copied README to outside multi-layer_GCN_model_s4696681 folder

59d88c9

wasnt sure which folder should contain the README so Ive put in in both just to be sure

corrected image link

0d715ed

Proper Final commit before pull request

7a64d64

Commented files and double checked if everything is in order

nathasha-naranpanawa added the GCN label Oct 29, 2023

gayanku mentioned this pull request Nov 10, 2023

Topic recognition #148

Open

shakes76 added the question Further information is requested label Nov 20, 2023

Update README.md

7f8cb0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s4696681: Implementation of GCN Model for Node Classification on the page-page Facebook dataset #164

s4696681: Implementation of GCN Model for Node Classification on the page-page Facebook dataset #164

kaia-santosha commented Oct 25, 2023

gayanku commented Nov 10, 2023

shakes76 commented Nov 20, 2023

shakes76 commented Nov 20, 2023

wangzhaomxy commented Nov 21, 2023

s4696681: Implementation of GCN Model for Node Classification on the page-page Facebook dataset #164

Are you sure you want to change the base?

s4696681: Implementation of GCN Model for Node Classification on the page-page Facebook dataset #164

Conversation

kaia-santosha commented Oct 25, 2023

gayanku commented Nov 10, 2023

shakes76 commented Nov 20, 2023

Marking

Good Practice (Design/Commenting, TF/Torch Usage)

Recognition Problem

Commit Log

Documentation

Pull Request

shakes76 commented Nov 20, 2023

wangzhaomxy commented Nov 21, 2023