Identifying highly-cited scholarly literature at an early stage is a vital endeavor to the academic research community and to other stakeholders, such as technology companies and government bodies. Due to the sheer amount of research published and the growth of ever-changing interdisciplinary areas, researchers need an effective approach to identifying important scholarly studies if they are to read or even skim all the new studies published in their respective fields. The number of citations that a given research publication has accrued has been used to help researchers in this regard. However, citations take time to occur and longer to accumulate. In this article, we used Altmetrics to predict citations that a scholarly publication could receive. We built various classification and regression models and evaluated their performance. We found that tree-based models performed best in classification. We found that Mendeley readership, publication age, post length, maximum followers, and academic status were the most important factors in predicting citations.
The dataset used for the experiments comprises of social media and scholarly indicators for scientific articles. The size of the dataset is 130,745 and for all the experiments 70 percent was used for training and 30 percent for test. Furthermore, for all the neural network models 20 percent of the training data was used as validation. There are three target variables in the dataset respectively for the three experiments. These are :
target_exp_1
: Binary label saying if citations exist or not.target_exp_2
: Binary label saying if existing citations are more than median number of citations or not.target_exp_3
: Discrete values for log(1 + citation).
The project comprises of three experiments. The first two experiments are classification problems while the third being a regression problem. A combination of approaches were used for solving the classification and regression problems. Neural networks, supervised learning algorithms and support vector machines were used for training the models on the data. The features used for all of the experiments were the same. While training the supervised learning models three algorithms had Randomized search and Grid search to obtain best hyper parameters for training the models. All of the supervised learning algorithms used 10 fold cross validation approach for training the models. The neural network models were implemented using TensorFlow. The supervised learning models and the support vector machine algorithms were implemented using scikit-learn.
Model | Optimum hyper parameters |
---|---|
Epochs | 10 |
Batch size | 64 |
Loss Function | binary cross-entropy |
Hidden Layers | 1 layer with 512 neurons |
Optimization function | RMS with 0.001 learning rate |
Activation function(s) | SeLU for the hidden layer, Softmax for the o/p layer |
Metric | Value |
---|---|
Training Loss | 2.07495 |
Training Accuracy | 0.8646 |
Validation Loss | 2.0891 |
Validation Accuracy | 0.86368 |
Test Accuracy | 0.8655 |
Precision | 0.866 |
Recall | 1.0 |
F-1 | 0.9279 |
Model | Train Accuracy | Test Accuracy | Precision | Recall | F-1 |
---|---|---|---|---|---|
Random Forest | 0.865 | 0.862 | 0.863 | 1.0 | 0.927 |
Decision Tree | 0.865 | 0.863 | 0.863 | 1.0 | 0.927 |
Gradient Boosting | 0.865 | 0.863 | 0.863 | 1.0 | 0.927 |
AdaBoost | 0.87 | 0.866 | 0.87 | 0.993 | 0.928 |
BernouliNB | 0.84 | 0.836 | 0.876 | 0.943 | 0.908 |
KNN | 0.85 | 0.851 | 0.883 | 0.953 | 0.917 |
Model | Optimum hyper parameters |
---|---|
Random Forest | n_estimators: 2, min_samples_split: 0.9, min_samples_leaf: 0.3, features: 18, max_depth: 9, criterion: gini-index |
Decision Tree | min_samples_split: 0.5, min_samples_leaf: 0.3, max_features: 10, max_depth: 32, criterion: gini-index |
Gradient Boosting | n_estimators: 200, min_samples_split: 0.6, min_samples_leaf: 0.1, max_features: 9, max_depth: 4, learning rate: 0.001 |
Parameter | Value |
---|---|
Kernel | Sigmoid |
Degree of the kernel | 3 |
Tolerance | 0.001 |
Gamma | 0.045 |
Metric | Value |
---|---|
Training Accuracy | 0.86 (+/- 0.00) |
Validation Accuracy | 0.86 (+/- 0.01) |
Test Accuracy | 0.861 |
Precision | 0.864 |
Recall | 0.997 |
F-1 | 0.925 |
Model | Optimum hyper parameters |
---|---|
Epochs | 10 |
Batch size | 64 |
Loss Function | binary cross-entropy |
Hidden Layers | 3 layers 64, 128, 64 neurons for respective layers |
Optimization function | RMS with 0.001 learning rate |
Activation function(s) | SeLU for the second hidden layer, Sigmoid for remaining layers |
Metric | Value |
---|---|
Training Loss | 0.4710 |
Training Accuracy | 0.78006 |
Validation Loss | 0.4726 |
Validation Accuracy | 0.7797 |
Test Accuracy | 0.7794 |
Precision | 0.81918 |
Recall | 0.69692 |
F-1 | 0.75312 |
Model | Train Accuracy | Test Accuracy | Precision | Recall | F-1 |
---|---|---|---|---|---|
Random Forest | 0.778 | 0.784 | 0.799 | 0.737 | 0.767 |
Decision Tree | 0.773 | 0.768 | 0.725 | 0.834 | 0.776 |
Gradient Boosting | 0.802 | 0.80 | 0.810 | 0.767 | 0.788 |
AdaBoost | 0.80 | 0.797 | 0.806 | 0.760 | 0.782 |
BernouliNB | 0.67 | 0.674 | 0.742 | 0.495 | 0.594 |
KNN | 0.75 | 0.752 | 0.777 | 0.680 | 0.725 |
Model | Optimum hyper parameters |
---|---|
Random Forest | n_estimators: 100, min_samples_split: 0.1, min_samples_leaf: 0.1, features: 3, max_depth: 30, criterion: entropy |
Decision Tree | min_samples_split: 0.4, min_samples_leaf: 0.1, max_features: 15, max_depth: 32, criterion: entropy |
Gradient Boosting | n_estimators: 200, min_samples_split: 0.1, min_samples_leaf: 0.1, max_features: 12, max_depth: 20, learning rate: 0.005 |
Parameter | Value |
---|---|
Kernel | Sigmoid |
Degree of the kernel | 3 |
Tolerance | 0.001 |
Gamma | 0.045 |
Metric | Value |
---|---|
Training Accuracy | 0.52 (+/- 0.00) |
Validation Accuracy | 0.52 (+/- 0.01) |
Test Accuracy | 0.519 |
Precision | 0.478 |
Recall | 0.007 |
F-1 | 0.014 |
Model | Optimum hyper parameters |
---|---|
Epochs | 500 |
Batch size | 128 |
Loss Function | mean squared error |
Hidden Layers | 7 layers 32, 64, 64, 128, 64, 64, 32 neurons for respective layers |
Optimization function | RMS with 0.001 learning rate |
Activation function(s) | ReLU for all layers, linear activation for o/p layer |
Metric | Value |
---|---|
Training MSE | 1.24965 |
Training MAE | 0.84157 |
Test MSE | 1.29756 |
Test MAE | 0.85583 |
Test MSE | 0.52284 |
Model | Train MSE | Test MSE | R-squared |
---|---|---|---|
Random Forest | 0.26 | 1.32 | 0.512 |
Decision Tree | 1.647 | 1.663 | 0.389 |
Linear | 1.75 | 1.758 | 0.354 |
Model | Optimum hyper parameters |
---|---|
Random Forest | n_estimators: 16, min_samples_split: 0.4, min_samples_leaf: 0.2, features: 16, max_depth: 24, criterion: mse |
Decision Tree | min_samples_split: 0.4, min_samples_leaf: 0.1, max_features: 13, max_depth: 32, criterion: friedman mse |