To run this code, execute the main.py in project main root directory.
Codebases:
src/ : This folder Contains code files for evaluation, creat gold list, preprocessing data.
+---src
| | create_golden_lists.py
| | preprocess_data.py
| +---eval
| | evaluation.py
| | evaluation_tf_idf.py
| | pytrec_eval_per_word.py
| | pytrec_eval_perword_tfidf.py
| | pytrec_avg.py
models/train_w2v.py: Trains the word to vector models.
+---models
| | train_w2v.py
Source Folder:
dataset/ : This folder includes our golden standards SimLex-9991 data set.
+---dataset
| | SimLex-999.txt
Target Folders: The target folders are the outputs for the models, gold standard datasets, and cosine similarity and nDCG metric.
evaluation/gold_list : This is the output for the gold()
function in main.py in project main root directory. It includes golden answers for each word in our golden standards SimLex-9991 data set.
+---evaluation
| +---gold_list
| | gold_list.json
output/cosine_similarity: This is the out put for top_k cosine similarity for words based on word to vector and td_idf models. brown_news
,brown_editorial
folders in each w2v
and tfidf
folder contain the results for word_to_vectors model which is trained with brown corpus news and ditorial genres respectively. Here is the sample results for tfidf/brown_editorial and w2v/brown_editorial.
+---output
| +---cosine_similarity
| | +---tfidf
| | +---brown_editorial
| | | | result_(word).xlsx
| | +---brown_news
| | | result_(word).xlsx
| | +---w2v
| | +---brown_editorial
| | | | result_(word).xlsx
| | +---brown_news
| | | result_(word).xlsx
output/gold_dataset_perword : contains the golden dataset for each words. This is the out of calling preprocess_data.gold_dataset_perwordd
function in eval()
function in main.py.
+---output
| +---gold_dataset_perword
| | gold_(word).json
output/models/w2v_models : This folder keeps the word to vectors models. This is the oputput of w2v()
function in main.py.
+---output
| +---models
| +---w2v_models
| | | w2v_brown_editorial
| | | w2v_brown_news
output/pytrec : this folder contains the evaluation result for each w2v and tfidf models in pytrec_eval_per_word
and tfidf_pytrec_perword
respectively, which each of them contains brown_editorial
and brown_news
folders. In these folders the results of nDCG evaluation are stored in pytrec_result_perword
folder.
+---output
| +---pytrec
| | +----pytrec_eval_per_word
| | +---brown_editorial
| | | | +---pytrec_result_perword
| | | | | pytrec_(word).json
| | +---brown_news
| | | | +---pytrec_result_perword
| | | | | pytrec_(word).json
| | +---brown_news
| | | result_(word).xlsx
| | +---tfidf_pytrec_perword
| | +---brown_editorial
| | | | +---pytrec_result_perword
| | | | | pytrec_(word).json
| | +---brown_news
| | | | +---pytrec_result_perword
| | | | | pytrec_(word).json
output/avg : Contains average nDCG results of each baseline train on two selected corpus.
+---output
| +---avg
| | +---tfidf
| | | ndcg_avg_brown_editorial.json
| | | ndcg_avg_brown_news.json
| | +---w2v
| | | ndcg_avg_brown_editorial.json
| | | ndcg_avg_brown_news.json