Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the results of the Actinomock Example #5

Open
WayneWu01 opened this issue Jun 30, 2022 · 20 comments
Open

About the results of the Actinomock Example #5

WayneWu01 opened this issue Jun 30, 2022 · 20 comments

Comments

@WayneWu01
Copy link

I have a question of how to evaluate the result of the example. I read your paper, but don't know how to find those FN things. Could you elaborate me more?

@Lizhen0909
Copy link
Owner

You may find examples in 'notebook' folder (The data link in the notebook may not exist anymore. But it can be download from https://www.amazon.com/clouddrive/share/eTIKYVLckXUCMnMQSpO8TCqZOwekmBrx23ZhMa3XO8d).

Also a python wrapper of pretrained models can be found at https://github.com/Lizhen0909/pyLSHVec.

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 1, 2022 via email

@Lizhen0909
Copy link
Owner

There are two tasks: embedding and classification.
For the actinomock example data the second columun looks like '47914-2616644829-Gammaproteobacteria-Proteobacteria'.
Here either of Proteobacteria, Gammaproteobacteria, 2616644829 can be the label of this row (depending on which level do you want on the taxonomy hierarchy).

########
Embedding is and unsupervised model to get embedding vectors for kmers/sequences.
Since it is unsupervised, there is no numerical metrics.
Also because the dimension is high (e.g. 100), we have to use tools like tsne or umap to visualize the vectors in low dimension (e.g. 2).
Then with the prior knowledge of the sequences (a.k.a labels of sequence, which is not used in training), we can judge if the vectors is good or not (It is expected that vectors of same labels are clustered together)
(You may find visualization examples of language embedding at https://projector.tensorflow.org/)

###########
Classification is just exactly same as what it is in machine learning.
Basically data is spitted as training and test datasets. Model is trained on training dataset. Metrics are reported on test dataset.
All metrics (accuracy, precision, recall, F1, AUC) applied to multiclass classification problems can be used . I did not get what FN is earlier. I think you mean false negative. It is part of the definition of precision and recall.

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 1, 2022 via email

@Lizhen0909
Copy link
Owner

use command line:
lshvec predict <model.bin> <test_data_file>
or
lshvec predict-prob<model.bin> <test_data_file>

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 3, 2022 via email

@Lizhen0909
Copy link
Owner

Hashed data.
It should be same as the training data.

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 7, 2022 via email

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 8, 2022 via email

@Lizhen0909
Copy link
Owner

Should be something like this:

.../lshvec predict .../model.bin .../test.hash

Here predict or predict-prob is subcommand, you can find them at src/main_fastseq.cc
If you run command without arguments, it should print usages.
For example:

lshvec #print all subcommands
lshvec predict #print usage for predict sub commands

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 9, 2022 via email

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 11, 2022 via email

@Lizhen0909
Copy link
Owner

It seems that your model was trained for embedding but for classification.
You may refer to https://github.com/Lizhen0909/LSHVec/blob/master/notebook/lsa_spike_fnv_classfication.ipynb, which is a bit older.
Notice the cell 35 where uses subcommand "supervised"

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 12, 2022 via email

@Lizhen0909
Copy link
Owner

From 'predict' subcommand you get predictions. Then it depends on you to evaluate the performance, no standard.
For example, comparing the predictions to the ground truth label to get accuracy, precision and recall.

@WayneWu01
Copy link
Author

WayneWu01 commented Jul 14, 2022 via email

@WayneWu01
Copy link
Author

WayneWu01 commented Oct 11, 2022 via email

@WayneWu01
Copy link
Author

WayneWu01 commented Oct 11, 2022 via email

@WayneWu01
Copy link
Author

WayneWu01 commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants