-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the results of the Actinomock Example #5
Comments
You may find examples in 'notebook' folder (The data link in the notebook may not exist anymore. But it can be download from https://www.amazon.com/clouddrive/share/eTIKYVLckXUCMnMQSpO8TCqZOwekmBrx23ZhMa3XO8d). Also a python wrapper of pretrained models can be found at https://github.com/Lizhen0909/pyLSHVec. |
There are two tasks: embedding and classification. ######## ########### |
I could get your result for the tsne part, but I have question on how to
find the number of false negative and all those numbers. Are you training
model for example use 80 percent coverage for the sequence ? How do you use
the mod.bin for the predicting of the rest 20 percent?
Lizhen Shi ***@***.***>于2022年7月1日 周五下午5:08写道:
… There are two tasks: embedding and classification.
For the actinomock example data the second columun looks like
'47914-2616644829-Gammaproteobacteria-Proteobacteria'.
Here either of Proteobacteria, Gammaproteobacteria, 2616644829 can be the
label of this row (depending on which level do you want on the taxonomy
hierarchy).
########
Embedding is and unsupervised model to get embedding vectors for
kmers/sequences.
Since it is unsupervised, there is no numerical metrics.
Also because the dimension is high (e.g. 100), we have to use tools like
tsne or umap to visualize the vectors in low dimension (e.g. 2).
Then with the prior knowledge of the sequences (a.k.a labels of sequence,
which is not used in training), we can judge if the vectors is good or not
(It is expected that vectors of same labels are clustered together)
(You may find visualization examples of language embedding at
https://projector.tensorflow.org/
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprojector.tensorflow.org%2F&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7Cf2480d9056d6428ed39008da5bae27fc%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637923100792719711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VZQ%2FLzoIHNIqlDye%2BybzwrXqGvNMMbDVJH8FWLBLPjA%3D&reserved=0>
)
###########
Classification is just exactly same as what it is in machine learning.
Basically data is spitted as training and test datasets. Model is trained
on training dataset. Metrics are reported on test dataset.
All metrics (accuracy, precision, recall, F1, AUC) applied to multiclass
classification problems can be used . I did not get what FN is earlier. I
think you mean false negative. It is part of the definition of precision
and recall.
—
Reply to this email directly, view it on GitHub
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLizhen0909%2FLSHVec%2Fissues%2F5%23issuecomment-1172748900&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7Cf2480d9056d6428ed39008da5bae27fc%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637923100792719711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vyzbtd%2F0Fmu2mg8dOSZPnb3rif%2B%2Bh3s%2BsNtdpWT1ynw%3D&reserved=0>,
or unsubscribe
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQYEGJ4J325WL2DVXRKXB5DVR5T33ANCNFSM52KMFZAA&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7Cf2480d9056d6428ed39008da5bae27fc%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637923100792719711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=A5rHjHa9qpgFqG03t7k0sohZ8WQ2Ct%2FCe86B5CWYjBI%3D&reserved=0>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
use command line: |
Hashed data. |
Should be something like this:
Here predict or predict-prob is subcommand, you can find them at src/main_fastseq.cc
|
I tried your way and it gave me all the details, but it still gave me the
error of "Model needs to be supervised for prediction!". I read that the k
and threshold are optional, why did it still give me this error? So sorry
for bothering you again and again!
Zheng Wu ***@***.***>于2022年7月8日 周五下午1:49写道:
… I tried your way and it gave me all the details, but it still gave me the
error of "Model needs to be supervised for prediction!". I read that the
k and threshold are optional, why did it still give me this error? So sorry
for bothering you again and again!
On Fri, Jul 8, 2022 at 12:59 PM Lizhen Shi ***@***.***>
wrote:
> Should be something like this:
>
> .../lshvec predict .../model.bin .../test.hash
>
> Here predict or predict-prob is subcommand, you can find them at
> src/main_fastseq.cc
> If you run command without arguments, it should print usages.
> For example:
>
> lshvec #print all subcommands
> lshvec predict #print usage for predict sub commands
>
> —
> Reply to this email directly, view it on GitHub
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLizhen0909%2FLSHVec%2Fissues%2F5%23issuecomment-1179238375&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C6d7a248546d348dbc6f408da610b9f1d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637928999830546333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w9n31Jkn3K0WVTMgQsgtk4iUSWmN4a%2FK7kwJ5D7daWM%3D&reserved=0>,
> or unsubscribe
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQYEGJ7MX6VLJYCOCRL7QSTVTBUARANCNFSM52KMFZAA&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C6d7a248546d348dbc6f408da610b9f1d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637928999830546333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jQdzUlDOfXHCJBfxu0%2Bvxhpska2vkl2uKliG6whU29c%3D&reserved=0>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
I tried exactly your way for prediction. It still gave me the error that
the "Model needs to be supervised for prediction!" How did you predict your
model? Sorry for bothering you again.
…On Sat, Jul 9, 2022 at 2:50 PM Zheng Wu ***@***.***> wrote:
I tried your way and it gave me all the details, but it still gave me the
error of "Model needs to be supervised for prediction!". I read that the
k and threshold are optional, why did it still give me this error? So sorry
for bothering you again and again!
Zheng Wu ***@***.***>于2022年7月8日 周五下午1:49写道:
> I tried your way and it gave me all the details, but it still gave me the
> error of "Model needs to be supervised for prediction!". I read that the
> k and threshold are optional, why did it still give me this error? So sorry
> for bothering you again and again!
>
> On Fri, Jul 8, 2022 at 12:59 PM Lizhen Shi ***@***.***>
> wrote:
>
>> Should be something like this:
>>
>> .../lshvec predict .../model.bin .../test.hash
>>
>> Here predict or predict-prob is subcommand, you can find them at
>> src/main_fastseq.cc
>> If you run command without arguments, it should print usages.
>> For example:
>>
>> lshvec #print all subcommands
>> lshvec predict #print usage for predict sub commands
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLizhen0909%2FLSHVec%2Fissues%2F5%23issuecomment-1179238375&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C6d7a248546d348dbc6f408da610b9f1d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637928999830546333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w9n31Jkn3K0WVTMgQsgtk4iUSWmN4a%2FK7kwJ5D7daWM%3D&reserved=0>,
>> or unsubscribe
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQYEGJ7MX6VLJYCOCRL7QSTVTBUARANCNFSM52KMFZAA&data=05%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C6d7a248546d348dbc6f408da610b9f1d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637928999830546333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jQdzUlDOfXHCJBfxu0%2Bvxhpska2vkl2uKliG6whU29c%3D&reserved=0>
>> .
>> You are receiving this because you authored the thread.Message ID:
>> ***@***.***>
>>
>
|
It seems that your model was trained for embedding but for classification. |
From 'predict' subcommand you get predictions. Then it depends on you to evaluate the performance, no standard. |
I have a question of how to evaluate the result of the example. I read your paper, but don't know how to find those FN things. Could you elaborate me more?
The text was updated successfully, but these errors were encountered: