-
Notifications
You must be signed in to change notification settings - Fork 0
2022_log
Zenodo API Could work but a bit complex.
To upload Can try https://github.com/jhpoelen/zenodo-upload. Wasn't necessary. In the end I just uploaded it from the files on my computer scp'ed from UA HPC.
Doi: https://zenodo.org/record/5821976
To retrieve a public zenodo dataset we don't need anything fancy just copy the download link and curl it with a file name like in this example from this zenodo page:
curl -o test.pdf https://zenodo.org/record/5068997/files/Event%20metadata.pdf?download=1
This works for mine as well e.g.:
curl -o sample_metadata.tsv https://zenodo.org/record/5821976/files/sample_metadata.tsv?download=1
Login for lytic with UA netID and pass:
Need to be on the UA VPN, and for some reason it's using my old UA netID password (see enpass).
BCODMO:
Try to make cleaner owl outputs for the patterns I've started trying to play with in /Users/kai/Desktop/scratch/BCODMO/model
.
requires java 1.8 or above
Option 1) Trying this
From https://tarql.github.io/docs/ links to https://github.com/tarql/tarql/releases
wget https://github.com/tarql/tarql/releases/download/v1.2/tarql-1.2.tar.gz
tar –xvzf tarql-1.2.tar.gz
Had to manually type this copying from text editor didn't work don't know why
This seems to have worked now test TARQL
location of tarql /home/u19/kblumberg/tarql-1.2/bin/tarql
/home/u19/kblumberg/tarql-1.2/bin/tarql sample-2.sparql TechCrunchcontinentalUSA.csv
This works. Now add /home/u19/kblumberg/tarql-1.2
to path.
vim ~/.bash_profile
# Tarql
export PATH=$PATH:/home/u19/kblumberg/tarql-1.2/bin //Add this line to the bash_profile
source ~/.bash_profile
now tarql works e.g. tarql sample-2.sparql TechCrunchcontinentalUSA.csv
Option 2) Didn't finish this went with option1
git clone https://github.com/cygri/tarql
If you don't have mvn need to install it:
see https://maven.apache.org/install.html
Tarql step 2
mvn clean install -DskipTests
downloaded apache-jena-X.X.X.tar.gz from https://jena.apache.org/download/index.cgi
wget https://dlcdn.apache.org/jena/binaries/apache-jena-4.3.2.tar.gz
tar –xvzf apache-jena-4.3.2.tar.gz
Add the following to the bash_profile:
# Apache Jena
export JENA_HOME=/home/u19/kblumberg/apache-jena-4.3.2
export PATH=$PATH:$JENA_HOME/bin
Testing apache install tdb2.tdbloader
gives a java error. Need to install java 12 without sudo Access. Made several attempts the following finaly worked.
From openjdk https://jdk.java.net/archive/ use this (note the similar instructions from https://opensource.com/article/19/11/install-java-linux)
wget https://download.java.net/java/GA/jdk12.0.2/e482c34c86bd4bf8b56c0b35558996b9/10/GPL/openjdk-12.0.2_linux-x64_bin.tar.gz
tar -xvzf openjdk-12.0.2_linux-x64_bin.tar.gz
Add the following to the bash_profile after consulting this page
# Java12
export JAVA_12_HOME=/home/u19/kblumberg/bin/jdk-12.0.2/
export PATH=$JAVA_12_HOME/bin:$PATH
## if these are commented out java8 becomes the default
See downloads page
get the latest apache-jena-fuseki-X.X.X.tar.gz
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.3.2.tar.gz
tar -xvzf apache-jena-fuseki-4.3.2.tar.gz
Add the following to the bash_profile:
# Apache fuseki
export FUSEKI_HOME=/home/u19/kblumberg/apache-jena-fuseki-4.3.2
export PATH=$PATH:$FUSEKI_HOME
it works in test
I ran fuseki-server
and it started.
Next steps see http://loopasam.github.io/jena-doc/documentation/serving_data/ for fuseki server if I can get it to work over http try to mix with examples from https://github.com/apache/jena/tree/main/jena-fuseki2/examples and https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html also https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html https://github.com/JPL-IMCE/gov.nasa.jpl.imce.ontologies.fuseki. https://medium.com/@rrichajalota234/how-to-apache-jena-fuseki-3-x-x-1304dd810f09 https://gist.github.com/afs/63a80512cdc55caf77d0
https://managewp.com/blog/how-to-access-a-local-website-from-internet-with-port-forwarding and https://www.linuxandubuntu.com/home/how-to-setup-a-web-server-and-host-website-on-your-own-linux-computer
based on ip route
and ifconfig
10.140.114.14
should be lytic's public id.
Stack overflow post about access-localhost-from-the-internet
The above leads to https://ngrok.com/ public urls for exposing a local web server
. Looks like I might be able to use it for free but the url will change when the server stops and starts. In principal its not the worst thing once it's configured to just set it up once and never turn it off. According to this page "If you are not a paid user of ngrok then your ngrok session expires in 8 hours."
Also links to http://localtunnel.me/. Sounds similar, seems more opensource (no paid plans). Same restriction you receive a url which remains active while your local instance is up. Requires NodeJS so need npm install
. See https://docs.npmjs.com/downloading-and-installing-node-js-and-npm and https://github.com/nodesource/distributions on how to install npm might need sudo to do it though TBD. This seems very simple if I can get npm installed. Lytic has npm! EXCELLENT! Tried npm install -g localtunnel
but I need sudo to run this. Could maybe ask Bonnie or Matt. Can perhaps try this on a cyverse VM where I have sudo and see if it works to expose fuseki.
Also links to http://localhost.run/ uses ssh a bit more complicate than localtunnel.
See https://www.softwaretestinghelp.com/ngrok-alternatives/ for more including Serveo
Trying localtunnel on cyverse ATM image.
install npm with sudo works
sudo npm install -g localtunnel
-> has error Error: Missing required argument #1
From here try:
sudo npm install -g n
sudo n latest
Now sudo npm install -g localtunnel
worked.
Install fuseki and test local tunnel. To test:
fuseki-server ## or
./fuseki-server & # gave PID 29767
lt --port 3030 # http://localhost:3030/ -> your url is: https://heavy-mouse-68.loca.lt
Going to the above url I'm seeing the apache jena fuseki server!!!!!!! HELLL YAAAA!
Kill process from PID kill -15 29767
https://ontologforum.org/index.php/KaiBlumberg
https://gleaner.io/ https://oceannetworks.ca/about-us https://github.com/iodepo/odis-arch https://book.oceaninfohub.org/ https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#metadata https://search.dataone.org/portals/toolik https://validator.schema.org/ http://knowwheregraph.org/prototypes/geoEnrichment/ (Mark S)
No longer will need triplestore/config/planet_microbe/supplemental_construct.sparql
or triplestore/planet_microbe_datatsv_to_triplify/qualitative_attributes_other.tsv
instead get this info and more from triplestore/wgs_annotation_data/sample_metadata.tsv
Then can remove the last output block from planet_microbe_qualitative.py
3:35 ran the wgs_annotation_data
part of the prepare triplestore script. Finished by 4:00.
https://knowledgestructure.pubpub.org/space
Resources for this session (link supplementary docs or presentations here)
We will add links to shared slides here, we have also added some links of interest here.
ESIP Semantic Technologies Committee:
RDF vs Property Graph from Global Data Geeks Youtube channel
GraphDBs (https://neo4j.com/, TinkerPop (TinkerPop DBs), SPARQL (triplestores)
Neo4j model TinkerPop RDF GQL
Emerging work on RDF and SPARQL "star". https://w3c.github.io/rdf-star/ To blend RDF and property graph approaches.
KGLab KG to ML toolkit
https://ontologforum.s3.amazonaws.com/OntologySummit2020/Communique/OntologySummit2020Communique.pdf
https://github.com/charlesvardeman/fuseki-geosparql-docker check this out to see if I can use it for my PhD triplestore.
https://github.com/topics/obofoundry
From Doug Fils https://caddyserver.com/ A simple Caddyfile
config for the above might be as easy as:
(cors) {
@origin header Origin *
header @origin Access-Control-Allow-Origin *
header @origin Access-Control-Request-Method *
}
example.org {
import cors
reverse_proxy localhost:3030
}
./bash/planet_microbe_data.sh: line 15: jq: command not found
...
SyntaxError: invalid syntax
File "../scripts//planet_microbe_numeric.py", line 56
print(msg, file=sys.stderr)
^
Make sure to install jq
and the default python might be 2 so I'll update the bash script to have it be 3.
Stats on Loading my PhD triplestore:
06:10:08 INFO loader :: Loader = LoaderPhased
06:10:08 INFO loader :: Start: 13 files
06:10:15 INFO loader :: Add: 1,000,000 merged_go_data.ttl (Batch: 139,899 / Avg: 139,899)
06:10:20 INFO loader :: Add: 2,000,000 merged_go_data.ttl (Batch: 186,462 / Avg: 159,859)
06:10:29 INFO loader :: Add: 3,000,000 merged_go_data.ttl (Batch: 114,955 / Avg: 141,442)
06:10:32 INFO loader :: End file: merged_go_data.ttl (triples/quads = 3,391,879)
06:10:39 INFO loader :: Add: 4,000,000 merged_interpro_data.ttl (Batch: 93,405 / Avg: 125,328)
06:10:51 INFO loader :: Add: 5,000,000 merged_interpro_data.ttl (Batch: 89,919 / Avg: 116,179)
06:11:03 INFO loader :: Add: 6,000,000 merged_interpro_data.ttl (Batch: 82,603 / Avg: 108,808)
06:11:13 INFO loader :: Add: 7,000,000 merged_interpro_data.ttl (Batch: 93,799 / Avg: 106,376)
06:11:26 INFO loader :: Add: 8,000,000 merged_interpro_data.ttl (Batch: 79,396 / Avg: 102,042)
06:11:37 INFO loader :: Add: 9,000,000 merged_interpro_data.ttl (Batch: 86,460 / Avg: 100,038)
06:11:50 INFO loader :: Add: 10,000,000 merged_interpro_data.ttl (Batch: 80,619 / Avg: 97,685)
06:11:50 INFO loader :: Elapsed: 102.37 seconds [2022/01/28 06:11:50 MST]
06:12:02 INFO loader :: Add: 11,000,000 merged_interpro_data.ttl (Batch: 85,026 / Avg: 96,381)
06:12:13 INFO loader :: Add: 12,000,000 merged_interpro_data.ttl (Batch: 85,492 / Avg: 95,369)
06:12:25 INFO loader :: Add: 13,000,000 merged_interpro_data.ttl (Batch: 84,817 / Avg: 94,465)
06:12:36 INFO loader :: Add: 14,000,000 merged_interpro_data.ttl (Batch: 89,166 / Avg: 94,065)
06:12:47 INFO loader :: Add: 15,000,000 merged_interpro_data.ttl (Batch: 96,983 / Avg: 94,254)
06:12:58 INFO loader :: Add: 16,000,000 merged_interpro_data.ttl (Batch: 86,460 / Avg: 93,726)
06:13:09 INFO loader :: Add: 17,000,000 merged_interpro_data.ttl (Batch: 92,225 / Avg: 93,637)
06:13:19 INFO loader :: Add: 18,000,000 merged_interpro_data.ttl (Batch: 98,697 / Avg: 93,904)
06:13:31 INFO loader :: Add: 19,000,000 merged_interpro_data.ttl (Batch: 81,806 / Avg: 93,179)
06:13:39 INFO loader :: End file: merged_interpro_data.ttl (triples/quads = 16,193,847)
06:13:43 INFO loader :: Add: 20,000,000 merged_ncbitaxon_data.ttl (Batch: 86,400 / Avg: 92,815)
06:13:43 INFO loader :: Elapsed: 215.48 seconds [2022/01/28 06:13:43 MST]
06:13:54 INFO loader :: Add: 21,000,000 merged_ncbitaxon_data.ttl (Batch: 93,580 / Avg: 92,851)
06:14:07 INFO loader :: Add: 22,000,000 merged_ncbitaxon_data.ttl (Batch: 77,507 / Avg: 92,023)
06:14:17 INFO loader :: Add: 23,000,000 merged_ncbitaxon_data.ttl (Batch: 99,472 / Avg: 92,323)
06:14:28 INFO loader :: Add: 24,000,000 merged_ncbitaxon_data.ttl (Batch: 88,315 / Avg: 92,149)
06:14:40 INFO loader :: Add: 25,000,000 merged_ncbitaxon_data.ttl (Batch: 84,968 / Avg: 91,839)
06:14:51 INFO loader :: Add: 26,000,000 merged_ncbitaxon_data.ttl (Batch: 92,618 / Avg: 91,868)
06:15:02 INFO loader :: Add: 27,000,000 merged_ncbitaxon_data.ttl (Batch: 83,920 / Avg: 91,547)
06:15:14 INFO loader :: Add: 28,000,000 merged_ncbitaxon_data.ttl (Batch: 88,167 / Avg: 91,422)
06:15:25 INFO loader :: Add: 29,000,000 merged_ncbitaxon_data.ttl (Batch: 87,950 / Avg: 91,298)
06:15:38 INFO loader :: Add: 30,000,000 merged_ncbitaxon_data.ttl (Batch: 78,131 / Avg: 90,788)
06:15:38 INFO loader :: Elapsed: 330.44 seconds [2022/01/28 06:15:38 MST]
06:15:49 INFO loader :: Add: 31,000,000 merged_ncbitaxon_data.ttl (Batch: 91,810 / Avg: 90,820)
06:15:59 INFO loader :: Add: 32,000,000 merged_ncbitaxon_data.ttl (Batch: 94,544 / Avg: 90,932)
06:16:12 INFO loader :: Add: 33,000,000 merged_ncbitaxon_data.ttl (Batch: 79,189 / Avg: 90,526)
06:16:23 INFO loader :: Add: 34,000,000 merged_ncbitaxon_data.ttl (Batch: 93,571 / Avg: 90,612)
06:16:33 INFO loader :: Add: 35,000,000 merged_ncbitaxon_data.ttl (Batch: 95,265 / Avg: 90,739)
06:16:46 INFO loader :: Add: 36,000,000 merged_ncbitaxon_data.ttl (Batch: 76,487 / Avg: 90,272)
06:16:58 INFO loader :: Add: 37,000,000 merged_ncbitaxon_data.ttl (Batch: 84,459 / Avg: 90,104)
06:17:10 INFO loader :: Add: 38,000,000 merged_ncbitaxon_data.ttl (Batch: 87,374 / Avg: 90,030)
06:17:22 INFO loader :: Add: 39,000,000 merged_ncbitaxon_data.ttl (Batch: 80,919 / Avg: 89,771)
06:17:32 INFO loader :: Add: 40,000,000 merged_ncbitaxon_data.ttl (Batch: 100,745 / Avg: 90,016)
06:17:32 INFO loader :: Elapsed: 444.36 seconds [2022/01/28 06:17:32 MST]
06:17:43 INFO loader :: Add: 41,000,000 merged_ncbitaxon_data.ttl (Batch: 87,191 / Avg: 89,945)
06:17:56 INFO loader :: Add: 42,000,000 merged_ncbitaxon_data.ttl (Batch: 81,162 / Avg: 89,714)
06:17:58 INFO loader :: End file: merged_ncbitaxon_data.ttl (triples/quads = 22,542,188)
06:17:58 INFO loader :: End file: sample_metadata.ttl (triples/quads = 4,914)
06:17:58 INFO loader :: End file: broad_scale_data.ttl (triples/quads = 2,457)
06:17:58 INFO loader :: End file: local_scale_data.ttl (triples/quads = 2,457)
06:17:58 INFO loader :: End file: material_data.ttl (triples/quads = 2,457)
06:17:58 INFO loader :: End file: merged_numeric_data.ttl (triples/quads = 47,230)
06:17:58 INFO loader :: End file: spatiotemporal.ttl (triples/quads = 4,914)
06:17:58 INFO loader :: End file: supplemental_data.ttl (triples/quads = 2,457)
06:18:07 INFO loader :: Add: 43,000,000 go.owl (Batch: 84,889 / Avg: 89,595)
06:18:15 INFO loader :: End file: go.owl (triples/quads = 1,425,490)
06:18:25 INFO loader :: Add: 44,000,000 ncbitaxon.owl (Batch: 58,387 / Avg: 88,520)
06:18:43 INFO loader :: Add: 45,000,000 ncbitaxon.owl (Batch: 55,682 / Avg: 87,375)
06:18:56 INFO loader :: Add: 46,000,000 ncbitaxon.owl (Batch: 74,432 / Avg: 87,046)
06:19:11 INFO loader :: Add: 47,000,000 ncbitaxon.owl (Batch: 65,915 / Avg: 86,456)
06:19:28 INFO loader :: Add: 48,000,000 ncbitaxon.owl (Batch: 57,954 / Avg: 85,579)
06:19:45 INFO loader :: Add: 49,000,000 ncbitaxon.owl (Batch: 61,218 / Avg: 84,890)
06:20:01 INFO loader :: Add: 50,000,000 ncbitaxon.owl (Batch: 59,687 / Avg: 84,179)
06:20:01 INFO loader :: Elapsed: 593.97 seconds [2022/01/28 06:20:01 MST]
06:20:17 INFO loader :: Add: 51,000,000 ncbitaxon.owl (Batch: 63,275 / Avg: 83,637)
06:20:36 INFO loader :: Add: 52,000,000 ncbitaxon.owl (Batch: 52,364 / Avg: 82,687)
06:20:56 INFO loader :: Add: 53,000,000 ncbitaxon.owl (Batch: 51,316 / Avg: 81,745)
06:21:13 INFO loader :: Add: 54,000,000 ncbitaxon.owl (Batch: 57,208 / Avg: 81,100)
06:21:32 INFO loader :: Add: 55,000,000 ncbitaxon.owl (Batch: 52,328 / Avg: 80,298)
06:21:50 INFO loader :: Add: 56,000,000 ncbitaxon.owl (Batch: 57,097 / Avg: 79,719)
06:22:08 INFO loader :: Add: 57,000,000 ncbitaxon.owl (Batch: 54,767 / Avg: 79,087)
06:22:27 INFO loader :: Add: 58,000,000 ncbitaxon.owl (Batch: 54,418 / Avg: 78,474)
06:22:44 INFO loader :: Add: 59,000,000 ncbitaxon.owl (Batch: 57,590 / Avg: 77,994)
06:22:58 INFO loader :: End file: ncbitaxon.owl (triples/quads = 16,105,832)
06:23:03 INFO loader :: End file: pmo.owl (triples/quads = 268,611)
06:23:03 INFO loader :: Finished: 13 files: 59,994,733 tuples in 775.36s (Avg: 77,376)
06:23:26 INFO loader :: Finish - index SPO
06:23:26 INFO loader :: Start replay index SPO
06:23:26 INFO loader :: Index set: SPO => SPO->POS, SPO->OSP
06:23:26 INFO loader :: Add: 1,000,000 Index (Batch: 3,076,923 / Avg: 3,076,923)
06:23:31 INFO loader :: Add: 2,000,000 Index (Batch: 223,964 / Avg: 417,536)
06:23:37 INFO loader :: Add: 3,000,000 Index (Batch: 174,489 / Avg: 285,143)
06:23:43 INFO loader :: Add: 4,000,000 Index (Batch: 149,543 / Avg: 232,450)
06:23:49 INFO loader :: Add: 5,000,000 Index (Batch: 179,083 / Avg: 219,375)
06:23:53 INFO loader :: Add: 6,000,000 Index (Batch: 252,525 / Avg: 224,282)
06:24:00 INFO loader :: Add: 7,000,000 Index (Batch: 134,318 / Avg: 204,696)
06:24:08 INFO loader :: Add: 8,000,000 Index (Batch: 124,626 / Avg: 189,479)
06:24:14 INFO loader :: Add: 9,000,000 Index (Batch: 183,654 / Avg: 188,813)
06:24:21 INFO loader :: Add: 10,000,000 Index (Batch: 148,456 / Avg: 183,816)
06:24:21 INFO loader :: Elapsed: 54.40 seconds [2022/01/28 06:24:21 MST]
06:24:27 INFO loader :: Add: 11,000,000 Index (Batch: 148,104 / Avg: 179,873)
06:24:32 INFO loader :: Add: 12,000,000 Index (Batch: 234,631 / Avg: 183,441)
06:24:37 INFO loader :: Add: 13,000,000 Index (Batch: 195,121 / Avg: 184,289)
06:24:41 INFO loader :: Add: 14,000,000 Index (Batch: 232,234 / Avg: 187,048)
06:24:46 INFO loader :: Add: 15,000,000 Index (Batch: 206,825 / Avg: 188,248)
06:24:50 INFO loader :: Add: 16,000,000 Index (Batch: 239,463 / Avg: 190,798)
06:24:54 INFO loader :: Add: 17,000,000 Index (Batch: 233,208 / Avg: 192,861)
06:25:00 INFO loader :: Add: 18,000,000 Index (Batch: 186,776 / Avg: 192,513)
06:25:06 INFO loader :: Add: 19,000,000 Index (Batch: 151,423 / Avg: 189,802)
06:25:13 INFO loader :: Add: 20,000,000 Index (Batch: 140,567 / Avg: 186,535)
06:25:13 INFO loader :: Elapsed: 107.22 seconds [2022/01/28 06:25:13 MST]
06:25:19 INFO loader :: Add: 21,000,000 Index (Batch: 189,537 / Avg: 186,676)
06:25:24 INFO loader :: Add: 22,000,000 Index (Batch: 202,798 / Avg: 187,353)
06:25:29 INFO loader :: Add: 23,000,000 Index (Batch: 183,553 / Avg: 187,185)
06:25:35 INFO loader :: Add: 24,000,000 Index (Batch: 170,561 / Avg: 186,428)
06:25:41 INFO loader :: Add: 25,000,000 Index (Batch: 165,043 / Avg: 185,466)
06:25:50 INFO loader :: Add: 26,000,000 Index (Batch: 114,129 / Avg: 181,112)
06:25:58 INFO loader :: Add: 27,000,000 Index (Batch: 115,713 / Avg: 177,399)
06:26:03 INFO loader :: Add: 28,000,000 Index (Batch: 199,600 / Avg: 178,106)
06:26:11 INFO loader :: Add: 29,000,000 Index (Batch: 128,600 / Avg: 175,773)
06:26:18 INFO loader :: Add: 30,000,000 Index (Batch: 150,966 / Avg: 174,816)
06:26:18 INFO loader :: Elapsed: 171.61 seconds [2022/01/28 06:26:18 MST]
06:26:24 INFO loader :: Add: 31,000,000 Index (Batch: 166,417 / Avg: 174,531)
06:26:30 INFO loader :: Add: 32,000,000 Index (Batch: 169,606 / Avg: 174,373)
06:26:36 INFO loader :: Add: 33,000,000 Index (Batch: 170,940 / Avg: 174,267)
06:26:41 INFO loader :: Add: 34,000,000 Index (Batch: 180,050 / Avg: 174,432)
06:26:48 INFO loader :: Add: 35,000,000 Index (Batch: 147,123 / Avg: 173,512)
06:26:56 INFO loader :: Add: 36,000,000 Index (Batch: 116,103 / Avg: 171,161)
06:27:04 INFO loader :: Add: 37,000,000 Index (Batch: 132,872 / Avg: 169,838)
06:27:10 INFO loader :: Add: 38,000,000 Index (Batch: 168,265 / Avg: 169,796)
06:27:17 INFO loader :: Add: 39,000,000 Index (Batch: 132,890 / Avg: 168,596)
06:27:26 INFO loader :: Add: 40,000,000 Index (Batch: 115,888 / Avg: 166,700)
06:27:26 INFO loader :: Elapsed: 239.95 seconds [2022/01/28 06:27:26 MST]
06:27:32 INFO loader :: Add: 41,000,000 Index (Batch: 178,030 / Avg: 166,959)
06:27:38 INFO loader :: Add: 42,000,000 Index (Batch: 148,016 / Avg: 166,452)
06:27:45 INFO loader :: Add: 43,000,000 Index (Batch: 157,903 / Avg: 166,243)
06:27:53 INFO loader :: Add: 44,000,000 Index (Batch: 127,779 / Avg: 165,113)
06:27:58 INFO loader :: Add: 45,000,000 Index (Batch: 176,211 / Avg: 165,345)
06:28:05 INFO loader :: Add: 46,000,000 Index (Batch: 157,728 / Avg: 165,171)
06:28:11 INFO loader :: Add: 47,000,000 Index (Batch: 147,080 / Avg: 164,740)
06:28:17 INFO loader :: Add: 48,000,000 Index (Batch: 168,208 / Avg: 164,811)
06:28:25 INFO loader :: Add: 49,000,000 Index (Batch: 130,582 / Avg: 163,934)
06:28:32 INFO loader :: Add: 50,000,000 Index (Batch: 143,740 / Avg: 163,475)
06:28:32 INFO loader :: Elapsed: 305.86 seconds [2022/01/28 06:28:32 MST]
06:28:38 INFO loader :: Add: 51,000,000 Index (Batch: 164,122 / Avg: 163,487)
06:28:44 INFO loader :: Add: 52,000,000 Index (Batch: 176,959 / Avg: 163,727)
06:28:49 INFO loader :: Add: 53,000,000 Index (Batch: 184,911 / Avg: 164,082)
06:28:55 INFO loader :: Add: 54,000,000 Index (Batch: 185,770 / Avg: 164,437)
06:29:01 INFO loader :: Add: 55,000,000 Index (Batch: 148,434 / Avg: 164,115)
06:29:10 INFO loader :: Add: 56,000,000 Index (Batch: 120,467 / Avg: 163,060)
06:29:16 INFO loader :: Add: 57,000,000 Index (Batch: 156,225 / Avg: 162,935)
06:29:22 INFO loader :: Add: 58,000,000 Index (Batch: 176,772 / Avg: 163,155)
06:29:28 INFO loader :: Add: 59,000,000 Index (Batch: 156,985 / Avg: 163,047)
06:29:33 INFO loader :: Index set: SPO => SPO->POS, SPO->OSP [59,535,659 items, 366.6 seconds]
06:29:36 INFO loader :: Finish - index OSP
06:29:40 INFO loader :: Finish - index POS
06:29:40 INFO loader :: Time = 1,172.670 seconds : Triples = 59,994,733 : Rate = 51,161 /s
Observations on queries
-
basic works but the env context local with ABP
ENVO:01000813
but only gets ~160 of the samples, Presumably because layer terms (tara and I think hot) aren't subclasses of ABP in ENVO, (or at least PMO's ENVO import). -
When I try basic query3 with bfo material entity instead of the local context it takes forever to query, which it didn't with subclasses of ABP. I'm wondering if it's doing something stupid like doing the query for each sample over and over. Should try for example using a having() block for the ENVO triad terms like how I did it in my BCODMO example. Aka a subquery where after selecting on the other fields like quantifiers or GO terms, then sub-select on the subclasses of ENVO terms. Could also try like in my BCODMO example (or some iteration of it) using filter in (list_of_subclasses).
Rough triplestore testing doing NCBITaxon subclass* query with Synechococcales
NCBITaxon:1890424
Medium/large NCBITaxon test. Not finishing in 1.5-2hr time range.
https://www.ontology-of-designing.ru/article/2021_4(42)/Ontology_Of_Designing_4_2021_402-421_Azamat_Abdoullaev.pdf see Figure 1 - Gartner Hype Cycle for Artificial Intelligence 2021
where semantic search is right at the slope of enlightenment. Could use this in my defense in the intro as a global picture of things. Maybe cite the paper in my dissertation intro too?
testing go_no_envo.rq
with GO:0043169
it took: 2m31.278s
Testing ncbitaxon_only.rq
with NCBITaxon:1890424
started ~noon by 9pm it's still going.
2022 OBI Workshop COB Data 2022
Doing subclass* on Synechococcales NCBITaxon:1890424 take ~3 seconds. So that's not the problem.
Trying ncbitaxon_only.rq
with NCBITaxon:2784134
Gibliniella took 1m13.041s. no results try new small test.
Try again with NCBITaxon:2649294 unclassified Janibacter
as I know we should have at least NCBITaxon:2761047
in our test data. There are 181 subclasses so fairly small test but not tiny. Starting ~noon. after an hour it failed from broken link to server.
Try with just sc of NCBITaxon:2761047 should be just the one class. took 0m22.204s. Good sanity check.
Trying ncbitaxon query ncbitaxon_no_sc_query.rq
without a subclass* constraint aka get all samples with all nodes with taxon annotations. Started 1:20. Idea is to try doing this first then throw in a HAVING() block with the SC* constraint as a subquery to see if that runs faster. It worked after 4m5.384s
I canceled because it'd be writing to STOUT for ever, but the query was quick.
Try ncbitax_having.rq with SC* constraint in HAVING block trying again with NCBITaxon:2649294 start 1:37 caneling at 3:21 for time reasons cyverse maintenance is coming up.
On lytic
Tested 1_sample_ncbitaxon_test.rq with NCBITaxon:2649294 (181 SC's) didn't finish probably ran out of memory.
Tested 1_sample_ncbitaxon_test.rq with a SC* query with only one class and for just one sample with the having block HAVING (?taxon IN (?tax_list) && ?sample="SRR9178442" )
takes 3m28.302s
perhaps not ideal. This was with bind. Try without bind, given this stack overflow post although that might just be for marklogic. without bind it took 2m59.466s
so a good 30 seconds shorter for that one sample but it didn't return data so query is prob wrong.
Trying without having block or NCBI just samples: without bind 0m1.744s
with bind 0m1.802s
so perhaps not too different. Would need to try longer query. TODO.
Also see https://hal.archives-ouvertes.fr/hal-01280951/file/Slides%20WebIST16.pdf file:///Users/kai/Downloads/Slides%20WebIST16.pdf See slide 21 about pushing down the filter with a nested select and where, saying we don't know how good the DB is and the filter uses the DB. In my case the TBD2 is indexed so maybe filter first is faster? Try both.
Also see https://dotnetrdf.org/docs/stable/developer_guide/SPARQL-Optimization.html. Basically move the bind and filter statements up.
Trying taxslim instead of the full ncbitaxon.owl
-rw-r--r--. 1 kblumberg 1.8G Dec 14 03:58 ncbitaxon.owl
-rw-r--r--. 1 kblumberg 127M Jan 18 13:05 go.owl
-rw-r--r--. 1 kblumberg 22M Feb 1 07:51 pmo.owl
-rw-r--r--. 1 kblumberg 3.4M Feb 4 01:14 taxslim.obo
-rw-r--r--. 1 kblumberg 23M Feb 4 01:14 taxslim.owl
Wow taxslim is a lot smaller! Hopefully we don't miss too many results with this ... If not I can make my own extract version of NCBITaxon and host it in the PMO repository. Using tax slim requires robot to do the conversion from obo to owl.
install robot:
wget https://github.com/ontodev/robot/releases/download/v1.8.3/robot.jar
curl https://raw.githubusercontent.com/ontodev/robot/master/bin/robot > robot
Note you also need to make robot executable chmod +x which might require sudo.
cut -f 2 /home/u19/kblumberg/planet-microbe-semantic-web-analysis/triplestore/wgs_annotation_data/merged_ncbitaxon_data.tsv | sort | uniq > ncbitaxon_terms.txt
cut -f 2 /home/u19/kblumberg/planet-microbe-semantic-web-analysis/triplestore/wgs_annotation_data/merged_go_data.tsv | sort | uniq > go_terms.txt
Tesing BOT and SLME extract methods for GO and NCBITaxon with our term lists:
cat go_BOT.csv | wc -l
6936
cat go_SLME.owl.csv | wc -l
6935
cat ncbitaxon_BOT.csv | wc -l
29267
cat ncbitaxon_SLME.csv | wc -l
29267
Only one more term in GO's BOT than SLME and same for both in NCBITaxon so lets just go with BOT. I also tested to make sure the remove commands don't get rid of terms and they don't.
./start_fuseki_server.sh &
[1] 7377
This works on VM but not on my comp with the local tunnel temp url
./query/assemble_query.py -u query/base_metadata.rq -o api_results/base_metadata.csv
#Try installing httpsclient to deal with EOF error in https://stackoverflow.com/questions/47142848/python-sslerror-bad-handshake-unexpected-eof#47816648
pip install ndg-httpsclient
lt --port 3030 &
./assemble_query.py -o output/go.csv -g GO:0043169
scp [email protected]:/home/kblumberg/planet-microbe-semantic-web-analysis/analysis/api_results/go.csv .
https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html
check dates for summer grad/ and send to bonnie.
https://www.nature.com/articles/s41587-020-0603-3 for Heidi's work TBD Bonnie suggests to use this database.
https://www.linkedin.com/pulse/western-science-technology-decadence-china-takes-over-abdoullaev/?published=t https://www.linkedin.com/in/azamat-abdoullaev-335a0881/ https://www.linkedin.com/feed/update/urn:li:activity:6676906939940790272/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A6676906939940790272%29 https://www.quora.com/What-are-some-countries-where-capitalism-has-failed/answer/Kiryl-Persianov https://www.linkedin.com/pulse/western-science-decadence-european-research-council-azamat-abdoullaev/?published=t
Chris Mungall and Justin Reese at the Ontology Summit 2022 on COVID-19 Knowledge Graphs
https://guianaplants.stir.ac.uk/seminar/materials/vegantutor.pdf http://www.djcxy.com/p/5222.html https://github.com/kaiiam/kblumberg_masters_thesis/blob/master/Digital_Supplement_Kai_Blumberg_MSC_Thesis/D.S.2/R_scripts/pcoa_analysis/pcoa_analysis.R https://rdrr.io/rforge/vegan/man/biplot.rda.html
https://www.frontiersin.org/articles/10.3389/fmars.2020.00105/full nickel story for marine pytoplankton.
Nickel (Ni) is a bio-essential element required for the growth of phytoplankton.
The generally lower surface concentrations in the NH subtropical gyre compared to the southern hemisphere (SH), might be related to a greater Ni uptake by nitrogen fixers that are stimulated by iron (Fe) deposition.
I bet this might apply to hot hence why we see more nickel at the surface
The distribution of Ni resembles the distribution of cadmium (Cd)
but other elements such as cadmium (Cd), molybdenum (Mo), vanadium (V), and selenium (Se) are important for specific taxonomic groups [e.g. (Morel et al., 2014; De Baar et al., 2018)].
basically marine phytoplankton need nickel.
https://www.mbari.org/wp-content/static/chemsensor/ni/nickel.html -> Ni has a nutrient like vertical profile with low concentrations at the surface and values increasing with depth (data).
Manganese https://www.mbari.org/wp-content/static/chemsensor/mn/manganese.html https://par.nsf.gov/servlets/purl/10179969 https://www.mbari.org/wp-content/static/chemsensor/mn/mngraph.html it'd be really cool to plot the distributions of the various GO ion binding genes like how people plot the profiles of nutrients see if for example manganese follows the same pattern. Doing it with the GO_0042301 phosphate ion binding and the HOT phosphate concentration would be awesome. Can try this with others? Hot has nitrate (nitrate and nitrite could use to get nitrite), phosphate, oxygen
Can also do the question with ion transport
would have subclasses like nitrate transport
maybe can do this for the oxic/anoxic story. Can also try ploting oxygen transport
or oxygen carrier activity
against O2 distribution for either project.
-
phosphate depth distribution against
phosphate ion binding
prob more interesting in HOT story. -
nitrate depth distribution against
nitrate transport
orresponse to nitrate
or other try both hot and tara. -
oxygen depth distribution with
oxygen transport
,oxygen carrier activity
,oxygen binding
oroxygen sensor activity
or subclasses ofresponse to oxygen levels
,response to reactive oxygen species
etc. Try these with Tara 02 story.
IF HOT-DOGS had the IRON profile could also do it against ferrous iron binding
or cadmium or manganes or others. Perhaps better not to bring in more new data and stick with what we have.
try energy derivation by oxidation of organic compounds
or cellular respiration
with the Tara oxygen if I haven't already.
ESIP soil talk I gave last week
Notes from meeting with Dava:
* announcement of final def gradpath
* Journal article form (cert published manuscript)
* announcement of defense in gradpath, with committees signature.
* final oral def approval form
* Word announcement flyer for defense (DAVA will upload to D2L)
Check out https://www.proquest.com/pqdtglobal/advanced for examples of dissertation
Also check manual for dissertation steps with an outline for what needs to be included
Aug 8 final submission of thesis need revisions from committee submit to proquest. Oral defense has to be 1-2 weeks before with time for revisions.
Date for may 13 for in person spring grad, no summer (not required)
Meeting with Kathe Todd-Brown
Publications:
1) Soil methods ontology paper
2) data pub similar to https://essd.copernicus.org/articles/12/61/2020/ on integrated data
3) modeling and or statistical/ML analysis of data integrated in 2). Could be scope to do both modeling and ML and compare and contrast the results.
Lifestuff:
Remote ok
Pay: 2 years 60K/yr
Maintain OBO network a plus.
Extras:
Teaching undergrads a pipeline for data harmonization/ontology annotation. Examples https://github.com/ISCN/SOCDRaHR2/tree/master/R see the SOCDRaHR2/R/readCPEAT.R
She's going for https://www.usda.gov/climate-solutions/climate-smart-commodities unlikely in my opinion.
https://dvc.org/doc/start from Bonnie like github but for data.
Humann
pipeline. Original version https://github.com/biobakery/humann
alise's version: https://github.com/aponsero/Humann_annotation_HPC
https://huttenhower.sph.harvard.edu/humann has concise docs for install and use (https://github.com/biobakery/humann elaborates on more) starting with the former.
Doing it in /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann
we don't have pip, so install with conda instead:
# Create a new conda environment for the installation
conda create --name biobakery3 python=3.7
#might need to update conda
conda update -n base -c defaults conda
#activate biobakery3 conda environment
conda activate biobakery3
#Set conda channel priority:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery
#Install HUMAnN 3.0 software with demo databases: (will also automatically install MetaPhlAn 3.0.) take a little bit to run
conda install humann -c biobakery
#Test installation: Run HUMAnN unit tests:
humann_test
# Download human source code (for test) Get latest download link from https://pypi.org/project/humann/#files
wget https://files.pythonhosted.org/packages/27/f9/d07bd76dd7dd5732c4d29d58849e96e4828c8a7dc95cf7ae58622f37591a/humann-3.0.1.tar.gz
# Unzip archive (might have to type this manually instead of copy-pasting it)
tar –xvzf humann-3.0.1.tar.gz
# install databases done in new folder database
mkdir database
cd database/
#install chocophlan
humann_databases --download chocophlan full .
humann_databases --download uniref uniref90_diamond .
humann_databases --download utility_mapping full .
#move into directory:
cd humann-3.0.1/examples/
#To run test switch to interactive mode
interactive
# reactivate biobakery3 conda environment
conda activate biobakery3
#test failed without database
#Run the HUMAnN demo:
humann -i demo.fastq.gz -o sample_results
after test run without databases:
Running metaphlan ........
CRITICAL ERROR: Error executing: /home/u19/kblumberg/miniconda2/envs/biobakery3/bin/metaphlan /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/tmpr_nv8ivq/tmpnx6snk_d -t rel_ab -o /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bowtie2.txt
Error message returned from metaphlan :
No MetaPhlAn BowTie2 database found (--index option)!
Expecting location bowtie2db
Exiting..
after installing choco database
Database installed: /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/database/chocophlan
HUMAnN configuration file updated: database_folders : nucleotide = /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/database/chocophlan
when running demo I still get the bowtie2 error. Like in https://forum.biobakery.org/t/no-metaphlan-bowtie2-database-found-index-option/1688 except that doesn't fix it. Try Alise's install instructions perhaps which use conda to install pip then pip to install human instead of the other way.
##Steps Humann install
#conda create --name humann # Didn't work
conda create --name humann python=3.7 #tired this instead
#add this
conda activate humann
conda install pip
conda update -n base -c defaults conda
pip install humann # instead try: `pip install humann --no-binary :all:`
conda install -c bioconda metaphlan
# fix libtbb2 for bowtie2
conda install tbb=2020.2
Output files will be written to: /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results
Decompressing gzipped file ...
Running metaphlan ........
CRITICAL ERROR: Error executing: /home/u19/kblumberg/miniconda2/envs/humann/bin/metaphlan /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/tmpdv78lxao/tmp5icolv89 -t rel_ab -o /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bowtie2.txt
Error message returned from metaphlan :
Downloading MetaPhlAn database
Please note due to the size this might take a few minutes
File /home/u19/kblumberg/miniconda2/envs/humann/lib/python3.7/site-packages/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.tar already present!
Downloading http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.md5
Downloading file of size: 0.00 MB
MD5 checksums do not correspond! If this happens again, you should remove the database files and rerun MetaPhlAn so they are re-downloaded
Decompressing gzipped file ...
Running metaphlan ........
CRITICAL ERROR: Error executing: /home/u19/kblumberg/miniconda2/envs/humann/bin/metaphlan /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/tmplyauebur/tmpzao7iyhj -t rel_ab -o /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bowtie2.txt
Error message returned from metaphlan :
No MetaPhlAn BowTie2 database found (--index option)!
Expecting location bowtie2db
Exiting...
Trying again with clean conda environment:
##Steps Humann install
conda create --name humann_kai python=3.7
#Activate
conda activate humann_kai
#Set conda channel priority:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery
#install pip
conda install pip
# install humann
pip install humann --no-binary :all:
# install metaphlan
conda install -c bioconda metaphlan
# fix libtbb2 for bowtie2
conda install tbb=2020.2
Dissertation 1st draft to Bonnie Jun 13
Bonnie will give me back by 3rd /4th week of June 17th.
To committee by June 27th back to me by July 15th.
Defense week of 18th or 25th.
Humann steps:
Run all samples (Run with Metaphlan output, there should be a flag for it)
merge tables
normalize table by count per million
unstratify
demo_metaphlan_bugs_list.tsv is the taxonomic profile (like kraken but always at species level)
reinstall databases and unzip the files prior to running.
Run bowtie2 with human database and trimgalore as initial QC step beforing running human. See https://github.com/aponsero/readbased_metagenomes_snakemake Hopefully this will deal with the gzip issue.
install bowtie2 and trim_galore into my conda env using bioconda
donlowad Human genome for bowtie2
wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip unzip GRCh38_noalt_as.zip
hdb="databases/GRCh38_noalt_as/GRCh38_noalt_as",
bowtie2 -p 8 -x {params.hdb} -U {input.f1} --un-gz output/{params.bowtiename}
trim_galore -o {trimgalore output folder} --fastqc {input to trimgalore}.fastq.gz
https://github.com/aponsero/readbased_metagenomes_snakemake/blob/main/PBS_pipeline/Snakefile
https://www.ebi.ac.uk/GOA/InterPro2GO http://current.geneontology.org/ontology/external2go/interpro2go https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1160203/ https://www.ebi.ac.uk/GOA/ http://current.geneontology.org/ontology/external2go https://github.com/geneontology/go-site/uniprotkb_kw2go https://wiki.geneontology.org/index.php/Release_Pipeline https://www.ebi.ac.uk/QuickGO/term/GO:0008198
http://berkeleybop.org/index.html
https://www.ontotext.com/products/graphdb/
berkeleybop projects:
### Likely in scope
National Microbiome Data Collaborative (NMDC)
OBO Foundry
Gene Ontology
INCA: Intelligent Concept Assistant (inactive) perhaps can bring some of this ML cleaned NCBIdatabase into NMDC?
### Maybe:
Monarch Initiative -> semantically integrate genotype-phenotype data from many species and sources in order to support precision medicine, disease modeling, and mechanistic exploration
Phenomics First -> part of Monarch developing tools for biomedical information about genetic conditions is captured, stored, and exchanged. upheno
NCATS Biomedical Translator project seeks to “translate” the results of biological research into clinical practice
SymbiOnt -> Augmenting and merging ontologies using ontology mappings and knowledge graph embeddings
### Less interesting to me:
Exomiser -> tool that finds potential disease-causing variants from whole-exome or whole-genome sequencing data part of Monarch
IDG2KG -> Illuminating the Druggable Genome
Alliance for Genome Resources -> model organisms to contribute to human health.
KG-COVID-19 -> covid knowldege graph
GMOD -> Generic Model Organism Database
CCDH -> cancer harmonization (Not currently active)
https://github.com/cidgoh/DataHarmonizer from damion
https://asm.org/ASM/media/Academy/Academy%20Reports/Microbes-Climate-Change-Science,-People-Impacts-Report.pdf from Chris for NMDC.
Bland Altman analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4470095/, https://rss.onlinelibrary.wiley.com/doi/abs/10.2307/2987937
A guide to appropriate use of Correlation coefficient in medical research
For Heidi
./count.sh ../../heidi/karnes_metagenomes/*.fastq.gz
../../heidi/karnes_metagenomes/Karnes-10-1_UA-NGsp-fastq_Karnes-C11-U035_Karnes-C11_S285_R1_001.fastq.gz 11G
Number of reads: 135756044
Number of bases in reads: 20311864640
../../heidi/karnes_metagenomes/Karnes-10-2_UA-NGsp-fastq_Karnes-C12-U036_Karnes-C12_S286_R1_001.fastq.gz 9.8G
Number of reads: 125171637
Number of bases in reads: 18750286185
../../heidi/karnes_metagenomes/Karnes-1-1_UA-NGsp-fastq_Karnes-F01-U061_Karnes-F01_S267_R1_001.fastq.gz 11G
Number of reads: 130472127
Number of bases in reads: 19453178652
../../heidi/karnes_metagenomes/Karnes-1-2_UA-NGsp-fastq_Karnes-F02-U062_Karnes-F02_S268_R1_001.fastq.gz 12G
Number of reads: 159006650
Number of bases in reads: 23401581189
../../heidi/karnes_metagenomes/Karnes-2-1_UA-NGsp-fastq_Karnes-F03-U063_Karnes-F03_S269_R1_001.fastq.gz 16G
Number of reads: 199491669
Number of bases in reads: 29728724644
sample Size G Number of reads Number of bases in reads Finished in original run
Karnes-10-1_UA-NGsp-fastq_Karnes-C11-U035_Karnes-C11_S285_R1_001.fastq.gz 11 135756044 20311864640 Yes
Karnes-10-2_UA-NGsp-fastq_Karnes-C12-U036_Karnes-C12_S286_R1_001.fastq.gz 9.8 125171637 18750286185 Yes
Karnes-1-1_UA-NGsp-fastq_Karnes-F01-U061_Karnes-F01_S267_R1_001.fastq.gz 11 130472127 19453178652 no
Karnes-1-2_UA-NGsp-fastq_Karnes-F02-U062_Karnes-F02_S268_R1_001.fastq.gz 12 159006650 23401581189 no
Karnes-2-1_UA-NGsp-fastq_Karnes-F03-U063_Karnes-F03_S269_R1_001.fastq.gz 16 199491669 29728724644 no
Going with 20 million reads as the subsamples -> gunzip -c $INPUT_DIR/$SMPLE | head -n 80000000 | gzip > subsample/$SMPLE
Humann results:
pathabundance.tsv is MetaCyc genefamilies.tsv is UniRef90. Can presumably use either for the functional rarefaction curve.
cut -f 1 Karnes-7-1_UA-NGsp-fastq_Karnes-B09-U021_Karnes-B09_S279_R1_001_trimmed_pathabundance.tsv | sort | uniq | wc -l
cut -f 1 Karnes-9-2_UA-NGsp-fastq_Karnes-C10-U034_Karnes-C10_S284_R1_001_trimmed_genefamilies.tsv | sort | uniq | wc -l
#inside temp folder(s)
cut -f 1 Karnes-9-2_UA-NGsp-fastq_Karnes-C10-U034_Karnes-C10_S284_R1_001_trimmed_metaphlan_bugs_list.tsv | sort | uniq | wc -l
./bash/wgs_annotation_data.sh: line 19: tarql: command not found
./bash/planet_microbe_data.sh: line 15: jq: command not found
./bash/planet_microbe_data.sh: line 116: python3: command not found
./bash/download_ontologies.sh: line 8: wget: command not found
./bash/create_triplestore.sh: line 17: tdb2.tdbloader: command not found
loading triplestore on tecti server Finished: 13 files: 42,656,343 tuples in 669.94s (Avg: 63,671)
Time = 906.027 seconds : Triples = 42,656,343 : Rate = 47,081 /s
Moved server version of repo to /opt/planet-microbe-semantic-web-analysis
more appropriate place to keep it then in my user folder, so that others can use it.
https://arizona.zoom.us/my/kaiblumberg
Meeting with Adam Michel
new endpoint: http://sparql.planetmicrobe.org/
The following command (run with sudo) apply to the server:
#See status
systemctl status fuseki
#restart server
systemctl restart fuseki
# Output log
journalctl -xa -u fuseki -f
# Status of proxy
systemctl status nginx
will send log to `/var/log/nginx/access.log`
how it works: nginx listing on webserver 80 and reverse proxying from local3030 to public url relevant config `/etc/nginx/conf.d/fuseki.conf`. He had to change some group permissions on files in the repo to get it to work.
using a system service init process will start a set of daemons fuseki now part of startup, in system d `/etc/systemd/system/fuseki.service` are all of the things the system daemone needs to start fuseki daemono. System.d is the daemono were using on this red hat server
https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html https://jena.apache.org/documentation/fuseki2/fuseki-webapp.html#fuseki-standalone-server https://stackoverflow.com/questions/31927012/disable-only-unauthenticated-adding-of-datasets-to-fuseki https://jena.apache.org/documentation/fuseki2/fuseki-security.html
http://ontology.buffalo.edu/philosophome/index_files/philosophome.html
From Damion he's looking for the USDA funing to help with https://fdc.nal.usda.gov/
from Alise The Ocean Gene Atlas v2.0: online exploration of the biogeography and phylogeny of plankton genes should cite this in paper 3.
Kraken2 database from https://benlangmead.github.io/aws-indexes/k2 we used he standard : archaea, bacteria, viral, plasmid, human1, UniVec_Core
Version from 5/17/2021
if Heidi wants a more accurate taxonomic profiling Alise would suggest using kraken along with the HumGut database https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01114-w#Sec10 But metaphlan should be pretty good for most of the bugs
https://www.linkedin.com/in/williamhsiao/ https://cidgoh.ca/ https://genepio.org/ https://github.com/GenEpiO/genepio https://irida.ca/ https://github.com/Public-Health-Bioinformatics
To merge metaphlan3 results (from Humann pipleline) merge_metaphlan_tables.py
is in the conda environment.
merge_metaphlan_tables.py *_bugs_list.tsv > merged_abundance_table.txt
To make the merged taxa table, from inside the humann_results
folder, run the following:
mkdir bug_list
cp Karnes-10-1_UA-NGsp-fastq_Karnes-C11-U035_Karnes-C11_S285_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-10-2_UA-NGsp-fastq_Karnes-C12-U036_Karnes-C12_S286_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-1-1_UA-NGsp-fastq_Karnes-F01-U061_Karnes-F01_S267_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-1-2_UA-NGsp-fastq_Karnes-F02-U062_Karnes-F02_S268_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-2-1_UA-NGsp-fastq_Karnes-F03-U063_Karnes-F03_S269_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-2-2_UA-NGsp-fastq_Karnes-F04-U064_Karnes-F04_S270_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-3-1_UA-NGsp-fastq_Karnes-F05-U065_Karnes-F05_S271_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-3-2_UA-NGsp-fastq_Karnes-F06-U066_Karnes-F06_S272_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-4-1_UA-NGsp-fastq_Karnes-F07-U067_Karnes-F07_S273_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-4-2_UA-NGsp-fastq_Karnes-F08-U068_Karnes-F08_S274_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-5-1_UA-NGsp-fastq_Karnes-F09-U069_Karnes-F09_S275_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-5-2_UA-NGsp-fastq_Karnes-F10-U070_Karnes-F10_S276_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-6-1_UA-NGsp-fastq_Karnes-F11-U071_Karnes-F11_S277_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-6-2_UA-NGsp-fastq_Karnes-F12-U072_Karnes-F12_S278_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-7-1_UA-NGsp-fastq_Karnes-B09-U021_Karnes-B09_S279_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-7-2_UA-NGsp-fastq_Karnes-B10-U022_Karnes-B10_S280_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-8-1_UA-NGsp-fastq_Karnes-B11-U023_Karnes-B11_S281_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-8-2_UA-NGsp-fastq_Karnes-B12-U024_Karnes-B12_S282_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-9-1_UA-NGsp-fastq_Karnes-C09-U033_Karnes-C09_S283_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-9-2_UA-NGsp-fastq_Karnes-C10-U034_Karnes-C10_S284_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
merge_metaphlan_tables.py bug_list/*_bugs_list.tsv > merged_abundance_table.txt
https://github.com/biobakery/biobakery/wiki/metaphlan3 has some cool ideas on what to do with the metaphlan3 results.
From Jim Balhoff
Imageomics Institute (https://imageomics.osu.edu/). The Phenoscape team is involved in providing ontology expertise to Imageomics, in the form of using ontologies as a form of structured knowledge input to machine learning analyses, and also using ontologies as a knowledge representation tool for the outputs of machine learning analyses. Determining exactly how ontologies will be employed in Imageomics is in the formative stage.
Thoughts:
Not a totally new problem for sure people have been doing similar things see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5556681/. Presumably it should be possible (or already done) that a multi-class Neural Network (or similar ML model) should be able, (given sufficient training data) to classify taxonomic assignments (at some level of granularity) from images.
Assuming the above works we should be able to get a dataset of images of some known taxonic rank e.g, birds or a specific bird of interest like a red-breasted robin.
I rememeber from my ML courses that it should be possible when doing image analysis to define features, I remember the example of ears and eyes being used as features within images when doing the classic dog or blueberry muffin
classifier. Assuming thats possible or tools for defining features already exist (e.g. something like https://www.keymakr.com/blog/image-annotation-tool-for-your-machine-learning-application/) we use them to manually annotate anatomical features (draw them onto a subset of our photo corpus), doing so with OBO ontology terms (namesly UBERON) e.g. 'chin'(UBERON:0008199), or 'face'(UBERON:0001456). These "features" within ML models of marked up images of anatomical body parts with ontology annotations could constitute the "computable traits". If that works we could use the intersection of the ontology and the data (photos with these ontology term annotations) to get more out of the data. For example we could use the ontology marked up features within new ML models to identify said features in more photos. Perhaps start by doing this for some species, and maybe move up to larger but related taxonomic groupings, see if it's producing reasonable results. If so we could do some analyses on traits across or within speices by quantifiying something like the lengths of 'tail feather'(UBERON:0018537) either as a quantified size (if that's possible to do from photos) or relativized against some other feature to be able to make comparisons.
Additionally, we might be able to use the intersection of the ontology/data to validate the the model we make is making "correct" assignments analogously to using a reasoner to check an ontology for inconsistencies.
For example 'chin'(UBERON:0008199) has the following axioms:
subdivision of head
structure with developmental contribution from neural crest
part of some lower jaw region
part of some face
If we can somehow translate some of these axioms (like the part of relations) into a computable framework then we can use the ontology axioms as rules to see if the outputs make sense. I.e. the bounding box feature of the chin has to be within the bounding box feature of 'face'(UBERON:0001456).
Committee wants:
1)Include known and unknown fractions of community I shoud have this Alise mentioned it. % known and unknown for taxa and function. Hopefuly we have this.
- Add supplemental protocols for final publication (was planning on doing this anyway with procols.io docs).
Nice to have can try for a perspective paper but no required for the dissertation
Meeting with Jim/Hilmar
imageomics -> 5 year NSF institute NLP ML biologists trait evolution
funding started last fall things slowly picking up. Propasal has funding to continue some work on phenoscape, brining ontologies to the ML project.
phenoscape making ontologies useful
-
knowlege guided ML . HOw can uberon help id structures in image data some work with fish and butterflies
-
using ontologies as a vocabulary if they can play a role in ML? Ouptus of ML tasks. What do the outputs look like how are they stored and marked up with ontologies.
Funding:
1st year full time post doc half time after
BGNN bio guided Neural Networks -> can use structured bio knowlege to improve ML using deep learning NN, Paula Mabee, Jane Greenberg (metadata researcher group director) (and one other PI) virginia tech. harness structured bio knowlege. knowlege of relatedness of taxa in taxonomc try to use to inform NN.
imageomics expands this to more bio images. Extracting traits from images or other bio knowlege. image -> deep learning NN -> info on traits -> use other bio knowlege to guide it. Recycle outputs of ML to improved knowlege and ML. Feed discovered traits or knowledge extracted into ?
using loss functions with ontologies?
Jane Greenberg metadata group director -> not offically part of the project so theres a gap in terms of taking a lead on metadata. What we want to know and express using ontologies at least hilmar whats that.
https://spoke.ucsf.edu/
OBOE SSN LPG
Meeting Kyle McKillop USDA (with Damion and Will too)
Food data central (FDC)
foundation food -> dataset
methods and food comp lab
food data team (kyle) making new lab: health of humans and environment. Mergining with human studies facilty do things like specific diets (ppl eating nuts or avocados meauring form of food chews, metabolisable energy). Push to use Ontologies in USDA
drafter letter of support for CDNO
me to help advise and bring in research questions
USDA components maybe 20 years out of date
issues with method and components
Colin K and phytochemical matching
portfolio pages data, ontology layer describes food -> want to get info to users
lab analytic data label data.
PTFI chemical compund data no IDs for them incoming data.
read what is food data central
USDA's FoodData Central: what is it and why is it needed today? https://academic.oup.com/ajcn/article/115/3/619/6459205
food and nutrtion services, child nutrion service, ERS, what is sustainable nutritonal intake for foodstamps etc, ARS
lack of standardization across USDA, using things like NCBITaxon
1) median first quartile or box plot by project
2) make a first dummy protocol or the work and add to thesis version
3) send out signatures page
4) results of oral dissertation in gradpath (go check gradpath)
upload final dissertation through graduate college has margin checks maybe ask Chunan.
Make some notes about how to graduate make it as a shared doc with all students
-> capture stuff like 2 weeks before
https://arxiv.org/abs/2207.02056 Ontology Development Kit: a toolkit for building, maintaining, and standardising biomedical ontologies preprint.
https://github.com/mszep/pandoc_resume
Matt Miller ORCID: https://orcid.org/0000-0002-3491-8763