Skip to content

2022_log

Kai Blumberg edited this page Aug 19, 2022 · 121 revisions

01.05

Zenodo API Could work but a bit complex.

To upload Can try https://github.com/jhpoelen/zenodo-upload. Wasn't necessary. In the end I just uploaded it from the files on my computer scp'ed from UA HPC.

Doi: https://zenodo.org/record/5821976

To retrieve a public zenodo dataset we don't need anything fancy just copy the download link and curl it with a file name like in this example from this zenodo page:

curl -o test.pdf https://zenodo.org/record/5068997/files/Event%20metadata.pdf?download=1

This works for mine as well e.g.:

curl -o sample_metadata.tsv https://zenodo.org/record/5821976/files/sample_metadata.tsv?download=1

01.06

Login for lytic with UA netID and pass:

Need to be on the UA VPN, and for some reason it's using my old UA netID password (see enpass).

01.10

BCODMO:

https://lucid.app/lucidchart/d250366b-b450-416a-b6ec-06b1f58ca18e/edit?invitationId=inv_d4679598-4b02-4326-aae0-7ee3064f7df4&page=-Yhi-Y3pdZSY#?referringapp=slack

Try to make cleaner owl outputs for the patterns I've started trying to play with in /Users/kai/Desktop/scratch/BCODMO/model.

01.11

Install notes

requires java 1.8 or above

Option 1) Trying this

From https://tarql.github.io/docs/ links to https://github.com/tarql/tarql/releases

wget https://github.com/tarql/tarql/releases/download/v1.2/tarql-1.2.tar.gz
tar –xvzf tarql-1.2.tar.gz

Had to manually type this copying from text editor didn't work don't know why

This seems to have worked now test TARQL

location of tarql /home/u19/kblumberg/tarql-1.2/bin/tarql

/home/u19/kblumberg/tarql-1.2/bin/tarql sample-2.sparql TechCrunchcontinentalUSA.csv

This works. Now add /home/u19/kblumberg/tarql-1.2 to path.

vim ~/.bash_profile

# Tarql
export PATH=$PATH:/home/u19/kblumberg/tarql-1.2/bin //Add this line to the bash_profile

source ~/.bash_profile

now tarql works e.g. tarql sample-2.sparql TechCrunchcontinentalUSA.csv

Option 2) Didn't finish this went with option1

git clone https://github.com/cygri/tarql

If you don't have mvn need to install it:

see https://maven.apache.org/install.html

Tarql step 2

mvn clean install -DskipTests

downloaded apache-jena-X.X.X.tar.gz from https://jena.apache.org/download/index.cgi

wget https://dlcdn.apache.org/jena/binaries/apache-jena-4.3.2.tar.gz
tar –xvzf apache-jena-4.3.2.tar.gz

Add the following to the bash_profile:

# Apache Jena
export JENA_HOME=/home/u19/kblumberg/apache-jena-4.3.2
export PATH=$PATH:$JENA_HOME/bin

Testing apache install tdb2.tdbloader gives a java error. Need to install java 12 without sudo Access. Made several attempts the following finaly worked.

From openjdk https://jdk.java.net/archive/ use this (note the similar instructions from https://opensource.com/article/19/11/install-java-linux)

wget https://download.java.net/java/GA/jdk12.0.2/e482c34c86bd4bf8b56c0b35558996b9/10/GPL/openjdk-12.0.2_linux-x64_bin.tar.gz
tar -xvzf openjdk-12.0.2_linux-x64_bin.tar.gz

Add the following to the bash_profile after consulting this page

# Java12
export JAVA_12_HOME=/home/u19/kblumberg/bin/jdk-12.0.2/
export PATH=$JAVA_12_HOME/bin:$PATH
## if these are commented out java8 becomes the default

Apache Jena Fuseki

See downloads page

get the latest apache-jena-fuseki-X.X.X.tar.gz

wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.3.2.tar.gz
tar -xvzf apache-jena-fuseki-4.3.2.tar.gz

Add the following to the bash_profile:

# Apache fuseki
export FUSEKI_HOME=/home/u19/kblumberg/apache-jena-fuseki-4.3.2
export PATH=$PATH:$FUSEKI_HOME

it works in test I ran fuseki-server and it started.

Next steps see http://loopasam.github.io/jena-doc/documentation/serving_data/ for fuseki server if I can get it to work over http try to mix with examples from https://github.com/apache/jena/tree/main/jena-fuseki2/examples and https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html also https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html https://github.com/JPL-IMCE/gov.nasa.jpl.imce.ontologies.fuseki. https://medium.com/@rrichajalota234/how-to-apache-jena-fuseki-3-x-x-1304dd810f09 https://gist.github.com/afs/63a80512cdc55caf77d0

https://managewp.com/blog/how-to-access-a-local-website-from-internet-with-port-forwarding and https://www.linuxandubuntu.com/home/how-to-setup-a-web-server-and-host-website-on-your-own-linux-computer

based on ip route and ifconfig 10.140.114.14 should be lytic's public id.

01.12

Stack overflow post about access-localhost-from-the-internet

The above leads to https://ngrok.com/ public urls for exposing a local web server. Looks like I might be able to use it for free but the url will change when the server stops and starts. In principal its not the worst thing once it's configured to just set it up once and never turn it off. According to this page "If you are not a paid user of ngrok then your ngrok session expires in 8 hours."

Also links to http://localtunnel.me/. Sounds similar, seems more opensource (no paid plans). Same restriction you receive a url which remains active while your local instance is up. Requires NodeJS so need npm install. See https://docs.npmjs.com/downloading-and-installing-node-js-and-npm and https://github.com/nodesource/distributions on how to install npm might need sudo to do it though TBD. This seems very simple if I can get npm installed. Lytic has npm! EXCELLENT! Tried npm install -g localtunnel but I need sudo to run this. Could maybe ask Bonnie or Matt. Can perhaps try this on a cyverse VM where I have sudo and see if it works to expose fuseki.

Also links to http://localhost.run/ uses ssh a bit more complicate than localtunnel.

See https://www.softwaretestinghelp.com/ngrok-alternatives/ for more including Serveo

Trying localtunnel on cyverse ATM image.

install npm with sudo works

sudo npm install -g localtunnel -> has error Error: Missing required argument #1

From here try:

sudo npm install -g n
sudo n latest

Now sudo npm install -g localtunnel worked.

Install fuseki and test local tunnel. To test:

fuseki-server ## or 
./fuseki-server & # gave PID 29767
lt --port 3030 # http://localhost:3030/ -> your url is: https://heavy-mouse-68.loca.lt

Going to the above url I'm seeing the apache jena fuseki server!!!!!!! HELLL YAAAA!

Kill process from PID kill -15 29767

ESIP paper

01.18

https://ontologforum.org/index.php/KaiBlumberg

ESIP

https://gleaner.io/ https://oceannetworks.ca/about-us https://github.com/iodepo/odis-arch https://book.oceaninfohub.org/ https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#metadata https://search.dataone.org/portals/toolik https://validator.schema.org/ http://knowwheregraph.org/prototypes/geoEnrichment/ (Mark S)

01.20

No longer will need triplestore/config/planet_microbe/supplemental_construct.sparql or triplestore/planet_microbe_datatsv_to_triplify/qualitative_attributes_other.tsv instead get this info and more from triplestore/wgs_annotation_data/sample_metadata.tsv Then can remove the last output block from planet_microbe_qualitative.py

3:35 ran the wgs_annotation_data part of the prepare triplestore script. Finished by 4:00.

ESIP

https://knowledgestructure.pubpub.org/space

Resources for this session (link supplementary docs or presentations here) We will add links to shared slides here, we have also added some links of interest here.
ESIP Semantic Technologies Committee: RDF vs Property Graph from Global Data Geeks Youtube channel GraphDBs (https://neo4j.com/, TinkerPop (TinkerPop DBs), SPARQL (triplestores) Neo4j model TinkerPop RDF GQL Emerging work on RDF and SPARQL "star". https://w3c.github.io/rdf-star/ To blend RDF and property graph approaches. KGLab KG to ML toolkit

https://www.researchgate.net/publication/354291659_Aligning_Standards_Communities_Sustainable_Darwin_Core_MIxS_Interoperability

https://ontologforum.s3.amazonaws.com/OntologySummit2020/Communique/OntologySummit2020Communique.pdf

01.21

https://github.com/charlesvardeman/fuseki-geosparql-docker check this out to see if I can use it for my PhD triplestore.

https://github.com/topics/obofoundry

From Doug Fils https://caddyserver.com/ A simple Caddyfile config for the above might be as easy as:

(cors) {
	@origin header Origin *
	header @origin Access-Control-Allow-Origin *
	header @origin Access-Control-Request-Method *
}

example.org {
	import cors
	reverse_proxy localhost:3030
}

01.28

./bash/planet_microbe_data.sh: line 15: jq: command not found
...

SyntaxError: invalid syntax
  File "../scripts//planet_microbe_numeric.py", line 56
    print(msg, file=sys.stderr)
                   ^

Make sure to install jq and the default python might be 2 so I'll update the bash script to have it be 3.

Stats on Loading my PhD triplestore:

06:10:08 INFO  loader          :: Loader = LoaderPhased
06:10:08 INFO  loader          :: Start: 13 files
06:10:15 INFO  loader          :: Add: 1,000,000 merged_go_data.ttl (Batch: 139,899 / Avg: 139,899)
06:10:20 INFO  loader          :: Add: 2,000,000 merged_go_data.ttl (Batch: 186,462 / Avg: 159,859)
06:10:29 INFO  loader          :: Add: 3,000,000 merged_go_data.ttl (Batch: 114,955 / Avg: 141,442)
06:10:32 INFO  loader          ::   End file: merged_go_data.ttl (triples/quads = 3,391,879)
06:10:39 INFO  loader          :: Add: 4,000,000 merged_interpro_data.ttl (Batch: 93,405 / Avg: 125,328)
06:10:51 INFO  loader          :: Add: 5,000,000 merged_interpro_data.ttl (Batch: 89,919 / Avg: 116,179)
06:11:03 INFO  loader          :: Add: 6,000,000 merged_interpro_data.ttl (Batch: 82,603 / Avg: 108,808)
06:11:13 INFO  loader          :: Add: 7,000,000 merged_interpro_data.ttl (Batch: 93,799 / Avg: 106,376)
06:11:26 INFO  loader          :: Add: 8,000,000 merged_interpro_data.ttl (Batch: 79,396 / Avg: 102,042)
06:11:37 INFO  loader          :: Add: 9,000,000 merged_interpro_data.ttl (Batch: 86,460 / Avg: 100,038)
06:11:50 INFO  loader          :: Add: 10,000,000 merged_interpro_data.ttl (Batch: 80,619 / Avg: 97,685)
06:11:50 INFO  loader          ::   Elapsed: 102.37 seconds [2022/01/28 06:11:50 MST]
06:12:02 INFO  loader          :: Add: 11,000,000 merged_interpro_data.ttl (Batch: 85,026 / Avg: 96,381)
06:12:13 INFO  loader          :: Add: 12,000,000 merged_interpro_data.ttl (Batch: 85,492 / Avg: 95,369)
06:12:25 INFO  loader          :: Add: 13,000,000 merged_interpro_data.ttl (Batch: 84,817 / Avg: 94,465)
06:12:36 INFO  loader          :: Add: 14,000,000 merged_interpro_data.ttl (Batch: 89,166 / Avg: 94,065)
06:12:47 INFO  loader          :: Add: 15,000,000 merged_interpro_data.ttl (Batch: 96,983 / Avg: 94,254)
06:12:58 INFO  loader          :: Add: 16,000,000 merged_interpro_data.ttl (Batch: 86,460 / Avg: 93,726)
06:13:09 INFO  loader          :: Add: 17,000,000 merged_interpro_data.ttl (Batch: 92,225 / Avg: 93,637)
06:13:19 INFO  loader          :: Add: 18,000,000 merged_interpro_data.ttl (Batch: 98,697 / Avg: 93,904)
06:13:31 INFO  loader          :: Add: 19,000,000 merged_interpro_data.ttl (Batch: 81,806 / Avg: 93,179)
06:13:39 INFO  loader          ::   End file: merged_interpro_data.ttl (triples/quads = 16,193,847)
06:13:43 INFO  loader          :: Add: 20,000,000 merged_ncbitaxon_data.ttl (Batch: 86,400 / Avg: 92,815)
06:13:43 INFO  loader          ::   Elapsed: 215.48 seconds [2022/01/28 06:13:43 MST]
06:13:54 INFO  loader          :: Add: 21,000,000 merged_ncbitaxon_data.ttl (Batch: 93,580 / Avg: 92,851)
06:14:07 INFO  loader          :: Add: 22,000,000 merged_ncbitaxon_data.ttl (Batch: 77,507 / Avg: 92,023)
06:14:17 INFO  loader          :: Add: 23,000,000 merged_ncbitaxon_data.ttl (Batch: 99,472 / Avg: 92,323)
06:14:28 INFO  loader          :: Add: 24,000,000 merged_ncbitaxon_data.ttl (Batch: 88,315 / Avg: 92,149)
06:14:40 INFO  loader          :: Add: 25,000,000 merged_ncbitaxon_data.ttl (Batch: 84,968 / Avg: 91,839)
06:14:51 INFO  loader          :: Add: 26,000,000 merged_ncbitaxon_data.ttl (Batch: 92,618 / Avg: 91,868)
06:15:02 INFO  loader          :: Add: 27,000,000 merged_ncbitaxon_data.ttl (Batch: 83,920 / Avg: 91,547)
06:15:14 INFO  loader          :: Add: 28,000,000 merged_ncbitaxon_data.ttl (Batch: 88,167 / Avg: 91,422)
06:15:25 INFO  loader          :: Add: 29,000,000 merged_ncbitaxon_data.ttl (Batch: 87,950 / Avg: 91,298)
06:15:38 INFO  loader          :: Add: 30,000,000 merged_ncbitaxon_data.ttl (Batch: 78,131 / Avg: 90,788)
06:15:38 INFO  loader          ::   Elapsed: 330.44 seconds [2022/01/28 06:15:38 MST]
06:15:49 INFO  loader          :: Add: 31,000,000 merged_ncbitaxon_data.ttl (Batch: 91,810 / Avg: 90,820)
06:15:59 INFO  loader          :: Add: 32,000,000 merged_ncbitaxon_data.ttl (Batch: 94,544 / Avg: 90,932)
06:16:12 INFO  loader          :: Add: 33,000,000 merged_ncbitaxon_data.ttl (Batch: 79,189 / Avg: 90,526)
06:16:23 INFO  loader          :: Add: 34,000,000 merged_ncbitaxon_data.ttl (Batch: 93,571 / Avg: 90,612)
06:16:33 INFO  loader          :: Add: 35,000,000 merged_ncbitaxon_data.ttl (Batch: 95,265 / Avg: 90,739)
06:16:46 INFO  loader          :: Add: 36,000,000 merged_ncbitaxon_data.ttl (Batch: 76,487 / Avg: 90,272)
06:16:58 INFO  loader          :: Add: 37,000,000 merged_ncbitaxon_data.ttl (Batch: 84,459 / Avg: 90,104)
06:17:10 INFO  loader          :: Add: 38,000,000 merged_ncbitaxon_data.ttl (Batch: 87,374 / Avg: 90,030)
06:17:22 INFO  loader          :: Add: 39,000,000 merged_ncbitaxon_data.ttl (Batch: 80,919 / Avg: 89,771)
06:17:32 INFO  loader          :: Add: 40,000,000 merged_ncbitaxon_data.ttl (Batch: 100,745 / Avg: 90,016)
06:17:32 INFO  loader          ::   Elapsed: 444.36 seconds [2022/01/28 06:17:32 MST]
06:17:43 INFO  loader          :: Add: 41,000,000 merged_ncbitaxon_data.ttl (Batch: 87,191 / Avg: 89,945)
06:17:56 INFO  loader          :: Add: 42,000,000 merged_ncbitaxon_data.ttl (Batch: 81,162 / Avg: 89,714)
06:17:58 INFO  loader          ::   End file: merged_ncbitaxon_data.ttl (triples/quads = 22,542,188)
06:17:58 INFO  loader          ::   End file: sample_metadata.ttl (triples/quads = 4,914)
06:17:58 INFO  loader          ::   End file: broad_scale_data.ttl (triples/quads = 2,457)
06:17:58 INFO  loader          ::   End file: local_scale_data.ttl (triples/quads = 2,457)
06:17:58 INFO  loader          ::   End file: material_data.ttl (triples/quads = 2,457)
06:17:58 INFO  loader          ::   End file: merged_numeric_data.ttl (triples/quads = 47,230)
06:17:58 INFO  loader          ::   End file: spatiotemporal.ttl (triples/quads = 4,914)
06:17:58 INFO  loader          ::   End file: supplemental_data.ttl (triples/quads = 2,457)
06:18:07 INFO  loader          :: Add: 43,000,000 go.owl (Batch: 84,889 / Avg: 89,595)
06:18:15 INFO  loader          ::   End file: go.owl (triples/quads = 1,425,490)
06:18:25 INFO  loader          :: Add: 44,000,000 ncbitaxon.owl (Batch: 58,387 / Avg: 88,520)
06:18:43 INFO  loader          :: Add: 45,000,000 ncbitaxon.owl (Batch: 55,682 / Avg: 87,375)
06:18:56 INFO  loader          :: Add: 46,000,000 ncbitaxon.owl (Batch: 74,432 / Avg: 87,046)
06:19:11 INFO  loader          :: Add: 47,000,000 ncbitaxon.owl (Batch: 65,915 / Avg: 86,456)
06:19:28 INFO  loader          :: Add: 48,000,000 ncbitaxon.owl (Batch: 57,954 / Avg: 85,579)
06:19:45 INFO  loader          :: Add: 49,000,000 ncbitaxon.owl (Batch: 61,218 / Avg: 84,890)
06:20:01 INFO  loader          :: Add: 50,000,000 ncbitaxon.owl (Batch: 59,687 / Avg: 84,179)
06:20:01 INFO  loader          ::   Elapsed: 593.97 seconds [2022/01/28 06:20:01 MST]
06:20:17 INFO  loader          :: Add: 51,000,000 ncbitaxon.owl (Batch: 63,275 / Avg: 83,637)
06:20:36 INFO  loader          :: Add: 52,000,000 ncbitaxon.owl (Batch: 52,364 / Avg: 82,687)
06:20:56 INFO  loader          :: Add: 53,000,000 ncbitaxon.owl (Batch: 51,316 / Avg: 81,745)
06:21:13 INFO  loader          :: Add: 54,000,000 ncbitaxon.owl (Batch: 57,208 / Avg: 81,100)
06:21:32 INFO  loader          :: Add: 55,000,000 ncbitaxon.owl (Batch: 52,328 / Avg: 80,298)
06:21:50 INFO  loader          :: Add: 56,000,000 ncbitaxon.owl (Batch: 57,097 / Avg: 79,719)
06:22:08 INFO  loader          :: Add: 57,000,000 ncbitaxon.owl (Batch: 54,767 / Avg: 79,087)
06:22:27 INFO  loader          :: Add: 58,000,000 ncbitaxon.owl (Batch: 54,418 / Avg: 78,474)
06:22:44 INFO  loader          :: Add: 59,000,000 ncbitaxon.owl (Batch: 57,590 / Avg: 77,994)
06:22:58 INFO  loader          ::   End file: ncbitaxon.owl (triples/quads = 16,105,832)
06:23:03 INFO  loader          ::   End file: pmo.owl (triples/quads = 268,611)
06:23:03 INFO  loader          :: Finished: 13 files: 59,994,733 tuples in 775.36s (Avg: 77,376)
06:23:26 INFO  loader          :: Finish - index SPO
06:23:26 INFO  loader          :: Start replay index SPO
06:23:26 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP
06:23:26 INFO  loader          :: Add: 1,000,000 Index (Batch: 3,076,923 / Avg: 3,076,923)
06:23:31 INFO  loader          :: Add: 2,000,000 Index (Batch: 223,964 / Avg: 417,536)
06:23:37 INFO  loader          :: Add: 3,000,000 Index (Batch: 174,489 / Avg: 285,143)
06:23:43 INFO  loader          :: Add: 4,000,000 Index (Batch: 149,543 / Avg: 232,450)
06:23:49 INFO  loader          :: Add: 5,000,000 Index (Batch: 179,083 / Avg: 219,375)
06:23:53 INFO  loader          :: Add: 6,000,000 Index (Batch: 252,525 / Avg: 224,282)
06:24:00 INFO  loader          :: Add: 7,000,000 Index (Batch: 134,318 / Avg: 204,696)
06:24:08 INFO  loader          :: Add: 8,000,000 Index (Batch: 124,626 / Avg: 189,479)
06:24:14 INFO  loader          :: Add: 9,000,000 Index (Batch: 183,654 / Avg: 188,813)
06:24:21 INFO  loader          :: Add: 10,000,000 Index (Batch: 148,456 / Avg: 183,816)
06:24:21 INFO  loader          ::   Elapsed: 54.40 seconds [2022/01/28 06:24:21 MST]
06:24:27 INFO  loader          :: Add: 11,000,000 Index (Batch: 148,104 / Avg: 179,873)
06:24:32 INFO  loader          :: Add: 12,000,000 Index (Batch: 234,631 / Avg: 183,441)
06:24:37 INFO  loader          :: Add: 13,000,000 Index (Batch: 195,121 / Avg: 184,289)
06:24:41 INFO  loader          :: Add: 14,000,000 Index (Batch: 232,234 / Avg: 187,048)
06:24:46 INFO  loader          :: Add: 15,000,000 Index (Batch: 206,825 / Avg: 188,248)
06:24:50 INFO  loader          :: Add: 16,000,000 Index (Batch: 239,463 / Avg: 190,798)
06:24:54 INFO  loader          :: Add: 17,000,000 Index (Batch: 233,208 / Avg: 192,861)
06:25:00 INFO  loader          :: Add: 18,000,000 Index (Batch: 186,776 / Avg: 192,513)
06:25:06 INFO  loader          :: Add: 19,000,000 Index (Batch: 151,423 / Avg: 189,802)
06:25:13 INFO  loader          :: Add: 20,000,000 Index (Batch: 140,567 / Avg: 186,535)
06:25:13 INFO  loader          ::   Elapsed: 107.22 seconds [2022/01/28 06:25:13 MST]
06:25:19 INFO  loader          :: Add: 21,000,000 Index (Batch: 189,537 / Avg: 186,676)
06:25:24 INFO  loader          :: Add: 22,000,000 Index (Batch: 202,798 / Avg: 187,353)
06:25:29 INFO  loader          :: Add: 23,000,000 Index (Batch: 183,553 / Avg: 187,185)
06:25:35 INFO  loader          :: Add: 24,000,000 Index (Batch: 170,561 / Avg: 186,428)
06:25:41 INFO  loader          :: Add: 25,000,000 Index (Batch: 165,043 / Avg: 185,466)
06:25:50 INFO  loader          :: Add: 26,000,000 Index (Batch: 114,129 / Avg: 181,112)
06:25:58 INFO  loader          :: Add: 27,000,000 Index (Batch: 115,713 / Avg: 177,399)
06:26:03 INFO  loader          :: Add: 28,000,000 Index (Batch: 199,600 / Avg: 178,106)
06:26:11 INFO  loader          :: Add: 29,000,000 Index (Batch: 128,600 / Avg: 175,773)
06:26:18 INFO  loader          :: Add: 30,000,000 Index (Batch: 150,966 / Avg: 174,816)
06:26:18 INFO  loader          ::   Elapsed: 171.61 seconds [2022/01/28 06:26:18 MST]
06:26:24 INFO  loader          :: Add: 31,000,000 Index (Batch: 166,417 / Avg: 174,531)
06:26:30 INFO  loader          :: Add: 32,000,000 Index (Batch: 169,606 / Avg: 174,373)
06:26:36 INFO  loader          :: Add: 33,000,000 Index (Batch: 170,940 / Avg: 174,267)
06:26:41 INFO  loader          :: Add: 34,000,000 Index (Batch: 180,050 / Avg: 174,432)
06:26:48 INFO  loader          :: Add: 35,000,000 Index (Batch: 147,123 / Avg: 173,512)
06:26:56 INFO  loader          :: Add: 36,000,000 Index (Batch: 116,103 / Avg: 171,161)
06:27:04 INFO  loader          :: Add: 37,000,000 Index (Batch: 132,872 / Avg: 169,838)
06:27:10 INFO  loader          :: Add: 38,000,000 Index (Batch: 168,265 / Avg: 169,796)
06:27:17 INFO  loader          :: Add: 39,000,000 Index (Batch: 132,890 / Avg: 168,596)
06:27:26 INFO  loader          :: Add: 40,000,000 Index (Batch: 115,888 / Avg: 166,700)
06:27:26 INFO  loader          ::   Elapsed: 239.95 seconds [2022/01/28 06:27:26 MST]
06:27:32 INFO  loader          :: Add: 41,000,000 Index (Batch: 178,030 / Avg: 166,959)
06:27:38 INFO  loader          :: Add: 42,000,000 Index (Batch: 148,016 / Avg: 166,452)
06:27:45 INFO  loader          :: Add: 43,000,000 Index (Batch: 157,903 / Avg: 166,243)
06:27:53 INFO  loader          :: Add: 44,000,000 Index (Batch: 127,779 / Avg: 165,113)
06:27:58 INFO  loader          :: Add: 45,000,000 Index (Batch: 176,211 / Avg: 165,345)
06:28:05 INFO  loader          :: Add: 46,000,000 Index (Batch: 157,728 / Avg: 165,171)
06:28:11 INFO  loader          :: Add: 47,000,000 Index (Batch: 147,080 / Avg: 164,740)
06:28:17 INFO  loader          :: Add: 48,000,000 Index (Batch: 168,208 / Avg: 164,811)
06:28:25 INFO  loader          :: Add: 49,000,000 Index (Batch: 130,582 / Avg: 163,934)
06:28:32 INFO  loader          :: Add: 50,000,000 Index (Batch: 143,740 / Avg: 163,475)
06:28:32 INFO  loader          ::   Elapsed: 305.86 seconds [2022/01/28 06:28:32 MST]
06:28:38 INFO  loader          :: Add: 51,000,000 Index (Batch: 164,122 / Avg: 163,487)
06:28:44 INFO  loader          :: Add: 52,000,000 Index (Batch: 176,959 / Avg: 163,727)
06:28:49 INFO  loader          :: Add: 53,000,000 Index (Batch: 184,911 / Avg: 164,082)
06:28:55 INFO  loader          :: Add: 54,000,000 Index (Batch: 185,770 / Avg: 164,437)
06:29:01 INFO  loader          :: Add: 55,000,000 Index (Batch: 148,434 / Avg: 164,115)
06:29:10 INFO  loader          :: Add: 56,000,000 Index (Batch: 120,467 / Avg: 163,060)
06:29:16 INFO  loader          :: Add: 57,000,000 Index (Batch: 156,225 / Avg: 162,935)
06:29:22 INFO  loader          :: Add: 58,000,000 Index (Batch: 176,772 / Avg: 163,155)
06:29:28 INFO  loader          :: Add: 59,000,000 Index (Batch: 156,985 / Avg: 163,047)
06:29:33 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP [59,535,659 items, 366.6 seconds]
06:29:36 INFO  loader          :: Finish - index OSP
06:29:40 INFO  loader          :: Finish - index POS
06:29:40 INFO  loader          :: Time = 1,172.670 seconds : Triples = 59,994,733 : Rate = 51,161 /s

Observations on queries

  1. basic works but the env context local with ABP ENVO:01000813 but only gets ~160 of the samples, Presumably because layer terms (tara and I think hot) aren't subclasses of ABP in ENVO, (or at least PMO's ENVO import).

  2. When I try basic query3 with bfo material entity instead of the local context it takes forever to query, which it didn't with subclasses of ABP. I'm wondering if it's doing something stupid like doing the query for each sample over and over. Should try for example using a having() block for the ENVO triad terms like how I did it in my BCODMO example. Aka a subquery where after selecting on the other fields like quantifiers or GO terms, then sub-select on the subclasses of ENVO terms. Could also try like in my BCODMO example (or some iteration of it) using filter in (list_of_subclasses).

01.29

Rough triplestore testing doing NCBITaxon subclass* query with Synechococcales NCBITaxon:1890424 Medium/large NCBITaxon test. Not finishing in 1.5-2hr time range.

01.30

https://www.ontology-of-designing.ru/article/2021_4(42)/Ontology_Of_Designing_4_2021_402-421_Azamat_Abdoullaev.pdf see Figure 1 - Gartner Hype Cycle for Artificial Intelligence 2021 where semantic search is right at the slope of enlightenment. Could use this in my defense in the intro as a global picture of things. Maybe cite the paper in my dissertation intro too?

01.31

testing go_no_envo.rq with GO:0043169 it took: 2m31.278s

Testing ncbitaxon_only.rq with NCBITaxon:1890424 started ~noon by 9pm it's still going.

2022 OBI Workshop COB Data 2022

02.01

Doing subclass* on Synechococcales NCBITaxon:1890424 take ~3 seconds. So that's not the problem.

Trying ncbitaxon_only.rq with NCBITaxon:2784134 Gibliniella took 1m13.041s. no results try new small test.

Try again with NCBITaxon:2649294 unclassified Janibacter as I know we should have at least NCBITaxon:2761047 in our test data. There are 181 subclasses so fairly small test but not tiny. Starting ~noon. after an hour it failed from broken link to server.

Try with just sc of NCBITaxon:2761047 should be just the one class. took 0m22.204s. Good sanity check.

Trying ncbitaxon query ncbitaxon_no_sc_query.rq without a subclass* constraint aka get all samples with all nodes with taxon annotations. Started 1:20. Idea is to try doing this first then throw in a HAVING() block with the SC* constraint as a subquery to see if that runs faster. It worked after 4m5.384s I canceled because it'd be writing to STOUT for ever, but the query was quick.

Try ncbitax_having.rq with SC* constraint in HAVING block trying again with NCBITaxon:2649294 start 1:37 caneling at 3:21 for time reasons cyverse maintenance is coming up.

On lytic

Tested 1_sample_ncbitaxon_test.rq with NCBITaxon:2649294 (181 SC's) didn't finish probably ran out of memory.

Tested 1_sample_ncbitaxon_test.rq with a SC* query with only one class and for just one sample with the having block HAVING (?taxon IN (?tax_list) && ?sample="SRR9178442" ) takes 3m28.302s perhaps not ideal. This was with bind. Try without bind, given this stack overflow post although that might just be for marklogic. without bind it took 2m59.466s so a good 30 seconds shorter for that one sample but it didn't return data so query is prob wrong.

Trying without having block or NCBI just samples: without bind 0m1.744s with bind 0m1.802s so perhaps not too different. Would need to try longer query. TODO.

Also see https://hal.archives-ouvertes.fr/hal-01280951/file/Slides%20WebIST16.pdf file:///Users/kai/Downloads/Slides%20WebIST16.pdf See slide 21 about pushing down the filter with a nested select and where, saying we don't know how good the DB is and the filter uses the DB. In my case the TBD2 is indexed so maybe filter first is faster? Try both.

Also see https://dotnetrdf.org/docs/stable/developer_guide/SPARQL-Optimization.html. Basically move the bind and filter statements up.

02.04

Trying taxslim instead of the full ncbitaxon.owl

-rw-r--r--. 1 kblumberg 1.8G Dec 14 03:58 ncbitaxon.owl
-rw-r--r--. 1 kblumberg 127M Jan 18 13:05 go.owl
-rw-r--r--. 1 kblumberg  22M Feb  1 07:51 pmo.owl
-rw-r--r--. 1 kblumberg 3.4M Feb  4 01:14 taxslim.obo
-rw-r--r--. 1 kblumberg  23M Feb  4 01:14 taxslim.owl

Wow taxslim is a lot smaller! Hopefully we don't miss too many results with this ... If not I can make my own extract version of NCBITaxon and host it in the PMO repository. Using tax slim requires robot to do the conversion from obo to owl.

install robot:

wget https://github.com/ontodev/robot/releases/download/v1.8.3/robot.jar

curl https://raw.githubusercontent.com/ontodev/robot/master/bin/robot > robot

Note you also need to make robot executable chmod +x which might require sudo.

cut -f 2 /home/u19/kblumberg/planet-microbe-semantic-web-analysis/triplestore/wgs_annotation_data/merged_ncbitaxon_data.tsv | sort | uniq > ncbitaxon_terms.txt

cut -f 2 /home/u19/kblumberg/planet-microbe-semantic-web-analysis/triplestore/wgs_annotation_data/merged_go_data.tsv | sort | uniq > go_terms.txt

Tesing BOT and SLME extract methods for GO and NCBITaxon with our term lists:

cat go_BOT.csv | wc -l
6936
cat go_SLME.owl.csv | wc -l
6935
cat ncbitaxon_BOT.csv | wc -l
29267
cat ncbitaxon_SLME.csv | wc -l
29267

Only one more term in GO's BOT than SLME and same for both in NCBITaxon so lets just go with BOT. I also tested to make sure the remove commands don't get rid of terms and they don't.

02.10

./start_fuseki_server.sh &
[1] 7377

This works on VM but not on my comp with the local tunnel temp url
./query/assemble_query.py -u query/base_metadata.rq -o api_results/base_metadata.csv

#Try installing httpsclient to deal with EOF error in https://stackoverflow.com/questions/47142848/python-sslerror-bad-handshake-unexpected-eof#47816648
pip install ndg-httpsclient

lt --port 3030 &

./assemble_query.py -o output/go.csv -g GO:0043169

scp [email protected]:/home/kblumberg/planet-microbe-semantic-web-analysis/analysis/api_results/go.csv .

https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html

check dates for summer grad/ and send to bonnie.

https://www.nature.com/articles/s41587-020-0603-3 for Heidi's work TBD Bonnie suggests to use this database.

02.11

https://www.linkedin.com/pulse/western-science-technology-decadence-china-takes-over-abdoullaev/?published=t https://www.linkedin.com/in/azamat-abdoullaev-335a0881/ https://www.linkedin.com/feed/update/urn:li:activity:6676906939940790272/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A6676906939940790272%29 https://www.quora.com/What-are-some-countries-where-capitalism-has-failed/answer/Kiryl-Persianov https://www.linkedin.com/pulse/western-science-decadence-european-research-council-azamat-abdoullaev/?published=t

Chris Mungall and Justin Reese at the Ontology Summit 2022 on COVID-19 Knowledge Graphs

02.21

https://guianaplants.stir.ac.uk/seminar/materials/vegantutor.pdf http://www.djcxy.com/p/5222.html https://github.com/kaiiam/kblumberg_masters_thesis/blob/master/Digital_Supplement_Kai_Blumberg_MSC_Thesis/D.S.2/R_scripts/pcoa_analysis/pcoa_analysis.R https://rdrr.io/rforge/vegan/man/biplot.rda.html

02.23

https://aws.amazon.com/blogs/database/build-interactive-graph-data-analytics-and-visualizations-using-amazon-neptune-amazon-athena-federated-query-and-amazon-quicksight/

02.25

https://www.frontiersin.org/articles/10.3389/fmars.2020.00105/full nickel story for marine pytoplankton.

Nickel (Ni) is a bio-essential element required for the growth of phytoplankton.

The generally lower surface concentrations in the NH subtropical gyre compared to the southern hemisphere (SH), might be related to a greater Ni uptake by nitrogen fixers that are stimulated by iron (Fe) deposition.

I bet this might apply to hot hence why we see more nickel at the surface

The distribution of Ni resembles the distribution of cadmium (Cd)

but other elements such as cadmium (Cd), molybdenum (Mo), vanadium (V), and selenium (Se) are important for specific taxonomic groups [e.g. (Morel et al., 2014; De Baar et al., 2018)].

basically marine phytoplankton need nickel.

https://www.mbari.org/wp-content/static/chemsensor/ni/nickel.html -> Ni has a nutrient like vertical profile with low concentrations at the surface and values increasing with depth (data).

Manganese https://www.mbari.org/wp-content/static/chemsensor/mn/manganese.html https://par.nsf.gov/servlets/purl/10179969 https://www.mbari.org/wp-content/static/chemsensor/mn/mngraph.html it'd be really cool to plot the distributions of the various GO ion binding genes like how people plot the profiles of nutrients see if for example manganese follows the same pattern. Doing it with the GO_0042301 phosphate ion binding and the HOT phosphate concentration would be awesome. Can try this with others? Hot has nitrate (nitrate and nitrite could use to get nitrite), phosphate, oxygen

Can also do the question with ion transport would have subclasses like nitrate transport maybe can do this for the oxic/anoxic story. Can also try ploting oxygen transport or oxygen carrier activity against O2 distribution for either project.

  1. phosphate depth distribution against phosphate ion binding prob more interesting in HOT story.

  2. nitrate depth distribution against nitrate transport or response to nitrate or other try both hot and tara.

  3. oxygen depth distribution with oxygen transport, oxygen carrier activity, oxygen binding or oxygen sensor activity or subclasses of response to oxygen levels, response to reactive oxygen species etc. Try these with Tara 02 story.

IF HOT-DOGS had the IRON profile could also do it against ferrous iron binding or cadmium or manganes or others. Perhaps better not to bring in more new data and stick with what we have.

try energy derivation by oxidation of organic compounds or cellular respiration with the Tara oxygen if I haven't already.

03.08

ESIP soil talk I gave last week

03.15

Notes from meeting with Dava:

* announcement of final def gradpath

* Journal article form (cert published manuscript)

* announcement of defense in gradpath, with committees signature. 

* final oral def approval form 

* Word announcement flyer for defense (DAVA will upload to  D2L)

Check out https://www.proquest.com/pqdtglobal/advanced for examples of dissertation
Also check manual for dissertation steps with an outline for what needs to be included

Aug 8 final submission of thesis need revisions from committee submit to proquest. Oral defense has to be 1-2 weeks before with time for revisions. 

Date for may 13 for in person spring grad, no summer (not required)

03.16

Meeting with Kathe Todd-Brown

Publications:
1) Soil methods ontology paper
2) data  pub similar to https://essd.copernicus.org/articles/12/61/2020/ on  integrated data
3) modeling and or statistical/ML analysis of data  integrated  in  2). Could be scope to do both modeling and ML and  compare and contrast the results.

Lifestuff:
Remote ok
Pay: 2 years  60K/yr

Maintain OBO network a plus. 

Extras:
Teaching undergrads a pipeline for data harmonization/ontology annotation. Examples https://github.com/ISCN/SOCDRaHR2/tree/master/R see the  SOCDRaHR2/R/readCPEAT.R 
She's going for https://www.usda.gov/climate-solutions/climate-smart-commodities unlikely in my opinion. 

04.05

https://dvc.org/doc/start from Bonnie like github but for data.

04.06

Humann pipeline. Original version https://github.com/biobakery/humann

alise's version: https://github.com/aponsero/Humann_annotation_HPC

https://huttenhower.sph.harvard.edu/humann has concise docs for install and use (https://github.com/biobakery/humann elaborates on more) starting with the former.

Doing it in /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann

we don't have pip, so install with conda instead:

# Create a new conda environment for the installation
conda create --name biobakery3 python=3.7

#might need to update conda
conda update -n base -c defaults conda

#activate biobakery3 conda environment
conda activate biobakery3

#Set conda channel priority: 
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery

#Install HUMAnN 3.0 software with demo databases: (will also automatically install MetaPhlAn 3.0.) take a little bit to run
conda install humann -c biobakery

#Test installation: Run HUMAnN unit tests:
humann_test


# Download human source code (for test) Get latest download link from https://pypi.org/project/humann/#files
wget https://files.pythonhosted.org/packages/27/f9/d07bd76dd7dd5732c4d29d58849e96e4828c8a7dc95cf7ae58622f37591a/humann-3.0.1.tar.gz

# Unzip archive (might have to type this manually instead of copy-pasting it) 
tar –xvzf humann-3.0.1.tar.gz 

# install databases done in new folder database
mkdir database
cd database/

#install chocophlan
humann_databases --download chocophlan full .
humann_databases --download uniref uniref90_diamond .
humann_databases --download utility_mapping full .

#move into directory:
cd humann-3.0.1/examples/

#To run test switch to interactive mode
interactive

# reactivate biobakery3 conda environment
conda activate biobakery3

#test failed without database

#Run the HUMAnN demo: 
humann -i demo.fastq.gz -o sample_results

after test run without databases:

Running metaphlan ........

CRITICAL ERROR: Error executing: /home/u19/kblumberg/miniconda2/envs/biobakery3/bin/metaphlan /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/tmpr_nv8ivq/tmpnx6snk_d -t rel_ab -o /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bowtie2.txt

Error message returned from metaphlan :
No MetaPhlAn BowTie2 database found (--index option)!
Expecting location bowtie2db
Exiting..

after installing choco database

Database installed: /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/database/chocophlan

HUMAnN configuration file updated: database_folders : nucleotide = /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/database/chocophlan

when running demo I still get the bowtie2 error. Like in https://forum.biobakery.org/t/no-metaphlan-bowtie2-database-found-index-option/1688 except that doesn't fix it. Try Alise's install instructions perhaps which use conda to install pip then pip to install human instead of the other way.

##Steps Humann install

#conda create --name humann # Didn't work
conda create --name humann python=3.7 #tired this instead

#add this
conda activate humann

conda install pip
conda update -n base -c defaults conda

pip install humann # instead try: `pip install humann --no-binary :all:`
conda install -c bioconda metaphlan
# fix libtbb2 for bowtie2
conda install tbb=2020.2
Output files will be written to: /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results
Decompressing gzipped file ...


Running metaphlan ........

CRITICAL ERROR: Error executing: /home/u19/kblumberg/miniconda2/envs/humann/bin/metaphlan /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/tmpdv78lxao/tmp5icolv89 -t rel_ab -o /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bowtie2.txt

Error message returned from metaphlan :

Downloading MetaPhlAn database
Please note due to the size this might take a few minutes

File /home/u19/kblumberg/miniconda2/envs/humann/lib/python3.7/site-packages/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.tar already present!

Downloading http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.md5
Downloading file of size: 0.00 MB
MD5 checksums do not correspond! If this happens again, you should remove the database files and rerun MetaPhlAn so they are re-downloaded

04.07

Decompressing gzipped file ...


Running metaphlan ........

CRITICAL ERROR: Error executing: /home/u19/kblumberg/miniconda2/envs/humann/bin/metaphlan /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/tmplyauebur/tmpzao7iyhj -t rel_ab -o /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai/humann/humann-3.0.1/examples/sample_results/demo_humann_temp/demo_metaphlan_bowtie2.txt

Error message returned from metaphlan :
No MetaPhlAn BowTie2 database found (--index option)!
Expecting location bowtie2db
Exiting...

Trying again with clean conda environment:

##Steps Humann install

conda create --name humann_kai python=3.7 

#Activate
conda activate humann_kai

#Set conda channel priority: 
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery

#install pip
conda install pip

# install humann
pip install humann --no-binary :all:

# install metaphlan
conda install -c bioconda metaphlan
# fix libtbb2 for bowtie2
conda install tbb=2020.2

Meeting with Bonnie

Dissertation 1st draft to Bonnie Jun 13

Bonnie will give me back by 3rd /4th week of June 17th.

To committee by June 27th back to me by July 15th.

Defense week of 18th or 25th.

From Alise

Humann steps:

Run all samples (Run with Metaphlan output, there should be a flag for it)

merge tables

normalize table by count per million

unstratify

demo_metaphlan_bugs_list.tsv is the taxonomic profile (like kraken but always at species level)

04.21

reinstall databases and unzip the files prior to running. 

Run bowtie2 with human database and trimgalore as initial QC step beforing running human. See https://github.com/aponsero/readbased_metagenomes_snakemake Hopefully this will deal with the gzip issue.

install bowtie2 and trim_galore into my conda env using bioconda 

donlowad Human genome for bowtie2

wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip unzip GRCh38_noalt_as.zip

hdb="databases/GRCh38_noalt_as/GRCh38_noalt_as",
bowtie2 -p 8 -x {params.hdb} -U {input.f1} --un-gz output/{params.bowtiename}  


trim_galore -o {trimgalore output folder} --fastqc {input to trimgalore}.fastq.gz

https://github.com/aponsero/readbased_metagenomes_snakemake/blob/main/PBS_pipeline/Snakefile

04.27

https://www.ebi.ac.uk/GOA/InterPro2GO http://current.geneontology.org/ontology/external2go/interpro2go https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1160203/ https://www.ebi.ac.uk/GOA/ http://current.geneontology.org/ontology/external2go https://github.com/geneontology/go-site/uniprotkb_kw2go https://wiki.geneontology.org/index.php/Release_Pipeline https://www.ebi.ac.uk/QuickGO/term/GO:0008198

http://berkeleybop.org/index.html

https://www.ontotext.com/products/graphdb/

04.28

berkeleybop projects:

### Likely in scope
National Microbiome Data Collaborative (NMDC)
OBO Foundry
Gene Ontology
INCA: Intelligent Concept Assistant (inactive) perhaps can bring some of this ML cleaned NCBIdatabase into NMDC?

### Maybe:
Monarch Initiative -> semantically integrate genotype-phenotype data from many species and sources in order to support precision medicine, disease modeling, and mechanistic exploration
Phenomics First -> part of Monarch developing tools for biomedical information about genetic conditions is captured, stored, and exchanged. upheno
NCATS Biomedical Translator project seeks to “translate” the results of biological research into clinical practice
SymbiOnt -> Augmenting and merging ontologies using ontology mappings and knowledge graph embeddings

### Less interesting to me:
Exomiser -> tool that finds potential disease-causing variants from whole-exome or whole-genome sequencing data part of Monarch
IDG2KG -> Illuminating the Druggable Genome 
Alliance for Genome Resources -> model organisms to contribute to human health. 
KG-COVID-19 -> covid knowldege graph
GMOD -> Generic Model Organism Database 
CCDH -> cancer harmonization (Not currently active)

04.29

https://github.com/cidgoh/DataHarmonizer from damion

https://asm.org/ASM/media/Academy/Academy%20Reports/Microbes-Climate-Change-Science,-People-Impacts-Report.pdf from Chris for NMDC.

Bland Altman analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4470095/, https://rss.onlinelibrary.wiley.com/doi/abs/10.2307/2987937

A guide to appropriate use of Correlation coefficient in medical research

05.15

For Heidi

./count.sh ../../heidi/karnes_metagenomes/*.fastq.gz

../../heidi/karnes_metagenomes/Karnes-10-1_UA-NGsp-fastq_Karnes-C11-U035_Karnes-C11_S285_R1_001.fastq.gz  11G
    Number of reads: 135756044
    Number of bases in reads: 20311864640
../../heidi/karnes_metagenomes/Karnes-10-2_UA-NGsp-fastq_Karnes-C12-U036_Karnes-C12_S286_R1_001.fastq.gz 9.8G
    Number of reads: 125171637
    Number of bases in reads: 18750286185
../../heidi/karnes_metagenomes/Karnes-1-1_UA-NGsp-fastq_Karnes-F01-U061_Karnes-F01_S267_R1_001.fastq.gz 11G
    Number of reads: 130472127
    Number of bases in reads: 19453178652
../../heidi/karnes_metagenomes/Karnes-1-2_UA-NGsp-fastq_Karnes-F02-U062_Karnes-F02_S268_R1_001.fastq.gz 12G
    Number of reads: 159006650
    Number of bases in reads: 23401581189
../../heidi/karnes_metagenomes/Karnes-2-1_UA-NGsp-fastq_Karnes-F03-U063_Karnes-F03_S269_R1_001.fastq.gz 16G
    Number of reads: 199491669
    Number of bases in reads: 29728724644
sample	Size G	Number of reads	Number of bases in reads	Finished in original run
Karnes-10-1_UA-NGsp-fastq_Karnes-C11-U035_Karnes-C11_S285_R1_001.fastq.gz	11	135756044	20311864640	Yes
Karnes-10-2_UA-NGsp-fastq_Karnes-C12-U036_Karnes-C12_S286_R1_001.fastq.gz	9.8	125171637	18750286185	Yes
Karnes-1-1_UA-NGsp-fastq_Karnes-F01-U061_Karnes-F01_S267_R1_001.fastq.gz	11	130472127	19453178652	no
Karnes-1-2_UA-NGsp-fastq_Karnes-F02-U062_Karnes-F02_S268_R1_001.fastq.gz	12	159006650	23401581189	no
Karnes-2-1_UA-NGsp-fastq_Karnes-F03-U063_Karnes-F03_S269_R1_001.fastq.gz	16	199491669	29728724644	no

Going with 20 million reads as the subsamples -> gunzip -c $INPUT_DIR/$SMPLE | head -n 80000000 | gzip > subsample/$SMPLE

05.17

Humann results:

pathabundance.tsv is MetaCyc genefamilies.tsv is UniRef90. Can presumably use either for the functional rarefaction curve.


cut -f 1 Karnes-7-1_UA-NGsp-fastq_Karnes-B09-U021_Karnes-B09_S279_R1_001_trimmed_pathabundance.tsv | sort | uniq | wc -l

cut -f 1 Karnes-9-2_UA-NGsp-fastq_Karnes-C10-U034_Karnes-C10_S284_R1_001_trimmed_genefamilies.tsv  | sort | uniq | wc -l

#inside temp folder(s)
cut -f 1 Karnes-9-2_UA-NGsp-fastq_Karnes-C10-U034_Karnes-C10_S284_R1_001_trimmed_metaphlan_bugs_list.tsv | sort | uniq | wc -l

05.25

./bash/wgs_annotation_data.sh: line 19: tarql: command not found
./bash/planet_microbe_data.sh: line 15: jq: command not found
./bash/planet_microbe_data.sh: line 116: python3: command not found
./bash/download_ontologies.sh: line 8: wget: command not found
./bash/create_triplestore.sh: line 17: tdb2.tdbloader: command not found

05.27

loading triplestore on tecti server Finished: 13 files: 42,656,343 tuples in 669.94s (Avg: 63,671)

Time = 906.027 seconds : Triples = 42,656,343 : Rate = 47,081 /s

Moved server version of repo to /opt/planet-microbe-semantic-web-analysis more appropriate place to keep it then in my user folder, so that others can use it.

https://arizona.zoom.us/my/kaiblumberg

Meeting with Adam Michel

new endpoint: http://sparql.planetmicrobe.org/

The following command (run with sudo) apply to the server:

#See status 
systemctl status fuseki

#restart server
systemctl restart fuseki

# Output log 
journalctl -xa -u fuseki -f

# Status of proxy 
systemctl status nginx

will send log to `/var/log/nginx/access.log`

how it works: nginx listing on webserver 80 and reverse proxying from local3030 to public url  relevant config `/etc/nginx/conf.d/fuseki.conf`. He had to change some group permissions on files in the repo to get it to work. 


using a system service init process will start a set of daemons fuseki now part of startup, in system d `/etc/systemd/system/fuseki.service` are all of the things the system daemone needs to start fuseki daemono. System.d is the daemono were using on this red hat server

05.28

https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html https://jena.apache.org/documentation/fuseki2/fuseki-webapp.html#fuseki-standalone-server https://stackoverflow.com/questions/31927012/disable-only-unauthenticated-adding-of-datasets-to-fuseki https://jena.apache.org/documentation/fuseki2/fuseki-security.html

http://ontology.buffalo.edu/philosophome/index_files/philosophome.html

06.15

From Damion he's looking for the USDA funing to help with https://fdc.nal.usda.gov/

from Alise The Ocean Gene Atlas v2.0: online exploration of the biogeography and phylogeny of plankton genes should cite this in paper 3.

06.16

Kraken2 database from https://benlangmead.github.io/aws-indexes/k2 we used he standard : archaea, bacteria, viral, plasmid, human1, UniVec_Core Version from 5/17/2021

06.20

if Heidi wants a more accurate taxonomic profiling Alise would suggest using kraken along with the HumGut database https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01114-w#Sec10 But metaphlan should be pretty good for most of the bugs

06.22

https://www.linkedin.com/in/williamhsiao/ https://cidgoh.ca/ https://genepio.org/ https://github.com/GenEpiO/genepio https://irida.ca/ https://github.com/Public-Health-Bioinformatics

07.07

To merge metaphlan3 results (from Humann pipleline) merge_metaphlan_tables.py is in the conda environment.

merge_metaphlan_tables.py *_bugs_list.tsv > merged_abundance_table.txt

To make the merged taxa table, from inside the humann_results folder, run the following:

mkdir bug_list
cp Karnes-10-1_UA-NGsp-fastq_Karnes-C11-U035_Karnes-C11_S285_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-10-2_UA-NGsp-fastq_Karnes-C12-U036_Karnes-C12_S286_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-1-1_UA-NGsp-fastq_Karnes-F01-U061_Karnes-F01_S267_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-1-2_UA-NGsp-fastq_Karnes-F02-U062_Karnes-F02_S268_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-2-1_UA-NGsp-fastq_Karnes-F03-U063_Karnes-F03_S269_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-2-2_UA-NGsp-fastq_Karnes-F04-U064_Karnes-F04_S270_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-3-1_UA-NGsp-fastq_Karnes-F05-U065_Karnes-F05_S271_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-3-2_UA-NGsp-fastq_Karnes-F06-U066_Karnes-F06_S272_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-4-1_UA-NGsp-fastq_Karnes-F07-U067_Karnes-F07_S273_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-4-2_UA-NGsp-fastq_Karnes-F08-U068_Karnes-F08_S274_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-5-1_UA-NGsp-fastq_Karnes-F09-U069_Karnes-F09_S275_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-5-2_UA-NGsp-fastq_Karnes-F10-U070_Karnes-F10_S276_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-6-1_UA-NGsp-fastq_Karnes-F11-U071_Karnes-F11_S277_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-6-2_UA-NGsp-fastq_Karnes-F12-U072_Karnes-F12_S278_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-7-1_UA-NGsp-fastq_Karnes-B09-U021_Karnes-B09_S279_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-7-2_UA-NGsp-fastq_Karnes-B10-U022_Karnes-B10_S280_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-8-1_UA-NGsp-fastq_Karnes-B11-U023_Karnes-B11_S281_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-8-2_UA-NGsp-fastq_Karnes-B12-U024_Karnes-B12_S282_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-9-1_UA-NGsp-fastq_Karnes-C09-U033_Karnes-C09_S283_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/
cp Karnes-9-2_UA-NGsp-fastq_Karnes-C10-U034_Karnes-C10_S284_R1_001_trimmed_humann_temp/*_bugs_list.tsv bug_list/

merge_metaphlan_tables.py bug_list/*_bugs_list.tsv > merged_abundance_table.txt

https://github.com/biobakery/biobakery/wiki/metaphlan3 has some cool ideas on what to do with the metaphlan3 results.

07.08

Phenoscape/Imageomics post-doc possibility

From Jim Balhoff

Imageomics Institute (https://imageomics.osu.edu/). The Phenoscape team is involved in providing ontology expertise to Imageomics, in the form of using ontologies as a form of structured knowledge input to machine learning analyses, and also using ontologies as a knowledge representation tool for the outputs of machine learning analyses. Determining exactly how ontologies will be employed in Imageomics is in the formative stage.

Thoughts:

Not a totally new problem for sure people have been doing similar things see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5556681/. Presumably it should be possible (or already done) that a multi-class Neural Network (or similar ML model) should be able, (given sufficient training data) to classify taxonomic assignments (at some level of granularity) from images.

Assuming the above works we should be able to get a dataset of images of some known taxonic rank e.g, birds or a specific bird of interest like a red-breasted robin.

I rememeber from my ML courses that it should be possible when doing image analysis to define features, I remember the example of ears and eyes being used as features within images when doing the classic dog or blueberry muffin classifier. Assuming thats possible or tools for defining features already exist (e.g. something like https://www.keymakr.com/blog/image-annotation-tool-for-your-machine-learning-application/) we use them to manually annotate anatomical features (draw them onto a subset of our photo corpus), doing so with OBO ontology terms (namesly UBERON) e.g. 'chin'(UBERON:0008199), or 'face'(UBERON:0001456). These "features" within ML models of marked up images of anatomical body parts with ontology annotations could constitute the "computable traits". If that works we could use the intersection of the ontology and the data (photos with these ontology term annotations) to get more out of the data. For example we could use the ontology marked up features within new ML models to identify said features in more photos. Perhaps start by doing this for some species, and maybe move up to larger but related taxonomic groupings, see if it's producing reasonable results. If so we could do some analyses on traits across or within speices by quantifiying something like the lengths of 'tail feather'(UBERON:0018537) either as a quantified size (if that's possible to do from photos) or relativized against some other feature to be able to make comparisons.

Additionally, we might be able to use the intersection of the ontology/data to validate the the model we make is making "correct" assignments analogously to using a reasoner to check an ontology for inconsistencies.

For example 'chin'(UBERON:0008199) has the following axioms:

subdivision of head
structure with developmental contribution from neural crest
part of some lower jaw region
part of some face

If we can somehow translate some of these axioms (like the part of relations) into a computable framework then we can use the ontology axioms as rules to see if the outputs make sense. I.e. the bounding box feature of the chin has to be within the bounding box feature of 'face'(UBERON:0001456).

07.18

Committee wants:

1)Include known and unknown fractions of community I shoud have this Alise mentioned it. % known and unknown for taxa and function. Hopefuly we have this.

  1. Add supplemental protocols for final publication (was planning on doing this anyway with procols.io docs).

Nice to have can try for a perspective paper but no required for the dissertation

07.20

Meeting with Jim/Hilmar

imageomics -> 5 year NSF institute NLP ML biologists trait evolution

funding started last fall things slowly picking up. Propasal has funding to continue some work on phenoscape, brining ontologies to the ML project.

phenoscape making ontologies useful

  1. knowlege guided ML . HOw can uberon help id structures in image data some work with fish and butterflies

  2. using ontologies as a vocabulary if they can play a role in ML? Ouptus of ML tasks. What do the outputs look like how are they stored and marked up with ontologies.

Funding:

1st year full time post doc half time after

BGNN bio guided Neural Networks -> can use structured bio knowlege to improve ML using deep learning NN, Paula Mabee, Jane Greenberg (metadata researcher group director) (and one other PI) virginia tech. harness structured bio knowlege. knowlege of relatedness of taxa in taxonomc try to use to inform NN.

imageomics expands this to more bio images. Extracting traits from images or other bio knowlege. image -> deep learning NN -> info on traits -> use other bio knowlege to guide it. Recycle outputs of ML to improved knowlege and ML. Feed discovered traits or knowledge extracted into ?

using loss functions with ontologies?

Jane Greenberg metadata group director -> not offically part of the project so theres a gap in terms of taking a lead on metadata. What we want to know and express using ontologies at least hilmar whats that.

Meeting Mark S, Matt J

https://spoke.ucsf.edu/

OBOE SSN LPG

07.21

Meeting Kyle McKillop USDA (with Damion and Will too)

Food data central (FDC)

foundation food -> dataset 

methods  and food comp lab

food data team (kyle) making new lab: health of humans and environment. Mergining with human studies facilty do things like specific diets (ppl eating nuts or avocados meauring form of food chews, metabolisable energy). Push to use Ontologies in USDA

drafter letter of support for CDNO 

me to help advise and bring in research questions 

USDA components maybe 20 years out of date 

issues with method and components 

Colin K and phytochemical matching 

portfolio pages data, ontology layer describes food -> want to get info to users 
lab analytic data label data. 

PTFI chemical compund data no IDs for them incoming data. 


read what is food data central 

USDA's FoodData Central: what is it and why is it needed today? https://academic.oup.com/ajcn/article/115/3/619/6459205

food and nutrtion services, child nutrion service, ERS, what is sustainable nutritonal intake for foodstamps etc, ARS

lack of standardization across USDA, using things like NCBITaxon 

# 07.28

1) median first quartile or box plot by project 

2) make a first dummy protocol or the work and add to thesis version 

3) send out signatures page 

4) results of oral dissertation in gradpath (go check gradpath)

upload final dissertation through graduate college has margin checks maybe ask Chunan. 

Make some notes about how to graduate make it as a shared doc with all students 
-> capture stuff like 2 weeks before 

08.19

https://arxiv.org/abs/2207.02056 Ontology Development Kit: a toolkit for building, maintaining, and standardising biomedical ontologies preprint.

Clone this wiki locally