A sample biodiversity dataset graph created with Preston.
This repository provides a small subset of biodiversity datasets registered with the GBIF. The purposes is to show an example of (1) how to track, (2) pre-process and (3) analyze datasets with Preston and Apache Spark.
The data
directory contains original datasets and provanance information created with the help of Preston using the following recipe:
- Download a preston jar from https://github.com/bio-guoda/preston/releases/download/0.0.10/preston.jar .
- In a terminal, run preston to track a graph of biodiversity occurrence datasets related to the Amazon using GBIF's api:
$ java -jar preston.jar track "http://api.gbif.org/v1/dataset/suggest?q=Amazon&type=OCCURRENCE"
<https://preston.guoda.org> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> .
<https://preston.guoda.org> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> .
<https://preston.guoda.org> <http://purl.org/dc/terms/description> "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en .
<83e11589-3579-47d8-ad5b-126e640112cd> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> .
<83e11589-3579-47d8-ad5b-126e640112cd> <http://purl.org/dc/terms/description> "A crawl event that discovers biodiversity archives."@en .
<83e11589-3579-47d8-ad5b-126e640112cd> <http://www.w3.org/ns/prov#startedAtTime> "2018-09-05T09:42:39.614Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
...
- You have now installed some occurrence datasets related to the Amazon using the GBIF dataset registry
- Locate a dwca in the biodiversity dataset graph using:
$ java -jar preston.jar ls -l tsv | grep "application/dwca" | cut -f1 | tail -n1
http://plazi.cs.umb.edu/GgServer/dwca/341C9C4FFFDDFFEF8D1DFFCBCB25FF90.zip
- Now, explore the version of a single archive using:
$ java -jar preston.jar ls | grep "http://plazi.cs.umb.edu/GgServer/dwca/341C9C4FFFDDFFEF8D1DFFCBCB25FF90.zip"
<663199f1-3528-4289-8069-d27552f62f10> <http://www.w3.org/ns/prov#hadMember> <http://plazi.cs.umb.edu/GgServer/dwca/341C9C4FFFDDFFEF8D1DFFCBCB25FF90.zip> .
<http://plazi.cs.umb.edu/GgServer/dwca/341C9C4FFFDDFFEF8D1DFFCBCB25FF90.zip> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<http://plazi.cs.umb.edu/GgServer/dwca/341C9C4FFFDDFFEF8D1DFFCBCB25FF90.zip> <http://purl.org/pav/hasVersion> <hash://sha256/e96d41772596daee7ebf7dd73239e236ae03c81d5ac39f8df4f911fc08776e98> .
- Get the content-addressed file and list its content using:
$ java -jar preston.jar get hash://sha256/e96d41772596daee7ebf7dd73239e236ae03c81d5ac39f8df4f911fc08776e98 > dwca.zip
$ unzip -l dwca.zip
Archive: dwca.zip
Length Date Time Name
--------- ---------- ----- ----
11694 2016-01-03 13:36 meta.xml
5085 2016-01-03 13:36 eml.xml
3720 2017-06-20 02:41 taxa.txt
284 2017-06-20 02:41 occurrences.txt
19069 2017-06-20 02:41 description.txt
54 2017-06-20 02:41 distribution.txt
53610 2017-06-20 02:41 media.txt
10738 2017-06-20 02:41 references.txt
33 2017-06-20 02:41 vernaculars.txt
--------- -------
104287 9 files
- for more information, see https://github.com/bio-guoda/preston .
Now that we have a collection of datasets of known provenance, we use Apache Spark with a idigbio-spark, a Spark library, to transform the data for further analysis. Our example collection contains Darwin Core Archives. These archives are typically distributed using zip archives. To prepare these Darwin Core archives for analysis, we take following steps pre-processing steps:
-
unzip the Darwin Core archives and bzip2 the entries
-
collect data schemas from all
meta.xml
files into meta.xml.seq -
use data schemas to convert core tables of the datasets to individual
core.parquet
files (e.g., core.parquet) -
combine all the individual
core.parquet
files into a singlecore.parquet
(e.g., core.parquet)
Note that Apache Parquet and Hadoop Sequence Files are file formats optimized for distributed/parallel processing.
See https://github.com/bio-guoda/preston-amazon and https://github.com/bio-guoda/preston-scripts for examples.
After pre-processing the data in formats suitable for scalable analysis, we can easily discover meta-data in eml.xml and meta.xml files. Also, the data itself can be queried similar to a database using Spark supported languages like python, R, java, and scala . For the examples below, we'll use pyspark, an interactive Spark python shell . For information how to install / start pyspark, see https://spark.apache.org/docs/latest/#running-the-examples-and-shell .
>>> datasets = spark.read.parquet("file:///some/path/preston-amazon/data-processed/core.parquet") # load aggregate data
>>> datasets.count() # count all rows
3858
>>> datasets.columns # print columns
['http://rs.tdwg.org/dwc/terms/datasetID', 'http://rs.tdwg.org/dwc/terms/specificEpithet', 'http://rs.tdwg.org/dwc/terms/order', 'http://rs.tdwg.org/dwc/terms/taxonID', 'http://rs.tdwg.org/dwc/terms/country', 'http://plazi.org/terms/1.0/basionymYear', 'http://gbif.org/dwc/terms/1.0/canonicalName', 'undefined0', 'http://rs.tdwg.org/dwc/terms/basisOfRecord', 'http://plazi.org/terms/1.0/combinationYear', 'http://plazi.org/terms/1.0/basionymAuthors', 'http://rs.tdwg.org/dwc/terms/scientificName', 'http://rs.tdwg.org/dwc/terms/decimalLatitude', 'http://rs.tdwg.org/dwc/terms/eventDate', 'http://rs.tdwg.org/dwc/terms/waterBody', 'http://rs.tdwg.org/dwc/terms/acceptedNameUsageID', 'http://rs.tdwg.org/dwc/terms/locationID', 'http://rs.tdwg.org/dwc/terms/taxonRank', 'http://rs.tdwg.org/dwc/terms/institutionCode', 'http://rs.tdwg.org/dwc/terms/phylum', 'http://purl.org/dc/terms/references', 'http://rs.tdwg.org/dwc/terms/originalNameUsageID', 'http://rs.tdwg.org/dwc/terms/individualCount', 'http://rs.tdwg.org/dwc/terms/kingdom', 'http://rs.tdwg.org/dwc/terms/year', 'http://rs.tdwg.org/dwc/terms/eventID', 'http://rs.tdwg.org/dwc/terms/identificationQualifier', 'http://rs.tdwg.org/dwc/terms/namePublishedIn', 'http://rs.tdwg.org/dwc/terms/scientificNameAuthorship', 'http://rs.tdwg.org/dwc/terms/taxonomicStatus', 'http://rs.tdwg.org/dwc/terms/decimalLongitude', 'http://rs.tdwg.org/dwc/terms/locality', 'http://rs.tdwg.org/dwc/terms/parentNameUsageID', 'http://rs.tdwg.org/dwc/terms/catalogNumber', 'http://rs.tdwg.org/dwc/terms/collectionCode', 'http://plazi.org/terms/1.0/combinationAuthors', 'http://purl.org/dc/terms/license', 'http://purl.org/dc/terms/bibliographicCitation', 'http://purl.org/dc/terms/accessRights', 'http://rs.tdwg.org/dwc/terms/family', 'http://rs.tdwg.org/dwc/terms/dynamicProperties', 'http://plazi.org/terms/1.0/verbatimScientificName', 'http://rs.tdwg.org/dwc/terms/eventRemarks', 'http://rs.tdwg.org/dwc/terms/class', 'http://rs.tdwg.org/dwc/terms/occurrenceID', 'http://rs.tdwg.org/dwc/terms/nomenclaturalStatus', 'http://rs.tdwg.org/dwc/terms/genus', 'http://purl.org/dc/terms/rightsHolder', 'http://www.w3.org/ns/prov#wasDerivedFrom']
>>> datasets.select('`http://rs.tdwg.org/dwc/terms/scientificName`').show(10) # show top 10 scientific names
+-------------------------------------------+
|http://rs.tdwg.org/dwc/terms/scientificName|
+-------------------------------------------+
| Acestrorhynchus g...|
| Charax sp. “Madeira”|
| Ageneiosus inermi...|
| Cyphocharax spilu...|
| Moenkhausia dichr...|
| Ctenobrycon spilu...|
| Hemiodus microlep...|
| Loricariichthys sp.|
| Trachydoras parag...|
| Hoplias malabaric...|
+-------------------------------------------+
only showing top 10 rows
>>> datasets.select('`http://rs.tdwg.org/dwc/terms/scientificName`').count()
3858
>>> datasets.select('`http://rs.tdwg.org/dwc/terms/scientificName`').distinct().count()
708
Preston data was pre-processed using a dedicated spark library, idigbio-spark. This library helps create sequence files of meta.xml and emls files in a Preston archive. The advantage of sequence files is that you can quickly iterate over many items. Large Preston archives consist of hundreds of thousands or millions of small meta-data files. Accessing these files individually can be done through Spark, but is very slow due to IO and memory overhead. The data-processed/meta.xml.seq
files contain tuples/pairs of originating data archive content hash and the content of their meta.xml file. To access this data:
>>> metas = sc.sequenceFile("files:///some/path/preston-amazon/data-processed/meta.xml.seq") # create the meta.xml RDD with (hash, xml string) tuples
>>> metas.map( lambda pair: pair[0] ).distinct().count() # count number of unique archive content hashes
37
>>> print(metas.map( lambda pair: pair[1] ).first()) # print the first meta.xml file
<archive metadata="eml.xml" xmlns="http://rs.tdwg.org/dwc/text/">
<core rowType="http://rs.tdwg.org/dwc/terms/Taxon" ignoreHeaderLines="1" fieldsEnclosedBy="" linesTerminatedBy="\n" fieldsTerminatedBy="\t" encoding="UTF-8">
<files>
<location>taxa.txt</location>
</files>
<id index="0"/>
<field term="http://rs.tdwg.org/dwc/terms/taxonID" index="0"/>
<field term="http://rs.tdwg.org/dwc/terms/namePublishedIn" index="1"/>
<field term="http://rs.tdwg.org/dwc/terms/acceptedNameUsageID" index="2"/>
<field term="http://rs.tdwg.org/dwc/terms/parentNameUsageID" index="3"/>
<field term="http://rs.tdwg.org/dwc/terms/originalNameUsageID" index="4"/>
<field term="http://rs.tdwg.org/dwc/terms/kingdom" index="5"/>
<field term="http://rs.tdwg.org/dwc/terms/phylum" index="6"/>
<field term="http://rs.tdwg.org/dwc/terms/class" index="7"/>
<field term="http://rs.tdwg.org/dwc/terms/order" index="8"/>
<field term="http://rs.tdwg.org/dwc/terms/family" index="9"/>
<field term="http://rs.tdwg.org/dwc/terms/genus" index="10"/>
<field term="http://rs.tdwg.org/dwc/terms/taxonRank" index="11"/>
<field term="http://rs.tdwg.org/dwc/terms/scientificName" index="12"/>
<field term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship" index="13"/>
<field term="http://gbif.org/dwc/terms/1.0/canonicalName" index="14"/>
<field term="http://plazi.org/terms/1.0/verbatimScientificName" index="15"/>
<field term="http://plazi.org/terms/1.0/basionymAuthors" index="16"/>
<field term="http://plazi.org/terms/1.0/basionymYear" index="17"/>
<field term="http://plazi.org/terms/1.0/combinationAuthors" index="18"/>
<field term="http://plazi.org/terms/1.0/combinationYear" index="19"/>
<field term="http://rs.tdwg.org/dwc/terms/taxonomicStatus" index="20"/>
<field term="http://rs.tdwg.org/dwc/terms/nomenclaturalStatus" index="21"/>
<field term="http://purl.org/dc/terms/references" index="22"/>
</core>
<snip/> <!-- removed a bunch of stuff here -->
</archive>
This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.