Import and conversion scripts related to Preston data. The scripts are intended to provide examples on how to use Preston in combination with GUODA's cluster (e.g., HDFS, Apache Spark, Mesos).
Includes scripts to:
HDFS Import
- import Preston archive into HDFSPreston to DwC-A
- convert Preston DwC-A archives into sequence files, and parquet files.Create Taxonomic Checklist
- use converted Preston DwC-A archives to generate taxonomic checklists given specified taxon and geospatial constraints.
Please submit any issues you may have using https://github.com/bio-guoda/guoda-services/issues/ .
Hadoop File System (HDFS) is a well-used distributed filesystem designed for parallel processing. Initially designed for hadoop map-reduce, it is now also used with processing engines like Apache Spark.
preston2hdfs.sh is a script to help migrate a Preston instance into HDFS. This is work in progress, so please be read the script before you use it.
To use:
- Start a terminal via https://jupyter.idigbio.org
- Clone this repository
git clone https://github.com/bio-guoda/preston-scripts
cd preston-scripts
- Inspect ./preston2hdfs.sh and change settings when needed.
- By default, the preston2hdfs.sh script uses an example Preston instance, https://github.com/bio-guoda/preston-amazon , as a Preston remote and HDFS target
/user/[your username]/guoda/data/source=preston-amazon/
. - Run
./preston2hdfs.sh
to migrate the Preston remote to the specified HDFS target. - Inspect the target HDFS target and the work directory
preston2hdfs.tmp
for results.
Now that Preston data has been moved into HDFS, we can use idigbio-spark to convert DwC-A files in the Preston data to formats like Parquet and Sequence file. This can be done using an interactive spark shell (spark-shell
or pyspark
), or by using by using spark-submit.
- Repeat step 0-2 of previous recipe
- Type
hdfs dfs -ls /user/[your username]/guoda/data/source=preston-amazon/
- Confirm that the
data
andprov
folder exists and have sub-directories. - Inspect ./dwca2parquet.sh
- Run
./dwca2parquet.sh
with appropriate settings. By default it uses/user/[your username]/guoda/data/source=preston-amazon/data
as your input and/user/[your username]/guoda/data/source=preston-amazon/dwca
as your output - Once the job is done, inspect HDFS output dir at
/user/[your username]/guoda/data/source=preston-amazon/dwca
for results
Similar to previous, only instead of using the spark-job-submit.sh script, do the following:
- start a jupyter terminal https://jupyter.idigbio.org
- download the https://github.com/bio-guoda/idigbio-spark/releases/download/0.0.1/iDigBio-LD-assembly-1.5.9.jar
- start a spark-shell using
spark-shell --conf spark.sql.caseSensitive=true --jars iDigBio-LD-assembly-1.5.9.jar
- now, run the following in the spark-shell
import bio.guoda.preston.spark.PrestonUtil
implicit val sparky = spark
PrestonUtil.main(Array("hdfs:///guoda/data/source=preston-amazon/data", "hdfs:///guoda/data/source=preston-amazon/dwca"))
- after the job is done, confirm that
val data = spark.read.parquet("/guoda/data/source=preston-amazon/dwca/core.parquet") // replace with suitable target directory
data.count
results in a non-zero result after replacing the hdfs paths with your desired input and output paths.
Note that your can run the spark-shell locally on your machine also and point the paths at a local file system using file:/// urls.
Also note that similar approach can be taken using pyspark (python) and a spark-shell that runs the executors in the cluster. See Apache Spark documentation for more information.
Taxonomic checklists can be generated after converting Preston DwC-A to Parquet files.
To generate a taxonomic checklist:
- inspect ./create-checklist.sh
- run
./create-checklist.sh
in jupyter.idigbio.org terminal using appropriate parameters. By default, a checklist for birds and frogs in an area covering the Amazon rainforest is created. - inspect the results in
hdfs:///user/[your user name]/guoda/checklist
andhdfs:///user/[your user name]/guoda/checklist-summary
or the non-default location that you used to calculate the checklist using:
$ hdfs dfs -ls /user/[your user name]/guoda/checklist
$ hdfs dfs -ls /user/[your user name]/guoda/checklist-summary
- to use checklists in spark, start a spark-shell (or pyspark) and run commands like:
$ spark-shell
...
scala> val checklists = spark.read.parquet("/user/[your user]/guoda/checklist")
...
scala> checklists.show(10) // to show first 10 items in checklist
...
Use path /user/[your user]/guoda/checklist-summary
to discover summaries of generated checklists.
- to export checklists to csv files, use:
$ spark-shell
scala> val checklists = spark.read.parquet("/user/[your user]/guoda/checklist")
...
scala> checklists.write.csv("/user/[your user]/my-checklist.csv")
This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.