Automated and parallelized pipeline for segmentation, dimesionality reduction, and clustering of animal vocalizations.
This pipeline uses BigDataScript (BDS) to wrap the Animal Vocalization Generative Network (AVGN) segmentation, Uniform Manifold Approximation and Projection (UMAP) dimensionality redction, and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering workflow from the AVGN example notebooks. This workflow is used extensively throughout this awesome preprint from Tim Sainburg of the Gentner Lab.
This pipeline is a wrapper for awesome code other people wrote and which I had no part in developing. Check them out:
- Should work on any Linux or Mac capable of running Java. I have it running locally on Ubuntu 19.10 (Eoan) and Red Hat Enterprise Linux Server 7.2 (Maipo) on the cluster.
- Java - required for BDS
- R with the tidyverse- plotting and analysis
Super easy, it self deploys. Takes a few minutes:
git clone https://github.com/mattisabrat/yoUMAP_vocalizations.git
cd yoUMAP_vocalizations
./Install.sh
Maybe not SUPER easy, but still pretty easy as far as cluster implementations go.
- Follow the directions for local installation
- Edit the bds config file found at yoUMAP_vocalizations/.bin/bds/bds.config according the appropriate BDS docs on Cluster Options for your cluster environment
- Generic cluster scripts for some common environments can be found on the BDS github
- I included the scripts I use to interface with SLURM in yoUMAP_vocalizations/.bin/bds_clusterGeneric_SLURM/
./yoUMAP_vocalizations -e /Path/to/Experimental/Directory/ -n Threads_per_Task -c Path/to/Config/File
- -n: Defaults to 1
- -c: Defaults to yoUMAP_vocalizations/Defaults.config
Your Experimental_Directory/ must be correctly formatted for the pipeline to run. The Experimental_Directory/ must contain a sudirectory Experimental_Directory/Raw_Inputs/. Raw_Inputs/ should contain a subdirectory for each sample, lets call them sample_folder/ s, with the sample's name as the sample_folder/ name. This sample_folder/ name will be taken by the pipeline as the sample's name in the output. Each sample_folder should contain all the .wav files associated with that sample. The sample_folder/ name CANNOT include "wav". It will break the code, so don't do it. I don't feel rewriting that step without regular expressions.
- Experimental_Directory/
- Raw_Inputs/
- Bird_1/
- 1.wav
- 2.wav
- ...
- .../
- Bird_1/
- Raw_Inputs/
The final output of the pipeline can be found at /Experimental_Directory/yoUMAPped_syllables.rds.
- Experimental_Directory/
- Raw_Inputs/
- Segmented_Songs/
- Bird_1/
- song_chk.txt
- Bird_1/
- wavs/
- 1000-01-01_00-00-00-000000.wav
- ...
- specs/
- 1000-01-01_00-00-00-000000.png
- ...
- csv/
- 1000-01-01_00-00-00-000000.csv
- ...
- wavs/
- .../
- Bird_1/
- Segmented_Syllables/
- Bird_1/
- Bird_1_segmented_syllables.hdf5
- .../
- Bird_1/
- Clustered_Syllables/
- Bird_1/
- Bird_1_clustered_syllables.csv
- .../
- Bird_1/
- yoUMAPped_syllables.rds
The output data can be loaded into R using:
syll_tbls <- readRDS('Path/To/Experimental_Directory/yoUMAPped_syllables.rds')
This produces a named list of tibbles with the following column names:
names(syll_tbls[['Bird_1']])
[1] "spectrograms" "syll_length_s" "start_time_rel_wav" "animal" "labels" "sequence_syllable" "sequence_num"
[8] "z1" "z2" "seg_song_wav" "orig_wav"
- spectrograms : Matrix of the spectrogram as a Factor. Can be converted to matrix, see Functions
- syll_length_s : Length of syllable in seconds
- start_time_rel_wav : Start time of syllable within seg_song_wav
- animal : Animal name, same as list element name, inherited from the sample_folder/ name
- labels : HDBSCAN assigned cluster label, -1 is unassigned
- sequence_syllable : Syllable's ordinal position within it's sequence
- sequence_number : Sequence identifier
- z1 : Component 1 of syllable's representation in low dimensional space
- z2 : Component 2 of syllabel's representation in low dimensional space
- seg_song_wav : Location of the segmented song containing the syllable
- orig_wav : Location of raw input containing the syllable
To load the supplied R functions, run:
source('/Path/to/yoUMAP_vocalizations/.bin/r_functions.R')
image_spectrogram(spectrogram_as_factor,show=TRUE)
-
Returns the input spectrogram as a matrix
-
Displays the spectrogram if show=TRUE
sample_cluster(syll_tbl, cluster_label, n, r_seed=42, show=TRUE)
-
Returns n randomly sampled spectrograms from cluster_label in syll_tbl as matrices
-
Displays the spectrograms if show=TRUE
-
Set the random seed with r_seed
scatter_clusters(syll_tbl, show=TRUE, size=0.5, alpha=1, filter_unlabled=TRUE)
-
Scatter plot of syllables in low dimensional space, colored by syll_tbl$labels, example plot below in Tests
line_seqs <- function(syll_tbl, show=TRUE, alpha=0.05)
-
Line plot of syllable sequences in low dimensional space, example plot below in Tests
I intentionally made the output a list of tibble to simplify data processing across all samples using lapply in combination with the tidyverse. For example:
lapply(X=syll_tlbs, FUN=scatter_clusters)
returns the scatter plots for all samples in the list. Similarly:
library('dplyr')
syll_tbls <- lapply(X=syll_tbls, FUN=function(syll_tbl){
new_syll_tbl <- syll_tbl %>%
mutute(new_col = some_function(z1,z2))
return(new_syll_tbl)
})
Adds a column, "new_col" to all tibbles in syll_tlbs. This "new_col" is some_function of the syllable's position in low dimenstional space. This could also be used to map additional experimental variables such as "days_post_lesion" or "optogenetic_state" using the orig_wav column.
The pipeline comes with a one animal test dataset taken from [Katahira K, Suzuki K, Kagawa H, Okanoya K (2013) A simple explanation for the evolution of complex song syntax in Bengalese finches. Biology Letters 9(6): 20130842.](https://doi.org/10.1098/rsbl.2013.0842 https://datadryad.org//resource/doi:10.5061/dryad.6pt8g) The unmodified Defaults.config is appropriate for the processing of these data.
Assuming you installed in your home (~) directory run the following:
cd yoUMAP_vocalizations/
./yoUMAP_vocalizations.sh -e test_dir/
Now open R in RStudio or a Jupyter notebook and run:
source('~/yoUMAP_vocalizations/.bin/r_functions.R')
syll_tbls <- readRDS('~/yoUMAP_vocalizations/test_dir/yoUMAPped_Syllables.rds')
scatter_clusters(syll_tbls[['Bird0']])
line_seqs(syll_tbls[['Bird0']])
This should produce the following plots, if it doesn't you have an issue:
The pipeline produces a lot of console vomit, sorry. Its a to do.
- Animal level clustering?
- Tame the console vomit.
- Turn the R functions into a proper R package. Does that have to be its own repo?
- Figure out exactly how to assist users in tuning the segmentation parameters. Probably a jupyter notebook capable of writing out a config file.
This pipeline was written and is maintained by Matt Daveport ([email protected])