yoUMAP_vocalizations

Automated and parallelized pipeline for segmentation, dimesionality reduction, and clustering of animal vocalizations.

What does it do?

This pipeline uses BigDataScript (BDS) to wrap the Animal Vocalization Generative Network (AVGN) segmentation, Uniform Manifold Approximation and Projection (UMAP) dimensionality redction, and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering workflow from the AVGN example notebooks. This workflow is used extensively throughout this awesome preprint from Tim Sainburg of the Gentner Lab.

Credit Where Credit is Due

This pipeline is a wrapper for awesome code other people wrote and which I had no part in developing. Check them out:

System Prerequisites

Should work on any Linux or Mac capable of running Java. I have it running locally on Ubuntu 19.10 (Eoan) and Red Hat Enterprise Linux Server 7.2 (Maipo) on the cluster.
Java - required for BDS
R with the tidyverse- plotting and analysis

Installation - Local Machine

Super easy, it self deploys. Takes a few minutes:

  git clone https://github.com/mattisabrat/yoUMAP_vocalizations.git
  cd yoUMAP_vocalizations
  ./Install.sh

Installation - Cluster Environment

Maybe not SUPER easy, but still pretty easy as far as cluster implementations go.

Follow the directions for local installation
Edit the bds config file found at yoUMAP_vocalizations/.bin/bds/bds.config according the appropriate BDS docs on Cluster Options for your cluster environment
- Generic cluster scripts for some common environments can be found on the BDS github
- I included the scripts I use to interface with SLURM in yoUMAP_vocalizations/.bin/bds_clusterGeneric_SLURM/

Usage

  ./yoUMAP_vocalizations -e /Path/to/Experimental/Directory/ -n Threads_per_Task -c Path/to/Config/File

-n: Defaults to 1
-c: Defaults to yoUMAP_vocalizations/Defaults.config

Input Structure / Experimental Directory Formatting

Your Experimental_Directory/ must be correctly formatted for the pipeline to run. The Experimental_Directory/ must contain a sudirectory Experimental_Directory/Raw_Inputs/. Raw_Inputs/ should contain a subdirectory for each sample, lets call them sample_folder/ s, with the sample's name as the sample_folder/ name. This sample_folder/ name will be taken by the pipeline as the sample's name in the output. Each sample_folder should contain all the .wav files associated with that sample. The sample_folder/ name CANNOT include "wav". It will break the code, so don't do it. I don't feel rewriting that step without regular expressions.

Example

Experimental_Directory/
- Raw_Inputs/
  - Bird_1/
    - 1.wav
    - 2.wav
    - ...
  - .../

Output Structure

The final output of the pipeline can be found at /Experimental_Directory/yoUMAPped_syllables.rds.

Example

Experimental_Directory/
- Raw_Inputs/
- Segmented_Songs/
  - Bird_1/
    - song_chk.txt
    - Bird_1/
      - wavs/
        
        1000-01-01_00-00-00-000000.wav
        
        ...
      - specs/
        
        1000-01-01_00-00-00-000000.png
        
        ...
      - csv/
        
        1000-01-01_00-00-00-000000.csv
        
        ...
  - .../
- Segmented_Syllables/
  - Bird_1/
    - Bird_1_segmented_syllables.hdf5
  - .../
- Clustered_Syllables/
  - Bird_1/
    - Bird_1_clustered_syllables.csv
  - .../
- yoUMAPped_syllables.rds

Getting Up and Running in R

The output data can be loaded into R using:

  syll_tbls <- readRDS('Path/To/Experimental_Directory/yoUMAPped_syllables.rds')

This produces a named list of tibbles with the following column names:

 names(syll_tbls[['Bird_1']])

 [1] "spectrograms"       "syll_length_s"      "start_time_rel_wav" "animal"             "labels"             "sequence_syllable"  "sequence_num"      
 [8] "z1"                 "z2"                 "seg_song_wav"       "orig_wav"

spectrograms : Matrix of the spectrogram as a Factor. Can be converted to matrix, see Functions
syll_length_s : Length of syllable in seconds
start_time_rel_wav : Start time of syllable within seg_song_wav
animal : Animal name, same as list element name, inherited from the sample_folder/ name
labels : HDBSCAN assigned cluster label, -1 is unassigned
sequence_syllable : Syllable's ordinal position within it's sequence
sequence_number : Sequence identifier
z1 : Component 1 of syllable's representation in low dimensional space
z2 : Component 2 of syllabel's representation in low dimensional space
seg_song_wav : Location of the segmented song containing the syllable
orig_wav : Location of raw input containing the syllable

Functions

To load the supplied R functions, run:

  source('/Path/to/yoUMAP_vocalizations/.bin/r_functions.R')

Descriptions

  image_spectrogram(spectrogram_as_factor,show=TRUE)

Returns the input spectrogram as a matrix

Displays the spectrogram if show=TRUE

sample_cluster(syll_tbl, cluster_label, n, r_seed=42, show=TRUE)

Returns n randomly sampled spectrograms from cluster_label in syll_tbl as matrices
Displays the spectrograms if show=TRUE

Set the random seed with r_seed

scatter_clusters(syll_tbl, show=TRUE, size=0.5, alpha=1, filter_unlabled=TRUE)

Scatter plot of syllables in low dimensional space, colored by syll_tbl$labels, example plot below in Tests
```
line_seqs <- function(syll_tbl, show=TRUE, alpha=0.05)
```
Line plot of syllable sequences in low dimensional space, example plot below in Tests

Data processing using lapply

I intentionally made the output a list of tibble to simplify data processing across all samples using lapply in combination with the tidyverse. For example:

  lapply(X=syll_tlbs, FUN=scatter_clusters)

returns the scatter plots for all samples in the list. Similarly:

  library('dplyr')
  syll_tbls <- lapply(X=syll_tbls, FUN=function(syll_tbl){
      new_syll_tbl <- syll_tbl %>%
          mutute(new_col = some_function(z1,z2))
      return(new_syll_tbl)
  })

Adds a column, "new_col" to all tibbles in syll_tlbs. This "new_col" is some_function of the syllable's position in low dimenstional space. This could also be used to map additional experimental variables such as "days_post_lesion" or "optogenetic_state" using the orig_wav column.

Tests

The pipeline comes with a one animal test dataset taken from [Katahira K, Suzuki K, Kagawa H, Okanoya K (2013) A simple explanation for the evolution of complex song syntax in Bengalese finches. Biology Letters 9(6): 20130842.](https://doi.org/10.1098/rsbl.2013.0842 https://datadryad.org//resource/doi:10.5061/dryad.6pt8g) The unmodified Defaults.config is appropriate for the processing of these data.

Assuming you installed in your home (~) directory run the following:

  cd yoUMAP_vocalizations/
  ./yoUMAP_vocalizations.sh -e test_dir/

Now open R in RStudio or a Jupyter notebook and run:

  source('~/yoUMAP_vocalizations/.bin/r_functions.R')
  syll_tbls <- readRDS('~/yoUMAP_vocalizations/test_dir/yoUMAPped_Syllables.rds')
  scatter_clusters(syll_tbls[['Bird0']])
  line_seqs(syll_tbls[['Bird0']])

This should produce the following plots, if it doesn't you have an issue:

Console output

The pipeline produces a lot of console vomit, sorry. Its a to do.

To Do

Animal level clustering?
Tame the console vomit.
Turn the R functions into a proper R package. Does that have to be its own repo?
Figure out exactly how to assist users in tuning the segmentation parameters. Probably a jupyter notebook capable of writing out a config file.

Contribution

This pipeline was written and is maintained by Matt Daveport ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.Install_Scripts		.Install_Scripts
.Pipe_Modules		.Pipe_Modules
.bin		.bin
test_dir/Raw_Inputs/Bird0		test_dir/Raw_Inputs/Bird0
.Defaults.config		.Defaults.config
.gitignore		.gitignore
.yoUMAP_vocalizations.bds		.yoUMAP_vocalizations.bds
Install.sh		Install.sh
LICENSE		LICENSE
README.md		README.md
yoUMAP_vocalizations.sh		yoUMAP_vocalizations.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yoUMAP_vocalizations

What does it do?

Credit Where Credit is Due

System Prerequisites

Installation - Local Machine

Installation - Cluster Environment

Usage

Input Structure / Experimental Directory Formatting

Example

Output Structure

Example

Getting Up and Running in R

Functions

Descriptions

Data processing using lapply

Tests

Console output

To Do

Contribution

About

Releases

Packages

Languages

License

mattisabrat/yoUMAP_vocalizations

Folders and files

Latest commit

History

Repository files navigation

yoUMAP_vocalizations

What does it do?

Credit Where Credit is Due

System Prerequisites

Installation - Local Machine

Installation - Cluster Environment

Usage

Input Structure / Experimental Directory Formatting

Example

Output Structure

Example

Getting Up and Running in R

Functions

Descriptions

Data processing using lapply

Tests

Console output

To Do

Contribution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages