Neurophysiology Pipeline

I. Installation

git clone [email protected]:theunissenlab/songephys.git
cd songephys
mkdir data (or symlink to data directory)
virtualenv env -p python3 (Using python3.5 seems to be the most stable)
source bin/activate (activates virtualenv and sets path correctly)
pip install -r requirements.txt
pip install statsmodels (because of a bug has to be installed after)
You also will need the spike sorting code in solid-garbazo (suss). In your Code folder: git clone [email protected]:theunissenlab/solid-garbanzo.git. Them make a soft link to that folder from songephys (that is the way the PYTHONPATH is set up in activate). If you are in songephys you can type: ln -s ~/Code/solid-garbanzo solid-garbanzo. Finally, make sure all the requirements are also installed. Look at the requirements.txt file.

After this, use the command source bin/activate whenever you want to enter this python environment to run the pipeline. Note that if you are not using virtualenv (for example is you use conda environments), you should still run the sh commands in bin/activate that set the paths correctly.

Some comments from installing on MacOS:

the module setup.py in pipeline conflicts with other setup.py from a Conda installation and, in my version, I changed it to be called file_setup.py. But that means that you also need to change the other python files in pipeline that import setup (change to import file_setup).
To use cmake in Catalina you will have to install Xcode. The Xcode for Catalina is 11.0 and it installs with the command line tools that gives you system libraries version 10.15 (MacOSX10.15.sdk). You should find this in /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/. You will need to install version 10.13 which you can get from https://github.com/phracker/MacOSX-SDKs/releases. Download that version put it in the same folder and create a soft link to it: ln -s MacOSX10.13.sdk MacOSX.sdk

II. Set up gdrive access

After environment is set up and activated, run the following script to set up automated gdrive access needed for the data pipeline.

python scripts/setup_gdrive.py

Follow the instructions printed. Make sure you are creating the api access in the right gdrive account (probably your berkeley one that has access to the Intan Ephys Data folder.

The gdrive folder Intan Ephys Data must be accessible from the top level of your google drive!

III. Running a site

To run these steps, you must be authenticated for API access through google drive

Create site directory in the subject's sites directory. By convention, we name a site by the name of the first session at that site.

mkdir data/birds/GreYel3594M/sites/GreYel3594M___170919_094646
Make a file called filelist in the site directory you just created. The first line should indicate the "lab", for now used only to distinguish differences between data organization for data collected in Berkeley ("theunissen") or Seewiesen ("seewiesen"). Subsequent lines should include all the rhd file names and csv file names in any order. The rhd and csv files for all the sessions must be uploaded on Google Drive

Example of data/birds/GreYel3594M/sites/GreYel3594M___170919_094646/filelist:
```
theunissen
GreYel3594M___170919_094646.rhd
GreYel3594M___170919_111646.rhd
GreYel3594M___170919_124646.rhd
GreYel3594M_trialdata_20170919094951.csv
GreYel3594M_trialdata_20170919102348.csv
GreYel3594M_trialdata_20170919125344.csv
GreYel3594M_trialdata_20170919140635.csv
```

Run the SiteConfig task with the following command, that will automatically generate a yaml configuration file for the site:

luigi --module file_setup SiteConfig \
--site-dir data/birds/GreYel3594M/sites/GreYel3594M___170919_094646 \
--local-scheduler

This should have created a yaml file in the site directory that defines the names of sessions, the pyoperant files associated with each session, and the start time of each session relative to the first one.

Example of data/birds/GreYel3594M/sites/GreYel3594M___170919_094646/sessions.yaml

lab: theunissen
sessions:
    - id: GreYel3594M___170919_094646
      playbacks:
          - GreYel3594M_trialdata_20170919094951
          - GreYel3594M_trialdata_20170919102348
      t: 0.0
    - id: GreYel3594M___170919_111646
      t: 5400.0
    - id: GreYel3594M___170919_124646
      playbacks:
          - GreYel3594M_trialdata_20170919125344
          - GreYel3594M_trialdata_20170919140635
      t: 10800.0

From the top level of the songephys directory, run the pipeline.

luigi --module playback_categories RunSiteCategories \
--site-dir data/birds/GreYel3594M/sites/GreYel3594M___170919_094646
--local-scheduler

If you are running Mac OS X, you might have to install python.app and use pythonw as in.

pythonw -m luigi --module playback_categories RunSiteCategories \
--site-dir data/birds/GreYel3594M/sites/GreYel3594M___170919_094646
--local-scheduler

This will download all files, and pre-process them up to the manual spike sorting point.

Do spike sorting! This involves cloning this project https://github.com/theunissenlab/solid-garbanzo and installing its requirements.txt file as well. Its probably best to do this in a separate python environment. You can then modify solid-garbanzo/suss/gui/config.py with the path to your spikes (e.g. data/birds/GreYel3594M/sites/GreYel3594M___170919_094646/sorted_spikes) for easier access.

For each of the files in sorted_spikes (there should be 16 of them), you can sort the spikes and then save the sorted pkl file in data/birds/GreYel3594M/sites/GreYel3594M___170919_094646/manually_curated. Then repeat the luigi command in the previous step to finish the pipeline.

TROUBLESHOOTING

Often there may be errors in intermediate steps of the pipeline. It can be difficult to figure out what the problem is.

Check that the sessions.yaml file is correct. Some errors might include mistyped session names, time offsets not computed correctly.
If early in the pipeline, check that the rhd download was not interrupted and that the rhd to nix conversion was not interrupted. These steps can leave an incomplete file that looks done but has corrupted data.

Data Pipeline

Data Pipeline Diagram Diagram of data pipeline. See below for detailed description of these functions.

1. DownloadSession

This task locates and downloads the rhd file, pyoperant files, and stimulus wav files associated with the specified recording session.

Output

RHD file: e.g. raw/GreYel3594M___170919_094646.rhd
- This is the raw data file
One more more pyoperant files: raw/___________.csv
- These files are generated by pyoperant and contain stimulus presentation times, trial numbers, etc. for a set of stimuli played
Stimuli directories at stimuli/* containing all .wav files referenced by the pyoperant files

2. Rhd2Nix

This task converts the rhd file into a neo block stored as a .nix file.

Output

NIX file: e.g. segmentation/GreYel3594M___170919_094646.nix
- 1 segment spanning entire session
- 3 channel indices (2 channel mic, 1 digital pulse for stimulus playbacks, 16 channel electrode)

3. CombineTrialMetadata

This task combines one or more pyoperant csvs into a single pandas dataframe. It also loads the digital signals of the neo block to determine which trials were present in the current session.

Output

Combined playback data in pickle file: e.g. segmentation/playbacks.pkl
- pandas dataframe with each row corresponding to one trial played during the session.

4. DetectSoundSegments

This task can be split up into two separate tasks. The first thing it does is detecting periods of vocal activity using the microphone channels, which it combines with stimulus playback information to define times during which either a live or playback sound occured. The second part of the task involves simply splitting up the main segment of the neo block into several 5 second segments.

Output

Segments defined in dictionary: e.g. segmentation/GreYel3594M___170919_094646-segments.npy
- numpy file containing dictionary with keys "live", "playback" and "consolidated". 'Live' contains all live sound periods, 'playback' contains all playback periods, and 'consolidated' contains non-overlapping live and playback periods by keeping only live periods that do not overlap with a playback
Neo block split into 5s segments: e.g. segmentation/GreYel3594M___170919_094646-segmented.pkl
- Up to 1080 segments (5s each) spanning entire session
- 3 channel indices (2 channel mic, 1 digital pulse for stimulus playbacks, 16 channel electrode)

5. FilterElectrodeSignals

Filter electrode signals using a method that attempts to remove correlations within and across electrodes to reduce motion artifacts and other correlated noise. It does this by attempting to model the amplitude at one electrode by signal on all other electrodes in a preceding time window.

Output

Filtered Neo block: e.g. segmentation/GreYel3594M___170919_094646-filtered.pkl
- Up to 1080 segments (5s each) spanning entire session
- 19 channel indices (2 channel mic, 1 digital pulse for stimulus playbacks, 1 channel index for each electrode)

6. ExtractSpikes

Threshold spikes with a channel and segment dependent threshold.

Output

Thresholded Neo block: e.g. spikes/GreYel3594M___170919_094646-thresholded.pkl
- Up to 1080 segments (5s each) spanning entire session
- 19 channel indices (2 channel mic, 1 digital pulse for stimulus playbacks, 1 channel index per electrode with 1 spiketrain per segment)

7. MergeVocalPeriods

Merges segment information for live and playbacks from multiple sessions into one file. Offsets the timestamps for each by the start time of each session relative to the first.

Output

Merged vocal period dict in numpy format: e.g. vocal_periods.npy

8. MergeSessionBlocks

Merges neo blocks with thresholded spikes from multiple sessions into one file. Offsets the timestamps for each by the start time of each session relative to the first.

Output

Merged Neo block: e.g. thresholded.pkl
- Up to 1080 segments (5s each) spanning entire session
- 19 channel indices (2 channel mic, 1 digital pulse for stimulus playbacks, 1 channel index per electrode with 1 spiketrain per segment)

9. CollectWaveforms

Output

One pickle file for each electrode channel: e.g. spikes/spike_waveforms-e0.pkl
- Has a SpikeDataset object with keys for "times" and "waveforms"

10. SortSpikes

Hierarchical spike sorting over time.

Output

One pickle file for each electrode channel: e.g. sorted/sorted-e0.pkl
- Has a ClusterDataset object representing each cluster. Several of these need to be merged in the following manual curation step.

11. ManualCuration

Manual step using a gui for merging and deleting clusters created in SortSpikes. Each "sorted" file is curated and then saved to the "manually_curated" directory.

Instructions

Output of SortSpikes for each electrode are generated and saved in /<bird_name>/sites/<site_name>/sort_results and are named sorted-e{8-24}.pkl
To sort, run the sorting gui python -m suss.gui.app
Load a sorted dataset. Combine, split, and delete clusters.
When done, Right Click -> Tag them with any tags that make sense (e.g. Single Unit, Multiunit)
Save the file in a directory named /<bird_name>/sites/<site_name>/manually_curated/ with the names curated-e{8-24}.pkl

Output

One pickle file for each electrode channel: e.g. manually_curated/curated-e0.pkl
- Has a ClusterDataset object representing each cluster.

12. SortSite

Write the results of the above sorting steps into the neo block.

Output

Sorted Neo block: e.g. sorted.pkl
- Up to 1080 segments (5s each) spanning entire session
- 19 channel indices (2 channel mic, 1 digital pulse for stimulus playbacks, 1 channel index per electrode with 1 spiketrain per unit per segment)

13. Resegment

Resegments the data previously cut into 5s chunks spanning the recording period into the variable length segments corresponding to live and playback vocal periods.

Output

Segmented Neo block: e.g. sorted-resegmented.pkl
- Variable number of segments depending on detected live and playback periods
- 19 channel indices (2 channel mic, 1 channel index per electrode with 1 spiketrain per unit per segment)

TODO

Things that would be nice:

automatic link between rhd and csv
- could be done at recording time - have pyoperant see what rhd file is in progress during start and stop of playbacks and write it to a csv / yaml file.
Making filtering fast and verifying filtering is working
- currently takes about 10 hours per full 90 minute file.
better algorithms to link spike clusters
- would make the manual curation step easier / faster
write end file to NIX.
- saving our files to nix is very slow, so they are currently still in pickle files which cant be lazy-loaded
Vocal categorization and motif alignment
- some of this can be manually done now

Pipeline Summary

Our neurophysiological data pipeline is organized by a sequence of Tasks (data transformations) defined in the songephys repository. We use Luigi, a pipeline framework for python that helps to organize Tasks, Targets (data files), and their dependency structure.

The primary things the pipeline does is process electrode data, join continuous recordings together (that were split into different files), and manage spike sorting.

Our data is recorded continuously but is broken up into separate files containing 90 minute chunks (sessions) containing raw electrode and microphone input. The majority of the ~12GB file size is due to 18 channels (16 electrode, 2 microphone) sampled at 30kHz for 90 minutes, stored as 32-bit floats (2.9 billion data-points). This data is all in the initial .rhd file for a session.

The file size grows as data is processed; many of the intermediate steps in the pipeline take steps to mitigate the size of individual files and the amount of data that must be loaded into memory at once time. The file sizes shrink toward the end of the pipeline when putative spikes have been extracted and raw electrode channels are dropped from the file.

Multiple sessions recorded at one site (one depth) continuously are merged together once putative spikes are extracted from the electrode data and the electrode trace can be dropped, forming a chunk of variable length (site). One site will be derived from data from multiple sessions concatenated and appropriately offset.

Neurophysiology Pipeline

I. Installation

II. Set up gdrive access

III. Running a site

TROUBLESHOOTING

Data Pipeline

1. DownloadSession

Output

2. Rhd2Nix

Output

3. CombineTrialMetadata

Output

4. DetectSoundSegments

Output

5. FilterElectrodeSignals

Output

6. ExtractSpikes

Output

7. MergeVocalPeriods

Output

8. MergeSessionBlocks

Output

9. CollectWaveforms

Output

10. SortSpikes

Output

11. ManualCuration

Instructions

Output

12. SortSite

Output

13. Resegment

Output

TODO

Pipeline Summary

Contents

General

Dry lab

Wet lab

Animal Care

Behavior

Surgeries, Histology, Imaging

Electrophysiology

Calcium imaging

fMRI

Theory

Modulations

STRFs

Other

Wetlab

Clone this wiki locally