Skip to content

Commit

Permalink
#4 initiate directory and file structure
Browse files Browse the repository at this point in the history
  • Loading branch information
vskode committed Nov 11, 2024
1 parent 99fda00 commit aeba190
Show file tree
Hide file tree
Showing 19 changed files with 228 additions and 162 deletions.
136 changes: 1 addition & 135 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Models currently include:

| Name| ref paper| ref code| sampling rate| input length| embedding dimension |
|---|---|---|---|---|---|
| Animal2vec_XC| [paper](https://arxiv.org/abs/2406.01253) | [code](https://github.com/livingingroups/animal2vec) | 8 kHz (?)| 5 s| 768 |
| Animal2vec_XC| [paper](https://arxiv.org/abs/2406.01253) | [code](https://github.com/livingingroups/animal2vec) | 24 kHz| 5 s| 768 |
| Animal2vec_MK| [paper](https://arxiv.org/abs/2406.01253) | [code](https://github.com/livingingroups/animal2vec) | 8 kHz| 10 s| 1024 |
| AudioMAE | [paper](https://proceedings.neurips.cc/paper_files/paper/2022/hash/b89d5e209990b19e33b418e14f323998-Abstract-Conference.html) | [code](https://github.com/facebookresearch/AudioMAE) | 16 kHz| 10 s| 768 |
| AVES | [paper](https://arxiv.org/abs/2210.14493) | [code](https://github.com/earthspecies/aves) | 16 kHz| 1 s| 768 |
Expand All @@ -28,140 +28,6 @@ Models currently include:
| UMAP | [paper](https://arxiv.org/abs/1802.03426) | [code](https://github.com/lmcinnes/umap) | - | - | |
| VGGish | [paper](https://ieeexplore.ieee.org/document/7952132) | [code](https://github.com/tensorflow/models/tree/master/research/audioset/vggish) | 16 kHz| 0.96 s| 128 |

## Brief description of models
All information is extracted from the respective repositories and manuscripts. Please refer to them for more details

### Animal2vec_XC
- raw waveform input
- self-supervised model
- transformer
- trained on bird song data

animal2vec model weights are from self-supervised pretraining on xeno-canto data. The model is based on data2vec2.0 with a number of bioacoustic-specific model implementations. See paper for more details.

### Animal2vec_MK
- raw waveform input
- self-supervised pretrained model, fine-tuned
- transformer
- trained on meerkat vocalizations

animal2vec model weights are from self-supervised pretraining on meerkat data with fine tuning on a curated meerkat dataset. The model is based on data2vec2.0 with a number of bioacoustic-specific model implementations. See paper for more details.

### AudioMAE
- spectrogram input
- self-supervised pretrained model, fine-tuned
- vision transformer
- trained on general audio

AudioMAE from the facebook research group is a vision transformer pretrained on AudioSet-2M data and fine-tuned on AudioSet-20K.

### AVES
- transformer
- self-supervised pretrained model
- trained on general audio

AVES is short for Animal Vocalization Encoder based on Self-Supervision. The model is based on the HuBERT-base architecture. The model is pretrained on unannotated audio datasets AudioSet-20K, FSD50K and the animal sounds from AudioSet and VGGSound.


### BioLingual
- transformer
- spectrogram input
- contrastive-learning
- self-supervised pretrained model
- trained on animal sound data (primarily bird song)

BioLingual is a language-audio model trained on captioning bioacoustic datasets inlcuding xeno-canto and iNaturalist. The model architecture is based on the [CLAP](https://arxiv.org/pdf/2211.06687) model architecture.

### BirdAVES
- transformer
- self-supervised pretrained model
- trained on general audio and bird song data

AVES is short for Animal Vocalization Encoder based on Self-Supervision. The model is based on the HuBERT-large architecture. The model is pretrained on unannotated audio datasets AudioSet-20K, FSD50K and the animal sounds from AudioSet and VGGSound as well as bird vocalizations from xeno-canto.

### BirdNET
- CNN
- supervised training model
- trained on bird song data

BirdNET (v2.4) is based on a EfficientNET(b0) architecture. The model is trained on a large amount of bird vocalizations from the xeno-canto database alongside other bird song databses.

### EchoPaSST
- transformer
- supervised pretrained model, fine-tuned
- pretrained on general audio and bird song data

EchoPaSST is a vision transformer trained on AudioSet and (deep) fine-tuned on xeno-canto. The model is based on the [PaSST](https://github.com/kkoutini/PaSST) framework.


### HumpbackNET
- CNN
- supervised training model
- trained on humpback whale song

HumpbackNET is a binary classifier based on a ResNet-50 model trained on humpback whale data from different parts in the North Atlantic.

### Insect66NET
- CNN
- supervised training model
- trained on insect sounds

InsectNET66 is a [EfficientNet v2 s](https://pytorch.org/vision/main/models/generated/torchvision.models.efficientnet_v2_s.html) model trained on the [Insect66 dataset](https://zenodo.org/records/8252141) including sounds of grasshoppers, crickets, cicadas developed by the winning team of the Capgemini Global Data Science Challenge 2023.


### Mix2
- CNN
- supervised training model
- trained on frog sounds

Mix2 is a [MobileNet v3](https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py) model trained on the [AnuranSet](https://github.com/soundclim/anuraset) which includes sounds of 42 different species of frogs from different regions in Brazil. The model was trained using a mixture of Mixup augmentations to handle the class imbalance of the data.

### RCL_FS_BSED
- CNN
- supervised contrastive learning
- trained on dcase 2023 task 5 dataset [link](https://zenodo.org/records/6482837)

RCL_FS_BSED stands for Regularized Contrastive Learning for Few-shot Bioacoustic Sound Event Detection and features a model based on a ResNet model. The model was originally created for the DCASE bioacoustic few shot challenge (task 5) and later improved.

### ProtoCLR
- transformer
- supervised contrastive learning
- trained on bird song data

ProtoCLR stands for Prototypical Contrastive Learning for robust representation learning. The architecture is a CvT-13 (Convolutional vision transformer) with 20M parameters. ProtoCLR has been validated on transfer learning tasks for bird sound classification, showing strong domain-invariance in few-shot scenarios. The model was trained on the xeno-canto dataset.


### Perch
- CNN
- supervised training model
- trained on bird song data

Perch is a EFficientNet B1 model trained on the entire Xeno-canto database.

### SurfPerch
- CNN
- supervised training model
- trained on bird song, fine-tuned on tropical reef data

Perch is a EFficientNet B1 model trained on the entire Xeno-canto database and fine tuned on coral reef and unrelated sounds.

### WhalePerch
- CNN
- supervised training model
- trained on 7 whale species

WhalePerch (multispecies_whale) is a EFficientNet B0 model trained on whale sounds.

### UMAP
see [repo](https://github.com/lmcinnes/umap)

### VGGISH
- CNN
- supervised training model
- trained on general audio

VGGish is a model based on the [VGG](https://arxiv.org/pdf/1409.1556) architecture. The model is trained on audio from youtube videos (YouTube-8M)

## Installation

Create a virtual environment using python3.11 and virtualenv
Expand Down
11 changes: 2 additions & 9 deletions bacpipe/config.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,16 @@
## MAIN VARS ##

# PATHS:
# define paths for your data, these are the standard directories, if you
# used the file condenser. If you do not have annotated data, and all of your
# sound files are in one directory, change accordingly
audio_dir : "bacpipe/test_files/audio/audio_test_files"
# audio_dir : "/media/vincent/Extreme SSD/MA/20221019-Benoit/transfer_1780104_files_cfc5b86f/SABA01_201511_201604_SN275/resampled_2kHz/wav"

# fixed path, embeddings will be stored here, DO NOT CHANGE
embed_parent_dir : "bacpipe/test_files/embeds"
umap_parent_dir : "bacpipe/test_files/umap_embeds"


# embedding model name
embedding_model: 'rcl_fs_bsed'

# supported formats of audio files
audio_suffixes: ['.wav', '.WAV', '.aif', '.mp3']

# specify your device ['cpu', 'cuda', ...]
device: 'cpu'

# UMAP settings
n_neighbors : 50
Expand Down
Empty file added bacpipe/embeddings/README.md
Empty file.
1 change: 1 addition & 0 deletions bacpipe/evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Evaluation scripts and data to analyze model performance
Empty file.
Empty file.
1 change: 1 addition & 0 deletions bacpipe/evaluation/datasets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Audio data used to evaluate the models
Binary file not shown.
1 change: 1 addition & 0 deletions bacpipe/evaluation/datasets/benchmark/test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Test data to benchmark model performance
1 change: 1 addition & 0 deletions bacpipe/evaluation/datasets/benchmark/train/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Train data to benchmark model performance
1 change: 1 addition & 0 deletions bacpipe/evaluation/results/metrics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Metrics generated during evaluation will be placed here
1 change: 1 addition & 0 deletions bacpipe/evaluation/results/plots/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Plots generated during evaluation will be placed here
Empty file.
11 changes: 7 additions & 4 deletions bacpipe/generate_embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,12 @@ def __init__(
self,
check_if_combination_exists=True,
model_name="umap",
audio_dir=None,
testing=False,
**kwargs,
):
self.model_name = model_name
self.audio_dir = audio_dir

with open("bacpipe/config.yaml", "r") as f:
self.config = yaml.safe_load(f)
Expand Down Expand Up @@ -72,7 +74,7 @@ def check_embeds_already_exist(self):
if (
self.model_name in d.stem
and Path(self.audio_dir).stem in d.stem
and self.embedding_model in d.stem
and self.model_name in d.stem
):

num_files = len(
Expand Down Expand Up @@ -136,7 +138,7 @@ def get_embeddings(self):
{"embedding_files": [], "embedding_dimensions": []}
)
self.embed_dir = Path(self.umap_parent_dir).joinpath(
self.get_timestamp_dir() + f"-{self.embedding_model}"
self.get_timestamp_dir() + f"-{self.model_name}"
)

def get_embedding_dir(self):
Expand All @@ -157,7 +159,7 @@ def get_embedding_dir(self):
embed_dirs = [
d
for d in self.embed_parent_dir.iterdir()
if self.audio_dir.stem in d.stem and self.embedding_model in d.stem
if self.audio_dir.stem in d.stem and self.model_name in d.stem
]
# check if timestamp of umap is after timestamp of model embeddings
embed_dirs.sort()
Expand Down Expand Up @@ -217,7 +219,8 @@ def get_embeddings_from_model(self, sample):
samples = self.model.preprocess(frames)
start = time.time()

embeds = self.model(samples)
batched_samples = self.model.init_dataloader(samples)
embeds = self.model.batch_inference(batched_samples)
if not isinstance(embeds, np.ndarray):
embeds = embeds.numpy()

Expand Down
Loading

0 comments on commit aeba190

Please sign in to comment.