Domain-Invariant Representation Learning of Bird Sounds

Authors

¹ Ilyass Moummad, ² Romain Serizel, ³ Emmanouil Benetos, ¹ Nicolas Farrugia

¹ IMT Atlantique, Lab-STICC, Brest, France
² Université de Lorraine, LORIA, INRIA, Nancy, France
³ C4DM, Queen Mary University of London, London, UK

Overview

This repository introduces ProtoCLR, a Prototypical Contrastive Learning approach designed for robust representation learning. ProtoCLR has been validated on transfer learning tasks for bird sound classification, showing strong domain-invariance in few-shot scenarios.

In our approach, focal recordings are utilized for pre-training, while soundscape recordings serve as the evaluation dataset, highlighting ProtoCLR's robustness to domain shifts. The initial results in the preprint are based on models trained for 100 epochs, with an update expected in the coming days/weeks to include results from extended training at 300 epochs (please see the table at the end of this page).

Preprint

Read the full paper: Domain-Invariant Representation Learning of Bird Sounds

Checkpoints

The pre-trained models, which include cross-entropy SimCLR, SupCon, and ProtoCLR and have been trained for 300 epochs, are available on Hugging Face. They can be directly accessed and integrated into your bioacoustic projects.

Audio Preparation Guidelines

To use the model effectively, ensure your audio meets the following criteria:

Mono Channel (Mandatory): If the audio has multiple channels, average them to create a single mono channel.
Sample rate (Mandatory): Resample your audio to a sample rate of 16 kHz.
Padding (Recommended): For audio shorter than 6 seconds, either pad with zeros or repeat the audio until it reaches 6 seconds.
Chunking (Recommended): For audio longer than 6 seconds, consider splitting it into 6-second chunks.

Example: Loading, Processing, and Running Inference on an Audio File

This example demonstrates how to load an audio file, preprocess it, and run inference with a pre-trained model.

Step 1: Download the Model and Code

First, download the model and code from the Hugging Face repository using the following command:

git clone https://huggingface.co/ilyassmoummad/ProtoCLR

Step 2: Load, Process, and Run Inference

After downloading the code and model weights, use the following Python script to preprocess an audio file and run inference:

import torch
from cvt import cvt13  # Import model architecture
from melspectrogram import MelSpectrogramProcessor  # Import Mel spectrogram processor

# Initialize preprocessor and model
preprocessor = MelSpectrogramProcessor()
model = cvt13()

# Load weights trained using Cross-Entropy
model.load_state_dict(torch.load("ce.pth", map_location="cpu")['encoder'])

# Load weights trained using SimCLR (self-supervised contrastive learning)
model.load_state_dict(torch.load("simclr.pth", map_location="cpu"))

# Load weights trained using SupCon (supervised contrastive learning)
model.load_state_dict(torch.load("supcon.pth", map_location="cpu"))

# Load weights trained using ProtoCLR (supervised contrastive learning using prototypes)
model.load_state_dict(torch.load("protoclr.pth", map_location="cpu"))

model.eval()

# Load and preprocess a sample audio waveform
def load_waveform(file_path):
    # Replace with audio loading code, e.g., torchaudio to load and resample
    pass

waveform = load_waveform("path/to/audio.wav")  # Load audio here

# Ensure waveform is sampled at 16 kHz, then pad/chunk to reach a 6-second length
input_tensor = preprocessor.process(waveform).unsqueeze(0)  # Add batch dimension

# Run the model on preprocessed audio
with torch.no_grad():
    output = model(input_tensor)
    print("Model output shape:", output.shape)

Experiments

Step 1: Downloading Data

We use datasets from the information retrieval benchmark BIRB and adapt them for few-shot learning, preparing them in .pt format with Xeno-Canto as the pretraining dataset, which includes focal recordings of over 10,000 bird species. For evaluation, various downstream soundscape datasets are provided, each featuring 6-second audio segments selected for peak bird activation using CNN14 from PANNs, with all recordings downsampled to 16kHz. This lightweight data format simplifies the training and evaluation of deep neural networks for bird sound classification, making it especially suited for few-shot learning.

1.1 Training Data

The Xeno-Canto dataset contains a large collection of bird sound recordings optimized for few-shot learning tasks in the context of bird species classification.

Dataset Summary:
- Total Examples: 684,744 audio segments
- Segment Length: Each segment is 6 seconds long
- Sampling Rate: 16kHz
- Classes: 10,127 unique bird species, each represented by its eBird species code

Download Xeno-Canto training data using this script from Hugging Face. Make sure to set the variable DESTINATION_PATH to your desired download location.

Merge and decompress the downloaded tar files:

cat DESTINATION_PATH/*tar* > DESTINATION_PATH/xc-6s-16khz-pann.tar
tar -xvf DESTINATION_PATH/xc-6s-16khz-pann.tar -C DATASET_PATH

1.2 Evaluation Data (Validation and Test)

The evaluation datasets are sourced from the BIRB benchmark, a collection of soundscape datasets for evaluating bird sound classification in challenging, real-world conditions.

Download the evaluation datasets (validation and test) from Zenodo.

Validation Dataset:
- File: pow.pt
- Contains 16,047 examples across 43 classes. Organized as a dictionary with data and label keys to enable efficient and rapid loading during validation. Classes with only one example are omitted to support one-shot classification tasks.

Test Datasets: Each test dataset is in .pt files organized by species codes within subfolders. Use the metadata file from the training set to map species codes to common names.

Dataset	Subfolder	Examples	Classes	Zenodo Source
SSW	`ssw/`	50,760	96	Link
NES	`coffee_farms/`	6,952	89	Link
UHH	`hawaii/`	59,583	27	Link
HSN	`high_sierras/`	10,296	19	Link
SNE	`sierras_kahl/`	20,147	56	Link
PER	`peru/`	14,768	132	Link

Step 2: Install Dependencies

Install requirements.txt from this repository:

pip install -r requirements.txt

Step 3: Train Feature Extractor

ProtoCLR

python3 train_encoder.py --loss protoclr --epochs 300 --nworkers 16 --bs 256 --lr 5e-4 --wd 1e-6 --device cuda:0 --traindir Path_to_Xeno-Canto-6s-16khz/ --evaldir Path_to_parent_folder_of_pow.pt --save --savefreq --freq 100

Cross-Entropy

python3 train_encoder.py --loss ce --epochs 300 --nworkers 16 --bs 256 --lr 5e-4 --wd 1e-6 --device cuda:0 --traindir Path_to_Xeno-Canto-6s-16khz/ --evaldir Path_to_parent_folder_of_pow.pt --save --savefreq --freq 100

--loss: Specifies the loss function to use. The following losses are supported: protoclr, supcon, simclr, and ce (cross-entropy).
--traindir: Path to the training data directory containing the Xeno-Canto bird sound dataset. This directory should contain the decompressed data downloaded from Hugging Face.
--evaldir: Path to the evaluation data directory where the validation file (pow.pt) is stored. This file will be used for evaluating model performance during training.

For more details about the arguments, refer to args.py.

Note Adjust the number of workers (--nworkers) based on your machine to avoid data loader bottlenecks which can slow down training.

Step 4: Few-shot Evaluation

To evaluate the model on one- and five-shot tasks, run the following script:

python3 test_fewshot.py --modelckpt /path/to/weights.pth --bs 1024 --nworkers 16 --evaldir /path/to/soundscapes --device cuda:0 --report

Note Reduce the batch size (--bs) if it doesn't fit in your GPU memory.

Model Performance Comparison

The following table presents the classification accuracy of various models on one-shot and five-shot bird sound classification tasks, evaluated across different soundscape datasets.

Model	Model Size	PER	NES	UHH	HSN	SSW	SNE	Mean
Random Guessing	-	0.75	1.12	3.70	5.26	1.04	1.78	2.22

1-Shot Classification
BirdAVES-biox-base	90M	7.41±1.0	26.4±2.3	13.2±3.1	9.84±3.5	8.74±0.6	14.1±3.1	13.2
BirdAVES-bioxn-large	300M	7.59±0.8	27.2±3.6	13.7±2.9	12.5±3.6	10.0±1.4	14.5±3.2	14.2
BioLingual	28M	6.21±1.1	37.5±2.9	17.8±3.5	17.6±5.1	22.5±4.0	26.4±3.4	21.3
Perch	80M	9.10±5.3	42.4±4.9	19.8±5.0	26.7±9.8	22.3±3.3	29.1±5.9	24.9
CE (Ours)	19M	9.55±1.5	41.3±3.6	19.7±4.7	25.2±5.7	17.8±1.4	31.5±5.4	24.2
SimCLR (Ours)	19M	7.85±1.1	31.2±2.4	14.9±2.9	19.0±3.8	10.6±1.1	24.0±4.1	17.9
SupCon (Ours)	19M	8.53±1.1	39.8±6.0	18.8±3.0	20.4±6.9	12.6±1.6	23.2±3.1	20.5
ProtoCLR (Ours)	19M	9.23±1.6	38.6±5.1	18.4±2.3	21.2±7.3	15.5±2.3	25.8±5.2	21.4

5-Shot Classification
BirdAVES-biox-base	90M	11.6±0.8	39.7±1.8	22.5±2.4	22.1±3.3	16.1±1.7	28.3±2.3	23.3
BirdAVES-bioxn-large	300M	15.0±0.9	42.6±2.7	23.7±3.8	28.4±2.4	18.3±1.8	27.3±2.3	25.8
BioLingual	28M	13.6±1.3	65.2±1.4	31.0±2.9	34.3±3.5	43.9±0.9	49.9±2.3	39.6
Perch	80M	21.2±1.2	71.7±1.5	39.5±3.0	52.5±5.9	48.0±1.9	59.7±1.8	48.7
CE (Ours)	19M	21.4±1.3	69.2±1.8	35.6±3.4	48.2±5.5	39.9±1.1	57.5±2.3	45.3
SimCLR (Ours)	19M	15.4±1.0	54.0±1.8	23.0±2.3	32.8±4.0	22.0±1.2	40.7±2.4	31.3
SupCon (Ours)	19M	17.2±1.3	64.6±2.4	34.1±2.9	42.5±2.9	30.8±0.8	48.1±2.4	39.5
ProtoCLR (Ours)	19M	19.2±1.1	67.9±2.8	36.1±4.3	48.0±4.3	34.6±2.3	48.6±2.8	42.4

Citation

@misc{moummad2024dirlbs,
      title={Domain-Invariant Representation Learning of Bird Sounds}, 
      author={Ilyass Moummad and Romain Serizel and Emmanouil Benetos and Nicolas Farrugia},
      year={2024},
      eprint={2409.08589},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2409.08589}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
README.md		README.md
args.py		args.py
augmentations.py		augmentations.py
cvt.py		cvt.py
dataset.py		dataset.py
eval_utils.py		eval_utils.py
losses.py		losses.py
requirements.txt		requirements.txt
test_fewshot.py		test_fewshot.py
train_encoder.py		train_encoder.py
train_utils.py		train_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Invariant Representation Learning of Bird Sounds

Authors

Overview

Preprint

Checkpoints

Audio Preparation Guidelines

Example: Loading, Processing, and Running Inference on an Audio File

Step 1: Download the Model and Code

Step 2: Load, Process, and Run Inference

Experiments

Step 1: Downloading Data

1.1 Training Data

1.2 Evaluation Data (Validation and Test)

Step 2: Install Dependencies

Step 3: Train Feature Extractor

Step 4: Few-shot Evaluation

Model Performance Comparison

Citation

About

Releases

Packages

Languages

brain-bzh/ProtoCLR

Folders and files

Latest commit

History

Repository files navigation

Domain-Invariant Representation Learning of Bird Sounds

Authors

Overview

Preprint

Checkpoints

Audio Preparation Guidelines

Example: Loading, Processing, and Running Inference on an Audio File

Step 1: Download the Model and Code

Step 2: Load, Process, and Run Inference

Experiments

Step 1: Downloading Data

1.1 Training Data

1.2 Evaluation Data (Validation and Test)

Step 2: Install Dependencies

Step 3: Train Feature Extractor

Step 4: Few-shot Evaluation

Model Performance Comparison

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages