1 Ilyass Moummad, 2 Romain Serizel, 3 Emmanouil Benetos, 1 Nicolas Farrugia
1 IMT Atlantique, Lab-STICC, Brest, France
2 Université de Lorraine, LORIA, INRIA, Nancy, France
3 C4DM, Queen Mary University of London, London, UK
This repository introduces ProtoCLR, a Prototypical Contrastive Learning approach designed for robust representation learning. ProtoCLR has been validated on transfer learning tasks for bird sound classification, showing strong domain-invariance in few-shot scenarios.
In our approach, focal recordings are utilized for pre-training, while soundscape recordings serve as the evaluation dataset, highlighting ProtoCLR's robustness to domain shifts. The initial results in the preprint are based on models trained for 100 epochs, with an update expected in the coming days/weeks to include results from extended training at 300 epochs (please see the table at the end of this page).
Read the full paper: Domain-Invariant Representation Learning of Bird Sounds
The pre-trained models, which include cross-entropy SimCLR, SupCon, and ProtoCLR and have been trained for 300 epochs, are available on Hugging Face. They can be directly accessed and integrated into your bioacoustic projects.
To use the model effectively, ensure your audio meets the following criteria:
- Mono Channel (Mandatory): If the audio has multiple channels, average them to create a single mono channel.
- Sample rate (Mandatory): Resample your audio to a sample rate of 16 kHz.
- Padding (Recommended): For audio shorter than 6 seconds, either pad with zeros or repeat the audio until it reaches 6 seconds.
- Chunking (Recommended): For audio longer than 6 seconds, consider splitting it into 6-second chunks.
This example demonstrates how to load an audio file, preprocess it, and run inference with a pre-trained model.
First, download the model and code from the Hugging Face repository using the following command:
git clone https://huggingface.co/ilyassmoummad/ProtoCLR
After downloading the code and model weights, use the following Python script to preprocess an audio file and run inference:
import torch
from cvt import cvt13 # Import model architecture
from melspectrogram import MelSpectrogramProcessor # Import Mel spectrogram processor
# Initialize preprocessor and model
preprocessor = MelSpectrogramProcessor()
model = cvt13()
# Load weights trained using Cross-Entropy
model.load_state_dict(torch.load("ce.pth", map_location="cpu")['encoder'])
# Load weights trained using SimCLR (self-supervised contrastive learning)
model.load_state_dict(torch.load("simclr.pth", map_location="cpu"))
# Load weights trained using SupCon (supervised contrastive learning)
model.load_state_dict(torch.load("supcon.pth", map_location="cpu"))
# Load weights trained using ProtoCLR (supervised contrastive learning using prototypes)
model.load_state_dict(torch.load("protoclr.pth", map_location="cpu"))
model.eval()
# Load and preprocess a sample audio waveform
def load_waveform(file_path):
# Replace with audio loading code, e.g., torchaudio to load and resample
pass
waveform = load_waveform("path/to/audio.wav") # Load audio here
# Ensure waveform is sampled at 16 kHz, then pad/chunk to reach a 6-second length
input_tensor = preprocessor.process(waveform).unsqueeze(0) # Add batch dimension
# Run the model on preprocessed audio
with torch.no_grad():
output = model(input_tensor)
print("Model output shape:", output.shape)
We use datasets from the information retrieval benchmark BIRB and adapt them for few-shot learning, preparing them in .pt
format with Xeno-Canto as the pretraining dataset, which includes focal recordings of over 10,000 bird species. For evaluation, various downstream soundscape datasets are provided, each featuring 6-second audio segments selected for peak bird activation using CNN14 from PANNs, with all recordings downsampled to 16kHz. This lightweight data format simplifies the training and evaluation of deep neural networks for bird sound classification, making it especially suited for few-shot learning.
The Xeno-Canto dataset contains a large collection of bird sound recordings optimized for few-shot learning tasks in the context of bird species classification.
- Dataset Summary:
- Total Examples: 684,744 audio segments
- Segment Length: Each segment is 6 seconds long
- Sampling Rate: 16kHz
- Classes: 10,127 unique bird species, each represented by its eBird species code
-
Download Xeno-Canto training data using this script from Hugging Face. Make sure to set the variable
DESTINATION_PATH
to your desired download location. -
Merge and decompress the downloaded tar files:
cat DESTINATION_PATH/*tar* > DESTINATION_PATH/xc-6s-16khz-pann.tar tar -xvf DESTINATION_PATH/xc-6s-16khz-pann.tar -C DATASET_PATH
The evaluation datasets are sourced from the BIRB benchmark, a collection of soundscape datasets for evaluating bird sound classification in challenging, real-world conditions.
Download the evaluation datasets (validation and test) from Zenodo.
-
Validation Dataset:
- File:
pow.pt
- Contains 16,047 examples across 43 classes. Organized as a dictionary with
data
andlabel
keys to enable efficient and rapid loading during validation. Classes with only one example are omitted to support one-shot classification tasks.
- File:
-
Test Datasets: Each test dataset is in
.pt
files organized by species codes within subfolders. Use the metadata file from the training set to map species codes to common names.Dataset Subfolder Examples Classes Zenodo Source SSW ssw/
50,760 96 Link NES coffee_farms/
6,952 89 Link UHH hawaii/
59,583 27 Link HSN high_sierras/
10,296 19 Link SNE sierras_kahl/
20,147 56 Link PER peru/
14,768 132 Link
Install requirements.txt from this repository:
pip install -r requirements.txt
ProtoCLR
python3 train_encoder.py --loss protoclr --epochs 300 --nworkers 16 --bs 256 --lr 5e-4 --wd 1e-6 --device cuda:0 --traindir Path_to_Xeno-Canto-6s-16khz/ --evaldir Path_to_parent_folder_of_pow.pt --save --savefreq --freq 100
Cross-Entropy
python3 train_encoder.py --loss ce --epochs 300 --nworkers 16 --bs 256 --lr 5e-4 --wd 1e-6 --device cuda:0 --traindir Path_to_Xeno-Canto-6s-16khz/ --evaldir Path_to_parent_folder_of_pow.pt --save --savefreq --freq 100
-
--loss
: Specifies the loss function to use. The following losses are supported:protoclr
,supcon
,simclr
, andce
(cross-entropy). -
--traindir
: Path to the training data directory containing the Xeno-Canto bird sound dataset. This directory should contain the decompressed data downloaded from Hugging Face. -
--evaldir
: Path to the evaluation data directory where the validation file (pow.pt
) is stored. This file will be used for evaluating model performance during training.
For more details about the arguments, refer to args.py.
Note Adjust the number of workers (--nworkers) based on your machine to avoid data loader bottlenecks which can slow down training.
To evaluate the model on one- and five-shot tasks, run the following script:
python3 test_fewshot.py --modelckpt /path/to/weights.pth --bs 1024 --nworkers 16 --evaldir /path/to/soundscapes --device cuda:0 --report
Note Reduce the batch size (--bs) if it doesn't fit in your GPU memory.
The following table presents the classification accuracy of various models on one-shot and five-shot bird sound classification tasks, evaluated across different soundscape datasets.
Model | Model Size | PER | NES | UHH | HSN | SSW | SNE | Mean |
---|---|---|---|---|---|---|---|---|
Random Guessing | - | 0.75 | 1.12 | 3.70 | 5.26 | 1.04 | 1.78 | 2.22 |
1-Shot Classification | ||||||||
BirdAVES-biox-base | 90M | 7.41±1.0 | 26.4±2.3 | 13.2±3.1 | 9.84±3.5 | 8.74±0.6 | 14.1±3.1 | 13.2 |
BirdAVES-bioxn-large | 300M | 7.59±0.8 | 27.2±3.6 | 13.7±2.9 | 12.5±3.6 | 10.0±1.4 | 14.5±3.2 | 14.2 |
BioLingual | 28M | 6.21±1.1 | 37.5±2.9 | 17.8±3.5 | 17.6±5.1 | 22.5±4.0 | 26.4±3.4 | 21.3 |
Perch | 80M | 9.10±5.3 | 42.4±4.9 | 19.8±5.0 | 26.7±9.8 | 22.3±3.3 | 29.1±5.9 | 24.9 |
CE (Ours) | 19M | 9.55±1.5 | 41.3±3.6 | 19.7±4.7 | 25.2±5.7 | 17.8±1.4 | 31.5±5.4 | 24.2 |
SimCLR (Ours) | 19M | 7.85±1.1 | 31.2±2.4 | 14.9±2.9 | 19.0±3.8 | 10.6±1.1 | 24.0±4.1 | 17.9 |
SupCon (Ours) | 19M | 8.53±1.1 | 39.8±6.0 | 18.8±3.0 | 20.4±6.9 | 12.6±1.6 | 23.2±3.1 | 20.5 |
ProtoCLR (Ours) | 19M | 9.23±1.6 | 38.6±5.1 | 18.4±2.3 | 21.2±7.3 | 15.5±2.3 | 25.8±5.2 | 21.4 |
5-Shot Classification | ||||||||
BirdAVES-biox-base | 90M | 11.6±0.8 | 39.7±1.8 | 22.5±2.4 | 22.1±3.3 | 16.1±1.7 | 28.3±2.3 | 23.3 |
BirdAVES-bioxn-large | 300M | 15.0±0.9 | 42.6±2.7 | 23.7±3.8 | 28.4±2.4 | 18.3±1.8 | 27.3±2.3 | 25.8 |
BioLingual | 28M | 13.6±1.3 | 65.2±1.4 | 31.0±2.9 | 34.3±3.5 | 43.9±0.9 | 49.9±2.3 | 39.6 |
Perch | 80M | 21.2±1.2 | 71.7±1.5 | 39.5±3.0 | 52.5±5.9 | 48.0±1.9 | 59.7±1.8 | 48.7 |
CE (Ours) | 19M | 21.4±1.3 | 69.2±1.8 | 35.6±3.4 | 48.2±5.5 | 39.9±1.1 | 57.5±2.3 | 45.3 |
SimCLR (Ours) | 19M | 15.4±1.0 | 54.0±1.8 | 23.0±2.3 | 32.8±4.0 | 22.0±1.2 | 40.7±2.4 | 31.3 |
SupCon (Ours) | 19M | 17.2±1.3 | 64.6±2.4 | 34.1±2.9 | 42.5±2.9 | 30.8±0.8 | 48.1±2.4 | 39.5 |
ProtoCLR (Ours) | 19M | 19.2±1.1 | 67.9±2.8 | 36.1±4.3 | 48.0±4.3 | 34.6±2.3 | 48.6±2.8 | 42.4 |
@misc{moummad2024dirlbs,
title={Domain-Invariant Representation Learning of Bird Sounds},
author={Ilyass Moummad and Romain Serizel and Emmanouil Benetos and Nicolas Farrugia},
year={2024},
eprint={2409.08589},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2409.08589},
}