Quick library to extract pause length features from audio files.
I'm assuming you are running this on a Mac computer (this is the only operating system tested).
First, make sure you have installed Python3, FFmpeg, and SoX via Homebrew:
brew install python3 sox ffmpeg
Now, clone the repository and install all require dependencies:
git clone [email protected]:jim-schwoebel/pauses.git
cd pauses
pip3 install -r requirements.txt
The extract_pauselength.py script uses sys.argv[] convention to pass through variables in the terminal. For more information on this, check out this StackOverflow post.
To simplify things a bit, I recorded a few files that I could use for reference (in ./data folder) - slow, moderate, moderate-fast, and fast speaking (reading the constitution of the US).
I then used pydub to segment based on a threshold of 50 milleseconds segments and -32 dBFS (to allow for detection of fast speaking events) as a silence interval. This parameter likely needs to be tuned to the dataset and speaker power, etc. and is likely overfitted to my voice. Nonetheless, this gives a proof-of-concept implementation of how to segment speaking segments from non-speaking segments with a threshold. I then calculated pause length as total duration (seconds) over the counted number of segments (e.g. number of pauses) - to get a sec/pause.
Run the script in the terminal with:
python3 extract_pauses_1.py n y
If you want to record a file you can do this by:
python3 extract_pauses_1.py y n
After you record it it will display the pause length and create a .JSON file.
If you want to both record a file (10 seconds) and process all the files in the ./data director you can run
python3 extract_pauses_1.py y y
Another technique that can be used is to train a machine learning model to detect pause lengths. In this case, I trained a quick machine learning model from 5-6 files separating the files into 200 millisecond windows and labeling each one as a 'pause' or a 'speech' event. I used the train_audioTPOT.py script found in the voicebook repository with the librosa feature embedding (librosa_features.py). The model achieves around 91.22807017543859% accuracy with an optimized SVM model.
To run this script, you must first put some files in the load_dir folder when you clone the repository (e.g. 'fast.wav').
Next, run the script:
python3 extract_pauses_2.py
The audio files in ./load_dir are then spliced into 200 millisecond segments and classified as silence or speech events. What results is a file in the ./load_dir that corresponds with the speech file (e.g. fast.wav --> fast.json) with the following information:
{"filename": "fast.wav", "total_length": 1.0, "mean": 0.4, "std": 0.20000000000000007, "max_value": 0.6000000000000001, "min_pause": 0.2, "median": 0.4}
As you can see, you get a bit more information here. Note this was a proof-of-concept and likely needs to be augmented with other datasets for it to work robustly across speakers.
Both scripts are limited to low-noise environments. If there is a lot of background noise in your file, I'd first suggest cleaning them and removing noise (e.g. with SoX) before using this script to calculate pause lengths.
Any feedback this repository is greatly appreciated.
- If you find something that is missing or doesn't work, please consider opening a GitHub issue.
- If you want to learn more about voice computing, check out Voice Computing in Python book.
- If you'd like to be mentored by someone on our team, check out the Innovation Fellows Program.
- If you want to talk to me directly, please send me an email @ [email protected].
This repository is licensed under the Apache 2.0 License.