-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Your opinion on PEASS? #78
Comments
If you have access to the MATLAB Wavelet Toolbox (for the CQT/ICQT functions: https://www.mathworks.com/help/wavelet/ref/cqt.html), I have written this algorithm for Harmonic/Percussive/Vocal source separation (loosely based on an iterative version of Fitzgerald soft-masking median filtering HPSS). It only works (for now) on mono wav files. It obtains better PEASS scores than UMX (to compare with UMX, I use the MUSDB18-HQ pretrained pytorch model, and set Usage is function HarmonicPercussiveVocal(filename, varargin)
p = inputParser;
WindowSizeP = 1024;
HopSizeP = 256;
Power = 2;
LHarmSTFT = 17;
LPercSTFT = 17;
LHarmCQT = 17;
LPercCQT = 7;
defaultOutDir = '.';
addRequired(p, 'filename', @ischar);
addOptional(p, 'OutDir', defaultOutDir, @ischar);
parse(p, filename, varargin{:});
[x, fs] = audioread(p.Results.filename);
%%%%%%%%%%%%%%%%%%%
% FIRST ITERATION %
%%%%%%%%%%%%%%%%%%%
% CQT of original signal
[cfs1,~,g1,fshifts1] = cqt(x, 'SamplingFrequency', fs, 'BinsPerOctave', 96);
cmag1 = abs(cfs1); % use the magnitude CQT for creating masks
H1 = movmedian(cmag1, LHarmCQT, 2);
P1 = movmedian(cmag1, LPercCQT, 1);
% soft masks, Fitzgerald 2010 - p is usually 1 or 2
Hp1 = H1 .^ Power;
Pp1 = P1 .^ Power;
total1 = Hp1 + Pp1;
Mh1 = Hp1 ./ total1;
Mp1 = Pp1 ./ total1;
% recover the complex STFT H and P from S using the masks
H1 = Mh1 .* cfs1;
P1 = Mp1 .* cfs1;
% finally istft to convert back to audio
xh1 = icqt(H1, g1, fshifts1);
xp1 = icqt(P1, g1, fshifts1);
%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SECOND ITERATION, VOCAL %
%%%%%%%%%%%%%%%%%%%%%%%%%%%
xim2 = xp1;
% CQT of original signal
[cfs2,~,g2,fshifts2] = cqt(xim2, 'SamplingFrequency', fs, 'BinsPerOctave', 24);
cmag2 = abs(cfs2); % use the magnitude CQT for creating masks
H2 = movmedian(cmag2, LHarmCQT, 2);
P2 = movmedian(cmag2, LPercCQT, 1);
% soft mask
Hp2 = H2 .^ Power;
Pp2 = P2 .^ Power;
total2 = Hp2 + Pp2;
Mh2 = Hp2 ./ total2;
Mp2 = Pp2 ./ total2;
% todo - set bins of mask below 100hz to 0
% recover the complex STFT H and P from S using the masks
H2 = Mh2 .* cfs2;
P2 = Mp2 .* cfs2;
% finally istft to convert back to audio
xh2 = icqt(H2, g2, fshifts2);
xp2 = icqt(P2, g2, fshifts2);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THIRD ITERATION, PERCUSSIVE %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
xim3 = xp1 + xp2;
% STFT parameters
winLen3 = WindowSizeP;
fftLen3 = winLen3 * 2;
overlapLen3 = HopSizeP;
win3 = sqrt(hann(winLen3, "periodic"));
% STFT of original signal
S3 = stft(xim3, "Window", win3, "OverlapLength", overlapLen3, ...
"FFTLength", fftLen3, "Centered", true);
halfIdx3 = 1:ceil(size(S3, 1) / 2); % only half the STFT matters
Shalf3 = S3(halfIdx3, :);
Smag3 = abs(Shalf3); % use the magnitude STFT for creating masks
% median filters
H3 = movmedian(Smag3, LHarmSTFT, 2);
P3 = movmedian(Smag3, LPercSTFT, 1);
% binary masks with separation factor, Driedger et al. 2014
% soft masks, Fitzgerald 2010 - p is usually 1 or 2
Hp3 = H3 .^ Power;
Pp3 = P3 .^ Power;
total3 = Hp3 + Pp3;
Mp3 = Pp3 ./ total3;
% recover the complex STFT H and P from S using the masks
P3 = Mp3 .* Shalf3;
% we previously dropped the redundant second half of the fft
P3 = cat(1, P3, flipud(conj(P3)));
% finally istft to convert back to audio
xp3 = istft(P3, "Window", win3, "OverlapLength", overlapLen3,...
"FFTLength", fftLen3, "ConjugateSymmetric", true);
% fix up some lengths
if size(xh1, 1) < size(x, 1)
xh1 = [xh1; x(size(xh1, 1)+1:size(x, 1))];
end
if size(xp3, 1) < size(x, 1)
xp3 = [xp3; x(size(xp3, 1)+1:size(x, 1))];
end
if size(xh2, 1) < size(x, 1)
xh2 = [xh2; x(size(xh2, 1)+1:size(x, 1))];
xp2 = [xp2; x(size(xp2, 1)+1:size(x, 1))];
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% FOURTH ITERATION, REFINE HARMONIC %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if size(xp3, 1) < size(x, 1)
xp3 = [xp3; x(size(xp3, 1)+1:size(x, 1))];
end
if size(xh2, 1) < size(x, 1)
xh2 = [xh2; x(size(xh2, 1)+1:size(x, 1))];
end
% use 2nd iter vocal estimation to improve harmonic sep
x_vocal = xh2;
x_harmonic = xh1;
x_percussive = xp3;
% CQT of harmonic signal
% use a high frequency resolution here as well
[cfs4,~,g4,fshifts4] = cqt(x_harmonic, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
[cfs4_vocal,~,~,~] = cqt(x_vocal, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
[cfs4_percussive,~,~,~] = cqt(x_percussive, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
cmag4 = abs(cfs4); % use the magnitude CQT for creating masks
cmag4_vocal = abs(cfs4_vocal);
cmag4_percussive = abs(cfs4_percussive);
% soft masks, Fitzgerald 2010 - p is usually 1 or 2
H4 = cmag4 .^ Power;
V4 = cmag4_vocal .^ Power;
P4 = cmag4_percussive .^ Power;
total4 = H4 + V4 + P4;
Mh4 = H4 ./ total4;
H4 = Mh4 .* cfs4;
% finally istft to convert back to audio
xh4 = icqt(H4, g4, fshifts4);
[~,fname,~] = fileparts(p.Results.filename);
splt = split(fname, "_");
prefix = splt{1};
% fix up some lengths
if size(xh4, 1) < size(x, 1)
xh4 = [xh4; x(size(xh4, 1)+1:size(x, 1))];
end
xhOut = sprintf("%s/%s_harmonic.wav", p.Results.OutDir, prefix);
xpOut = sprintf("%s/%s_percussive.wav", p.Results.OutDir, prefix);
xvOut = sprintf("%s/%s_vocal.wav", p.Results.OutDir, prefix);
audiowrite(xhOut, xh4, fs);
audiowrite(xpOut, xp3, fs);
audiowrite(xvOut, xh2, fs);
end |
Hi Concerning PEASS, long story short: it's super slow, trained on antediluvean data and would need a serious update, but the idea is nice. It was never really adopted mostly due to its slowness |
Yes, it's inspired by his algorithm (2-pass with large window + small window), and also Fitzgerald later revisited and added a multipass version with CQT for voice separation: https://arrow.tudublin.ie/cgi/viewcontent.cgi?article=1007&context=argart |
Hello,
I'm working on a project where I implement and compare various source separation algorithms. I am using PEASS:
http://bass-db.gforge.inria.fr/peass/ as an evaluation (which is supposedly a perceptual evolution of BSS)
In a specific case, I noticed one algorithm gets higher PEASS scores across the board - artifact/APS, interference/IPS, target/TPS) than https://github.com/sigsep/open-unmix-pytorch, but lower on BSS scores across the board.
Has the sigsep community looked at PEASS? Compared it to BSS?
Thanks.
The text was updated successfully, but these errors were encountered: