Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your opinion on PEASS? #78

Open
sevagh opened this issue Feb 13, 2021 · 3 comments
Open

Your opinion on PEASS? #78

sevagh opened this issue Feb 13, 2021 · 3 comments

Comments

@sevagh
Copy link

sevagh commented Feb 13, 2021

Hello,

I'm working on a project where I implement and compare various source separation algorithms. I am using PEASS:
http://bass-db.gforge.inria.fr/peass/ as an evaluation (which is supposedly a perceptual evolution of BSS)

In a specific case, I noticed one algorithm gets higher PEASS scores across the board - artifact/APS, interference/IPS, target/TPS) than https://github.com/sigsep/open-unmix-pytorch, but lower on BSS scores across the board.

Has the sigsep community looked at PEASS? Compared it to BSS?

Thanks.

@sevagh
Copy link
Author

sevagh commented Feb 19, 2021

If you have access to the MATLAB Wavelet Toolbox (for the CQT/ICQT functions: https://www.mathworks.com/help/wavelet/ref/cqt.html), I have written this algorithm for Harmonic/Percussive/Vocal source separation (loosely based on an iterative version of Fitzgerald soft-masking median filtering HPSS). It only works (for now) on mono wav files.

It obtains better PEASS scores than UMX (to compare with UMX, I use the MUSDB18-HQ pretrained pytorch model, and set bass+other = harmonic, drums = percussive, vocal = vocal). But it gets much worse BSSv4 scores (or even the original BSS).

Usage is HarmonicPercussiveVocal("path_to_mix.wav") to write harmonic, percussive, and vocal component files to cwd.

function HarmonicPercussiveVocal(filename, varargin)
p = inputParser;

WindowSizeP = 1024;
HopSizeP = 256;

Power = 2;

LHarmSTFT = 17;
LPercSTFT = 17;

LHarmCQT = 17;
LPercCQT = 7;

defaultOutDir = '.';

addRequired(p, 'filename', @ischar);
addOptional(p, 'OutDir', defaultOutDir, @ischar);

parse(p, filename, varargin{:});

[x, fs] = audioread(p.Results.filename);

%%%%%%%%%%%%%%%%%%%
% FIRST ITERATION %
%%%%%%%%%%%%%%%%%%%

% CQT of original signal
[cfs1,~,g1,fshifts1] = cqt(x, 'SamplingFrequency', fs, 'BinsPerOctave', 96);

cmag1 = abs(cfs1); % use the magnitude CQT for creating masks

H1 = movmedian(cmag1, LHarmCQT, 2);
P1 = movmedian(cmag1, LPercCQT, 1);

% soft masks, Fitzgerald 2010 - p is usually 1 or 2
Hp1 = H1 .^ Power;
Pp1 = P1 .^ Power;
total1 = Hp1 + Pp1;
Mh1 = Hp1 ./ total1;
Mp1 = Pp1 ./ total1;

% recover the complex STFT H and P from S using the masks
H1 = Mh1 .* cfs1;
P1 = Mp1 .* cfs1;

% finally istft to convert back to audio
xh1 = icqt(H1, g1, fshifts1);
xp1 = icqt(P1, g1, fshifts1);

%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SECOND ITERATION, VOCAL %
%%%%%%%%%%%%%%%%%%%%%%%%%%%

xim2 = xp1;

% CQT of original signal
[cfs2,~,g2,fshifts2] = cqt(xim2, 'SamplingFrequency', fs, 'BinsPerOctave', 24);

cmag2 = abs(cfs2); % use the magnitude CQT for creating masks

H2 = movmedian(cmag2, LHarmCQT, 2);
P2 = movmedian(cmag2, LPercCQT, 1);

% soft mask
Hp2 = H2 .^ Power;
Pp2 = P2 .^ Power;
total2 = Hp2 + Pp2;
Mh2 = Hp2 ./ total2;
Mp2 = Pp2 ./ total2;

% todo - set bins of mask below 100hz to 0

% recover the complex STFT H and P from S using the masks
H2 = Mh2 .* cfs2;
P2 = Mp2 .* cfs2;

% finally istft to convert back to audio
xh2 = icqt(H2, g2, fshifts2);
xp2 = icqt(P2, g2, fshifts2);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THIRD ITERATION, PERCUSSIVE %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

xim3 = xp1 + xp2;

% STFT parameters
winLen3 = WindowSizeP;
fftLen3 = winLen3 * 2;
overlapLen3 = HopSizeP;
win3 = sqrt(hann(winLen3, "periodic"));

% STFT of original signal
S3 = stft(xim3, "Window", win3, "OverlapLength", overlapLen3, ...
  "FFTLength", fftLen3, "Centered", true);

halfIdx3 = 1:ceil(size(S3, 1) / 2); % only half the STFT matters
Shalf3 = S3(halfIdx3, :);
Smag3 = abs(Shalf3); % use the magnitude STFT for creating masks

% median filters
H3 = movmedian(Smag3, LHarmSTFT, 2);
P3 = movmedian(Smag3, LPercSTFT, 1);

% binary masks with separation factor, Driedger et al. 2014
% soft masks, Fitzgerald 2010 - p is usually 1 or 2
Hp3 = H3 .^ Power;
Pp3 = P3 .^ Power;
total3 = Hp3 + Pp3;
Mp3 = Pp3 ./ total3;

% recover the complex STFT H and P from S using the masks
P3 = Mp3 .* Shalf3;

% we previously dropped the redundant second half of the fft
P3 = cat(1, P3, flipud(conj(P3)));

% finally istft to convert back to audio
xp3 = istft(P3, "Window", win3, "OverlapLength", overlapLen3,...
  "FFTLength", fftLen3, "ConjugateSymmetric", true);

% fix up some lengths
if size(xh1, 1) < size(x, 1)
    xh1 = [xh1; x(size(xh1, 1)+1:size(x, 1))];
end

if size(xp3, 1) < size(x, 1)
    xp3 = [xp3; x(size(xp3, 1)+1:size(x, 1))];
end

if size(xh2, 1) < size(x, 1)
    xh2 = [xh2; x(size(xh2, 1)+1:size(x, 1))];
    xp2 = [xp2; x(size(xp2, 1)+1:size(x, 1))];
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% FOURTH ITERATION, REFINE HARMONIC %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if size(xp3, 1) < size(x, 1)
    xp3 = [xp3; x(size(xp3, 1)+1:size(x, 1))];
end

if size(xh2, 1) < size(x, 1)
    xh2 = [xh2; x(size(xh2, 1)+1:size(x, 1))];
end

% use 2nd iter vocal estimation to improve harmonic sep
x_vocal = xh2;
x_harmonic = xh1;
x_percussive = xp3;

% CQT of harmonic signal
% use a high frequency resolution here as  well
[cfs4,~,g4,fshifts4] = cqt(x_harmonic, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
[cfs4_vocal,~,~,~] = cqt(x_vocal, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
[cfs4_percussive,~,~,~] = cqt(x_percussive, 'SamplingFrequency', fs, 'BinsPerOctave', 12);

cmag4 = abs(cfs4); % use the magnitude CQT for creating masks
cmag4_vocal = abs(cfs4_vocal);
cmag4_percussive = abs(cfs4_percussive);

% soft masks, Fitzgerald 2010 - p is usually 1 or 2
H4 = cmag4 .^ Power;
V4 = cmag4_vocal .^ Power;
P4 = cmag4_percussive .^ Power;
total4 = H4 + V4 + P4;
Mh4 = H4 ./ total4;

H4 = Mh4 .* cfs4;

% finally istft to convert back to audio
xh4 = icqt(H4, g4, fshifts4);

[~,fname,~] = fileparts(p.Results.filename);
splt = split(fname, "_");
prefix = splt{1};

% fix up some lengths
if size(xh4, 1) < size(x, 1)
    xh4 = [xh4; x(size(xh4, 1)+1:size(x, 1))];
end

xhOut = sprintf("%s/%s_harmonic.wav", p.Results.OutDir, prefix);
xpOut = sprintf("%s/%s_percussive.wav", p.Results.OutDir, prefix);
xvOut = sprintf("%s/%s_vocal.wav", p.Results.OutDir, prefix);

audiowrite(xhOut, xh4, fs);
audiowrite(xpOut, xp3, fs);
audiowrite(xvOut, xh2, fs);
end

@aliutkus
Copy link
Member

aliutkus commented Feb 20, 2021

Hi
Nice
This actually makes me think of the work @jonathandriedger did back in the days !

Concerning PEASS, long story short: it's super slow, trained on antediluvean data and would need a serious update, but the idea is nice. It was never really adopted mostly due to its slowness

@sevagh
Copy link
Author

sevagh commented Feb 20, 2021

Yes, it's inspired by his algorithm (2-pass with large window + small window), and also Fitzgerald later revisited and added a multipass version with CQT for voice separation: https://arrow.tudublin.ie/cgi/viewcontent.cgi?article=1007&context=argart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants