Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BIC Search Algorithm #65

Open
michaelweylandt opened this issue Jun 15, 2020 · 0 comments
Open

Improve BIC Search Algorithm #65

michaelweylandt opened this issue Jun 15, 2020 · 0 comments

Comments

@michaelweylandt
Copy link
Member

michaelweylandt commented Jun 15, 2020

When doing nested BIC search, do we need to update both U and V? (The current code holds V constant while optimizing over U and vice versa) Formally, in the BIC expression, v is a function of u_hat so it seems weird to not change v. On the flip side, we're just optimizing a regression (with constraint). My concern is that, if we're just solving the penalized regression problem and Xv is a sub-unit vector, then we can always choose u to be exactly Xv and then we get zero variance and stumble into a log(0) problem. (Or at least, that's what I got out of a quick skim. Please correct me if I'm wrong @Banana1530.)

For well-initialized V, this doesn't make a big difference so we can usually get away with it: in particular, if V is well initialized (maybe because we're already using a full grid or tuning parameters are small enough that the SVD is really close to correct), V_init and V_stationary point are close, so BIC(U, V_init) and BIC(U, V_stationary point) will be similar. That seems like too strong an assumption to make in practice though.

The below experiments indicate what I mean. If we set UPDATE_V = 0 but leave INITIALIZE_V_SVD = 1, we get saved by the fact that v_star and V1 are pretty close; but if we set both flags to 0, corresponding to a random initialization of v_hat without updating, things go poorly. If we have INITIALIZE_V_SVD = 0, but keep UPDATE_V = 1, we do better, but not nearly as well as using the SVD initialization.

UPDATE_V = 1;        % Set to 0 to fix V at initialization value
INITIALIZE_SVD = 1;  % Set to 0 to initialize U, V to random unit vector

n = 50; p = 25; s = 5; 
u_star = [ones(s, 1); zeros(n - s, 1)]; v_star = [zeros(p - s, 1); ones(s, 1)]; d = 3; 

N = randn(n, p);
S = d * u_star * v_star'; 

X = S + N;  
[U, ~, V] = svd(X); 

st = @(x, lambda) sign(x) .* max(abs(x) - lambda, 0); 

% Initialize SFPCA
U1 = U(:, 1); V1 = V(:, 1);


% SFPCA search on lambda_U
% Keep v parameters fixed for now...
lambda_u_range = linspace(0, 5, 51);
n_lambda_u     = size(lambda_u_range, 2);

sigma_hat_holder = zeros(size(lambda_u_range)); 
bic_holder     = zeros(size(lambda_u_range));
df_holder      = zeros(size(lambda_u_range)); 

for lu_ix=1:n_lambda_u
  lu = lambda_u_range(lu_ix);
  
  % Quick and dirty SFPCA - only doing sparsity in U
  
  if INITIALIZE_SVD
      u_hat = U1; 
      v_hat = V1; 
  else 
      u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat); 
      v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
  end
  
  u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000; 
  
  while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
      while norm(u_hat - u_hat_old) > 1e-6
          u_hat_old = u_hat; 
          u_hat = st(X * v_hat, lu); 
          u_hat = u_hat / norm(u_hat); 
      end
      
      if UPDATE_V
          while norm(v_hat - v_hat_old) > 1e-6
              v_hat_old = v_hat;
              v_hat = st(u_hat' * X, 0);
              v_hat = v_hat / norm(v_hat);
              v_hat = v_hat'; % Keep sizes correct
          end
      else
          v_hat_old = v_hat;
      end
  end
  
  sigma_hat_sq = mean((X * v_hat - u_hat).^2); 
  
  sigma_hat_holder(lu_ix) = sigma_hat_sq; 
  df_holder(lu_ix)  = sum(u_hat ~= 0); 
  bic_holder(lu_ix) = log(sigma_hat_sq / n) + 1 / n * log(n) * sum(u_hat ~= 0); 
end

[min_bic, min_bic_ind] = min(bic_holder);

lambda_u_optimal = lambda_u_range(min_bic_ind);
  
% Quick and dirty SFPCA - only doing sparsity in U

if INITIALIZE_SVD
    u_hat = U1; 
    v_hat = V1; 
else 
    u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat); 
    v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
end
  
  u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000;
  
while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
    while norm(u_hat - u_hat_old) > 1e-6
        u_hat_old = u_hat; 
        u_hat = st(X * v_hat, lambda_u_optimal); 
        u_hat = u_hat / norm(u_hat); 
    end
     
    if UPDATE_V
        while norm(v_hat - v_hat_old) > 1e-6
            v_hat_old = v_hat;
            v_hat = st(u_hat' * X, 0);
            v_hat = v_hat / norm(v_hat);
            v_hat = v_hat'; % Keep sizes correct
        end
    else
        v_hat_old = v_hat;
    end
end

snr = max(svd(X)) / max(svd(N)); 

u_hat_supp = u_hat ~= 0; 
u_star_supp = u_star ~= 0; 
u_star_nonsupp = u_star == 0; 

tpr = mean(u_hat_supp(u_star_supp)); 
fpr = mean(u_hat_supp(u_star_nonsupp)); 

Running 1000 replicates, I see

                           | TPR  |  FPR  | SNR
Update_V = 1, INIT_SVD = 1 | 96%  | 0%    | 1.5
Update_V = 1, INIT_SVD = 0 | 42%  | 26%   | 1.5
Update_V = 0, INIT_SVD = 1 | 96%  | 0.1%  | 1.5
Update_V = 0, INIT_SVD = 0 | 36%  | 18%   | 1.5

My hunch is that the Update_V = 0, INIT_SVD = 1 case would suffer more on harder situations than this: the angle between V1(S + N) and v_star isn't very much for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant