Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseOptimization pattern discrepancy #77

Open
rpalaganas opened this issue Jan 10, 2024 · 21 comments · May be fixed by #125
Open

SparseOptimization pattern discrepancy #77

rpalaganas opened this issue Jan 10, 2024 · 21 comments · May be fixed by #125

Comments

@rpalaganas
Copy link

rpalaganas commented Jan 10, 2024

Good afternoon! I recently ran into an issue where there is pattern discrepancy between runs with sparseOptimization set to TRUE versus FALSE. The code I ran and the output is below. With sparseOptimization set to TRUE I noticed that the ChiSq value was -nan and during the equilibration phase, the P matrix was 0. With sparseOptimization set to FALSE there seemed to be no problems, however the number of patterns learned differed in either case, i.e. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns. This was true for a range of patterns that I ran (5-50)

SPARSE OPTIMIZATION ENABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, 
sparseOptimization=TRUE,
distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            5 
minNS          3 
maxNS          9 

Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 4 is starting!
    worker 6 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 13376(A), 1242(P), ChiSq: -nan, Time: 00:00:45 / 01:16:13
...
30000 of 30000, Atoms: 20636(A), 1461(P), ChiSq: -nan, Time: 00:35:40 / 01:16:38
-- Sampling Phase --
1000 of 30000, Atoms: 20671(A), 1460(P), ChiSq: -nan, Time: 00:36:54 / 01:16:28
...
29000 of 30000, Atoms: 20645(A), 1469(P), ChiSq: -nan, Time: 01:12:07 / 01:13:27
    worker 2 is finished! Time: 01:12:22
30000 of 30000, Atoms: 20670(A), 1484(P), ChiSq: -nan, Time: 01:13:21 / 01:13:21
    worker 1 is finished! Time: 01:13:21
    worker 3 is finished! Time: 01:13:24
    worker 5 is finished! Time: 01:15:26
    worker 4 is finished! Time: 01:15:26
    worker 6 is finished! Time: 01:19:08

Matching Patterns Across Subsets...
Running Final Stage...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 6 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 10022(A), 0(P), ChiSq: -nan, Time: 00:00:27 / 00:45:43
...
30000 of 30000, Atoms: 15174(A), 0(P), ChiSq: -nan, Time: 00:47:13 / 00:47:13
    worker 1 is finished! Time: 00:47:13
    worker 2 is finished! Time: 00:47:28
    worker 5 is finished! Time: 00:47:34
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

SPARSE OPTIMIZATION DISABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42,
distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   FALSE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            5 
minNS          3 
maxNS          9 

Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...

    worker 2 is starting!
    worker 3 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 4 is starting!
    worker 5 is starting!
    worker 6 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 4665(A), 966(P), ChiSq: 5137063, Time: 00:01:16 / 02:08:43
...
30000 of 30000, Atoms: 9933(A), 2460(P), ChiSq: 4886798, Time: 00:49:52 / 01:47:09
-- Sampling Phase --
1000 of 30000, Atoms: 10033(A), 2514(P), ChiSq: 4886740, Time: 00:51:31 / 01:46:45
...
30000 of 30000, Atoms: 9953(A), 2489(P), ChiSq: 4886819, Time: 01:34:05 / 01:34:05
    worker 1 is finished! Time: 01:34:05
    worker 5 is finished! Time: 01:44:52
    worker 4 is finished! Time: 01:54:06
    worker 2 is finished! Time: 01:54:29
    worker 6 is finished! Time: 01:54:31
    worker 3 is finished! Time: 01:54:38

Matching Patterns Across Subsets...
Running Final Stage...

    worker 5 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 2 is starting!
    worker 6 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 5928(A), 0(P), ChiSq: 14908930, Time: 00:00:10 / 00:16:56
...
30000 of 30000, Atoms: 10469(A), 0(P), ChiSq: 14908930, Time: 00:08:47 / 00:18:52
-- Sampling Phase --
1000 of 30000, Atoms: 10403(A), 0(P), ChiSq: 14908930, Time: 00:09:00 / 00:18:39
...
30000 of 30000, Atoms: 10379(A), 0(P), ChiSq: 14908930, Time: 00:15:17 / 00:15:17
    worker 1 is finished! Time: 00:15:17
    worker 5 is finished! Time: 00:16:29
    worker 3 is finished! Time: 00:19:47
    worker 2 is finished! Time: 00:20:37
    worker 4 is finished! Time: 00:20:38
    worker 6 is finished! Time: 00:20:45
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

After obtaining the patterns, I ran patternMarkers on patterns learned with sparseOptimization = TRUE. When I set threshold = “all”, I would get this error.

test <- patternMarkers_all(Hoxd10_matnp5, threshold = "all")

Error in colnames(markerScores)[apply(markerScores, 1, which.min)] : 
  invalid subscript type 'list'
This error would not trigger when threshold was set to “cut”.
PatternMarkers worked normally when run on patterns learned without sparseOptimization. 

UPDATE @dimalvovs  - delete rows for readability
@ejfertig
Copy link
Collaborator

ejfertig commented Jan 10, 2024 via email

@rpalaganas
Copy link
Author

I did not filter genes with zero expression

@ejfertig
Copy link
Collaborator

ejfertig commented Jan 10, 2024 via email

@rpalaganas
Copy link
Author

rpalaganas commented Jan 11, 2024

Good morning,
Sorry for the delay. After removing zero variance and zero expression genes with

#drop zero expression genes 
row_sums <- rowSums(Hoxd10.mat)
zeroindx <- which(row_sums == 0)
Hoxd10mat_filtered <- Hoxd10.mat[-zeroindx,]

#drop zero variance genes 
row_var <- rowVars(Hoxd10mat_filtered)
zervarindx <- which(row_var == 0)
Hoxd10mat_filtered <- Hoxd10mat_filtered[-zervarindx,]

I reran CoGAPS with and without sparse optimization. Looks like it gave a similar result with -nan ChiSq

> Hoxd10_mat <- readRDS('~Hoxd10mat_filtered.RDS')

> params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE, distributed="genome-wide")
> params <- setDistributedParams(params, nSets=5)
setting distributed parameters - call this again if you change nPatterns

> Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 16620(A), 1167(P), ChiSq: -nan, Time: 00:00:52 / 01:28:04
...
30000 of 30000, Atoms: 25197(A), 1328(P), ChiSq: -nan, Time: 00:41:15 / 01:28:38
-- Sampling Phase --
1000 of 30000, Atoms: 25152(A), 1329(P), ChiSq: -nan, Time: 00:42:42 / 01:28:29
...
30000 of 30000, Atoms: 25114(A), 1344(P), ChiSq: -nan, Time: 01:24:45 / 01:24:45
    worker 1 is finished! Time: 01:24:45
    worker 2 is finished! Time: 01:24:49
    worker 3 is finished! Time: 01:25:26
    worker 5 is finished! Time: 01:28:35
    worker 4 is finished! Time: 01:30:20

Matching Patterns Across Subsets...
Running Final Stage...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 5 is starting!
    worker 3 is starting!
    worker 4 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 12640(A), 0(P), ChiSq: -nan, Time: 00:00:34 / 00:57:35
...
30000 of 30000, Atoms: 19022(A), 0(P), ChiSq: -nan, Time: 00:25:15 / 00:54:15
-- Sampling Phase --
1000 of 30000, Atoms: 19126(A), 0(P), ChiSq: -nan, Time: 00:26:07 / 00:54:07
...
30000 of 30000, Atoms: 18941(A), 0(P), ChiSq: -nan, Time: 00:51:43 / 00:51:43
    worker 1 is finished! Time: 00:51:43
    worker 3 is finished! Time: 00:52:06
    worker 5 is finished! Time: 00:52:45
    worker 2 is finished! Time: 00:52:56
    worker 4 is finished! Time: 00:53:10
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

While running without sparse optimization looked normal.

> Hoxd10_mat <- readRDS('~Hoxd10mat_filtered.RDS')
> params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, distributed="genome-wide")
> params <- setDistributedParams(params, nSets=5)
setting distributed parameters - call this again if you change nPatterns

> Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   FALSE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

    worker 2 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 5351(A), 1181(P), ChiSq: 6223970, Time: 00:01:14 / 02:05:20
...
30000 of 30000, Atoms: 12093(A), 2528(P), ChiSq: 5907389, Time: 00:55:27 / 01:59:09
-- Sampling Phase --
1000 of 30000, Atoms: 12133(A), 2524(P), ChiSq: 5907130, Time: 00:57:20 / 01:58:48
...
22000 of 30000, Atoms: 12115(A), 2522(P), ChiSq: 5906996, Time: 01:38:09 / 01:54:53
    worker 5 is finished! Time: 01:39:32
23000 of 30000, Atoms: 12027(A), 2495(P), ChiSq: 5907204, Time: 01:40:02 / 01:54:40
24000 of 30000, Atoms: 12127(A), 2530(P), ChiSq: 5906893, Time: 01:41:46 / 01:54:16
25000 of 30000, Atoms: 12067(A), 2549(P), ChiSq: 5907026, Time: 01:43:30 / 01:53:54
26000 of 30000, Atoms: 12093(A), 2520(P), ChiSq: 5907063, Time: 01:45:13 / 01:53:31
    worker 4 is finished! Time: 01:46:33
27000 of 30000, Atoms: 12146(A), 2533(P), ChiSq: 5907351, Time: 01:46:57 / 01:53:09
    worker 3 is finished! Time: 01:48:18
28000 of 30000, Atoms: 12094(A), 2504(P), ChiSq: 5906844, Time: 01:48:37 / 01:52:44
    worker 2 is finished! Time: 01:49:34
29000 of 30000, Atoms: 12145(A), 2491(P), ChiSq: 5907016, Time: 01:50:00 / 01:52:03
30000 of 30000, Atoms: 12155(A), 2493(P), ChiSq: 5907008, Time: 01:51:34 / 01:51:34
    worker 1 is finished! Time: 01:51:34

Matching Patterns Across Subsets...
Running Final Stage...

    worker 5 is starting!
    worker 3 is starting!
    worker 4 is starting!
    worker 2 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 6726(A), 0(P), ChiSq: 18248056, Time: 00:00:08 / 00:13:33
...
30000 of 30000, Atoms: 11486(A), 0(P), ChiSq: 18248056, Time: 00:07:05 / 00:15:13
-- Sampling Phase --
1000 of 30000, Atoms: 11367(A), 0(P), ChiSq: 18248056, Time: 00:07:20 / 00:15:11
...
30000 of 30000, Atoms: 11467(A), 0(P), ChiSq: 18248056, Time: 00:14:32 / 00:14:32
    worker 1 is finished! Time: 00:14:32
    worker 2 is finished! Time: 00:15:12
    worker 5 is finished! Time: 00:16:59
    worker 3 is finished! Time: 00:17:06
    worker 4 is finished! Time: 00:17:30
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

This time, each run generated the same number of patterns, however the values differed.

> range(sparseTRUE@featureLoadings)
[1] 0.000000 9.607491
> range(sparseFALSE@featureLoadings)
[1] 7.684777e-09 5.845305e+00

PatternMarkers with threshold = 'all' also did not work on the CoGAPS object generated with sparseOptimization = "TRUE". PatternMarkers worked on the object generated without sparse optimization.

@ejfertig
Copy link
Collaborator

ejfertig commented Jan 11, 2024 via email

@dimalvovs
Copy link
Contributor

Chisq is still not nan if we run on exact same dimensions and parameters

c <- 380
r <- 27277
simdata <- matrix(runif(r*c), nrow=r, ncol=c)
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, distributed="genome-wide", sparseOptimization=TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(simdata, params = params, outputFrequency = 1000)

This is CoGAPS version 3.22.0 
Running genome-wide CoGAPS on simdata (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 3 is starting!
-- Equilibration Phase --
    worker 4 is starting!
    worker 2 is starting!
    worker 5 is starting!
1000 of 30000, Atoms: 28296(A), 1016(P), ChiSq: 171672640, Time: 00:00:55 / 01:33:09
2000 of 30000, Atoms: 31014(A), 983(P), ChiSq: 171320192, Time: 00:02:02 / 01:32:27
3000 of 30000, Atoms: 32194(A), 934(P), ChiSq: 171210704, Time: 00:03:07 / 01:28:59
4000 of 30000, Atoms: 33225(A), 922(P), ChiSq: 171167024, Time: 00:04:12 / 01:26:24

@dimalvovs
Copy link
Contributor

Making data 50% sparse still runs fine. @rpalaganas what's the sparsity of your data?

c <- 380
r <- 27277
dense <- runif(r*c)
sparse <- sample(c(dense, rep(0, length(dense))),
                size = length(dense), replace = T)
sum(sparse==0)/length(sparse)
simdata <- matrix(dense, nrow=r, ncol=c)
params <- CogapsParams(nPatterns = 5, nIterations = 30000, seed = 42, distributed = "genome-wide", 
                       sparseOptimization = TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(simdata, params = params, outputFrequency = 1000)
This is CoGAPS version 3.22.0 
Running genome-wide CoGAPS on simdata (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
    worker 2 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
1000 of 30000, Atoms: 28519(A), 990(P), ChiSq: 171619760, Time: 00:00:56 / 01:34:51
2000 of 30000, Atoms: 30973(A), 967(P), ChiSq: 171261920, Time: 00:02:04 / 01:33:58

@rpalaganas
Copy link
Author

The sparsity of the matrix that gave the -nans is 0.71.
coop::sparsity(Hoxd10) #0.7125972

I also do not get -nan ChiSq when testing a matrix that is almost exactly as sparse.

x <- matrix(0, 380, 27277)
x[sample(length(x), size = round(0.29 * length(x)))] <- 1
coop::sparsity(x)  #0.7099992

params <- CogapsParams(nPatterns = 5, nIterations = 30000, seed = 42, distributed = "genome-wide", 
                       sparseOptimization = TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(t(x), params = params, outputFrequency = 1000)


This is CoGAPS version 3.21.5 
Running genome-wide CoGAPS on t(x) (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...    worker 5 is starting!
    worker 4 is starting!
    worker 2 is starting!
Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
    worker 3 is starting!
1000 of 30000, Atoms: 10004(A), 1354(P), ChiSq: 42395544, Time: 00:00:15 / 00:25:24
2000 of 30000, Atoms: 12855(A), 1780(P), ChiSq: 42189580, Time: 00:00:38 / 00:28:47

@dimalvovs
Copy link
Contributor

I thought that something is wrong in some genes' distributions, and used this fun to remove genes that would yield chisq nan in the results.

failRemoveRowsAll <- function(data){
  i <- 1
  j <- 6 #to keep genes > patterns 
  failnames <- c()
  params <- CogapsParams(nPatterns = 5, nIterations = 10, seed = 1,
                         sparseOptimization = TRUE)
  while (j <= nrow(data)) {
    res <- CoGAPS(data[c(i:j), ], params = params, outputFrequency = 10,
                  messages = FALSE)
    if (sum(is.na(res@metadata$chisq)) > 0) {
      failname <- rownames(data)[j]
      message(failname, ", at: ", j, " fails: ", length(failnames))
      failnames <- c(failnames, failname)
      data <- data[-c(j),]
    } else {
        j <- j + 1
    }
  }
  return(failnames)
}

afterwards, the chisq is not nan anymore, but the value itself is huge:

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on hoxdata[!(rownames(hoxdata) %in% failed_by_all), ] (28470 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          10000 
seed                 1 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 10000, Atoms: 108128(A), 625(P), ChiSq: 1839920873269298576727893606400, Time: 00:01:01 / 00:30:39
2000 of 10000, Atoms: 115884(A), 588(P), ChiSq: 2292072881336368004474865713152, Time: 00:02:15 / 00:30:21
3000 of 10000, Atoms: 120147(A), 593(P), ChiSq: 2783612750416574213534292901888, Time: 00:03:29 / 00:29:30
4000 of 10000, Atoms: 123364(A), 572(P), ChiSq: 2550784517891173166148455759872, Time: 00:04:44 / 00:28:53
5000 of 10000, Atoms: 126268(A), 568(P), ChiSq: 2541503896605446561631530123264, Time: 00:06:01 / 00:28:30
6000 of 10000, Atoms: 126646(A), 563(P), ChiSq: 2905690684144169274910709907456, Time: 00:07:20 / 00:28:16
7000 of 10000, Atoms: 126687(A), 555(P), ChiSq: 2674804590947979129294038237184, Time: 00:08:38 / 00:27:57
8000 of 10000, Atoms: 126574(A), 561(P), ChiSq: 3147559868411558326579587186688, Time: 00:09:56 / 00:27:41
9000 of 10000, Atoms: 126684(A), 553(P), ChiSq: 2698749482425631185700149788672, Time: 00:11:13 / 00:27:23
10000 of 10000, Atoms: 126485(A), 553(P), ChiSq: 2957405810624188978088997027840, Time: 00:12:30 / 00:27:06
-- Sampling Phase --
1000 of 10000, Atoms: 126720(A), 556(P), ChiSq: 2646788641772774809142103638016, Time: 00:13:47 / 00:26:51
2000 of 10000, Atoms: 126701(A), 559(P), ChiSq: 2966075017676645483900815015936, Time: 00:15:05 / 00:26:40
3000 of 10000, Atoms: 126748(A), 561(P), ChiSq: 2824885175666762749901468073984, Time: 00:16:23 / 00:26:29
4000 of 10000, Atoms: 126825(A), 548(P), ChiSq: 2910313314246920713217492647936, Time: 00:17:41 / 00:26:19
5000 of 10000, Atoms: 126857(A), 542(P), ChiSq: 3040585044948408827482774437888, Time: 00:18:58 / 00:26:08
6000 of 10000, Atoms: 126800(A), 551(P), ChiSq: 3114951210047638030192943824896, Time: 00:20:16 / 00:25:59
7000 of 10000, Atoms: 126686(A), 548(P), ChiSq: 2923385429134413698483590529024, Time: 00:21:33 / 00:25:49
8000 of 10000, Atoms: 126779(A), 559(P), ChiSq: 2855560157182209446923168907264, Time: 00:22:51 / 00:25:41
9000 of 10000, Atoms: 126407(A), 551(P), ChiSq: 2890611147933206197900012421120, Time: 00:24:09 / 00:25:34
10000 of 10000, Atoms: 126802(A), 552(P), ChiSq: 2753871361865324913892758913024, Time: 00:25:26 / 00:25:26

compared to results on the same data with sparseOptimization = FALSE:

-- Equilibration Phase --
10000 of 10000, Atoms: 57241(A), 3038(P), ChiSq: 26866870, Time: 00:29:54 / 01:04:51

sparse sampler cannot find a proper solution? btw is that normal for chisq to increase over iterations?

@jeanettejohnson
Copy link
Collaborator

hey all-- a few notes regarding this issue
@rpalaganas since the run is distributed, the P atoms being 0 in the second phase is correct, because at that point one matrix has been fixed and cogaps is learning the cognate (A) matrix

@jeanettejohnson
Copy link
Collaborator

Is this data more than 80% sparse?

@rpalaganas
Copy link
Author

Is this data more than 80% sparse?

Slightly less, ~71% sparse

@dimalvovs
Copy link
Contributor

so, we have technically addressed all the points addressed in the issue report:

  1. ChiSq value was -nan: these -nans appear as the actual value of ChiSq are too large, as can be confirmed here
  2. during the equilibration phase, the P matrix was 0: it is expected as in the distributed run one of the matrices is fixed (see comment above)
  3. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns: number of patterns returned can differ from the number of patterns requested in the distributed mode, as the number of patterns is a superset of patterns matched across nSets, controlled by maxNS parameter. We may want to set the maxNS parameter to nPatterns by default to avoid this confusion.

The unsolved problem that is motivated by this issue is why the ChisQ is so large for a given dataset compared to a simulated dataset with similar dimensions and sparsity parameters, as demonstrated here.

@dimalvovs
Copy link
Contributor

Interestingly the boostrapped version of the original data also fails

> resamp <- sample(hoxdata, size = length(hoxdata), replace = T)
> resamp <- matrix(resamp, ncol = 380)
> params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE, distributed="genome-wide")
> params <- setDistributedParams(params, nSets=5)
setting distributed parameters - call this again if you change nPatterns
> res <- CoGAPS(resamp, params)

This is CoGAPS version 3.23.1 
Running genome-wide CoGAPS on resamp (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (6081, 6081.4, 6083)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
    worker 3 is starting!
    worker 4 is starting!
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
    worker 5 is starting!
    worker 2 is starting!
1000 of 30000, Atoms: 24149(A), 137(P), ChiSq: nan, Time: 00:00:15 / 00:25:24

@dimalvovs
Copy link
Contributor

Also sparseOptimization=TRUE fails for non-distributed mode:

#sparse optimized and not distributed
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(hoxdata, params)

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on hoxdata (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 30000, Atoms: 86929(A), 369(P), ChiSq: nan, Time: 00:00:38 / 01:04:21
2000 of 30000, Atoms: 99716(A), 375(P), ChiSq: nan, Time: 00:01:29 / 01:07:26
3000 of 30000, Atoms: 103343(A), 386(P), ChiSq: nan, Time: 00:02:23 / 01:08:03

@dimalvovs
Copy link
Contributor

Interestingly, sampling from a histogram does not fail:

#sample from histogram of data
hox_hist <- hist(hoxdata, breaks = 100, plot = FALSE)

hox_sim <- sample(hox_hist$mids, size = length(hoxdata),
  replace = T, prob = hox_hist$density)
hox_sim <- matrix(jitter(hox_sim), ncol = 380)

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE)
res <- CoGAPS(hox_sim, params)

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on hox_sim (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 30000, Atoms: 73945(A), 1873(P), ChiSq: 221560368, Time: 00:05:48 / 09:49:26
2000 of 30000, Atoms: 83256(A), 1960(P), ChiSq: 220883600, Time: 00:11:54 / 09:01:04
3000 of 30000, Atoms: 94329(A), 1951(P), ChiSq: 220712496, Time: 00:19:01 / 09:03:02
4000 of 30000, Atoms: 103541(A), 1927(P), ChiSq: 220577920, Time: 00:26:22 / 09:02:24
5000 of 30000, Atoms: 111937(A), 1922(P), ChiSq: 220456768, Time: 01:02:02 / 16:30:34
^C

@dimalvovs
Copy link
Contributor

the problem is that for some datasets, including the dataset where the error was generated, yield a 0 in the denominator for chi^2 calculation in the following line:

chisq += 1 + dot * (dot - 2 * get<1>(it) - dsq * dot) / dsq;

adding 1 to dsq fixes the issue and chisq is not NaN anymore

> Hoxd10mat_5 <- CoGAPS(Hoxd10_mat, params, outputFrequency = 100)

This is CoGAPS version 3.25.2 
Running Standard CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          1000 
seed                 2 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:01)
-- Equilibration Phase --
100 of 1000, Atoms: 35559(A), 118(P), ChiSq: 301204768, Time: 00:00:02 / 00:01:13
200 of 1000, Atoms: 65582(A), 130(P), ChiSq: 300587776, Time: 00:00:10 / 00:02:33
300 of 1000, Atoms: 83274(A), 137(P), ChiSq: 300368096, Time: 00:00:21 / 00:03:16
400 of 1000, Atoms: 92869(A), 138(P), ChiSq: 300274784, Time: 00:00:34 / 00:03:45
500 of 1000, Atoms: 98478(A), 136(P), ChiSq: 300240576, Time: 00:00:50 / 00:04:13
600 of 1000, Atoms: 101806(A), 136(P), ChiSq: 300215040, Time: 00:01:08 / 00:04:37
700 of 1000, Atoms: 103477(A), 135(P), ChiSq: 300206656, Time: 00:01:25 / 00:04:49
800 of 1000, Atoms: 104286(A), 137(P), ChiSq: 300208928, Time: 00:01:43 / 00:04:59
900 of 1000, Atoms: 104929(A), 136(P), ChiSq: 300204416, Time: 00:01:59 / 00:05:01
1000 of 1000, Atoms: 105183(A), 135(P), ChiSq: 300204288, Time: 00:02:12 / 00:04:55

@dimalvovs dimalvovs linked a pull request Nov 15, 2024 that will close this issue
@dimalvovs
Copy link
Contributor

It took some time but it looks we're approaching the resolution. @rpalaganas would you be open to a functional test by installing from the feature branch and running on your dataset? Please do not use distributed="genome-wide" though.

@rpalaganas
Copy link
Author

sure, see below

#sparse optimized and not distributed
devtools::install_github("FertigLab/CoGAPS@77-sparseoptimization-pattern-discrepancy")
library(CoGAPS)
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(hoxdata, params)
This is CoGAPS version 3.25.2 
Running Standard CoGAPS on hoxdata (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 30000, Atoms: 81088(A), 377(P), ChiSq: 237356768, Time: 00:00:52 / 01:28:04
2000 of 30000, Atoms: 94278(A), 383(P), ChiSq: 237524992, Time: 00:02:04 / 01:33:58
3000 of 30000, Atoms: 99328(A), 398(P), ChiSq: 237576224, Time: 00:03:19 / 01:34:42
4000 of 30000, Atoms: 101559(A), 395(P), ChiSq: 237613648, Time: 00:04:33 / 01:33:36
5000 of 30000, Atoms: 103355(A), 386(P), ChiSq: 237643600, Time: 00:05:50 / 01:33:08
6000 of 30000, Atoms: 104854(A), 404(P), ChiSq: 237652800, Time: 00:07:07 / 01:32:27

@dimalvovs
Copy link
Contributor

Nice, chisq is not NaN anymore. Do the actual results make sense?

@rpalaganas
Copy link
Author

I don't see any issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants