Update mnsf-tutorial-dlpfc.md

hansenlab · Nov 17, 2024 · aa61db5 · aa61db5
1 parent 3e88779
commit aa61db5
Showing 1 changed file with 98 additions and 42 deletions.
diff --git a/tutorial/mnsf-tutorial-dlpfc.md b/tutorial/mnsf-tutorial-dlpfc.md
@@ -140,6 +140,7 @@ Make sure to replace `dir_mNSF_functions` with the actual path to the mNSF funct
 Before running mNSF, you need to set up some key parameters. This step is crucial as it defines how the model will behave:
 
 ```python
+nchunk = 1  # Number of chunks per sample
 L = 2  # Number of factors
 nsample = 2  # Number of samples
 
@@ -156,7 +157,9 @@ Let's break this down:
 
 2. `nsample = 2`: This tells the model how many distinct samples are in your dataset. Make sure this matches the actual number of samples you're analyzing.
 
-3. Setting up output directories:
+3. `nchunk = 1`: This sets the number of chunks per sample
+
+4. Setting up output directories:
    - `mpth = path.join("models")`: This creates a path for storing the trained models.
    - `misc.mkdir_p(mpth)`: This creates the directory if it doesn't exist.
    - `pp = path.join(mpth, "pp", str(2))`: This creates a subdirectory for storing preprocessing results.
@@ -175,8 +178,8 @@ list_D = []
 list_X = []
 
 for ksample in range(nsample):
-    Y = pd.read_csv('path/to/Y_features_sele_sample{ksample+1}_500genes.csv')
-    X = pd.read_csv('path/to/X_allSpots_sample{ksample+1}.csv')
+    Y = pd.read_csv(f'path/to/Y_sample{ksample+1}.csv')
+    X = pd.read_csv(f'path/to/X_sample{ksample+1}.csv')
     D = process_multiSample.get_D(X, Y)
     list_D.append(D)
     list_X.append(D["X"])
@@ -200,30 +203,16 @@ Make sure to replace 'path/to/' with the actual path to your data files. The fil
 After loading the data, we need to prepare it for input into mNSF:
 
 ```python
-list_Dtrain = process_multiSample.get_listDtrain(list_D)
 list_sampleID = process_multiSample.get_listSampleID(list_D)
 
-# Set induced points (15% of total spots for each sample)
-for ksample in range(nsample):
-    ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)
-    rd_ = random.sample(range(list_D[ksample]['X'].shape[0]), ninduced)
-    list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]
+
 ```
 
 This code does the following:
 
-1. `list_Dtrain = process_multiSample.get_listDtrain(list_D)`: Extracts the training data from our processed data. This function prepares the data in the format required for model training.
-
-2. `list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.
 
-3. Setting up induced points:
-   - Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
-   - For each sample:
-     - `ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
-     - `rd_ = random.sample(...)`: Randomly selects the induced points.
-     - `list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.
+`list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.
 
-The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.
 
 ### 5.3 Choose the number of factors to be used
 
@@ -294,48 +283,115 @@ The "best" number of factors often involves a nuanced balance between statistica
 
 
 
-## 6. Model Initialization
 
-Now we're ready to initialize the mNSF model:
+
+## 6. Model Training
+
+### 6.1 Optimization Techniques
+
+Before training the model, we'll implement two key optimization techniques that make mNSF practical for large datasets: induced points and data chunking.
+
+#### Induced Points
+Induced points reduce computational complexity by selecting representative spatial locations. This is crucial for:
+- Managing memory usage with large datasets
+- Reducing computational time
+- Maintaining model accuracy while improving efficiency
+
+#### Data Chunking
+Data chunking divides the data into manageable pieces, enabling:
+- Processing of datasets too large to fit in memory
+- Potential parallel processing
+- Better memory management during training
+
+### 6.2 Setting Up Optimization
+
+First, let's implement both optimization techniques:
 
 ```python
-list_fit = process_multiSample.ini_multiSample(list_D, L, "nb")
-```
+# Process data chunking
+list_D_chunked=list()
+list_X_chunked=list()
+for ksample in range(0,nsample):
+  Y = pd.read_csv(f'path/to/Y_sample{ksample+1}.csv')
+  X = pd.read_csv(f'path/to/X_sample{ksample+1}.csv')
+  list_D_sampleTmp,list_X_sampleTmp = process_multiSample.get_chunked_data(X,Y,nchunk)
+  list_D_chunked = list_D_chunked + list_D_sampleTmp
+  list_X_chunked = list_X_chunked + list_X_sampleTmp
+
+#  Extracts the training data from our processed data. This function prepares the data in the format required for model training.
+list_Dtrain = process_multiSample.get_listDtrain(list_D_chunked)
+
+# Set up induced points for each sample
+for ksample in range(nsample):
+    # Select 15% of spots as induced points
+    ninduced = round(list_D_chunked[ksample]['X'].shape[0] * 0.15)
+    rd_ = random.sample(range(list_D_chunked[ksample]['X'].shape[0]), ninduced)
+    list_D_chunked[ksample]["Z"] = list_D_chunked[ksample]['X'][rd_, :]
 
-This function does several important things:
+```
+Setting up induced points:
+   - Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
+   - For each sample:
+     - `ninduced = round(list_Dlist_D_chunked['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
+     - `rd_ = random.sample(...)`: Randomly selects the induced points.
+     - `list_D_chunked[ksample]["Z"] = list_D_chunked[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.
 
-1. It initializes the model parameters for all samples simultaneously.
-2. The `L` parameter specifies the number of factors we want to identify, as set earlier.
-3. The "nb" parameter specifies that we're using a negative binomial distribution for the data. This is often appropriate for count data like gene expression, as it can handle overdispersion better than a Poisson distribution.
+The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.
 
-The function returns a list of initialized model objects, one for each sample. These objects contain the initial parameter values that will be optimized during training.
+Key parameters to consider:
+- Induced points percentage (15% here): Balance between speed and accuracy
+- Number of chunks per sample (2 here): Depends on dataset size and available memory
 
-## 7. Model Training
+### 6.3 Model Initialization
 
-With the model initialized, we can now train it:
+Now we can initialize the model with our optimized data structure:
 
 ```python
-list_fit = training_multiSample.train_model_mNSF(list_fit, pp, list_Dtrain, list_D, num_epochs=2)
+list_fit = process_multiSample.ini_multiSample(list_D_chunked, L, "nb")
 ```
 
-This function trains the mNSF model using the prepared data. Here's what each parameter does:
+Parameters:
+- `list_D_chunked`: Our chunked data structure
+- `L`: Number of factors to identify
+- `"nb"`: Specifies negative binomial distribution
+
+### 6.4 Training the Model
+
+With optimization techniques in place, we can train the model:
+
+```python
+list_fit = training_multiSample.train_model_mNSF(
+    list_fit,      # Initialized model
+    pp,            # Directory for preprocessing results
+    list_Dtrain,    # Chunked training data
+    list_D_chunked, # Full chunked dataset
+    num_epochs=2,  # Number of training iterations
+    nsample = nsample, # Number of samples
+    nchunk = nchunk # Number of chunks
+)
+```
 
-- `list_fit`: The list of initialized model objects from the previous step.
-- `pp`: The path where preprocessing results are stored.
-- `list_Dtrain`: The training data prepared earlier.
-- `list_D`: The full processed data.
-- `num_epochs=2`: The number of training iterations. 
+#### Training Parameters:
+- `num_epochs`: Number of training iterations (500 recommended for real data)
+- The function automatically handles:
+  - Processing data chunks
+  - Managing induced points
+  - Optimizing model parameters
+  - Combining results across chunks
 
-Note that `num_epochs=2` is likely too low for real-world applications. This is just for demonstration purposes. In practice, you might want to increase this number significantly (e.g., to 100 or 1000) for better results, but be aware that training time will increase accordingly. You may need to experiment to find the right balance between training time and model performance for your specific dataset.
+### 6.5 Monitoring Training
 
-The function returns a list of trained model objects, one for each sample. These objects contain the optimized parameters that best explain the spatial patterns in your data according to the mNSF model.
+During training, you should monitor:
+1. Memory usage: If too high, increase number of chunks
+2. Training progress: Watch for convergence
+3. Error messages: May indicate need to adjust parameters
 
-## 8. Visualizing Results
+## 7. Visualizing Results
 
 After training, we can visualize the results. Here's how to plot the mNSF factors for a sample:
 
 ```python
-Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D[0]["X"], S=3, chol=False)).T
+Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D_chunked[0]["X"], S=3, chol=False)).T
 hmkw = {"figsize": (4, 4), "bgcol": "white", "subplot_space": 0.1, "marker": "s", "s": 10}
 fig, axes = visualize.multiheatmap(list_D[0]["X"], Fplot, (1, 2), cmap="RdBu", **hmkw)
 ```
@@ -366,7 +422,7 @@ Let's break this down:
 
 This will produce a figure with two heatmaps, one for each factor, showing how these factors vary across the spatial dimensions of your sample.
 
-## 9. Calculate Moran's I for each factor
+## 8. Calculate Moran's I for each factor
 
 After obtaining the spatial factors from mNSF, it's important to quantify how spatially structured these factors are. One way to do this is by calculating Moran's I statistic for each factor. Moran's I is a measure of spatial autocorrelation, which tells us whether similar values tend to cluster together in space.