Update mnsf-tutorial-mouse.md

hansenlab · Nov 17, 2024 · 16cff68 · 16cff68
1 parent fd3b7cb
commit 16cff68
Showing 1 changed file with 100 additions and 45 deletions.
diff --git a/tutorial/mnsf-tutorial-mouse.md b/tutorial/mnsf-tutorial-mouse.md
@@ -1,4 +1,4 @@
-# Tutorial for mNSF Package - DLPFC data example
+# Tutorial for mNSF Package - mouse sagittal section data example
 
 **Authors:** Yi Wang and Kasper Hansen  
 **Date:** May, 2024
@@ -140,13 +140,17 @@ Make sure to replace `dir_mNSF_functions` with the actual path to the mNSF funct
 Before running mNSF, you need to set up some key parameters. This step is crucial as it defines how the model will behave:
 
 ```python
+nchunk = 1  # Number of chunks per sample
+nsample = 2 # Number of samples
 L = 2  # Number of factors
-nsample = 2  # Number of samples
 
 # Set up paths for outputs
+dir_output = "/path/to/output/directory/"
+os.chdir(dir_output)
+
 mpth = path.join("models")
 misc.mkdir_p(mpth)
-pp = path.join(mpth, "pp", str(2))
+pp = path.join(mpth, "pp")
 misc.mkdir_p(pp)
 ```
 
@@ -156,7 +160,9 @@ Let's break this down:
 
 2. `nsample = 2`: This tells the model how many distinct samples are in your dataset. Make sure this matches the actual number of samples you're analyzing.
 
-3. Setting up output directories:
+3. `nchunk = 1`: This sets the number of chunks per sample
+
+4. Setting up output directories:
    - `mpth = path.join("models")`: This creates a path for storing the trained models.
    - `misc.mkdir_p(mpth)`: This creates the directory if it doesn't exist.
    - `pp = path.join(mpth, "pp", str(2))`: This creates a subdirectory for storing preprocessing results.
@@ -168,15 +174,15 @@ These directories will be used to save the results of your analysis, including t
 
 ### 5.1 Loading and Processing Data
 
-mNSF requires your data to be in a specific format. Here's a detailed explanation of how to load and process your data:
+The implementation includes data chunking at the loading stage. Here's how to load and process your data:
 
 ```python
 list_D = []
 list_X = []
 
 for ksample in range(nsample):
-    Y = pd.read_csv(f'path/to/Y_features_sele_sample{ksample+1}_500genes.csv')
-    X = pd.read_csv(f'path/to/X_allSpots_sample{ksample+1}.csv')
+    Y = pd.read_csv(f'path/to/Y_sample{ksample+1}.csv')
+    X = pd.read_csv(f'path/to/X_sample{ksample+1}.csv')
     D = process_multiSample.get_D(X, Y)
     list_D.append(D)
     list_X.append(D["X"])
@@ -200,30 +206,15 @@ Make sure to replace 'path/to/' with the actual path to your data files. The fil
 After loading the data, we need to prepare it for input into mNSF:
 
 ```python
-list_Dtrain = process_multiSample.get_listDtrain(list_D)
 list_sampleID = process_multiSample.get_listSampleID(list_D)
 
-# Set induced points (15% of total spots for each sample)
-for ksample in range(nsample):
-    ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)
-    rd_ = random.sample(range(list_D[ksample]['X'].shape[0]), ninduced)
-    list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]
 ```
 
 This code does the following:
 
-1. `list_Dtrain = process_multiSample.get_listDtrain(list_D)`: Extracts the training data from our processed data. This function prepares the data in the format required for model training.
-
-2. `list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.
+`list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.
 
-3. Setting up induced points:
-   - Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
-   - For each sample:
-     - `ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
-     - `rd_ = random.sample(...)`: Randomly selects the induced points.
-     - `list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.
 
-The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.
 
 ### 5.3 Choose the number of factors to be used
 
@@ -297,49 +288,113 @@ Other possible ways for selecting the number of factors:
 The "best" number of factors often involves a nuanced balance between statistical fit, biological interpretability, computational resources, and research objectives. It's often helpful to try a few different values and compare the results before making a final decision. The process may involve iterative refinement and integration of multiple lines of evidence.
 
 
+## 6. Model Training
+
+### 6.1 Optimization Techniques
+
+Before training the model, we'll implement two key optimization techniques that make mNSF practical for large datasets: induced points and data chunking.
+
+#### Induced Points
+Induced points reduce computational complexity by selecting representative spatial locations. This is crucial for:
+- Managing memory usage with large datasets
+- Reducing computational time
+- Maintaining model accuracy while improving efficiency
+
+#### Data Chunking
+Data chunking divides the data into manageable pieces, enabling:
+- Processing of datasets too large to fit in memory
+- Potential parallel processing
+- Better memory management during training
 
-## 6. Model Initialization
+### 6.2 Setting Up Optimization
 
-Now we're ready to initialize the mNSF model:
+First, let's implement both optimization techniques:
 
 ```python
-list_fit = process_multiSample.ini_multiSample(list_D, L, "nb")
+# Process data chunking
+list_D_chunked=list()
+list_X_chunked=list()
+for ksample in range(0,nsample):
+	Y=pd.read_csv(path.join('//dcs04/hansen/data/ywang/ST/DLPFC/processed_Data//Y_features_sele_sample'+str(ksample*4+1)+'_500genes.csv'))
+	X=pd.read_csv(path.join('//dcs04/hansen/data/ywang/ST/DLPFC/processed_Data///X_allSpots_sample'+str(ksample*4+1)+'.csv'))
+	list_D_sampleTmp,list_X_sampleTmp = process_multiSample.get_chunked_data(X,Y,nchunk)
+	list_D_chunked = list_D_chunked + list_D_sampleTmp
+	list_X_chunked = list_X_chunked + list_X_sampleTmp
+
+#  Extracts the training data from our processed data. This function prepares the data in the format required for model training.
+list_Dtrain = process_multiSample.get_listDtrain(list_D_chunked)
+
+# Set up induced points for each sample
+for ksample in range(nsample):
+    # Select 15% of spots as induced points
+    ninduced = round(list_D_chunked[ksample]['X'].shape[0] * 0.15)
+    rd_ = random.sample(range(list_D_chunked[ksample]['X'].shape[0]), ninduced)
+    list_D_chunked[ksample]["Z"] = list_D_chunked[ksample]['X'][rd_, :]
+
 ```
+Setting up induced points:
+   - Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
+   - For each sample:
+     - `ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
+     - `rd_ = random.sample(...)`: Randomly selects the induced points.
+     - `list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.
 
-This function does several important things:
+The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.
+
+Key parameters to consider:
+- Induced points percentage (15% here): Balance between speed and accuracy
+- Number of chunks per sample (2 here): Depends on dataset size and available memory
+
+### 6.3 Model Initialization
 
-1. It initializes the model parameters for all samples simultaneously.
-2. The `L` parameter specifies the number of factors we want to identify, as set earlier.
-3. The "nb" parameter specifies that we're using a negative binomial distribution for the data. This is often appropriate for count data like gene expression, as it can handle overdispersion better than a Poisson distribution.
+Now we can initialize the model with our optimized data structure:
+
+```python
+list_fit = process_multiSample.ini_multiSample(list_D_chunked, L, "nb", chol=False)
+```
 
-The function returns a list of initialized model objects, one for each sample. These objects contain the initial parameter values that will be optimized during training.
+Parameters:
+- `list_D_chunked`: Our chunked data structure
+- `L`: Number of factors to identify
+- `"nb"`: Specifies negative binomial distribution
+- `chol=False`: Disables Cholesky decomposition for better memory usage
 
-## 7. Model Training
+### 7.4 Training the Model
 
-With the model initialized, we can now train it:
+With optimization techniques in place, we can train the model:
 
 ```python
-list_fit = training_multiSample.train_model_mNSF(list_fit, pp, list_Dtrain, list_D, num_epochs=2)
+list_fit = training_multiSample.train_model_mNSF(
+    list_fit,      # Initialized model
+    pp,            # Directory for preprocessing results
+    list_Dtrain,    # Chunked training data
+    list_D_chunked, # Full chunked dataset
+    num_epochs=2  # Number of training iterations
+)
 ```
 
-This function trains the mNSF model using the prepared data. Here's what each parameter does:
+#### Training Parameters:
+- `num_epochs`: Number of training iterations (100 recommended for real data)
+- The function automatically handles:
+  - Processing data chunks
+  - Managing induced points
+  - Optimizing model parameters
+  - Combining results across chunks
 
-- `list_fit`: The list of initialized model objects from the previous step.
-- `pp`: The path where preprocessing results are stored.
-- `list_Dtrain`: The training data prepared earlier.
-- `list_D`: The full processed data.
-- `num_epochs=2`: The number of training iterations. 
+### 6.5 Monitoring Training
 
-Note that `num_epochs=2` is likely too low for real-world applications. This is just for demonstration purposes. In practice, you might want to increase this number significantly (e.g., to 100 or 1000) for better results, but be aware that training time will increase accordingly. You may need to experiment to find the right balance between training time and model performance for your specific dataset.
+During training, you should monitor:
+1. Memory usage: If too high, increase number of chunks
+2. Training progress: Watch for convergence
+3. Error messages: May indicate need to adjust parameters
 
-The function returns a list of trained model objects, one for each sample. These objects contain the optimized parameters that best explain the spatial patterns in your data according to the mNSF model.
 
-## 8. Visualizing Results
+## 7. Visualizing Results
 
 After training, we can visualize the results. Here's how to plot the mNSF factors for a sample:
 
 ```python
-Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D[0]["X"], S=3, chol=False)).T
+Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D_chunked[0]["X"], S=3, chol=False)).T
 hmkw = {"figsize": (4, 4), "bgcol": "white", "subplot_space": 0.1, "marker": "s", "s": 10}
 fig, axes = visualize.multiheatmap(list_D[0]["X"], Fplot, (1, 2), cmap="RdBu", **hmkw)
 ```
@@ -371,7 +426,7 @@ Let's break this down:
 
 This will produce a figure with two heatmaps, one for each factor, showing how these factors vary across the spatial dimensions of your sample.
 
-## 9. Calculate Moran's I for each factor
+## 8. Calculate Moran's I for each factor
 
 After obtaining the spatial factors from mNSF, it's important to quantify how spatially structured these factors are. One way to do this is by calculating Moran's I statistic for each factor. Moran's I is a measure of spatial autocorrelation, which tells us whether similar values tend to cluster together in space.