Skip to content

Commit

Permalink
Update mnsf-tutorial-mouse.md
Browse files Browse the repository at this point in the history
  • Loading branch information
yiwang12 authored Nov 17, 2024
1 parent fd3b7cb commit 16cff68
Showing 1 changed file with 100 additions and 45 deletions.
145 changes: 100 additions & 45 deletions tutorial/mnsf-tutorial-mouse.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Tutorial for mNSF Package - DLPFC data example
# Tutorial for mNSF Package - mouse sagittal section data example

**Authors:** Yi Wang and Kasper Hansen
**Date:** May, 2024
Expand Down Expand Up @@ -140,13 +140,17 @@ Make sure to replace `dir_mNSF_functions` with the actual path to the mNSF funct
Before running mNSF, you need to set up some key parameters. This step is crucial as it defines how the model will behave:

```python
nchunk = 1 # Number of chunks per sample
nsample = 2 # Number of samples
L = 2 # Number of factors
nsample = 2 # Number of samples

# Set up paths for outputs
dir_output = "/path/to/output/directory/"
os.chdir(dir_output)

mpth = path.join("models")
misc.mkdir_p(mpth)
pp = path.join(mpth, "pp", str(2))
pp = path.join(mpth, "pp")
misc.mkdir_p(pp)
```

Expand All @@ -156,7 +160,9 @@ Let's break this down:

2. `nsample = 2`: This tells the model how many distinct samples are in your dataset. Make sure this matches the actual number of samples you're analyzing.

3. Setting up output directories:
3. `nchunk = 1`: This sets the number of chunks per sample

4. Setting up output directories:
- `mpth = path.join("models")`: This creates a path for storing the trained models.
- `misc.mkdir_p(mpth)`: This creates the directory if it doesn't exist.
- `pp = path.join(mpth, "pp", str(2))`: This creates a subdirectory for storing preprocessing results.
Expand All @@ -168,15 +174,15 @@ These directories will be used to save the results of your analysis, including t

### 5.1 Loading and Processing Data

mNSF requires your data to be in a specific format. Here's a detailed explanation of how to load and process your data:
The implementation includes data chunking at the loading stage. Here's how to load and process your data:

```python
list_D = []
list_X = []

for ksample in range(nsample):
Y = pd.read_csv(f'path/to/Y_features_sele_sample{ksample+1}_500genes.csv')
X = pd.read_csv(f'path/to/X_allSpots_sample{ksample+1}.csv')
Y = pd.read_csv(f'path/to/Y_sample{ksample+1}.csv')
X = pd.read_csv(f'path/to/X_sample{ksample+1}.csv')
D = process_multiSample.get_D(X, Y)
list_D.append(D)
list_X.append(D["X"])
Expand All @@ -200,30 +206,15 @@ Make sure to replace 'path/to/' with the actual path to your data files. The fil
After loading the data, we need to prepare it for input into mNSF:

```python
list_Dtrain = process_multiSample.get_listDtrain(list_D)
list_sampleID = process_multiSample.get_listSampleID(list_D)

# Set induced points (15% of total spots for each sample)
for ksample in range(nsample):
ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)
rd_ = random.sample(range(list_D[ksample]['X'].shape[0]), ninduced)
list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]
```

This code does the following:

1. `list_Dtrain = process_multiSample.get_listDtrain(list_D)`: Extracts the training data from our processed data. This function prepares the data in the format required for model training.

2. `list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.
`list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.

3. Setting up induced points:
- Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
- For each sample:
- `ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
- `rd_ = random.sample(...)`: Randomly selects the induced points.
- `list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.

The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.

### 5.3 Choose the number of factors to be used

Expand Down Expand Up @@ -297,49 +288,113 @@ Other possible ways for selecting the number of factors:
The "best" number of factors often involves a nuanced balance between statistical fit, biological interpretability, computational resources, and research objectives. It's often helpful to try a few different values and compare the results before making a final decision. The process may involve iterative refinement and integration of multiple lines of evidence.


## 6. Model Training

### 6.1 Optimization Techniques

Before training the model, we'll implement two key optimization techniques that make mNSF practical for large datasets: induced points and data chunking.

#### Induced Points
Induced points reduce computational complexity by selecting representative spatial locations. This is crucial for:
- Managing memory usage with large datasets
- Reducing computational time
- Maintaining model accuracy while improving efficiency

#### Data Chunking
Data chunking divides the data into manageable pieces, enabling:
- Processing of datasets too large to fit in memory
- Potential parallel processing
- Better memory management during training

## 6. Model Initialization
### 6.2 Setting Up Optimization

Now we're ready to initialize the mNSF model:
First, let's implement both optimization techniques:

```python
list_fit = process_multiSample.ini_multiSample(list_D, L, "nb")
# Process data chunking
list_D_chunked=list()
list_X_chunked=list()
for ksample in range(0,nsample):
Y=pd.read_csv(path.join('//dcs04/hansen/data/ywang/ST/DLPFC/processed_Data//Y_features_sele_sample'+str(ksample*4+1)+'_500genes.csv'))
X=pd.read_csv(path.join('//dcs04/hansen/data/ywang/ST/DLPFC/processed_Data///X_allSpots_sample'+str(ksample*4+1)+'.csv'))
list_D_sampleTmp,list_X_sampleTmp = process_multiSample.get_chunked_data(X,Y,nchunk)
list_D_chunked = list_D_chunked + list_D_sampleTmp
list_X_chunked = list_X_chunked + list_X_sampleTmp

# Extracts the training data from our processed data. This function prepares the data in the format required for model training.
list_Dtrain = process_multiSample.get_listDtrain(list_D_chunked)

# Set up induced points for each sample
for ksample in range(nsample):
# Select 15% of spots as induced points
ninduced = round(list_D_chunked[ksample]['X'].shape[0] * 0.15)
rd_ = random.sample(range(list_D_chunked[ksample]['X'].shape[0]), ninduced)
list_D_chunked[ksample]["Z"] = list_D_chunked[ksample]['X'][rd_, :]

```
Setting up induced points:
- Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
- For each sample:
- `ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
- `rd_ = random.sample(...)`: Randomly selects the induced points.
- `list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.

This function does several important things:
The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.

Key parameters to consider:
- Induced points percentage (15% here): Balance between speed and accuracy
- Number of chunks per sample (2 here): Depends on dataset size and available memory

### 6.3 Model Initialization

1. It initializes the model parameters for all samples simultaneously.
2. The `L` parameter specifies the number of factors we want to identify, as set earlier.
3. The "nb" parameter specifies that we're using a negative binomial distribution for the data. This is often appropriate for count data like gene expression, as it can handle overdispersion better than a Poisson distribution.
Now we can initialize the model with our optimized data structure:

```python
list_fit = process_multiSample.ini_multiSample(list_D_chunked, L, "nb", chol=False)
```

The function returns a list of initialized model objects, one for each sample. These objects contain the initial parameter values that will be optimized during training.
Parameters:
- `list_D_chunked`: Our chunked data structure
- `L`: Number of factors to identify
- `"nb"`: Specifies negative binomial distribution
- `chol=False`: Disables Cholesky decomposition for better memory usage

## 7. Model Training
### 7.4 Training the Model

With the model initialized, we can now train it:
With optimization techniques in place, we can train the model:

```python
list_fit = training_multiSample.train_model_mNSF(list_fit, pp, list_Dtrain, list_D, num_epochs=2)
list_fit = training_multiSample.train_model_mNSF(
list_fit, # Initialized model
pp, # Directory for preprocessing results
list_Dtrain, # Chunked training data
list_D_chunked, # Full chunked dataset
num_epochs=2 # Number of training iterations
)
```

This function trains the mNSF model using the prepared data. Here's what each parameter does:
#### Training Parameters:
- `num_epochs`: Number of training iterations (100 recommended for real data)
- The function automatically handles:
- Processing data chunks
- Managing induced points
- Optimizing model parameters
- Combining results across chunks

- `list_fit`: The list of initialized model objects from the previous step.
- `pp`: The path where preprocessing results are stored.
- `list_Dtrain`: The training data prepared earlier.
- `list_D`: The full processed data.
- `num_epochs=2`: The number of training iterations.
### 6.5 Monitoring Training

Note that `num_epochs=2` is likely too low for real-world applications. This is just for demonstration purposes. In practice, you might want to increase this number significantly (e.g., to 100 or 1000) for better results, but be aware that training time will increase accordingly. You may need to experiment to find the right balance between training time and model performance for your specific dataset.
During training, you should monitor:
1. Memory usage: If too high, increase number of chunks
2. Training progress: Watch for convergence
3. Error messages: May indicate need to adjust parameters

The function returns a list of trained model objects, one for each sample. These objects contain the optimized parameters that best explain the spatial patterns in your data according to the mNSF model.

## 8. Visualizing Results
## 7. Visualizing Results

After training, we can visualize the results. Here's how to plot the mNSF factors for a sample:

```python
Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D[0]["X"], S=3, chol=False)).T
Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D_chunked[0]["X"], S=3, chol=False)).T
hmkw = {"figsize": (4, 4), "bgcol": "white", "subplot_space": 0.1, "marker": "s", "s": 10}
fig, axes = visualize.multiheatmap(list_D[0]["X"], Fplot, (1, 2), cmap="RdBu", **hmkw)
```
Expand Down Expand Up @@ -371,7 +426,7 @@ Let's break this down:

This will produce a figure with two heatmaps, one for each factor, showing how these factors vary across the spatial dimensions of your sample.

## 9. Calculate Moran's I for each factor
## 8. Calculate Moran's I for each factor

After obtaining the spatial factors from mNSF, it's important to quantify how spatially structured these factors are. One way to do this is by calculating Moran's I statistic for each factor. Moran's I is a measure of spatial autocorrelation, which tells us whether similar values tend to cluster together in space.

Expand Down

0 comments on commit 16cff68

Please sign in to comment.