Skip to content

Commit

Permalink
Update mnsf-tutorial-dlpfc.md
Browse files Browse the repository at this point in the history
  • Loading branch information
yiwang12 authored Nov 17, 2024
1 parent 3e88779 commit aa61db5
Showing 1 changed file with 98 additions and 42 deletions.
140 changes: 98 additions & 42 deletions tutorial/mnsf-tutorial-dlpfc.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ Make sure to replace `dir_mNSF_functions` with the actual path to the mNSF funct
Before running mNSF, you need to set up some key parameters. This step is crucial as it defines how the model will behave:

```python
nchunk = 1 # Number of chunks per sample
L = 2 # Number of factors
nsample = 2 # Number of samples

Expand All @@ -156,7 +157,9 @@ Let's break this down:

2. `nsample = 2`: This tells the model how many distinct samples are in your dataset. Make sure this matches the actual number of samples you're analyzing.

3. Setting up output directories:
3. `nchunk = 1`: This sets the number of chunks per sample

4. Setting up output directories:
- `mpth = path.join("models")`: This creates a path for storing the trained models.
- `misc.mkdir_p(mpth)`: This creates the directory if it doesn't exist.
- `pp = path.join(mpth, "pp", str(2))`: This creates a subdirectory for storing preprocessing results.
Expand All @@ -175,8 +178,8 @@ list_D = []
list_X = []

for ksample in range(nsample):
Y = pd.read_csv('path/to/Y_features_sele_sample{ksample+1}_500genes.csv')
X = pd.read_csv('path/to/X_allSpots_sample{ksample+1}.csv')
Y = pd.read_csv(f'path/to/Y_sample{ksample+1}.csv')
X = pd.read_csv(f'path/to/X_sample{ksample+1}.csv')
D = process_multiSample.get_D(X, Y)
list_D.append(D)
list_X.append(D["X"])
Expand All @@ -200,30 +203,16 @@ Make sure to replace 'path/to/' with the actual path to your data files. The fil
After loading the data, we need to prepare it for input into mNSF:

```python
list_Dtrain = process_multiSample.get_listDtrain(list_D)
list_sampleID = process_multiSample.get_listSampleID(list_D)

# Set induced points (15% of total spots for each sample)
for ksample in range(nsample):
ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)
rd_ = random.sample(range(list_D[ksample]['X'].shape[0]), ninduced)
list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]

```

This code does the following:

1. `list_Dtrain = process_multiSample.get_listDtrain(list_D)`: Extracts the training data from our processed data. This function prepares the data in the format required for model training.

2. `list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.

3. Setting up induced points:
- Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
- For each sample:
- `ninduced = round(list_D[ksample]['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
- `rd_ = random.sample(...)`: Randomly selects the induced points.
- `list_D[ksample]["Z"] = list_D[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.
`list_sampleID = process_multiSample.get_listSampleID(list_D)`: Extracts sample IDs from the processed data. This helps keep track of which data belongs to which sample.

The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.

### 5.3 Choose the number of factors to be used

Expand Down Expand Up @@ -294,48 +283,115 @@ The "best" number of factors often involves a nuanced balance between statistica



## 6. Model Initialization

Now we're ready to initialize the mNSF model:

## 6. Model Training

### 6.1 Optimization Techniques

Before training the model, we'll implement two key optimization techniques that make mNSF practical for large datasets: induced points and data chunking.

#### Induced Points
Induced points reduce computational complexity by selecting representative spatial locations. This is crucial for:
- Managing memory usage with large datasets
- Reducing computational time
- Maintaining model accuracy while improving efficiency

#### Data Chunking
Data chunking divides the data into manageable pieces, enabling:
- Processing of datasets too large to fit in memory
- Potential parallel processing
- Better memory management during training

### 6.2 Setting Up Optimization

First, let's implement both optimization techniques:

```python
list_fit = process_multiSample.ini_multiSample(list_D, L, "nb")
```
# Process data chunking
list_D_chunked=list()
list_X_chunked=list()
for ksample in range(0,nsample):
Y = pd.read_csv(f'path/to/Y_sample{ksample+1}.csv')
X = pd.read_csv(f'path/to/X_sample{ksample+1}.csv')
list_D_sampleTmp,list_X_sampleTmp = process_multiSample.get_chunked_data(X,Y,nchunk)
list_D_chunked = list_D_chunked + list_D_sampleTmp
list_X_chunked = list_X_chunked + list_X_sampleTmp

# Extracts the training data from our processed data. This function prepares the data in the format required for model training.
list_Dtrain = process_multiSample.get_listDtrain(list_D_chunked)

# Set up induced points for each sample
for ksample in range(nsample):
# Select 15% of spots as induced points
ninduced = round(list_D_chunked[ksample]['X'].shape[0] * 0.15)
rd_ = random.sample(range(list_D_chunked[ksample]['X'].shape[0]), ninduced)
list_D_chunked[ksample]["Z"] = list_D_chunked[ksample]['X'][rd_, :]

This function does several important things:
```
Setting up induced points:
- Induced points are a subset of spatial locations used to reduce computational complexity while maintaining model accuracy.
- For each sample:
- `ninduced = round(list_Dlist_D_chunked['X'].shape[0] * 0.15)`: Calculates the number of induced points as 15% of total spots.
- `rd_ = random.sample(...)`: Randomly selects the induced points.
- `list_D_chunked[ksample]["Z"] = list_D_chunked[ksample]['X'][rd_, :]`: Stores the selected points in the data structure.

1. It initializes the model parameters for all samples simultaneously.
2. The `L` parameter specifies the number of factors we want to identify, as set earlier.
3. The "nb" parameter specifies that we're using a negative binomial distribution for the data. This is often appropriate for count data like gene expression, as it can handle overdispersion better than a Poisson distribution.
The number of induced points (15% here) is a trade-off between computational efficiency and accuracy. You might need to adjust this percentage based on your dataset size and available computational resources.

The function returns a list of initialized model objects, one for each sample. These objects contain the initial parameter values that will be optimized during training.
Key parameters to consider:
- Induced points percentage (15% here): Balance between speed and accuracy
- Number of chunks per sample (2 here): Depends on dataset size and available memory

## 7. Model Training
### 6.3 Model Initialization

With the model initialized, we can now train it:
Now we can initialize the model with our optimized data structure:

```python
list_fit = training_multiSample.train_model_mNSF(list_fit, pp, list_Dtrain, list_D, num_epochs=2)
list_fit = process_multiSample.ini_multiSample(list_D_chunked, L, "nb")
```

This function trains the mNSF model using the prepared data. Here's what each parameter does:
Parameters:
- `list_D_chunked`: Our chunked data structure
- `L`: Number of factors to identify
- `"nb"`: Specifies negative binomial distribution

### 6.4 Training the Model

With optimization techniques in place, we can train the model:

```python
list_fit = training_multiSample.train_model_mNSF(
list_fit, # Initialized model
pp, # Directory for preprocessing results
list_Dtrain, # Chunked training data
list_D_chunked, # Full chunked dataset
num_epochs=2, # Number of training iterations
nsample = nsample, # Number of samples
nchunk = nchunk # Number of chunks
)
```

- `list_fit`: The list of initialized model objects from the previous step.
- `pp`: The path where preprocessing results are stored.
- `list_Dtrain`: The training data prepared earlier.
- `list_D`: The full processed data.
- `num_epochs=2`: The number of training iterations.
#### Training Parameters:
- `num_epochs`: Number of training iterations (500 recommended for real data)
- The function automatically handles:
- Processing data chunks
- Managing induced points
- Optimizing model parameters
- Combining results across chunks

Note that `num_epochs=2` is likely too low for real-world applications. This is just for demonstration purposes. In practice, you might want to increase this number significantly (e.g., to 100 or 1000) for better results, but be aware that training time will increase accordingly. You may need to experiment to find the right balance between training time and model performance for your specific dataset.
### 6.5 Monitoring Training

The function returns a list of trained model objects, one for each sample. These objects contain the optimized parameters that best explain the spatial patterns in your data according to the mNSF model.
During training, you should monitor:
1. Memory usage: If too high, increase number of chunks
2. Training progress: Watch for convergence
3. Error messages: May indicate need to adjust parameters

## 8. Visualizing Results
## 7. Visualizing Results

After training, we can visualize the results. Here's how to plot the mNSF factors for a sample:

```python
Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D[0]["X"], S=3, chol=False)).T
Fplot = misc.t2np(list_fit[0].sample_latent_GP_funcs(list_D_chunked[0]["X"], S=3, chol=False)).T
hmkw = {"figsize": (4, 4), "bgcol": "white", "subplot_space": 0.1, "marker": "s", "s": 10}
fig, axes = visualize.multiheatmap(list_D[0]["X"], Fplot, (1, 2), cmap="RdBu", **hmkw)
```
Expand Down Expand Up @@ -366,7 +422,7 @@ Let's break this down:

This will produce a figure with two heatmaps, one for each factor, showing how these factors vary across the spatial dimensions of your sample.

## 9. Calculate Moran's I for each factor
## 8. Calculate Moran's I for each factor

After obtaining the spatial factors from mNSF, it's important to quantify how spatially structured these factors are. One way to do this is by calculating Moran's I statistic for each factor. Moran's I is a measure of spatial autocorrelation, which tells us whether similar values tend to cluster together in space.

Expand Down

0 comments on commit aa61db5

Please sign in to comment.