Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using Akita for single-cell prediction on the HCT116 cell line, the Pearson correlation coefficient is much lower. #186

Open
1944498970 opened this issue Nov 25, 2023 · 8 comments

Comments

@1944498970
Copy link

Hi, I have downloaded the sequence.bed file and the cool file for the HCT116 cell line from your Google resource. I have performed data processing and training using your provided code and the parameters given in the tutorial. However, the Pearson correlation coefficient (PearsonR) is significantly lower than 0.6.

@davek44
Copy link
Contributor

davek44 commented Dec 9, 2023

The tutorial parameters are intended to demonstrate the model and training. You'll want to take parameters from the full model in order to reproduce the paper results.

@1944498970
Copy link
Author

Thank you for your response. I have used the model parameters mentioned in the paper (link: https://github.com/calico/basenji/blob/master/manuscripts/akita/params.json). What parameters should be adjusted to replicate the experiment? Are these related to data processing?

@davek44
Copy link
Contributor

davek44 commented Dec 15, 2023

You don't need to adjust the parameters to replicate.

@1944498970
Copy link
Author

1944498970 commented Dec 15, 2023 via email

@davek44
Copy link
Contributor

davek44 commented Dec 15, 2023

It's impossible to say with the information you've given. Could you provide more details?

@1944498970
Copy link
Author

Of course
. I downloaded the file https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz, and from https://storage.googleapis.com/basenji_hic, I downloaded the files Unsynchronized_all.hg38.2048.cool and sequences.bed. When using akita_data.py for data processing, I read the contents of the sequences.bed file and assigned it to mseqs to ensure that I am using the same data as you. Additionally, during processing, I selected the parameters -l 1048576 --crop 65536 --local --as_obsexp -p 16. Then, I used the parameters from https://github.com/calico/basenji/blob/master/manuscripts/akita/params.json and selected the -k parameter from akita_train.py for training. After training for 140 epochs, the Pearson R has stabilized (I disabled early stopping to train for enough epochs), and the R on the validation set does not exceed 0.45.

params as this params.json.

@davek44
Copy link
Contributor

davek44 commented Dec 17, 2023

I brainstormed a bit with Geoff, and one thing we caught was that you need to set -k 1 for the akita_data.py script to perform Gaussian smoothing of the data and make sure the values are getting clipped to [-2,2]. I'm copy-pasting the Methods paragraph with these details.

To focus on locus-specific patterns and mitigate the impact of sparse sampling present in even the currently highest-resolution Hi-C maps, we adaptively coarse-grain, normalize for the distance-dependent decrease in contact frequency, take a natural log, clip to (−2,2), linearly interpolate missing bins and convolve with a small 2D Gaussian filter (sigma, 1 and width, 5). The first to third steps use cooltools functions (https://github.com/mirnylab/cooltools). Interpolation of low-coverage bins filtered out in typical Hi-C pipelines was crucial for learning with log(observed/expected) Hi-C targets, greatly outperforming replacing these bins with zeros.

@1944498970
Copy link
Author

Thank you very much. I set the 'clip' in the target.txt file, but I forgot to use the '-k 1' parameter during processing. I will add the parameter and retrain to see the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants