-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37
Comments
If it's convenient for you, could you please review the logic I used to test if the sequences are identical? My h5 files have all been converted from tfrecords, such that they should be exactly the same. I used the BED files and genome that you provided, but in the end, only two samples in the validation set had matching sequences.
|
I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy. |
I may have found the issue. It appears that the order of the data in the bed file does not correspond with the order in the TFRecords. For example, the second training data entry in the TFRecords corresponds to the 256th line in the bed file. |
I have identified discrepancies between the sequences extracted from the
hg38.ml.fa
reference genome using BED file coordinates and those stored in the tfrecords within the Basenji dataset hosted on Google Cloud (https://console.cloud.google.com/storage/browser/basenji_barnyard?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22). This inconsistency is concerning as it affects the reliability of our data used for genomic analyses.Detailed Observations:
I retrieved sequences using the
sequences.bed
file from thehg38.ml.fa
reference genome.Upon comparing these sequences with those recorded in the
tfrecords
dataset, I discovered mismatches. For example, in the validation set for human data, only two samples had sequences that matched their coordinates.Questions:
What could be causing these discrepancies between the retrieved sequences and those in the
tfrecords
?How is the
hg38.ml.fa
genome processed, and how does it differ from the standard reference genome?Any assistance or guidance on how to verify the consistency of the data in the Google Cloud repository would be greatly appreciated.
The text was updated successfully, but these errors were encountered: