Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37

yangzhao1230 · 2024-06-25T02:35:44Z

I have identified discrepancies between the sequences extracted from the hg38.ml.fa reference genome using BED file coordinates and those stored in the tfrecords within the Basenji dataset hosted on Google Cloud (https://console.cloud.google.com/storage/browser/basenji_barnyard?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22). This inconsistency is concerning as it affects the reliability of our data used for genomic analyses.

Detailed Observations:

I retrieved sequences using the sequences.bed file from the hg38.ml.fa reference genome.
Upon comparing these sequences with those recorded in the tfrecords dataset, I discovered mismatches. For example, in the validation set for human data, only two samples had sequences that matched their coordinates.

Questions:

What could be causing these discrepancies between the retrieved sequences and those in the tfrecords?
How is the hg38.ml.fa genome processed, and how does it differ from the standard reference genome?

Any assistance or guidance on how to verify the consistency of the data in the Google Cloud repository would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

yangzhao1230 · 2024-06-25T02:43:21Z

If it's convenient for you, could you please review the logic I used to test if the sequences are identical? My h5 files have all been converted from tfrecords, such that they should be exactly the same. I used the BED files and genome that you provided, but in the end, only two samples in the validation set had matching sequences.

import h5py
import torch
from torch.utils.data import Dataset
import numpy as np
from Bio import SeqIO
import pandas as pd
from tqdm import tqdm

ENFORMER_INPUT_LENGTH = 196_608
BASENJI_INPUT_LENGTH = 131_072

class H5Dataset(Dataset):
    def __init__(self, h5_file_path, genome_dict, df):
        self.h5_file = h5py.File(h5_file_path, 'r')
        self.sequences = self.h5_file['sequences']
        self.targets = self.h5_file['targets']
        self.genome_dict = genome_dict
        self.df = df

    def __del__(self):
        self.h5_file.close()

    @staticmethod
    def decode_one_hot(one_hot_encoded):
        """
        Decode one-hot encoded sequence to string.
        """
        mapping = np.array(['A', 'C', 'G', 'T', 'N'])
        indices = np.argmax(one_hot_encoded, axis=1)
        all_zeros = np.all(one_hot_encoded == 0, axis=1)
        indices[all_zeros] = 4 
        return ''.join(mapping[indices])

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):

        # Directly read sequences provided by Basenji
        sequence_h5 = self.decode_one_hot(self.sequences[idx])
        target = torch.tensor(self.targets[idx], dtype=torch.float32)

        # Retrieve sequence from genome with coordinates provided by Basenji
        row = self.df.iloc[idx]
        chrom, start, end = row["chrom"], row["start"], row["end"]
        sequence = str(self.genome_dict[chrom].seq[start:end]).upper()
        
        if sequence == sequence_h5:
            self.equal_count += 1
        else:
            self.unequal_count += 1

        # Extend sequence to match Enformer input length
        median = (start + end) // 2
        enformer_start = median - ENFORMER_INPUT_LENGTH // 2
        enformer_end = median + ENFORMER_INPUT_LENGTH // 2
        enformer_sequence = str(self.genome_dict[chrom].seq[enformer_start:enformer_end]).upper()

        return {
            'sequence': enformer_sequence,
            'target': target,
        }

if __name__ == '__main__':
    # Dataset paths
    bed_path = "/blob/Data/human/sequences.bed"
    genome_path = "/blob/Data/DNA/Caduceus/hg38/hg38.ml.fa"
    train_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_train.h5"
    valid_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_valid.h5"
    test_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_test.h5"
    # Load genome
    genome_dict = SeqIO.to_dict(SeqIO.parse(genome_path, "fasta"))
    # Load bed file
    df = pd.read_csv(bed_path, sep="\t", header=None)
    df.columns = ["chrom", "start", "end", "split"]
    train_df = df[df["split"] == "train"].reset_index(drop=True)
    valid_df = df[df["split"] == "valid"].reset_index(drop=True)
    test_df = df[df["split"] == "test"].reset_index(drop=True)
    # Load datasets
    train_dataset = H5Dataset(train_h5_path, genome_dict, train_df)
    valid_dataset = H5Dataset(valid_h5_path, genome_dict, valid_df)
    test_dataset = H5Dataset(test_h5_path, genome_dict, test_df)
    # Check if sequences are equal
    for i, sample in enumerate(tqdm(test_dataset)):
        targets = sample["target"]
        max_value = targets.max().item()
        min_value = targets.min().item()
        test_max = max(test_max, max_value)
        test_min = min(test_min, min_value)
    print(f"Equal count: {test_dataset.equal_count}, Unequal count: {test_dataset.unequal_count}")

davek44 · 2024-06-30T22:37:39Z

I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy.

yangzhao1230 · 2024-07-02T10:09:38Z

I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy.

Thank you for your quick response. I would like to ask if the one-hot encoded sequences stored in your TFRecord are consistent with the encoding format described in the Enformer paper?

I am not sure if I am decoding these one-hot vectors correctly (I am using the 131k bed file and TFRecord you provided for comparison, so the lengths are consistent).

yangzhao1230 · 2024-07-02T10:31:58Z

I may have found the issue. It appears that the order of the data in the bed file does not correspond with the order in the TFRecords. For example, the second training data entry in the TFRecords corresponds to the 256th line in the bed file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37

Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37

yangzhao1230 commented Jun 25, 2024

yangzhao1230 commented Jun 25, 2024

davek44 commented Jun 30, 2024

yangzhao1230 commented Jul 2, 2024

yangzhao1230 commented Jul 2, 2024

Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37

Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37

Comments

yangzhao1230 commented Jun 25, 2024

yangzhao1230 commented Jun 25, 2024

davek44 commented Jun 30, 2024

yangzhao1230 commented Jul 2, 2024

yangzhao1230 commented Jul 2, 2024