Understanding patterns of X chromosome inactivation in full term human placenta
- Directory
placenta
- Snakefile
sexcheck.snakefile
-
Generate config files
python generate_json_config_dna_females.py
: take in the input filefemale_sample_ids.csv
and output the config fileprocess_dna_females_config.json
(for female placentas)python generate_json_config_dna_males.py
: take in the input filemale_sample_ids.csv
and output the config fileprocess_dna_males_config.json
(for male placenta)
-
Snakemake files:
- for mapping and genotype variants
process_dna_females.snakefile
andprocess_dna_males.snakefile
:
- Config file:
asereadcounter_config.json
- Snakemake file:
asereadcounter.snakefile
- Output directory:
02_run_asereadcounter/asereadcounter
- Subdirectories:
scripts
andresults
- Calculate (unphased) allele balance:
- Python script
calc_allele_balance.py
- Config file:
analyze_ase_config.json
- Snakemake file:
analyze_ase.snakefile
- Python script
- Calculate median allele balance per individual:
- Python script
calc_median_allele_balance_placenta_decidua.py
(see Bash scriptrun_calc_median_allele_balance_placenta_decidua.sh
)
- Python script
- Phasing strategy:
- For each pair of placenta (site A and site B):
- Subset to contain shared expressed variants
- Using the site with more variants where allele balance is greater than 0.8
- Generate a haplotype by adding all the biased allele together. If the allele balance is equal to 0.5, pick at random
- Calculate allele balance using the phased data
- For each pair of placenta (site A and site B):
- Steps:
- For each pair of placenta (site A and B), find shared variants between site A and site B:
python subset_paired_placentas_for_shared_variants.py chrX > chrX_summary_stats.txt
python subset_paired_placentas_for_shared_variants.py chr8 > chr8_summary_stats.txt
- Results are in directory
paired_placentas_shared_variants/
- Run snakefile:
snakemake --snakefile phase.snakefile
to compute allele balance for phased data - Concat for plotting Figure 2:
cd 04_phasing/phased_allele_balance/ cat *chrX*allele_balance_summary.tsv | grep -v sample_id | sort -n -r -k 3,3 > all_placenta_chrX_phased_allele_balance.tsv
- Contain files for generating the PCA
- Directory
gtex
- In this directory, we are analyzing the ASEReadCounter counts from GTEx version 8.
- Download the file
participant.tsv
from anvil project website. This file has information about the sample id - Download the file
sample.tsv
from anvil project website. This file has information about the rna id and which tissue - Obtain a list of individuals
- There are 979 individuals
- Run the python script
obtain_individuals_list.py
- Download using the file
download_asereadcounter_count.sh
- After downloading, I noticed that there are some files with this message inside: No such object: fc-secure-ff8156a3-ddf3-42e4-9211-0fd89da62108/GTEx_Analysis_2017-06-05_v8_ASE_WASP_chrX_raw_counts_by_subject/GTEX-1J8EW.v8.readcounts.chrX.txt.gz. We want to remove these individuals from further analyses. Therefore, we need to know which are these individuals.
- Run the python script:
python check_corrupted_files.py
- The outfile is
failed_files.txt
. There are 147 individuals without ASEReadCounter results. - Generate a config file from the file
sample.tsv
:python generate_config.py
- Remove the individuals in the
failed_files.txt
- Only keep the females
1. Download:
wget https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
- Snakemake file:
analyze_gtex_counts.snakefile
- Subset each downloaded count file for each tissue
- Because each count file includes all of the tissues for an individual, I need to subset for each tissue
- Use the python script
subset_gtex_counts.py
. See snakemake rulesubset_gtex_counts
(line 9)
- calculate allele balance
1.See snakemake rule
calc_allele_balance
(line 24) - Calculate median allele balance per tissue:
- See snakemake rule
calc_median_allele_balance_per_tissue
(line 36) - Find tissues where there are at least 10 samples per tissue:
python find_tissues_more_than_10_samples_per_tissue.py
- Directory
heart
python find_samples_with_2_hearts.py
python calc_prop_variants_skewed_per_sample.py
python /scratch/tphung3/Placenta_XCI/heart/subset_paired_hearts_for_shared_variants.py chrX
. Results are in directorypaired_hearts_shared_variants/
- Use the snakemake file
phase.snakefile
. Results are in directoryphased_allele_balance/
cat *chrX* | grep -v sample_id | awk '{print$1"\t"$3"\t"$2}' | sort -n -r -k 3,3 > all_heart_chrX_phased_allele_balance.tsv
- In this directory, I am analyzing genes that escape XCI for placenta and gtex tissues using the individuals that show skewed allele balance (median allele balance is greater than 0.8)
- Directory
gene_level
- Sub-directories:
gtex_counts
andspecific_gene_analysis
- Sub-directories:
- Process the placenta skewed samples
- Convert skewed samples to bed file format
- Use the python script
convert_asereadcounter_to_bed.py
- See the snakemake rule
convert_asereadcounter_to_bed_placenta
- Convert
gtf
file to bed file format
python convert_gtf_to_bed.py
- Output is
/scratch/tphung3/Placenta_XCI/gene_level/wes_genotyping/gtf_bed/gencode.v29.annotation.chrX.bed
- Use bedtools to find where on the genes the variants are
- Snakemake rule
bedtools_intersect
- Remove duplicated: snakemake rule
find_unique_lines_after_bedtools
- Find samples that are skewed in GTEX tissues, placenta, decidua females, and decidua males
python /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/scripts/find_samples_skewed.py
- The config file is
/scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/escape_genes_config.json
- There are 525 samples that are skewed in the GTEx
- Tabulate how many individuals there are for each tissue that are highly skewed
python /scratch/tphung3/PlacentaSexDiff/E_escape_genes/gtex_counts/scripts/tabulate_individuals.py
- The result file is
tabulate_individuals.csv
- Convert skewed samples to bed file format
- Rule
convert_asereadcounter_to_bed_gtex
inescape_genes.snakefile
- Use bedtools to find where on the genes the variants are
- Snakemake rule
bedtools_intersect
- Remove duplicated: snakemake rule
find_unique_lines_after_bedtools
- Find genes that have at least one heterozygous and expressed variant across all skewed samples
python /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/scripts/find_expressed_genes.py
produces the output file/scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/expressed_genes_all_samples.txt
that lists all of the genes with at least one heterozygous and expressed variant across all samples. There are 689 genes.- For each skewed individual in gtex, compute mean allele balance for each gene. For the placenta and decidua sample, I have already done this step here
/scratch/tphung3/Placenta_XCI/gene_level/wes_genotyping/asereadcounter_geneinfo/chrX
.
- Use the Python script
compute_allele_balance_per_gene.py
- See snakemake rule
compute_allele_balance_per_gene_gtex
- Run
python make_allele_count_per_gene.py
- Add genes to config:
python add_genes_to_config.py
- See snakemake rule
plot_per_gene_allele_balance_compare_gtex_placenta_decidua
- Categorize genes into inactivated, escape, or variable escape for gtex, placenta, decidua females, and decidua males.
- Use Python script
categorize_genes.py
- This script categorizes the genes, remove NA, and also sort for plotting heatmaps
-
Directory
manuscript_plots
-
Figure 2:
scripts/figure_2.R
-
Figure 2C:
scripts/figure_2C.R
-
Figure 3:
scripts/figure_3.R
-
Figure 4:
scripts/figure_4.R
-
Supplementary figures:
- Figure S2:
scripts/figure_s2.R
- Figure S3:
scripts/figure_s3_pca.R
- Figure S5:
./scripts/run_figure_s5_determine_threshold.sh
- Figure S6:
./scripts/figure_s6_nonPAR_males.R
- Figure S7:
./scripts/figure_s7_xci_entire_X.R
- Figure S8:
scripts/figure_s8.R
- Figure S9:
- Directory:
/scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/
python scripts/compute_escaping_samples_prop_per_gene.py chrX_escaping_samples_prop_per_gene.tsv
- Run R script
/scratch/tphung3/Placenta_XCI/manuscript_plots/scripts/figure_s9_chrX_escaping_samples_prop_per_gene.R
- Directory:
- Figure S10:
- Directory:
/scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/
python scripts/generate_data_for_gene_heatmap.py
- Run R script
/scratch/tphung3/Placenta_XCI/manuscript_plots/scripts/figure_s10.R
- Directory:
- Figure S11:
- Directory:
/scratch/tphung3/Placenta_XCI/gene_level/female_male_log2ratio/
python find_log2ratio_genes.py
- Run R script
scripts/figure_s11_log2ratio.R
- Directory:
- Figure S2: