Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function for writing (filtered/modified) EIGENSTRAT data to a new location #91

Open
bodkan opened this issue May 8, 2024 · 1 comment

Comments

@bodkan
Copy link
Owner

bodkan commented May 8, 2024

Looking at the Reading/writing EIGENSTRAT data section of the reference, I see there's no write_eigenstrat() function.

I think the reason for this is that the only motivation for writing a new EIGENSRAT data set would be to save a filtered version of SNP data, such as after removing transitions, etc. However, handling filtered SNPs has, so far, been handled by assigning a modifier component to an EIGENSTRAT R object... with all analyses downstream from the filtering using the standard ADMIXTOOLS mechanism for excluding filtered SNPs via a "badsnpname" parfile parameter.

I.e., to use an example from the main vignette:

library(admixr)

# download small testing data
prefix <- download_data(dirname = tempdir())
snps <- eigenstrat(prefix)

# get a path to an example filter BED file
bed <- file.path(dirname(prefix), "regions.bed")

# the BED file contains regions to keep in an analysis -- remove SNPs which fall outside
new_snps <- filter_bed(snps, bed)

# BED file contains regions to remove from an analysis
new_snps <- filter_bed(snps, bed, remove = TRUE)

new_snps
#> EIGENSTRAT object
#> =================
#> components:
#>   ind file: /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpzHxCwT/snps/snps.ind
#>   snp file: /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpzHxCwT/snps/snps.snp
#>   geno file: /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpzHxCwT/snps/snps.geno
#> 
#> modifiers:
#>   excluded sites:  /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpzHxCwT/file5cb6227326a3.snp 
#>       (SNPs excluded: 100000, SNPs remaining: 400000)

(note the "modifier" item in the EIGENSTRAT object summary above).

Recently it has been pointed out to me that it would be useful to have a function which can save filtered EIGENSTRAT data (such as new_snps) above to a completely new location. Effectively, this write_eigenstrat() function would take the contents of the snp file, remove positions present in the excluded sites, and write out snp and geno files which are filtered down appropriately.

I suppose this would be useful in situations where the original EIGENSTRAT data set is quite huge, and loading it and filtering it before every analysis (as opposed to loading a smaller, filtered down version) would be wasteful. But having this function seems like a good idea in principle.

Is there something I am missing? Ah, right. There's also a relabel() function which can group individuals into renamed groups:

modif_snps <- relabel(
  snps,
  European = c("French", "Sardinian"),
  African = c("Dinka", "Yoruba", "Mbuti", "Khomani_San"),
  Archaic = c("Vindija", "Altai", "Denisova")
)

modif_snps
#> =================
#> components:
#>   ind file: /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpCjcci5/snps/snps.ind
#>   snp file: /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpCjcci5/snps/snps.snp
#>   geno file: /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpCjcci5/snps/snps.geno
#> 
#> modifiers:
#>   groups:  /var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpCjcci5/file6b8b2c155cd1.ind 

relabel() creates a new ind file that we swap in for the original ind file in a parfile generated by ever every D/f3/f4/etc. wrapper. The putative write_eigenstrat() should -- if a group modifier is present -- write this modified ind file as well.

@bodkan
Copy link
Owner Author

bodkan commented May 8, 2024

Additionally, while I'm at it, we have filter_bed(), same for transversions-only filtering, so it also makes sense to implement filtering of individuals.

Maybe the putative filter_individuals()could add another "modifier" item in the EIGENSTRAT R object? Although filtering individuals automatically implies filtering columns of a geno file... so perhaps this will need something bit more complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant