Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
drarwood authored Oct 4, 2018
1 parent d578f27 commit 477b0fa
Showing 1 changed file with 93 additions and 2 deletions.
95 changes: 93 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,93 @@
# gags
Genetic Algorithm for Group Selection
# GAGS
## Genetic Algorithm for Group Selection
Written by Andrew Wood - [email protected] - Genetics of Complex Traits - University of Exeter

### Purpose
GAGS is a C++ program created to sample data with specific means and standard deviations from large datasets. The original purpose was to select phenotypic distributions from large scale studies, such as the UK Biobank. The current version is able to sample up to two non-overlapping groups simultaneously with specific means and SDs in an efficient manner. However, creating additional groups simultaneously using the current version is not advised but updates will be made to cater for this.

### Compiling
GAGS can be compiled using the 'make' file provided
```
make
make install
```

#### Input Files

- Phenotype File (required). A tab delimited file containing ordered columns for ID and phenotypic values
```
ID PHENO
A 178
B 170
C 167
D 169
...
```
- Exclusion File (optional): A list of IDs (without header) for which you do not want the associated phenotypic values to be used by the algorithm and included in the output.

### Runnings GAGS
#### Options
A full list of options can be found by simply typing 'gags' at the command line:

```
gags --pheno -p [phenotype file]
--ns -n [list of group sizes]
--means -m [list of respective means]
--sds -s [list of respective SDs]
--seed -r [seed - default 10000]
--mrate -y [max mutation rate - default 0.01]
--popsize -z [initial solutions - default 10]
--exclusions -e [optional: file of IDs to exclude]
--iterations -i [max iterations - default 50000]
--out -o [output file prefix]
```

#### Running GAGS
To extract a subset of 1000 people from a phenotype file with mean = 169 and SD = 5:
```
gags --pheno MyPhenoFile.txt --ns 1000 --means 169 --sds 5 --out MyPhenoSubset.txt
```

To extract a subset of 1000 people with mean = 169 and SD = 5 AND a subset of 500 people with mean = 160 and SD = 4:
```
gags --pheno MyPhenoFile.txt --ns 1000,500 --means 169,160 --sds 5,4 --out MyPhenoSubset.txt
```
Note that if trying to create more than one group then the order of N, mean and SD values need to be respective of each other on the command line.

To create a series of non-identical solutions ensure to inclue the --seed flag:
```
gags --pheno MyPhenoFile.txt --ns 1000,500 --means 169,160 --sds 5,4 --out MyPhenoSubset1.txt --seed 10
gags --pheno MyPhenoFile.txt --ns 1000,500 --means 169,160 --sds 5,4 --out MyPhenoSubset2.txt --seed 20
gags --pheno MyPhenoFile.txt --ns 1000,500 --means 169,160 --sds 5,4 --out MyPhenoSubset3.txt --seed 30
...
```

Note that the algorithm will aim to match to the same precision as provided on the command line.

#### Output File
The output file will contain the group ID individual ID and phenotypic value
```
Group ID Phenotype
1 A 178
1 C 167
...
2 B 170
2 D 169
...
```
WARNING: if a solution is not found then the best solution is output once the maximum number of iterations has occured. You should check the program output as this will be flagged as a warning.

#### Fine-tuning the algorithm
Depending on the mean and SD required, it may be necessary to adjust the initial chromosome population size, maximum mutation rate and the maximum number of iterations if it becomes difficult to find a solution. The error for the best solution is output to the console (or log files for HPC). Therefore, you should run GAGS with several tunings of the parameters and compare the error to determine which seems to be the most effective in getting closer to the solution you want. Additional releases of this software will automate this.

#### Asking for the impossible
The ability to find a desired set that meets the distribution requirements is dependent on how far away the mean and SD are from the full dataset. Be aware that the further away, the smaller the N as a proportion needs to be in order to maximise the chances of obtaining a solution. For example, if the phenotype mean = 200 and SD = 5 (respectively) in a full dataset of 500,000, then it will be much harder to find a solution for a subset of 499,999 with mean=150 and SD=4. More guidance on this will be available here shortly.

### Contact
Please email me any questions or comments to [email protected]

### Reference
If you use GAGS, please cite:

*Tyrrell et al. Gene-obesogenic environment interactions in the UK Biobank study, Int J Epidemiol. 2017 Apr 1;46(2):559-575*

0 comments on commit 477b0fa

Please sign in to comment.