Thierry Schuepbach · 04ddba9e
--- a/documentation.md
+++ b/documentation.md
+# Documentation
+Despite the use of Plink binary data format, ***FastEpistasis*** has been designed to match as most as possible Plink command line options for epistasis. Of course, there exists slight differences that may bother you. The purpose here is to gather some common problems and explain a full procedure based upon data used for benchmarking. In any case, user should try the executables with --help option to have a glance of the possibilities together with a brief description, especially as the following will NOT deal with all possibilities but focus on the basics and mandatory elements.
+
+For this tutorial we will need the [HapMap3](ftp://ftp.ncbi.nlm.nih.gov/hapmap/phase_3/hapmap3_r1_b36_fwd.qc.poly.tar.bz2)  (~750 MB) data file. More precisely the files hapmap3_r1_b36_fwd.MKK.qc.poly.recode.map and hapmap3_r1_b36_fwd.MKK.qc.poly.recode.ped thereby contained. In addition, a working Plink executable will also be required.
+
+A run of FastEpistasis bears several stages, each having its own executables (or several in the compute stage to differentiate MPI from SMP computer architecture). 
+
+1. The first stage gathers data into a unique binary file that will hold all the information required later on in a compact format. Most of the issues  arise here as the user may run into unrecognized format. Similar to Plink, PreFastEpistasis has several mandatory arguments that point to the different files holding the data: 
+   * the individual relations: Plink .fam file
+   * the overall genotype data: Plink .ped or .bed file
+   * the SNPs information: Plink .map or .bim file.
+   * the phenotype(s) file
+   * the SNPs set file
+
+It is nevertheless worth mentionning that ***PreFastEpistasis*** imports Plink data based upon Plink's binary format that is .bed and .bim files rather that .ped and .map file. Furthermore, the family file is not always presents but may be generated by Plink.
+Hence one has to run Plink in order to generate the .fam file and the binary version of the .map and .ped files. So in our example, this is performed by the following commands: 
+
+```bash
+tar xvfj hapmap3_r1_b36_fwd.qc.poly.tar.bz2 hapmap3_r1_b36_fwd.MKK.qc.poly.recode.ped hapmap3_r1_b36_fwd.MKK.qc.poly.recode.map
+plink --file hapmap3_r1_b36_fwd.MKK.qc.poly.recode --make-bed --out FastEpistasis.data.MKK
+```
+You should now have generated the binary files FastEpistasis.data.MKK.bed and FastEpistasis.data.MKK.bim as well as the family file FastEpistasis.data.MKK.fam.
+Before one can run PreFastEpistasis, a phenotype file and a set file must be created.
+The phenotype file is space separated text and contains 2+N columns and M+1 rows where N is the number of phenotypes and M the number of individuals. As of version 2.0, there may not be more than 64 phenotypes due to the algorithm used in the detection of missing values. The header row should contain the column titles that is "fid" and "iid" for column 1 and 2 and then the desired phenotype names (above 16 characters the name will be truncated).
+> PreFastEpistasis assumes row order to be exactly the same as the .map or .bim file, any change should trigger an error!
+
+Here is an example used in the benchmarks.
+```
+fid	iid	MyTestPhenotype
+2604	NA21443	0.105837759541085
+2572	NA21336	0.174993976989406
+NA21741	NA21741	-0.180997775252538
+2654	NA21576	0.124891672568321
+NA21740 NA21740	0.129005922520283
+2607	NA21451	-0.099288395141353
+NA21733	NA21733	0.366692612925268
+2667	NA21616	0.307536670256091
+2678	NA21650	0.0219941783329674
+NA21768	NA21768	-0.719218556314241
+2620	NA21582	-0.0321610724176173
+NA21776	NA21776	-0.0114452691614359
+2654	NA21575	0.300632967389373
+NA21825	NA21825	-0.110190915120632
+2700	NA21719	0.252044801460924
+2564	NA21306	0.0369835944263691
+2582	NA21378	-0.279937551495705
+```
+
+The set file is text based as well and should bear a unique SNP name or keyword per line. Recognized keywords are SET_A, SET_B and END. It is important to keep in mind that PreFastEpistasis searches for best SNPs pair correlations taking each element of set A with respect to each element in set B. Consequentely one has to provide two sets even though they may be identical. The possibilities are then 3 fold, each having its own dedicated computational phase for optimization:
+
+* A = B yielding a so called pure interaction run where symmetrical pairs are taking into account ond performed only once. A reduction upon the best match of each thread is then necessary at the end of the computation phase.
+* A and B have shared SNPs, hence some interactions will be performed twice in order to avoid any post reduction. One might suffer a great performance penalty if there are many shared SNPs.
+* A and B are disjoint, no reduction is necessary as the best pairs may be allocated and owned by only one working thread.
+
+Here is an example of a set file used for the pure interaction benchmarks. 
+```
+SET_A
+rs17763185
+rs12544008
+rs16926871
+rs3864667
+rs3864668
+rs1835758
+rs4738868
+rs7001997
+rs1367975
+END
+
+SET_B
+rs17763185
+rs12544008
+rs16926871
+rs3864667
+rs3864668
+rs1835758
+rs4738868
+rs7001997
+rs1367975
+END
+```
\ No newline at end of file