diff --git a/.github/ISSUE_TEMPLATE/task-issue-template.md b/.github/ISSUE_TEMPLATE/task-issue-template.md new file mode 100644 index 0000000..33eac1f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/task-issue-template.md @@ -0,0 +1,20 @@ +--- +name: Task issue template +about: The actionable task template. +title: '' +labels: task +assignees: '' + +--- + +## Goal +What are we trying to achieve? + +## Proposed approach +(brief, not perfect) + +## Scope +What files/modules are affected? + +## Done when +Clear condition for completion diff --git a/docs/source/algorithm.rst b/docs/source/algorithm.rst new file mode 100644 index 0000000..1341c11 --- /dev/null +++ b/docs/source/algorithm.rst @@ -0,0 +1,308 @@ +================== +Algorithm Overview +================== + +Here are some notes about how the peeling algorithm works. +For further details read: + +.. [1] Whalen, A, Ros-Freixedes, R, Wilson, DL, Gorjanc, G, Hickey, JM. (2018). *Hybrid peeling for fast and accurate calling, phasing, and imputation with sequence data of any coverage in pedigrees*. Genetics Selection Evolution; doi: https://doi.org/10.1186/s12711-018-0438-2 + +Peeling basics +============== + +The idea behind multi-locus iterative peeling is +to estimate an individual's genotype based on information from their own SNP gentoype data, +the genotypes of their parents, and the genotypes of their offspring. +This can be challenging because we have to combine these different sources of information and +there can be feedback loops that need to be managed when doing this. +For example, if a child's genotypes were imputed based on the genotypes of their parents, +we would not want to use those genotypes directly to re-impute the parent. +Instead we would want to use only the "new" information from the genotypes of the child +(and potentially their offspring). + +To keep these sources of information separate, +we explicitly calculate and store three different values. +These are, + +- anterior: genotype probabilities for the individual based on their parent's genotypes, +- penetrance: genotype probabilities for an individual based on their own genotypes, and +- posterior: genotype probabilities for an individual based on the offspring's genotypes. + For the penetrance term you will often need the mate's genotypes as well. + +For each of these values we store genotype probabilities over the four, phased genotype states, +that is, ``aa``, ``aA``, ``Aa``, and ``AA``. +In all cases the first allele is the allele inherited from the father, +and the second is the allele that was inherited from the mother. + +One of the key things about inheritance and associated probability calculations +is that individual's inherit their chromosomes from their parents in large blocks. +If we knew which blocks an individual inherited, +we could use that information to help us determine what alleles the individual carried, +and also to determine which alleles their parents carried. +For each individual, we estimate these values using a segregation probabilities. + +We code segregation probabilities in two forms, + +- as the joint probability for both the paternal and maternal segregation + (given by four values, ordered ``pp``, ``pm``, ``mp``, and ``mm``, where the ``p`` denotes the paternal allele, + and the ``m`` denotes the maternal allele, while position is as before - the first allele inherited from father, + meaning that ``pp`` denotes that individual inherited paternal allele from the father and mother), +- or as the probability for either the father or the mother + (a pair of single values giving the probability that the individual inherits the maternal haplotype, + that is, a seg=1 means that the individual inherits the maternal haplotype for that parent). + +In a lot of places, we will rely heavily on calculating the "inheritance" or "transmission" probabilities. +These give the probability that a child inherits an allele based on the genotypes of their parents. +For some cases this is simple, for example, if the father is ``AA``, and the mother is ``aa`` +then their offspring will be ``Aa`` since we know that the father will transmit a copy of ``A`` and +the mother will transmit a copy of ``a`` so the resulting offspring will be ``a``. + +Other cases may be more challenging and involve multiple possible outcomes +with associated probabilities. +If the father is heterozygous, ``aA``, and the segregation is unknown, +then there will be +a 50% probability that the offspring will inherit an ``a`` and +a 50% probability they inherit a ``A``. +If we take such a father and the mother is ``AA`` +then the genotype probabilities of the offspring are: + +:: + + p(aa): 0.0 + p(aA): 0.5 + p(Aa): 0.0 + p(AA): 0.5 + +However we may know which haplotype the individual inherits. +If the segregation of the individual is ``mm`` +(it inherited the maternal allele from both parents) +then the resulting genotype probabilities of an offspring +from an ``aA`` father and an ``AA`` mother are: + +:: + + p(aa): 0.0 + p(aA): 0.0 + p(Aa): 0.0 + p(AA): 1.0 + +Segregation probabilities are really helpful for determining which alleles an individual inherits +from their parents and are used for both the peel down (anterior) and peel up (posterior) steps. +The following sections outline how the penetrance, anterior, posterior, and segregation terms are calculated. + +Penetrance +---------- + +The penetrance term gives the probability of each of the four phased genotypes +based on the direct information we have about an individual's genotype. +There are two ways to calculate this term: + +1. Via a helper function ``tinyhouse.ProbMath.getGenotypeProbabilities_ind``, which is what ``AlphaPeel``, ``AlphaFamImpute``, ``AlphaAssign`` use, and takes into account sequence data. + +2. Or via a somewhat hacky one line command, ``ind.peeling_view.setValueFromGenotypes(ind.peeling_view.penetrance)``. which is what currently is used. + +The basic idea is that we want to take a genotype and turn it into genotype probabilities. + +- If the individual is genotyped as a ``0``, that is ``aa``, we want to set ``p(aa)=1.0``. +- If the individual is genotyped ``1``, that is ``aA`` or ``Aa``, we want to set ``p(aA) = p(Aa) = 0.5``. +- If the individual is genotyped as a ``2``, that is ``AA``, we want to set ``p(AA) = 1.0``. + +The difference between the heterozygous and homozygous states is that +with the heterozygous state there are two possible allele combinations that produce the same genotype. +In all of the cases we also want to add a small amount of noise to the estimate to allow for genotyping errors. +This error is often called as "penetrance" error. + +Anterior +-------- + +The anterior term is calculated differently for the founders and non-founders. + +The anterior term for founders in the population (individuals without any parents) +are set using the population minor allele frequency. +To set the term, we use: + +:: + + pedigree.setMaf() + founder_anterior = ProbMath.getGenotypesFromMaf(pedigree.maf) + founder_anterior = founder_anterior*(1-0.1) + 0.1/4 + for ind in pedigree: + if ind.isFounder(): + ind.peeling_view.setAnterior(founder_anterior.copy()) + +For non-founders the anterior value is based on the genotypes of their parents. +The probability that the inherit an ``A`` allele from their father if their segregation is unknown is: + +:: + + P(a|father) = 0.5 p(aA) + 0.5 p(Aa) + 1 p(AA) + +where ``p(aA)`` relates to the genotype probabilities of their father. +If the segregation value is known, this is instead: + +:: + + P(a|father) = seg p(aA) + (1-seg) p(Aa) + 1 p(AA) + +We can generate genotype probabilities for the four phased genotype +as the product of the allele probabilities for each parent: + +:: + + p(aA) = p(a|father) p(A|mother) + +Posterior +--------- + +The posterior term is a bit trickier than the anterior term. +The idea is that for each full-sub family (that is, each father-mother mate pair) +we will use their children to estimate their genotype probabilities. +To do this, we construct the join probabilities of their parent's genotypes, +a 4x4 matrix of values. +We then calculate the posterior estimate for a single parent by +marginalizing the joint genotype probabilities by the genotypes of the other parents. + +The joint genotypes are estimated by + +.. math:: + p(g_{father}, g_{mother}|children) = \prod(p(g_{father}, g_{mother}|child)). + +Simulation results using AlphaPeel have suggested that accuracy may be increased +by using the called genotype probabilities. +Because of this we call the individual's genotypes, haplotypes, and segregation values. +This has the added benefit of allowing us to use a look-up table to +produce the joint genotype probabilities at a given locus. + +We then marginalize the genotypes for each parent by, + +.. math:: + p(g_{father}|children) = \sum(p(g_{father}, g_{mother}|children)p(g_{mother})). + +We assume that the posterior values for each family are independent. +This lets them calculate them separately for each family group, and +then sum them together to produce the final called genotype. +Because some individuals have a large number of offspring, +we calculate the joint genotypes on a log scale and then convert back in the marginalization step. + +Segregation +----------- + +We estimate the segregation values in a two step process. In the first step we create "point estimates" for the segregation values. In the second step we smooth the point estimates. + +1. In step 1 we look to see if the child's haplotype for a single parent matches one of the parent's haplotypes, but not the other. + + - For example, if the parent is ``aA``, and the child is ``aa`` we will set the segregation value of ``pp`` and ``pm`` to ``1-e`` since that is consistent with the child inheriting the grand paternal allele (first allele) from their sire. + + - We also can consider the case where the child is unphased and heterozygote. In this case we see if a particular combination of parental haplotypes will produce a heterozygous offspring. + + - We used called genotypes in this step because an individual (and their parent's) genotypes are not statistically independent from each other at each loci. Using genotype probabilities (particularly for the parents) can produce erroneous results. + +2. For the second step we use a standard forward-backward algorithm, lifted almost directly from AlphaPeel. The transmission rate determines how much uncertainty when moving from one loci to the next. + +Specific code comments +====================== + +General math +------------ + +**Exponential + normalization:** +When working with the posterior values it is important to handle overflow/underflow. +We do this by treating most of the posterior values on a log scale. +To convert back from a log scale to a normal scale requires an exponential operation. +Most of the cases where we do this, we also need to normalize the resulting values so that they sum to one. +In order to prevent issues with underflows, +we first calculate the maximum value in the slice of the array that we are dealing with. +We then subtract the maximum value and take the exponential. +This means that the greatest value in the array will be set to :math:`exp(0)=1`. +Very small values may be set to zero, but this just indicates that they have vanishingly small probability. + +Parallel +-------- + +In order to parallelize the code, we take this approach. +The overall idea is to find blocks of individuals +who can be updated at the same time (in parallel) and perform those updates. +In the context of peeling, we can break up individuals into discrete generations and +perform updates on all of the individuals in the same generation at the same time. +We use a ``ThreadPoolExecutor`` to perform tasks in parallel, and +use ``numba``'s ``nogil`` flag to get around python's global interpreter lock. + +Because of the overhead in crating threads, and calling ``numba`` functions, +we split out the tasks in groups of full-sub families, +which will be updated at the same time. +Because a given parent can be a parent of multiple families +(but an offspring can only be an offspring of one family), +we set the genotypes for the parents separately in a non-parallel mode. +A complete breakdown of the main steps (and parallelization) is: + +- **father and mother genotypes** (no parallel): + We set the genotypes of the fathers and mothers in a non-parallel mode. + Individual father and mothers may be used in multiple families within the same generation, + and so setting them within a family block may lead to conflicts. +- **Child genotypes** (parallel): + We set the genotypes of the children in parallel. + This is safe because children are only members of a single full sib family + (the family which contains their father and mother). + We also collapse the posterior terms for individuals with offspring in this step + when estimating the posterior term for their parents. + This is done because all of the posterior terms for the individual + should have been calculated by this point + (the offspring's generation is always greater than their parent's). +- **Child segregation** (parallel): + Individual child segregation values are re-estimated in parallel. + These values only depend on the genotypes of the child and their parents. + Their parents genotypes are fixed for the parallel blocks, and + their genotypes are set on a family-by-family basis. +- **Child anterior term** (parallel): + The anterior terms are estimated in parallel. + These depend only on the parent genotype probabilities which fixed for a given generation. +- **Parent posterior** (parallel, but outside of ``numba``): + We add the family posterior elements to each of the parents in parallel but outside of ``numba``. + I think this should be safe because the code is run in multi-threaded mode, + and so the GIL should make the append operations threadsafe. + +Speed +----- + +- Time seems to be split pretty evenly between the anterior and posterior computation terms. +- On the whole, the posterior term seems to dominate compared to the anterior term (usually by a factor of ~2). +- Estimating the segregation appears to be fairly low cost. +- There do not seem to be any obvious speed bottlenecks. + +Memory +------ + +The storage of values takes place inside ``jit_Peeling_Individual`` in ``ImputationIndividual``. The main costs are: + +- three ``4xnLoci float32`` s: + - anterior + - penetrance + - genotypeProbabilities +- For individuals with offspring there are an additional two ``4xnloci float32`` s: + - posterior + - newPosterior +- one ``2xnLoci float32``: + - segregation + +There are a lot of possible places to obtain substantial memory savings. + +- For all individuals + - Because the anterior terms segregate independently, this could be reduced down to ``2xnLoci float32``, + and re-calculated on the fly. + - All of the ``4xnLoci float32`` s could potentially be stored as ``int8`` s with some conversion from ``int8->float32``. + We don't actually need these terms to be very accurate (as long as we can still accurately call values). +- Individuals without offspring + - For individuals without offspring we only ever use their called genotypes + (with just the penetrance term) and + segregation estimates in the peeling. + - These terms come to play in the context of the posterior term for their parents. + - This means we could potentially just save their genotypes, haplotypes, and segregation estimate. + - If we need to call the individual in the future, we can re-calculate their anterior term on the fly, and + recombine with the individual's genotype. +- For individuals with offspring: + - We only ever use the penetrance+posterior or penetrance+anterior+posterior. + We could calculate and store these values explicitly instead of storing them independently and + re-calculating the genotype probabilities. + - We currently store the posterior estimates as a list and re-add. + We could instead store the values as a single matrix and just add each time. + We need to be careful with the parallel updates on this term though. diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst new file mode 100644 index 0000000..c51944a --- /dev/null +++ b/docs/source/changelog.rst @@ -0,0 +1,22 @@ +========= +Changelog +========= + +[?.?.?] - 2026-03-?? +==================== + +Maintenance +----------- + +* Add and improve sections of documentation. + * Introduction + * Getting Started + * Usage + * Algorithm + * Changelog + + +[0.0.3] - 2025-12-04 +==================== + +First release on `PyPI `_. Have all core functionalities for pedigree-based method as well as population-based method. \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index fa7cf0b..f8e6d84 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -69,6 +69,11 @@ # This pattern also affects html_static_path and html_extra_path. exclude_patterns = [] +# Global substitutions available to all .rst files +rst_epilog = """ +.. |Software| replace:: ``AlphaImpute2`` +""" + # The name of the Pygments (syntax highlighting) style to use. pygments_style = None diff --git a/docs/source/get_started.rst b/docs/source/get_started.rst new file mode 100644 index 0000000..42beac6 --- /dev/null +++ b/docs/source/get_started.rst @@ -0,0 +1,248 @@ +=============== +Getting Started +=============== + +Install |Software| +------------------ + +Here is a sample guide for installing |Software| as a Python package. + +For more information on installing Python packages, +visit `Python Packaging User Guide `_. + +Install via pip +=============== + +|Software| is available on `PyPI `_ +hence you can install it simply by running the following command in a terminal: + +.. code-block:: bash + + pip install AlphaImpute2 + +Install from the git repository +=============================== + +You can also install |Software| from the code in git repository. + +Clone the repository: + +.. code-block:: bash + + git clone --recurse-submodules https://github.com/AlphaGenes/AlphaImpute2.git + +Change the working directory to the one holding the repository: + +.. code-block:: bash + + cd AlphaImpute2 + +Create a virtual environment: + +.. code-block:: bash + + python -m venv AlphaImpute2_env + +Activate the environment: + + - If you are using macOS/Linux: + + .. code-block:: bash + + source AlphaImpute2_env/bin/activate + + - If you are using Windows: + + .. code-block:: console + + AlphaImpute2-env\Scripts\activate + +Upgrade pip: + +.. code-block:: bash + + python -m pip install --upgrade pip + +Upgrade the build: + +.. code-block:: bash + + python -m pip install --upgrade build + +Build the distributions: + +.. note:: + + If you are building on a branch other than the main branch, + you may need to first check if the submodule reference is correct before the build. + +.. code-block:: bash + + python -m build + +Install the package by using the built wheel distribution: + +.. code-block:: bash + + python -m pip install dist/alphaimpute2*.whl + +Run examples +============ +.. _run-examples: + +Change the working directory to the one holding the example: + +.. code-block:: bash + + cd example + +Run the example: + +.. code-block:: bash + + bash run_examples.sh + +If you have installed R, you can check the accuracy of the example code: + +.. code-block:: bash + + Rscript check_accuracy.R + +Deactivate the environment: + +.. code-block:: bash + + deactivate + +Install on Eddie +================ + +.. note:: + + This section is only for the users from the University of Edinburgh. + +To use |Software| on the Eddie, the University of Edinburgh's Research Compute Cluster, +you can find information to create an environment without causing the home directory to go over quota at +`Eddie wiki page `_. + +If you encountered the issue that the files show as modified directly after a git clone, +you may can try: + +.. code-block:: bash + + git config core.fileMode false + +An example +---------- + +The following is a very simple example to demonstrate +the principle of estimating unknown haplotypes and genotypes. +The example is deliberately simplistic for the demonstration. +Note that |Software| can handle much more complex examples. +The example contains two parents with known genotypes and +one progeny with unknown genotypes. + +.. image:: static/example_pedigree_with_missing.png + :width: 20em + +Our task is to estimate the haplotypes of all the individuals (we name this task "phasing") and +genotypes of the progeny (we name this task "imputation"). +Since the two parents are fully homozygous and the progeny has no information, +you can easily solve this problem yourself and check your solution with |Software|. + +We first prepare a genotype input file: + +.. code-block:: + + A 2 2 0 0 0 + B 0 0 2 2 2 + C 9 9 9 9 9 + +And a pedigree file: + +.. code-block:: + + A 0 0 + B 0 0 + C A B + +We named the two file as ``simple_genotype.txt`` and ``simple_pedigree.txt``. +You can create the files yourself. +They are also available in the ``example/simple_example/`` directory of the repository. +To use these files you have to change the working directory to the one holding the simple example: + +.. code-block:: bash + + cd simple_example + +We perform the phasing and imputation with the following command: + +.. code-block:: bash + + AlphaImpute2 \ + -genotypes simple_genotype.txt \ + -pedigree simple_pedigree.txt \ + -ped_only \ + -out simple_output + +Above we provided the two input files, asked for the pedigree-only method, and +requested the output filenames starting with ``simple_output``. + +If the run was successful, +you should see the following output: + +.. code-block:: bash + + ------------------------------------------ + AlphaImpute2 + ------------------------------------------ + Version: 0.0.3 + Email: alphagenes@roslin.ed.ac.uk + Website: http://alphagenes.roslin.ed.ac.uk + ------------------------------------------ + Reading in AlphaGenes format: simple_genotype.txt + Read in: 0.00 seconds + + Pedigree Imputation Only + ------------------------------------------ + Number of peeling cycles: 4 + Final cutoff: 0.1 + + Imputation cycle 1 + Peel down: 3.57 seconds + Peel up: 2.40 seconds + + Imputation cycle 2 + Peel down: 0.00 seconds + Peel up: 0.00 seconds + + Imputation cycle 3 + Peel down: 0.00 seconds + Peel up: 0.00 seconds + + Imputation cycle 4 + Peel down: 0.00 seconds + Peel up: 0.00 seconds + Core peeling cycles: 5.97 seconds + Heuristic Peeling: 7.03 seconds + + Writing Out Results + ------------------------------------------ + Write out: 0.00 seconds + Full Program Run: 7.18 seconds + +And the following output file: + +.. code-block:: bash + + simple_output.genotypes + +The ``simple_output.genotypes`` provides the called genotypes: + +.. code-block:: bash + + A 2 2 0 0 0 + B 0 0 2 2 2 + C 1 1 1 1 1 + +For more information about how to use |Software|, please see :ref:`usage`. \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 2cd96c6..b89ae1c 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -10,232 +10,12 @@ AlphaImpute2 .. toctree:: :maxdepth: 2 - :caption: Contents: -.. highlight:: none - -Introduction -~~~~~~~~~~~~ - -AlphaImpute2 is program to perform imputation in a range of animal and plant species. - -Availability ------------- - -AlphaImpute2 is available as open-source from `Github `_ and can be easily installed using ``pip install AlphaImpute2`` (the latest release). - -Please report any issues at ``_. - -Conditions of use ------------------ - -AlphaImpute2 is part of a suite of software that our group has developed. It is fully and freely available for all use under the MIT License. - -Suggested Citation: - -Whalen, A, Hickey, JM. (2020). *AlphaImpute2: Fast and accurate pedigree and population based imputation for hundreds of thousands of individuals in livestock populations*. BioRxiv https://doi.org/10.1101/2020.09.16.299677 - -Disclaimer ----------- - -While every effort has been made to ensure that AlphaImpute2 does what it claims to do, there is absolutely no guarantee that the results provided are correct. Use of AlphaImpute2 is entirely at your own risk. - -Program Options -~~~~~~~~~~~~~~~ - -AlphaImpute2 takes in a number of command line arguments to control the program's behavior. To view a list of arguments, run AlphaImpute2 without any command line arguments, i.e. ``AlphaImpute2`` or ``AlphaImpute2 -h``. - -There are four primary ways to run AlphaImpute2 which differ on whether population or pedigree imputation should be run. The default option is to run both population and pedigree imputation in an integrated algorithm. This will be the option most users will want if they have access to pedigree data on a majority of individuals. The second option is to run population imputation only with the ``-pop_only`` flag. This option should be used if no pedigree data is availible. The third option is to run only pedigree based imputation using the ``-ped_only`` flag. This option is not recommended for general use cases, but may be applicable if (1) there are more than five generations of pedigree data, (2) imputation is done only on the most recent generations, (3) speed is a priority. - -The fourth option is to run AlphaImpute2 with the ``-cluster_only``. This option performs AlphaImpute2's array clustering algorithm and outputs the results of the clustering. This option may be useful for debugging how individuals are clustered. - -Core Arguments --------------- - -:: - - Core arguments - -out prefix The output file prefix. - -The ``-out`` argument gives the output file prefix for where the outputs of AlphaImpute2 should be stored. By default, AlphaImpute2 outputs a file with imputed genotypes, ``prefix.genotypes`` and phased haplotypes ``prefix.phase``. For more information on which files are created, see "Output Arguments", below. - -Input Arguments ----------------- - -:: - - Input arguments: - -genotypes [GENOTYPES ...] - A file in AlphaGenes format. - -pedigree [PEDIGREE ...] - A pedigree file in AlphaGenes format. - -startsnp STARTSNP The first marker to consider. The first marker in the file is marker '1'. Default: 1. - -stopsnp STOPSNP The last marker to consider. Default: all markers considered. - -seed SEED A random seed to use for debugging. - -AlphaImpute2 requires a genotype file and an optional pedigree file to run the analysis. - -A pedigree file may be supplied using the ``-pedigree`` option. - -Use the ``-startsnp`` and ``-stopsnp`` comands to run the analysis only on a subset of markers. - -Imputation arguments: ---------------------- - -:: - - Impute options: - -maxthreads MAXTHREADS - Number of threads to use. Default: 1. - -binaryoutput Flag to write out the genotypes as a binary plink - output. - -phase_output Flag to write out the phase information. - -seg_output Flag to write out the segmentation information. - -pop_only Flag to run the population based imputation algorithm - only. - -ped_only Flag to run the pedigree based imputation algorithm - only. - -cluster_only Flag to just cluster individuals into marker arrays - and write out results. - -length LENGTH Estimated map length for pedigree and population - imputation in Morgans. Default: 1 (100cM). - -These options control how imputation is run. The ``-maxthreads`` argument can be used to allow multiple threads to be used for imputation. This argument can be set seperately from the ``-iothreads`` argument (above). The speed gains of using multiple threads is close to linear for population imputation, but is more limited for pedigree based imputation. - -The ``-length`` argument controls the assumed length of the chromosome (in Morgans). We have found that imputation is largely insensitive to this value so keeping this value at its default of 1, should work in many cases. There are additional options to control the assumed recombination used for population based imputation (below). - -The binary output option flags the program to write out files in plink binary format. Binary plink files require the package ``alphaplinkpython``. This can be installed via ``pip`` but is only stable for Linux. A fake map file is generated. - -The remaining options control how AlphaImpute2 is run. - -Pedigree imputation options ---------------------------- -:: + introduction + get_started + usage + algorithm + changelog - Pedigree imputation options: - -cycles CYCLES Number of peeling cycles. Default: 4 - -final_peeling_threshold FINAL_PEELING_THRESHOLD - Genotype calling threshold for final round of peeling. - Default: 0.1 (best guess genotypes). - -These options control how pedigree imputation is run for either the pedigree only algorithm, or the combined algorithm. ``-cycles`` controls the number of cycles of peeling that are perfromed. An additional very-high-confidence cycle is always performed in addition to the cycles specific here. We recommend using the default value of 4 cycles. Additional cycles seem to provide limited benifit in most pedigrees. - -The ``-final_peeling_threshold`` argument gives the genotype calling threshold for the final round of peeling. This applies to both the pedigree only or the combined algorithm. We recommend either using best guess genotypes (default with a cutoff of 0.1) or high confidence genotypes (with a cutoff of 0.95). Values that cannot be imputed with high enough confidence will be coded as missing. - -Population imputation options ------------------------------ - -:: - - Population imputation options: - -n_phasing_cycles N_PHASING_CYCLES - Number of phasing cycles. Default: 5 - -n_phasing_particles N_PHASING_PARTICLES - Number of phasing particles. Defualt: 40. - -n_imputation_particles N_IMPUTATION_PARTICLES - Number of imputation particles. Defualt: 100. - -hd_threshold HD_THRESHOLD - Percentage of non-missing markers for an individual be - classified as high-density when building the haplotype - library. Default: 0.95. - -min_chip MIN_CHIP Minimum number of individuals on an inferred low- - density chip for it to be considered a low-density - chip. Default: 0.05 - -phasing_loci_inclusion_threshold PHASING_LOCI_INCLUSION_THRESHOLD - Percentage of non-missing markers per loci for it to - be included on a chip for imputation. Default: 0.9. - -imputation_length_modifier IMPUTATION_LENGTH_MODIFIER - Increases the effective map length of the chip for - population imputation by this amount. Default: 1. - -phasing_length_modifier PHASING_LENGTH_MODIFIER - Increases the effective map length of the chip for - Phasing imputation by this amount. Default: 5. - -phasing_consensus_window_size PHASING_CONSENSUS_WINDOW_SIZE - Number of markers used to evaluate haplotypes when - creating a consensus haplotype. Default: 50. - -These options control how population imputation is run. This algorithm uses a particle-based imputation approach where a number of particles are used to explore genotype combinations with high posterior probability. Increasing the number of particles can increase accuracy. USe the options, ``-n_phasing_particles`` and ``n_imputation_particles`` to change the number of particles run for phasing and imputation. - -AlphaImpute2 uses a number of rounds of phasing in order to iteratively build a haplotype reference panel from the observed data. The argument ``-n_phasing_cycles`` controls the number of rounds that are used for phasing. In pilot testing we have found that the default value of 5 cycles tends to give good accuracy. Additional accuracy may be possible by slightly increasing this value. - -To perfrom phasing and imputation, AlphaImpute2 selects high-density individuals to form the haplotype reference panel. ``-hd_threshold`` gives the percentage of non-missing markers the individual needs to carry to be included in the high-density reference panel. - -Similar to ``-length`` the ``-imputation_length_modifier`` and ``-phasing_length_modifier`` control the assumed chromosome length for phasing and imputation. These values are applied multiplicatively to the ``-length`` option. We have found that imputation accuracy is not very sensitive to these values and recommend setting them to their default value. - -When AlphaImpute2 is run, multiple particles are merged based on the particle's score in a window centered around each marker. ``phasing_consensus_window_size`` controls the size of the window. Increasing this value can increase imputation accuracy if the low-density panel is very sparse compared to the high-density panel. - -Joint imputation options ------------------------- - -:: - - Joint imputation options: - -chip_threshold CHIP_THRESHOLD - Proportion more high density markers parents need to - be used over population imputation. Default: 0.95 - -final_peeling_threshold_for_phasing FINAL_PEELING_THRESHOLD_FOR_PHASING - Genotype calling threshold for first round of peeling - before phasing. This value should be conservative.. - Default: 0.9. - -These options control how population and pedigree imputation are combined. As part of the combined algorithm, AlphaImpute2 detects a small number of "pseudo-founders" to impute using the population imputation algorithm. These "pseudo-founders" are selected by finding individuals with higher genotyping densities than their parents. AlphaImpute2 tries to be conservative in which individuals are selected as a "pseudo-founder" and the ``-chip_threshold`` parameter tells algorithm how many more non-missing markers the individuals needs compared to their parents to be considered a "pseudo-founder". - -Similar to the ``-final_peeling_threshold`` argument, the ``-final_peeling_threshold_for_phasing`` argument gives the final peeling threshold for the initial round of pedigree imputation in the combined algorithm. - -Input file formats -~~~~~~~~~~~~~~~~~~ - -Genotype file -------------- - -Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each. - -Example: :: - - id1 0 2 9 0 - id2 1 1 1 1 - id3 2 0 2 0 - id4 0 2 1 0 - -Pedigree file -------------- - -Each line of a pedigree file has three values, the individual's id, their father's id, and their mother's id. "0" represents an unknown id. - -Example: :: - - id1 0 0 - id2 0 0 - id3 id1 id2 - id4 id1 id2 - -Output file formats -~~~~~~~~~~~~~~~~~~~ - -Genotype file -------------- - -Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each. - -Example: :: - - id1 0 2 9 0 - id2 1 1 1 1 - id3 2 0 2 0 - id4 0 2 1 0 - -Phase file ------------ - -The phase file gives the phased haplotypes (either 0 or 1) for each individual in two lines. For individuals where we can determine the haplotype of origin, the first line will provide information on the paternal haplotype, and the second line will provide information on the maternal haplotype. - -Example: :: +.. highlight:: none - id1 0 1 9 0 # Maternal haplotype - id1 0 1 9 0 # Paternal haplotype - id2 1 1 1 0 - id2 0 0 0 1 - id3 1 0 1 0 - id3 1 0 1 0 - id4 0 1 0 0 - id4 0 1 1 0 diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst new file mode 100644 index 0000000..d6aa494 --- /dev/null +++ b/docs/source/introduction.rst @@ -0,0 +1,25 @@ +Introduction +~~~~~~~~~~~~ + +|Software| is program to perform imputation in a range of animal and plant species. + +Availability +------------ + +|Software| is available as open-source from `GitHub `_ and can be easily installed using ``pip install AlphaImpute2`` (the latest release). + +Please report any issues at ``_. + +Conditions of use +----------------- + +|Software| is part of a suite of software that our group has developed. It is fully and freely available for all use under the MIT License. + +Suggested Citation: + +.. [1] Whalen, A, Hickey, JM. (2020). *AlphaImpute2: Fast and accurate pedigree and population based imputation for hundreds of thousands of individuals in livestock populations*. BioRxiv https://doi.org/10.1101/2020.09.16.299677 + +Disclaimer +---------- + +While every effort has been made to ensure that |Software| does what it claims to do, there is absolutely no guarantee that the results provided are correct. Use of |Software| is entirely at your own risk. diff --git a/docs/source/static/example_pedigree_with_missing.png b/docs/source/static/example_pedigree_with_missing.png new file mode 100644 index 0000000..63c5ab2 Binary files /dev/null and b/docs/source/static/example_pedigree_with_missing.png differ diff --git a/docs/source/usage.rst b/docs/source/usage.rst new file mode 100644 index 0000000..74d91b1 --- /dev/null +++ b/docs/source/usage.rst @@ -0,0 +1,216 @@ +.. _usage: + +----- +Usage +----- + +=============== +Program options +=============== + +|Software| takes in a number of command line arguments to control the program's behavior. To view a list of arguments, run |Software| without any command line arguments, i.e. |Software| or ``AlphaImpute2 -h``. + +There are four primary ways to run |Software| which differ on whether population or pedigree imputation should be run. The default option is to run both population and pedigree imputation in an integrated algorithm. This will be the option most users will want if they have access to pedigree data on a majority of individuals. The second option is to run population imputation only with the ``-pop_only`` flag. This option should be used if no pedigree data is availible. The third option is to run only pedigree based imputation using the ``-ped_only`` flag. This option is not recommended for general use cases, but may be applicable if + +(1) there are more than five generations of pedigree data, + +(2) imputation is done only on the most recent generations, + +(3) speed is a priority. + +The fourth option is to run |Software| with the ``-cluster_only``. This option performs |Software|'s array clustering algorithm and outputs the results of the clustering. This option may be useful for debugging how individuals are clustered. + +Core Arguments +-------------- + +:: + + Core arguments + -out prefix The output file prefix. + +The ``-out`` argument gives the output file prefix for where the outputs of |Software| should be stored. By default, |Software| outputs a file with imputed genotypes, ``prefix.genotypes`` and phased haplotypes ``prefix.phase``. For more information on which files are created, see "Output Arguments", below. + +Input Arguments +---------------- + +:: + + Input arguments: + -genotypes [GENOTYPES ...] + A file in AlphaGenes format. + -pedigree [PEDIGREE ...] + A pedigree file in AlphaGenes format. + -startsnp STARTSNP The first marker to consider. The first marker in the file is marker '1'. Default: 1. + -stopsnp STOPSNP The last marker to consider. Default: all markers considered. + -seed SEED A random seed to use for debugging. + +|Software| requires a genotype file and an optional pedigree file to run the analysis. + +A pedigree file may be supplied using the ``-pedigree`` option. + +Use the ``-startsnp`` and ``-stopsnp`` comands to run the analysis only on a subset of markers. + +Imputation arguments: +--------------------- + +:: + + Impute options: + -maxthreads MAXTHREADS + Number of threads to use. Default: 1. + -binaryoutput Flag to write out the genotypes as a binary plink + output. + -phase_output Flag to write out the phase information. + -seg_output Flag to write out the segmentation information. + -pop_only Flag to run the population based imputation algorithm + only. + -ped_only Flag to run the pedigree based imputation algorithm + only. + -cluster_only Flag to just cluster individuals into marker arrays + and write out results. + -length LENGTH Estimated map length for pedigree and population + imputation in Morgans. Default: 1 (100cM). + +These options control how imputation is run. The ``-maxthreads`` argument can be used to allow multiple threads to be used for imputation. This argument can be set seperately from the ``-iothreads`` argument (above). The speed gains of using multiple threads is close to linear for population imputation, but is more limited for pedigree based imputation. + +The ``-length`` argument controls the assumed length of the chromosome (in Morgans). We have found that imputation is largely insensitive to this value so keeping this value at its default of 1, should work in many cases. There are additional options to control the assumed recombination used for population based imputation (below). + +The binary output option flags the program to write out files in plink binary format. Binary plink files require the package ``alphaplinkpython``. This can be installed via ``pip`` but is only stable for Linux. A fake map file is generated. + +The remaining options control how |Software| is run. + +Pedigree imputation options +--------------------------- +:: + + Pedigree imputation options: + -cycles CYCLES Number of peeling cycles. Default: 4 + -final_peeling_threshold FINAL_PEELING_THRESHOLD + Genotype calling threshold for final round of peeling. + Default: 0.1 (best guess genotypes). + +These options control how pedigree imputation is run for either the pedigree only algorithm, or the combined algorithm. ``-cycles`` controls the number of cycles of peeling that are perfromed. An additional very-high-confidence cycle is always performed in addition to the cycles specific here. We recommend using the default value of 4 cycles. Additional cycles seem to provide limited benifit in most pedigrees. + +The ``-final_peeling_threshold`` argument gives the genotype calling threshold for the final round of peeling. This applies to both the pedigree only or the combined algorithm. We recommend either using best guess genotypes (default with a cutoff of 0.1) or high confidence genotypes (with a cutoff of 0.95). Values that cannot be imputed with high enough confidence will be coded as missing. + +Population imputation options +----------------------------- + +:: + + Population imputation options: + -n_phasing_cycles N_PHASING_CYCLES + Number of phasing cycles. Default: 5 + -n_phasing_particles N_PHASING_PARTICLES + Number of phasing particles. Defualt: 40. + -n_imputation_particles N_IMPUTATION_PARTICLES + Number of imputation particles. Defualt: 100. + -hd_threshold HD_THRESHOLD + Percentage of non-missing markers for an individual be + classified as high-density when building the haplotype + library. Default: 0.95. + -min_chip MIN_CHIP Minimum number of individuals on an inferred low- + density chip for it to be considered a low-density + chip. Default: 0.05 + -phasing_loci_inclusion_threshold PHASING_LOCI_INCLUSION_THRESHOLD + Percentage of non-missing markers per loci for it to + be included on a chip for imputation. Default: 0.9. + -imputation_length_modifier IMPUTATION_LENGTH_MODIFIER + Increases the effective map length of the chip for + population imputation by this amount. Default: 1. + -phasing_length_modifier PHASING_LENGTH_MODIFIER + Increases the effective map length of the chip for + Phasing imputation by this amount. Default: 5. + -phasing_consensus_window_size PHASING_CONSENSUS_WINDOW_SIZE + Number of markers used to evaluate haplotypes when + creating a consensus haplotype. Default: 50. + +These options control how population imputation is run. This algorithm uses a particle-based imputation approach where a number of particles are used to explore genotype combinations with high posterior probability. Increasing the number of particles can increase accuracy. USe the options, ``-n_phasing_particles`` and ``n_imputation_particles`` to change the number of particles run for phasing and imputation. + +|Software| uses a number of rounds of phasing in order to iteratively build a haplotype reference panel from the observed data. The argument ``-n_phasing_cycles`` controls the number of rounds that are used for phasing. In pilot testing we have found that the default value of 5 cycles tends to give good accuracy. Additional accuracy may be possible by slightly increasing this value. + +To perfrom phasing and imputation, |Software| selects high-density individuals to form the haplotype reference panel. ``-hd_threshold`` gives the percentage of non-missing markers the individual needs to carry to be included in the high-density reference panel. + +Similar to ``-length`` the ``-imputation_length_modifier`` and ``-phasing_length_modifier`` control the assumed chromosome length for phasing and imputation. These values are applied multiplicatively to the ``-length`` option. We have found that imputation accuracy is not very sensitive to these values and recommend setting them to their default value. + +When |Software| is run, multiple particles are merged based on the particle's score in a window centered around each marker. ``phasing_consensus_window_size`` controls the size of the window. Increasing this value can increase imputation accuracy if the low-density panel is very sparse compared to the high-density panel. + +Joint imputation options +------------------------ + +:: + + Joint imputation options: + -chip_threshold CHIP_THRESHOLD + Proportion more high density markers parents need to + be used over population imputation. Default: 0.95 + -final_peeling_threshold_for_phasing FINAL_PEELING_THRESHOLD_FOR_PHASING + Genotype calling threshold for first round of peeling + before phasing. This value should be conservative.. + Default: 0.9. + +These options control how population and pedigree imputation are combined. As part of the combined algorithm, |Software| detects a small number of "pseudo-founders" to impute using the population imputation algorithm. These "pseudo-founders" are selected by finding individuals with higher genotyping densities than their parents. |Software| tries to be conservative in which individuals are selected as a "pseudo-founder" and the ``-chip_threshold`` parameter tells algorithm how many more non-missing markers the individuals needs compared to their parents to be considered a "pseudo-founder". + +Similar to the ``-final_peeling_threshold`` argument, the ``-final_peeling_threshold_for_phasing`` argument gives the final peeling threshold for the initial round of pedigree imputation in the combined algorithm. + +============ +File formats +============ + +Input file formats +------------------ + +Genotype file +============= + +Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each. + +Example: :: + + id1 0 2 9 0 + id2 1 1 1 1 + id3 2 0 2 0 + id4 0 2 1 0 + +Pedigree file +============= + +Each line of a pedigree file has three values, the individual's id, their father's id, and their mother's id. "0" represents an unknown id. + +Example: :: + + id1 0 0 + id2 0 0 + id3 id1 id2 + id4 id1 id2 + +Output file formats +------------------- + +Genotype file +============= + +Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each. + +Example: :: + + id1 0 2 9 0 + id2 1 1 1 1 + id3 2 0 2 0 + id4 0 2 1 0 + +Phase file +========== + +The phase file gives the phased haplotypes (either 0 or 1) for each individual in two lines. For individuals where we can determine the haplotype of origin, the first line will provide information on the paternal haplotype, and the second line will provide information on the maternal haplotype. + +Example: :: + + id1 0 1 9 0 # Maternal haplotype + id1 0 1 9 0 # Paternal haplotype + id2 1 1 1 0 + id2 0 0 0 1 + id3 1 0 1 0 + id3 1 0 1 0 + id4 0 1 0 0 + id4 0 1 1 0 diff --git a/example/simple_example/simple_genotype.txt b/example/simple_example/simple_genotype.txt new file mode 100644 index 0000000..87e2a2c --- /dev/null +++ b/example/simple_example/simple_genotype.txt @@ -0,0 +1,3 @@ +A 2 2 0 0 0 +B 0 0 2 2 2 +C 9 9 9 9 9 diff --git a/example/simple_example/simple_pedigree.txt b/example/simple_example/simple_pedigree.txt new file mode 100644 index 0000000..b3951a7 --- /dev/null +++ b/example/simple_example/simple_pedigree.txt @@ -0,0 +1,3 @@ +A 0 0 +B 0 0 +C A B diff --git a/pyproject.toml b/pyproject.toml index 372104e..0f1d3ff 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta" name = "AlphaImpute2" version = "0.0.3" authors = [ - { name="Andrew Whalen", email="awhalen@roslin.ed.ac.uk" }, + { name="Andrew Whalen", email="alphagenes.dev@gmail.com" }, ] description = "An imputation software for massive livestock populations." # readme corresponds to the long_description