ibdutils is a small Python package designed to facilitate identity-by-descent (IBD) analysis. It offers a set of tools for genetic researchers and bioinformaticians working with IBD data.
- Filter IBD segments by TMRCA, mutation, IBD segment length and samples
- Allow haploid-to-diploid IBD conversion and flattening of diploid IBD segments
- Remove highly related samples/isolates biased on pairwise total IBD network
- Calculate and visualize IBD coverage over sampling sites
- Identify (via IBD coverage threholding method) and validate (via
iHS-based statistics) IBD peak regions - Remove IBD located within identified IBD peak regions for selection correction
- Split genomes into contigs of non-zero IBD coverage, important for preparing IBDNe input after removing IBD from peak regions
- Prepare pairwise total IBD sharing matrix and run hierarchical clustering over it
- Provide wrappers for IBD-based downstream analyses, including:
igraph-pythonInfomap community detection algorithm for population structure analysisIBDNeto infer the trajectory of effective population size for recent times
- Provide an IbdComparator class to allow benchmarking inferred IBD set against a true IBD set
- Provide helper classes like GeneticMap (base pair and centimorgan coordinate conversion) and Genome (chromosome sizes, and annotations like drug resistance genes, multigene family)
- Fast dump and load of IBD objects
This package has been tested on MacOS and Linux operating systems. Software dependencies are specified in the pyproject.toml file. These dependencies will be automatically installed when this package is installed via pip.
The specific versions of the software dependencies for ibdutils are left blank, so they are more flexible and will not conflict with other software environments, such as the ones used for posseleff_simulations and posseleff_empirical. In fact, ibdutils is part of the Conda environments of the posseleff_simulations and posseleff_empirical pipelines and is known to work as expected.
The package can be easily installed via pip within an existing python environment or a newly created Conda environment, such as python=3.10. Installation time can be as short as 30 seconds.
git clone https://github.com/bguo068/ibdutils.git cd ibdutils # optional: git checkout [specific version/branch/commit] # Some pip version are unknow to have issue, work around: # use a conda environment with python=3.10 specified pip install .Note:
- Installation of the dependency
pybedtoolsmay need c compilers and python3 developer packages. So you need to have them installed before installingibdutils.- On Redhat-based systems, you might want to install
sudo dnf install python3-devel gccpackage before you install ibdutils packages. On Debian-based systems, you can installsudo apt install python3-dev build-essential. - If you don't have root previledge, you can use conda to install
pybedtools(or other packages that need compilation) by runningconda install pybedtools -c bioconda.
- On Redhat-based systems, you might want to install
- Alternatively, if you want a conda recipe to construct a full conda environment including
ibdutilsitself, you can
wget https://raw.githubusercontent.com/bguo068/posseleff_empirical/main/env.yaml conda env create -f env.yaml conda activate empirical- Prepare input data
#! /usr/bin/env bash tar xf testdata.tgzNote for preparing *.ibd files for your emprical analysis, the following columns "Ancestor", "Tmrca", and "HasMutation" are unknown, and can be set to some values that would be compatible with downstream analysis, such as Ancestor = 9999, Tmrca = 100, HasMutation = 0. Please check the complete code used to prepare ibd files from hmmIBD output files in this script
- Example 1
import ibdutils.utils.ibdutils as ibdutils import ibdutils.runner.ibdne as ibdne # ---------------- prepare IBDNe input files ------------ # input files (simulated from single population model with selection) sp_ibd_files = [f"testdata/single_pop/tskibd/{i}.ibd" for i in range(1, 15)] sp_vcf_files = [f"testdata/single_pop/vcf/{i}.vcf.gz" for i in range(1, 15)] # parameters label_str = "tmp_sp_sim" ibdne_flatmeth = "none" ibdne_mincm = 2 # read ibd genome_14_100 = ibdutils.Genome.get_genome("simu_14chr_100cm") ibd = ibdutils.IBD(genome=genome_14_100, label=f"{label_str}_orig") ibd.read_ibd(ibd_fn_lst=sp_ibd_files) # remove highly related samples mat = ibd.make_ibd_matrix() unrelated_samples = ibd.get_unrelated_samples(ibdmat=mat) ibd.subset_ibd_by_samples(subset_samples=unrelated_samples) # prepare input for IBDNe # remove ibd with tmrca < 1.5 (required according to IBDNe paper) ibd.filter_ibd_by_time(min_tmrca=1.5) ibd.filter_ibd_by_length(min_seg_cm=ibdne_mincm) # calculate iHS ibd.calc_ihs(vcf_fn_lst=sp_vcf_files, min_maf=0.01) # calculate IBD coverage and find peaks ibd.calc_ibd_cov() ibd.find_peaks() # only keep peaks that contain a ihs hit ibd.filter_peaks_by_ihs() # plot IBD coverage with peaks marked with red shading ax = ibd.plot_coverage(which="ihsfilt") fig = ax.get_figure() fig.savefig("cov.png") # save IBD before remove peaks # of_orig_ibdne_obj = f"{label_str}_orig.ibdne.ibdobj.gz" # ibd.pickle_dump(of_orig_ibdne_obj) # saved ibd.obj file can be loaded by # ibd = ibdutils.IBD.pickle_load("xxx.ibd.obj.gz") # USEFUL information saved in the IBD object: ibd._df # IBD segment dataframe ibd._cov_df # IBD coverage dataframe ibd._peaks_df # IBD peaks and annotations # remove peaks ibd2 = ibd.duplicate(f"{label_str}_rmpeaks") ibd2.remove_peaks() ibd2._df = ibd2.cut_and_split_ibd() # USEFUL information saved in the IBD object: # IBD segment dataframe after peak removal ibd2._df # IBD coverage dataframe. Note this is not updated by `remove_peaks()` so now it # still reflects coverage before peak removal; if needed, you can call # `calc_ibd_cov()` to update _cov_df. ibd2._cov_df # IBD peaks and annotations. Note this is not updated by `remove_peaks()` so it # still reflects peaks before peak removal) ibd2._peaks_df # save IBD after remove peaks # of_rmpeaks_ibdne_obj = f"{label_str}_rmpeaks.ibdne.ibdobj.gz" # ibd2.pickle_dump(of_rmpeaks_ibdne_obj) ibdne_runner1 = ibdne.IbdNeRunner(ibd, ".", ".") ibdne_runner1.run(dry_run=True) # This generate three files *.ibd.gz/*.map/*.sh for running IBDNe ibdne_runner2 = ibdne.IbdNeRunner(ibd2, ".", ".") ibdne_runner2.run(dry_run=True) # This generate three files *.ibd.gz/*.map/*.sh for running IBDNe- Example 2
import ibdutils.utils.ibdutils as ibdutils import ibdutils.runner.ibdne as ibdne import pandas as pd import numpy as np # ---------------- call infomap ------------ # input files mp_ibd_files = [f"testdata/multiple_pop/tskibd/{i}.ibd" for i in range(1, 15)] mp_vcf_files = [f"testdata/multiple_pop/vcf/{i}.vcf.gz" for i in range(1, 15)] # parameters label_str = "tmp_mp_sim" nsam = 200 # no. of isolates per subpopulation npop = 5 # no. of subpopulations transform = "square" # transformation of IBD matrix ntrials = 1000 # parameter of infomap algorithm # meta information with optional true labels (for purpose of comparison) meta = pd.DataFrame( { "Sample": np.arange(nsam * npop), # use haploid here "Population": np.repeat(np.arange(npop), nsam), } ) # read ibd from files genome_14_100 = ibdutils.Genome.get_genome("simu_14chr_100cm") ibd = ibdutils.IBD(genome=genome_14_100, label=f"{label_str}_orig") ibd.read_ibd(ibd_fn_lst=mp_ibd_files) # remove highly relatedness samples # NOTE: for empirical analysis, you might want to remove some short IBD segments # and mask ibd matrix elements of low values by changing arguments of the # `make_ibd_matrix` method, e.g.: # ibd.make_ibd_matrix(min_seg_cm=2, max_gw_ibd_cm=5.0) mat = ibd.make_ibd_matrix() unrelated_samples = ibd.get_unrelated_samples(ibdmat=mat) ibd.subset_ibd_by_samples(subset_samples=unrelated_samples) # calculate IBD coverage and find peaks ibd.calc_ibd_cov() ibd.find_peaks() # calculate iHS and filter peaks ibd.calc_ihs(vcf_fn_lst=mp_vcf_files, min_maf=0.01) ibd.filter_peaks_by_ihs() # plot IBD coverage with peaks marked with red shading ibd.plot_coverage(which="ihsfilt") # USEFUL information saved in the IBD object: ibd._df # IBD segment dataframe ibd._cov_df # IBD coverage dataframe ibd._peaks_df # IBD peaks and annotations # make a copy of the IBD object and remove IBD within peaks on the copy ibd2 = ibd.duplicate("rmpeak") ibd2.remove_peaks() # USEFUL information saved in the IBD object: # IBD segment dataframe after peak removal ibd2._df # IBD coverage dataframe. Note this is not updated by `remove_peaks()` so now it # still reflects coverage before peak removal; if needed, you can call # `calc_ibd_cov()` to update _cov_df. ibd2._cov_df # IBD peaks and annotations. Note this is not updated by `remove_peaks()` so it # still reflects peaks before peak removal) ibd2._peaks_df # run infomap on IBD object without peak removal mat = ibd.make_ibd_matrix() # run infomap on IBD object WITH peaks not removed member_df = ibd.call_infomap_get_member_df( mat, meta, trials=ntrials, transform=transform ) # run infomap on IBD object WITH peak removal mat2 = ibd2.make_ibd_matrix() member_df2 = ibd2.call_infomap_get_member_df( mat2, meta, trials=ntrials, transform=transform ) # Infomap identifies 2 main groups member_df.Rank.value_counts().iloc[:6] # 0 640 # 1 120 # 2 15 # 3 11 # 4 9 # 5 9 # Name: Rank, dtype: int64 # Infomap identifies 4 main groups when selection correction is performed member_df2.Rank.value_counts().iloc[:6] # 0 203 # 1 150 # 2 149 # 3 108 # 4 13 # 5 9 # Name: Rank, dtype: int64 # NOTE about Infomap result dataframe # column "Membership" contains the inferred labels (unsorted version) # column "Rank" contains the sorted version of inferred labelsMore examples can be found for simulated and and emprical dataset:
- for simulated data:
- for emprical data:
- The documentation for each function or method is currently only partially complete.
- The existing implementation is based solely on Python. A separate implementation in Rust is in progress. The Rust version is expected to offer enhanced computational efficiency, particularly for coverage calculation, and peak removal steps.
If you find this tool useful, please cite our paper:
Guo, B., Borda, V., Laboulaye, R. et al. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun 15, 2499 (2024). https://doi.org/10.1038/s41467-024-46659-0
Other citations:
IBDNe
Browning, S. R., & Browning, B. L. (2015). Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. American journal of human genetics, 97(3), 404–418. https://doi.org/10.1016/j.ajhg.2015.07.012
Infomapalgorithm
Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4), 1118–1123. https://doi.org/10.1073/pnas.0706851105
iHSstatistics and calculation
iHS calculation via scikit-allel:
Miles, A. et al. cggh/scikit-allel: v1.3.7. (2023) doi:10.5281/ZENODO.8326460.
iHS statistics:
Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
For packaging:
