Pipeline to construct species phylogenies using BUSCO.
- Alignment: PRANK, MAFFT.
- Trimming: GBlocks, TrimAl.
- Phylogenetic tree constraction: IQTree, MrBayes, ASTRAL III, RapidNJ, PHYLIP.
- Visualization: Etetoolkit, Matplotlib.
To use this workflow, you can either download and extract the latest release or clone the repository:
git clone https://github.com/tomarovsky/BuscoClade.git Place your unpacked FASTA genome assemblies into the genomes/ directory. Keep in mind that the file prefixes will influence the output phylogeny. Ensure that your files have a .fasta extension.
To set up the workflow, modify config/default.yaml. I recommend to copy config gile and do all modifications in this copy. Some of the options (all nonested options from default.yaml) could also be set via command line using --config flag. Sections of config file:
-
Pipeline Configuration: This section outlines the workflow. By default, it includes alignments and following filtration of nucleotide sequences, and all tools for phylogeny reconstruction, except for MrBayes (it is recommended to run the GPU compiled version separately). To disable a tool, set its value to
Falseor comment out the corresponding line. -
Tool Parameters: Specify parameters for each tool. To perform BUSCO, it is important to specify:
busco_dataset_path: Download the BUSCO dataset beforehand and specify its path here.busco_params: Use the--offlineflag and the--download_pathparameter, indicating the path to thebusco_downloads/directory.
-
Directory structure: Define output file structure in the
results/directory. It is recommended to leave it unchanged. -
Resources: Specify Slurm queue, threads, memory, and runtime for each tool.
Install snakemake:
mamba create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake snakemake-executor-plugin-cluster-generic mamba activate snakemake For a dry run:
snakemake --profile profile/slurm/ --configfile config/default.yaml --dry-run Snakemake will print all the rules that will be executed. Remove --dry-run to initiate the actual run.
How to run the workflow if I have completed BUSCOs?
First, move the genome assemblies to the genomes/ directory or create empty files with corresponding names. Then, create a results/busco/ directory and move the BUSCO output directories into it. Note that BUSCO output must be formatted. Thus, for Ailurus_fulgens.fasta BUSCO output should look like this:
results/ busco/ Ailurus_fulgens/ busco_sequences/ fragmented_busco_sequences/ multi_copy_busco_sequences/ single_copy_busco_sequences/ hmmer_output/ logs/ metaeuk_output/ full_table_Ailurus_fulgens.tsv missing_busco_list_Ailurus_fulgens.tsv short_summary_Ailurus_fulgens.txt short_summary.json short_summary.specific.mammalia_odb10.Ailurus_fulgens.json short_summary.specific.mammalia_odb10.Ailurus_fulgens.txt Please email me at: andrey.tomarovsky@gmail.com for any questions or feedback.
