Skip to content

pythseq/physpetools

Repository files navigation

PhySpeTree: an automated pipeline for reconstructing phylogenetic species trees

PyPI versionDocsLicense

PhySpeTree is implemented in Python language (supports Python2.7+ and Python3+), designed for Linux systems (docker for Windows OS or Mac OS).

Documents: PhySpeTree documentation.

Understanding phylogenetic relationships between different species is crucial for evolutionary studies. Reconstructing the phylogenetic species tree, a branching diagram, is particularly useful in inferring evolutionary relationships. For example, the tree-of-life provides a remarkable view of organizing principles of the biological world. So, the exact species tree to be reconstructed is necessary, but the process of reconstructing the species or gene tree is very tedious.

Here, we developed an easy-to-use package named PhySpeTree that is convenient to reconstruct species trees by one command line. Two independent pipelines were included by using the most adopted small subunit ribosomal RNA (SSU rRNA) and concatenated highly conserved proteins (HCP), respectively. A distinct advantage is that users only need to input species names and PhySpeTree automatically downloads and analyzes sequences of SSU rRNA or HCP from about 4,000 organisms.

https://raw.githubusercontent.com/yangfangs/physpetools/master/docs/docs/img/PhySpeTree_work_follow.png

PhySpeTree workflow includes the following steps:

  • ① Automatic tree reconstruction.
  • ② Processing user-defined fasta files for unannotated organisms.
  • ③ Reconstructing species trees with unannotated organisms.
  • Inputs only include species names.
  • One command line to build trees.
  • HCP and SSU rRNA methods.
  • Combine trees.
  • View trees with iTOL.
  • Versatile software with adjustable parameters.
  1. PyPI
$ pip install PhySpeTree

or download PhySpeTree and install:

$ pip install PhySpeTree-*.tar.gz

To upgrade to latest version:

$ pip install --upgrade PhySpeTree
  1. GitHub
$ git clone [email protected]:yangfangs/physpetools.git $ cd physpetools $ python setup.py install

or download and install:

$ pip install physpetools-*.tar.gz

The input of autobuild module is a TXT file containing abbreviated species names, for example organism example list.

Use autobuild in command line like this:

$ PhySpeTree -i organism_example_list.txt [options]*
-hPrint help message and exits.
-iInput a TXT file containing abbreviated species names.
-oA directory to store outputs. The default is "Outdata".
-tNumber of processing threads (CPUs). The default is 1.
-eFASTA format files to extend the tree with the --ehcp or --esrna option.
--hcpHCP (highly conserved protein) method (default).
--ehcpHCP method with extended HCP sequences.
--srnaSSU method.
--esrnaSSU rRNA method with extended SSU rRNA sequences.

Advanced options of internal software called in PhySpeTree can be set. These options are enclosed in single quotes and start with a space.

Here is an example of setting RAxML advanced options by --raxml_p:

$ PhySpeTree autobuild -i organism_example_list.txt -o test --srna --raxml --raxml_p ' -f a -m GTRGAMMA -p 12345 -x 12345 -# 100 -n T1'
--muscleMultiple sequence alignment by MUSCLE (default).
--muscle_p

Set Muscle advance parameters. The default is -maxiter 100, please see MUSCLE Manual.

-maxitermaximum number of iterations to run is set 100.
--clustalwMultiple sequence alignment by clustalw2.
--clustalw_pSet clustalw2 advance parameters. Here use clustalw default parameters, please see Clustalw Help.
--mafftMultiple sequence alignment by mafft.
--mafft_pSet mafft advance parameters. Here use mafft default parameters, please see mafft algorithms.
--gblocksTrim by Gblocks.(default)
--gblocks_p

Set Gblocks advance parameters, please see Gblocks documentation.

-tChoice type of sequence(default).
-eGeneric File Extension. PhySpeTree set default is "-gbl1".
--trimalTrim by trimal.
--trimal_pSet trimal advance parameters, please see trimal command line.
--raxmlReconstruct phylogenetic tree by RAxML (default).
--raxml_p

Set RAxML advanced parameters. The default is -f a -m PROTGAMMAJTTX -p 12345 -x 12345 -# 100 -n T1, please see RAxML Manual.

-fselect algorithm. The PhySpeTree default set is a, rapid Bootstrap analysis and search for best­scoring ML tree in one program run.
-mModel of Binary (Morphological), Nucleotide, Multi­State, or Amino Acid Substitution. The PhySpeTree default set is PROTGAMMAJTTX.
-pSpecify a random number seed for the parsimony inferences. The physep default set is 12345.
-xSpecify an integer number (random seed) and turn on rapid bootstrapping. The PhySpeTree default set is 12345.
-NThe same with -# specify the number of alternative runs on distinct starting trees. The PhySpeTree default set is 100.
--fasttreeReconstruct phylogenetic tree by FastTree.
--fasttree_pSet FastTree advance parameters, please see FastTree.
--iqtreeReconstruct phylogenetic tree by iqtree.
--iqtree_pSet iqtree advance parameters, please see IQ-TREE.

The build module is used to reconstruct species trees with manually prepared sequences. Advanced options are the same as autobuild module.

Use build in command line to reconstruct phylogenetic tree:

  • build phylogenetic tree by multiple method:
$ PhySpeTree build -i example_hcp -o output --multiple
  • build phylogenetic tree by SSU rRNA method:
$ PhySpeTree build -i example_16s_ssurna.fasta -o output --single
-hPrint help message and exits.
-iInput a TXT file containing abbreviated species names.
-oA directory to store outputs. The default is "Outdata".
-tNumber of processing threads (CPUs). The default is 1.
--multipleSpecify concatenate highly conserved protein method to reconstruct phylogenetic tree. The default method.
--singleUse SSU rRNA data to reconstruct phylogenetic tree.

The combine module is used to combine trees generated from different methods. It contains two steps, at first merge different tree files into the same file. You can use cat bash command in the Linux system, for example:

$ cat tree1.tree tree2.tree > combineTree.tree

Then, use combine

$ PhySpeTree PhySpeTree combine -i combineTree.tree [options]*
-hPrint help message and exits.
-iInput PHYLIP format file containing multiple trees.
-oOutput directory. The default is "combineTree".
--mrMajority rule trees..
--mreExtended majority rule trees.
--strictStrict consensus trees.
--supertreeUse Spr_Supertree combining conflicting evolutionary histories that are due to lateral gene transfer (LGT).

PhySpeTree provides the iview module to annotate taxonomic information (kingdom, phylum, class, or order) of output trees and to generate configure files linked to iTol.

Use iview in command line like this:

$ PhySpeTree iview -i organism_example_list.txt --range
-hPrint help message and exits.
-iInput a TXT file containing abbreviated species names.
-oA directory to store outputs. The default is "iview".
-rAnnotating labels with ranges by kingdom, phylum, class or order. The default is phylum.
-cAnnotating labels without ranges by kingdom, phylum, class or order. The default is phylum.
-aColored ranges by users assign, users can choice from [kingdom, phylum, class and order].
-lChange species labels from abbreviated names to full names.

The check module is used to check whether input organisms are in pre-built databases.

$ PhySpeTree check -i organism_example_list.txt -out check --ehcp
-hPrint help message and exits.
-iInput a TXT file containing abbreviated species names.
-oA directory to store outputs. The default is "check".
--hcpCheck whether organisms are supported in the KEGG database.
--ehcpCheck input organisms prepare for extend autobuild tree module.
--srnaCheck whether organisms are supported in the SILVA database.

1.What is the input of PhySpeTree?

Users only need to prepare a TXT file containing KEGG abbreviated species names. For example, organism example list.

2.How to explain PhySpeTree outputs?

PhySpeTree returns two folders, Outdata contains the output species tree and temp includes temporary data. Files in temp can be used to check the quality of outputs in each step. If HCP method (--hcp) is selected, the temp folder includes:

  • conserved_protein: highly conserved proteins retrieved from the KEGG database.
  • alignment: aligned sequences.
  • concatenate: concatenated sequences and conserved blocks.

If SSU rRNA method (--srna) is selected, the temp folder includes:

  • rna_sequence: SSU rRNA sequences retrieved from the SILVA database.
  • rna_alignment: aligned sequences and conserved blocks.

3.What classes of HCP are selected?

PhySpeTree uses 31 HCP without horizontal transferred genes according to Ciccarelli et al..

cite:

Ciccarelli F D, Doerks T, Von Mering C, et al. Toward automatic reconstruction of a highly resolved tree of life[J]. science, 2006, 311(5765): 1283-1287.

The 31 HCP and corresponding KEGG KO number are shown in the following table:

Protein NamesEukaryotes KOProkaryotes KO
DNA-directed RNA polymerase subunit alphaK03040K03040
Ribosomal protein L1K02865K02863
Leucyl-tRNA synthetaseK01869K01869
Metal-dependent proteases with chaperone activityK01409K01409
Phenylalanine-tRNA synthethase alpha subunitK01889K01889
Predicted GTPase probable translation factorK06942K06942
Preprotein translocase subunit SecYK10956K10956
Ribosomal protein L11K02868K02867
Ribosomal protein L13K02873K02871
Ribosomal protein L14K02875K02874
Ribosomal protein L15K02877K17437
Ribosomal protein L16/L10EK02866K02872
Ribosomal protein L18K02883K02882
Ribosomal protein L22K02891K02890
Ribosomal protein L3K02925K02906
Ribosomal protein L5K02932K02931
Ribosomal protein L6P/L9EK02940K02939
Ribosomal protein S11K02949K02948
Ribosomal protein S15P/S13EK02958K02956
Ribosomal protein S17K02962K02961
Ribosomal protein S2K02981K02967
Ribosomal protein S3K02985K02982
Ribosomal protein S4K02987K02986
Ribosomal protein S5K02989K02988
Ribosomal protein S7K02993K02992
Ribosomal protein S8K02995K02994
Ribosomal protein S9K02997K02996
Seryl-tRNA synthetaseK01875K01875
Arginyl-tRNA synthetaseK01887K01887
DNA-directed RNA polymerase beta subunitK03043K03043
Ribosomal protein S13K02953K02952

4.How are SSU rRAN created?

The SSU rRAN sequences are created from the SILVA database (123.1 release). Sequences haven been truncated, which means unaligned nucleotides are removed.

5. How do I use PhySpeTree when I can't connect to the Internet?

When users can't connect to the Internet. They can download the HCP or SSU rRNA database to local and reconstruct species tree.

  • SSU rRNA database: database16s.tar.gz
  • HCP database: databasehcp.tar.gz

Use $ tar -zxvf database16s.tar.gz decompress the download database.

Use -db option setting the absolute path to decompression directory.

About

PhySpeTree: automatically reconstructing phylogenetic tree

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python65.3%
  • Shell34.7%