
eLetters is an online forum for ongoing peer review. Submission of eLetters are open to all . Please read our guidelines before submitting your own eLetter.
- Authors' response to Harish eLetter(20 April 2016)
Response to Harish et al.: No ‘small genome attraction’ artifact
Arshan Nasir, Kyung Mo Kim and Gustavo Caetano-AnollésIn their eLetter, Harish, Abroi, Gough and Kurland criticize our structural phylogenomic methods, which are supported by large-scale structural and functional data and well-established comparative genomics, phylogenomics, and multidimensional scaling approaches (1). Harish et al. would like to see the origin of Eukarya at the base of the Tree of Life (ToL) (2). So despite invalidating critique (3), they go on to challenge the early cellular origin of viruses. Their claims include a warning to “nonspecialists” that the rooting of our trees is distorted by what they dub is “a small genome attraction artifact of genome content-based phylogenetic analysis”. Here we examine their reasoning, which mingles with misunderstandings and misinterpretations of cladistic methodology (Please refer to Table 1).
1. Their claim that our rooting approach uses outgroup taxa is incorrect. We do not “use a hypothetical pseudo-outgroup, … an artificial all-zero taxon … to root the ToL”. No outgroup taxon (presumably extant or artificial) was ever used or defined in our study (1). Their confusion of outgroups with ancestors showcases cladistics misunderstanding (Table 1). Contrary to their claims, our rooting method is grounded in early and well-established cladistic formalizations (4, 5) and is direct because it polarizes character transformations with infor...
Show MoreResponse to Harish et al.: No ‘small genome attraction’ artifact
Arshan Nasir, Kyung Mo Kim and Gustavo Caetano-AnollésIn their eLetter, Harish, Abroi, Gough and Kurland criticize our structural phylogenomic methods, which are supported by large-scale structural and functional data and well-established comparative genomics, phylogenomics, and multidimensional scaling approaches (1). Harish et al. would like to see the origin of Eukarya at the base of the Tree of Life (ToL) (2). So despite invalidating critique (3), they go on to challenge the early cellular origin of viruses. Their claims include a warning to “nonspecialists” that the rooting of our trees is distorted by what they dub is “a small genome attraction artifact of genome content-based phylogenetic analysis”. Here we examine their reasoning, which mingles with misunderstandings and misinterpretations of cladistic methodology (Please refer to Table 1).
1. Their claim that our rooting approach uses outgroup taxa is incorrect. We do not “use a hypothetical pseudo-outgroup, … an artificial all-zero taxon … to root the ToL”. No outgroup taxon (presumably extant or artificial) was ever used or defined in our study (1). Their confusion of outgroups with ancestors showcases cladistics misunderstanding (Table 1). Contrary to their claims, our rooting method is grounded in early and well-established cladistic formalizations (4, 5) and is direct because it polarizes character transformations with information solely present in the taxa being studied (the ingroup), distinguishing ancestral from derived character states and rooting the trees a posteriori. Our rooted trees also comply with Weston’s generality criterion (6, 7), which states that as long as ancestral characters are preponderantly retained in descendants, ancestral character states will always be more general than their derivatives given their nested hierarchical distribution in rooted phylogenies. Biologically, structural domains spread by recruitment in evolution when genes duplicate and diversify, genomes rearrange, and genetic information is exchanged. This is a process of accumulation and retention of iterative homologies, such as serial homologues and paralogous genes (7), which is global, universal and largely unaffected by proteome size. The Lundberg method (5), which does not attach outgroup taxa to the ingroup as they claim, simply enables rooting by the generality criterion (8).
2. Their confusion of a priori and a posteriori character polarization questions their understanding of cladistic methodology. Their claim that we use a pseudo-outgroup to polarize character state changes a priori is inconsistent with our methodology of first reconstructing an undirected phyletic network and then polarizing character transformations with a direct method (1). They miss the fact that rooting is not a neutral procedure. While the length of the most parsimonious trees is unaffected by the position of the root, making a priori polarization unnecessary (4), rooting impacts the homology statements of the undirected networks (4, 5). “The length of a tree is unaffected by the position of the root but is certainly not unaffected by the inclusion of a root” (9).
3. Their claim that small genome size affects rooting and induces attraction artifacts is conceptually and empirically debunked. Their recitation that organisms and viruses with small genomes (irrespective of their taxonomic affiliation) would be attracted to basal branches of our trees is incorrect. During searches of tree space and prior to rooting, we optimize individual character change in unrooted networks. This allows unrestricted gains and losses of domain occurrence or abundance throughout their branches. Thus, rooting plays no role in defining unrooted tree topology and cannot be distorted by genome size, which is a property of taxa (not individual characters changing in trees). This was already described in our supplementary text (1) and made explicit in a recent study (10). Polarization is only applied a posteriori by: (a) considering character spread in nested branches while accounting unproblematically for homoplasy (Weston’s rule), (b) searching for the most parsimonious solutions with Lundberg while treating homologies as taxic hypotheses, and (c) allowing gradual build-up of evolutionary innovation (including loss and punctuation) that complies with the principle of spatiotemporal continuity (PC). These three mutually supportive technical and biological axiomatic criteria were confirmed experimentally by Venn group distributions in ToDs and by visualizing clouds of proteomes in temporal space [see Figs. 5 and 8 of (1)]. Even Felsenstein’s suggestion of inverse polarization (11) of our ordered (Wagner) characters, which can be polarized in only two directions, produced suboptimal trees [see Figs. 3 and 4 of (3)]. Fiction is even debunked empirically. Plotting the node distance (nd) for each terminal node (i.e. taxa) from the root node of a ToL – on a scale from 0 (most basal) to 1 (most recent) – against the genome sizes of taxa showed substantial scatter (rho = 0.80), poor lineal fits (several peaks and troughs with 10 iterations of the LOWESS fitting method and a smoothing of q = 0.05) shallow monotonic increases (flat lines in Archaea and Bacteria), and no distortions/mixing of taxa among the four supergroups, viruses, Archaea, Bacteria and Eukarya. Genomes of similar sizes were scattered throughout the nd axis suggesting that genome size was not a significant determinant of taxa position in our trees. Similarly, genome size scatter for individual nd increased towards the base of the trees, including scatter for supergroups (rho values of 0.55, 0.63, 0.75 and 0.85 for 266 viruses, 30 Archaea, 31 Bacteria and 28 Eukarya respectively), debunking the alleged basal attraction artifact. The order of appearance of supergroups matched their proteomic complexity, from simple to complex, which also matched scaling patterns of use and reuse of structural domains in proteomes (3). This emerging property of trees supports evolution’s PC. Labeling the phylogenetic positions of the smallest proteomes in our trees showed that the smallest genomes were neither attracted towards the root nor caused distortions in 3-supergroup or 4-supergroup views of life. Their claim that “the rooting in viral lineages is an inevitable consequence of pre-specifying ‘0’ or ‘absence’ as the ancestral state” is therefore conceptually and empirically false.
4. Confusion of characters and taxa bootstraps their preconceptions. In rushing their unsupported claim that the rooting of trees of structural domains (ToDs) is also unreliable and affected by small genome size, they wrongly considered ToDs as being “uninterpretable in terms of the definition of the (domain) superfamilies which it comprises”, because homology “within” superfamilies “can be ascertained based on similarity of sequence, structure and function”. But superfamilies are the taxa and proteomes the characters, and definitions of taxa (superfamily hidden Markov models) do not need to follow either definitions of characters (superfamily growth in proteomes) or statements of homology tested in ToDs. What is ‘uninterpretable’ however is the putative effect of genome size on ToDs, since each proteome embodies a character, which by definition (Kluge’s Auxiliary Principle) is independent of others. So fiction bootstraps their preconceptions, including the idea that domains, the evolutionary units of proteins, do not evolve. Are 1,200 structural folds fortuitous findings or the makings of intelligent cause? Where does significant evolutionary signal of the ToDs, including a match to the geological record (12), come from? Even an exploration of the mapping of functions in genotype space shows the centrality of structure in defining evolutionary constraints (13).
5. They fictionalized comparative genomic research. They state that our Venn diagrams and summary statistics of domain distributions in supergroups are unconvincing in light of other comparative genomic analyses (14, 15). However, Abroi and Gough (14) argue that viruses may be a source of new protein fold architectures, a conclusion strongly supported by our Venn analysis, and Abroi (15) showcases the distribution and sharing of superfamilies between viral replicon types and cells, which is largely consistent with our analysis (3). There are no irreconcilable differences between these studies. Instead, our phylogenomic analysis dissects the alternative evolutionary scenarios that can be posed with the comparative method (1, 15).
Conclusions. It is ironic that attempts to controvert our direct methods of character polarization come from authors that are themselves proponents of the use of polarized characters, but with arbitrary transformation costs carefully engineered a priori to attract large eukaryotic genomes to the base of their trees (2). These characters violate the ‘triangle inequality’ (16), a fundamental property conferring metricity to phylogenetic distances. Its violation invalidates phylogenetic reconstruction (17). Their asymmetric step matrices require that they be solely optimized on a rooted tree, making them even prone to genome size attraction artifacts. In addition to this self-inconsistency, transformation costs also violate genomic scaling and processes responsible for scale-free behavior of proteins, challenging evolution’s PC and artificially forcing biological innovations to the base of universal trees (3).
Arshan Nasir1
Kyung Mo Kim2
Gustavo Caetano-Anollés3*1Department of Biosciences, COMSATS Institute of Information Technology, Islamabad 45550, Pakistan
2Microbial Resource Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, South Korea
3Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USAReferences
1. A. Nasir, G. Caetano-Anolles, A phylogenomic data-driven exploration of viral origins and evolution. Sci. Adv. 1, e1500527 (2015).
2. A. Harish, A. Tunlid, C. G. Kurland, Rooted phylogeny of the three superkingdoms. Biochimie 95, 1593-1604 (2013).
3. K. M. Kim, A. Nasir, G. Caetano-Anollés, The importance of using realistic evolutionary models for retrodicting proteomes. Biochimie 99, 129-137 (2014).
4. J. S. Farris, Methods for computing Wagner trees. Syst. Zool. 19, 83-92 (1970).
5. J. Lundberg, Wagner networks and ancestors. Syst. Zool. 18, 1-32 (1972).
6. P. H. Weston, Indirect and direct methods in systematics, inn Ontogeny and Systematics, C, J. Humphries, Ed. (Columbia University Press: New York, 1988), pp. 27–56.
7. P. H. Weston, Methods for rooting cladistic trees, in Models in Phylogeny Reconstruction, Systematics Association Special Volume No. 52, D. J. Siebert, R. W. Scotland, D. M. Williams, Eds. (Clarendon Press, Oxford, 1994), pp. 125–155.
8. H. N. Bryant, Hypothetical ancestors and rooting in cladistic analysis. Cladistics 13, 337-348 (1997).
9. A. V. Brower, M. C. de Pinna, Homology and errors. Cladistics 28, 529-538 (2012).
10. A. Nasir, K. M. Kim, G. Caetano-Anollés, Global patterns of protein domain gain and loss in superkingdoms. PLoS Comput. Biol. 10, e1003452 (2014).
11. J. Felsenstein, The statistical approach to inferring phylogeny and what it tells us about parsimony and character compatibility, in Cladistics: Perspectives on the Reconstruction of Evolutionary History, T. Duncan, T. F. Stuessy, Eds. (Columbia University Press, New York, 1984), pp. 169-191.
12. M. Wang, Y.-Y., Jiang, K.M. Kim, G. Qu, H.-F. Ji, H.-Y. Zhang, G. Caetano-Anollés, A molecular clock of protein folds and its power in tracing the early history of aerobic metabolism and planet oxygenation. Mol. Biol. Evol. 28, 567-582 (2011).
13. E. Ferrada, A. Wagner, Evolutionary innovations and the organization of protein functions in genotype space. PLoS ONE 5, e14172 (2010).
14. A. Abroi, J. Gough, Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution. Bioessays 33, 626-635 (2011).
15. A. Abroi, A protein domain-based view of the virosphere–host relationship. Biochimie 119, 231-243 (2015).
16. W. C. Wheeler, Systematics: a course of lectures (John Wiley & Sons, Hoboken, New Jersey, 2012).
17. W. C. Wheeler, The triangle inequality and character analysis. Mol. Biol. Evol. 10, 707–712 (1993).Table 1. Fact-checking the narrative of Harish et al.
Fiction: We root trees with the “outgroup comparison method” that uses an “hypothetical pseudo-outgroup”, i.e. “an artificial all-zero taxon”
Fact: We do not use indirect character polarization methods of rooting that rely on outgroup taxa. Instead, we root trees with direct methods, which do not use outgroups or pseudo-outgroups and avoid auxiliary and ad hoc assumptions (1). We note that the root branch serves as a basal character state vector for the time arrow (9). Inclusion of outgroups or pseudo-outgroups results in an optimized state-set derived from a more inclusive set of taxon terminals, which changes tree topology, tests of character homology, and the results of phylogenetic analyses. See section 1 in main textFiction: Our rooting approach uses “a hypothetical pseudo-outgroup: an artificial construct based on presumed ancestral states, … an ‘all-zero’ or ‘all-absent’ hypothetical ancestor. The Lundberg method involves estimating an unrooted tree for the ingroup taxa only, and then attaching an outgroup … to the tree a posteriori to determine the position of the root node”.
Fact: We do not combine outgroups and ancestors during tree reconstruction. “(The) use of one hypothetical ancestor that combines inferences based on outgroup comparison with those based on other methods of polarizing character transformations to root a cladogram is invalid (8).” Inferences regarding ancestral states apply to the outgroup or ingroup nodes, respectively, and “cannot be combined into a single hypothetical construct (8)”. Section 1Fiction: “The state '0' or 'absence' of a superfamily is assumed to be ‘ancestral’ a priori”.
Fact: We first optimize character change in unrooted networks using the Wagner algorithm and then polarize character transformations a posteriori, most parsimoniously, and complying with Weston’s generality criterion. Lundberg simply enables the task. Section 2Fiction: “We can expect the ‘all-zero’ pseudo-outgroup to cluster with proteomes described by the largest number of ‘0s’ in the data matrix”.
Fact: During searches of tree space, optimal trees are retained by minimization of character state changes in branches of competing trees (4). In absence of computational optimization, the topology or rooted trees cannot be predicted from patterns in character state vectors of ingroup or outgroup taxa. Section 3Fiction: “The Lundberg method distorts phylogenetic analysis” in ToLs and ToDs. “The rooting in viral lineages is an inevitable consequence of pre-specifying ‘0’ or ‘absence’ as the ancestral state”.
Fact: Character polarization plays no role in defining unrooted tree topology, which cannot be distorted by genome size. Their claim is “erroneous since polarization also involves spread in the nested lineages of the uToLs and is only applied a posteriori, allowing gains and losses throughout branches of the tree” [supplementary text in (1)]. If this were to be true then we should observe a comb-like pattern of organism appearance in the ToL, which is clearly not the case. Section 3Fiction: The tree of structural domains (ToD) is “uninterpretable in terms of the definition of the superfamilies which it comprises” and its rooting is “unreliable”.
Fact: Harish et al. confuse the meaning of characters and taxa and extend their unsubstantiated claims to ToDs. To build ToDs, domain homologies (defined with robust hidden Markov models) define families and superfamilies that are used as taxa, while their proteomic abundances are used as characters. ToDs contain significant phylogenetic signal that follows a molecular clock linking the molecular and geological records (12). Section 4Fiction: “Inferences based on statistical distributions of superfamilies alone are unconvincing, especially in light of other recent analyses”
Show Less
Fact: The other recent analyses that are mentioned (14,15) failed to couple the comparative genomic studies to phylogenomic reconstructions, which in our case (1) helped weed out evolutionary scenarios that were non-significant or incompatible with reconstructed history. Section 5Competing Interests: None declared. - eLetter Concerning Nasir and Caetano-Anollés’ Conclusions About the Origins of Viruses(15 April 2016)
Nasir and Caetano-Anollés (1) conclude that viruses are an ancient lineage that diverged in parallel with their cellular hosts from a universal common ancestor (UCA). They reiterate their earlier claim (2) that viruses are a unique lineage, which predated or coevolved with the last UCA of cellular lineages (LUCA) through reductive evolution as opposed to recent and multiple origins. The claims are based on statistical- and phylogenetic analyses of the genomic distribution patterns of protein domains, classified as superfamilies (SFs) in Structural Classification of Proteins (SCOP) (3). We highlight issues with the phylogenomic approach that is at the heart of these analyses. In particular, we believe that nonspecialists should be aware that conclusions in (1) are subject to a technical artifact of genome content based phylogenetic analysis dubbed as the "small genome attraction" artifact. Key to the conclusion that viruses are an ancient lineage (1) is the specific rooting method—the so-called Lundberg rooting method—used to convert unrooted trees into rooted phylogenies, a necessary step in reconstructing the hypothetical common ancestors at the nodes of branching points in a tree. The Lundberg rooting method employed in (1) distorts the phylogenetic analyses in (i) tree of proteomes (ToP), which represents the universal tree of life (ToL) and (ii) tree of domains (ToD), which is used to determine the relative age of SCOP-SFs.
(i) Root...
Show MoreNasir and Caetano-Anollés (1) conclude that viruses are an ancient lineage that diverged in parallel with their cellular hosts from a universal common ancestor (UCA). They reiterate their earlier claim (2) that viruses are a unique lineage, which predated or coevolved with the last UCA of cellular lineages (LUCA) through reductive evolution as opposed to recent and multiple origins. The claims are based on statistical- and phylogenetic analyses of the genomic distribution patterns of protein domains, classified as superfamilies (SFs) in Structural Classification of Proteins (SCOP) (3). We highlight issues with the phylogenomic approach that is at the heart of these analyses. In particular, we believe that nonspecialists should be aware that conclusions in (1) are subject to a technical artifact of genome content based phylogenetic analysis dubbed as the "small genome attraction" artifact. Key to the conclusion that viruses are an ancient lineage (1) is the specific rooting method—the so-called Lundberg rooting method—used to convert unrooted trees into rooted phylogenies, a necessary step in reconstructing the hypothetical common ancestors at the nodes of branching points in a tree. The Lundberg rooting method employed in (1) distorts the phylogenetic analyses in (i) tree of proteomes (ToP), which represents the universal tree of life (ToL) and (ii) tree of domains (ToD), which is used to determine the relative age of SCOP-SFs.
(i) Rooting the ToP or the ToL: The most common rooting method is the outgroup comparison method, which is based on the premise that homologous features (character-states) common to the ingroup (study group) and a closely related sister-group (the outgroup) are likely to be ancestral. Therefore, in an unrooted tree the root is on the branch that connects the outgroup to the ingroup and the tree is directed a posteriori (4). However, there are no known outgroups for the ToL. In the absence of natural outgroups, pseudo-outgroups are used to root the ToL (4). Nasir and Caetano-Anollés use a hypothetical pseudo-outgroup: an artificial construct based on presumed ancestral states. For each character (SF), the state '0' or 'absence' of a SF is assumed to be “ancestral” a priori. This artificial “all-zero” taxon is used as outgroup to root the ToL. Further, they use the Lundberg rooting method, where outgroups are not included in the tree construction. The Lundberg method involves estimating an unrooted tree for the ingroup taxa only, and then attaching outgroup(s) (when available) or a hypothesized ancestor to the tree a posteriori to determine the position of the root node. Unrooted trees describe relatedness of taxa based on graded compositional similarities of characters (and states). Accordingly, we can expect the “all-zero” pseudo-outgroup to cluster with genomes (proteomes) in which the least number of SFs are present (proteomes described by the largest number of ‘0s’ in the data matrix). The use of artificial outgroups is not uncommon in rooting experiments when rooting is ambiguous (5). Artificial taxa are either an all-zero outgroup or outgroups constructed by randomizing characters and/or character-states from real taxa. Although rooting experiments with multiple real outgroups, or randomized artificial outgroups that simulate loss of phylogenetic signal can minimize the ambiguity in rooting (5), the all-zero outgroup has proved to be of little use (4). Conclusions based on an all-zero outgroup are often refuted when empirically grounded analysis with real taxa are carried out (4).
(ii) Rooting the ToD: Rooting the ToD (1), which describes the evolution of individual protein domain-SFs, is also unreliable. In addition to the rooting practice described above, the foundational assumption behind the ToD (that all SFs are related to one another by common ancestry in the same way that organismal species are related) ignores and is contradicted by empirical data describing the evolution of protein folding and tertiary structure (6). Indeed, this assumption contradicts the basis upon which SFs are classified in the SCOP hierarchy (3). The existence of evolutionary relationships between SFs with different folds in SCOP is not established (3, 7). Homology of members ‘within' a SCOP SF or family can be ascertained based on similarity of sequence, structure and function. The ToD is therefore uninterpretable in terms of the definition of the SFs which it comprises.
(iii) Statistical distribution patterns of SFs in genomes: Unlike phylogenetic trees that describe the evolution of individual proteomes, Venn diagrams and summary statistics of SF sharing among groups of proteomes only depict generalized trends. Multiple scenarios can be invoked a priori to explain the trends without any phylogenetic analyses. Although such patterns are suggestive, they are speculative (8, 9). Thus the authors’ inferences (1) based on statistical distributions of SFs alone are unconvincing, especially in light of other recent analyses of the trends in SF distributions (8, 9).
In Summary, the models of SF evolution used in (1) bring into question the level of support for the conclusions. The rooting approach and phylogenomic analyses are subject to technical artifacts; the “all-zero” or “all-absent” hypothetical ancestor is neither empirically grounded nor biologically meaningful. Consequently, the phylogenomic analysis in (1) is neither a test nor a confirmation of hypotheses and scenarios that can be invoked a priori based on statistical distribution patterns of protein-domain SFs (8, 9). The rooting in viral lineages is an inevitable consequence of pre-specifying ‘0’ or ‘absence’ as the ancestral state. Although we hope, as the authors do, that the proposal for an ancient viral supergroup might initiate fruitful debate (1), we fear that it may risk adding mud to the waters of an already cloudy field.
Ajith Harish1* Aare Abroi2 Julian Gough3 Charles Kurland4 1Structural and Molecular Biology Group, Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, 751 24 Uppsala, Sweden; 2Estonian Biocentre, Riia 23, Tartu, 51010, Estonia; 3Computational Genomics Group, Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK; 4Microbial Ecology, Department of Biology, Lund University, Ecology Building, SE-223 62 Lund, Sweden.
References
1. A. Nasir, G. Caetano-Anollés, A phylogenomic data-driven exploration of viral origins and evolution. Science Advances 1, (2015).
2. A. Nasir, K. Kim, G. Caetano-Anolles, Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya. BMC Evolutionary Biology 12, 156 (2012).
3. A. G. Murzin, S. E. Brenner, T. Hubbard, C. Chothia, SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536-540 (1995).
4. W. C. Wheeler, Systematics: A course of lectures. (John Wiley & Sons, 2012).
5. S. W. Graham, R. G. Olmstead, S. C. H. Barrett, Rooting Phylogenetic Trees with Distant Outgroups: A Case Study from the Commelinoid Monocots. Molecular Biology and Evolution 19, 1769-1781 (2002).
6. P. G. Wolynes, Evolution, energy landscapes and the paradoxes of protein folding. Biochimie 119, 218-230 (2015).
7. J. Gough, K. Karplus, R. Hughey, C. Chothia, Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology 313, 903-919 (2001).
8. A. Abroi, J. Gough, Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution. BioEssays 33, 626-635 (2011).
9. A. Abroi, A protein domain-based view of the virosphere-host relationship. Biochimie 119, 231-243 (2015).
Show LessCompeting Interests: None declared.