Instances of Orphan Coding Sequences (“New genes” and New exons) Discovered in Sequenced Genomes
(Long, Deutsch et al. 2003; Kaessmann 2010; Guerzoni and McLysaght 2011; Tautz and Domazet-Loso 2011; Carvunis, Rolland et al. 2012; Chen, Krinsky et al. 2013; Long, VanKuren et al. 2013; Neme and Tautz 2013; Light, Basile et al. 2014; Andersson, Jerlstrom-Hultqvist et al. 2015; McLysaght and Guerzoni 2015; Basile, Sachenkova et al. 2017; Schmitz and Bornberg-Bauer 2017)

Genome Source


Viruses (bacteriophages) “Almost one-third of all ORFs in 1,456 complete virus genomes correspond to ORFans, a figure significantly larger than that observed in prokaryotes… 38.4% of phage ORFs have no homologs in other phages, and 30.1% have no homologs neither in the viral nor in the prokaryotic world…”

(Yin and Fischer 2008)

“Virus de novo genes…in which an existing gene has been "overprinted" by a new open reading frame, a process that generates a new protein-coding gene overlapping the ancestral gene… young de novo genes have a different codon usage from the rest of the genome…evolve rapidly and are under positive or weak purifying selection...In contrast to young de novo genes, older de novo genes have a codon usage that is similar to the rest of the genome.”

(Sabath, Wagner et al. 2012; Pavesi, Magiorkinis et al. 2013)

Prokaryotes (20,000 orphan sequences) “…only 2.8% of all microbial ORFans have detectable homologs in viruses, while the percentage of non-ORFans with detectable homologs in viruses is 7.9%, a significantly higher figure.”

(Yin and Fischer 2006)

Escherichia coli O157:H7 (EHEC): 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo”…“origin of a new gene through overprinting in Escherichia coli K12.”

(Delaye, Deluna et al. 2008; Neuhaus, Landstorfer et al. 2016)

Yeast – “BSC4 in Saccharomyces cerevisiae…encoding a 132-amino-acid-long peptide… no homologous ORF in…closely related species …Because the corresponding noncoding sequences in S. paradoxus, S. mikatae, and S. bayanus also transcribe, we propose that a new de novo protein-coding gene may have evolved from a previously expressed noncoding sequence.”

(Cai, Zhao et al. 2008)


Yeast and Drosophila protein C-termini - the co-option of short segments of noncoding sequence into the C-termini of existing proteins via the loss of a stop codon “…54 examples of C-terminal extensions in Saccharomyces and 28 in Drosophila, all of them recent enough to still be polymorphic…Four of the Saccharomyces C-terminal extensions (to ADH1, ARP8, TPM2, and PIS1)…are predicted to lead to significant modification of a protein domain structure.”

(Andreatta, Levine et al. 2015)

Green multicellular algae Chlamydomonas and Volvox carteri: PHD domain added to condensin II by exonization of mobile DNA sequences; 141 retrogene candidates in total in both genomes, with their fraction being significantly higher in the multicellular Volvox. Majority of the retrogene candidates showed signatures of functional constraints, thus indicating their functionality. Detailed analyses of the identified retrogene candidates, their parental genes, and homologs of both, revealed that most of the retrogene candidates were derived from ancient retroposition events in the common ancestor of the two algae and that the parental genes were subsequently lost from the respective lineages, making many retrogenes 'orphan'.”

(Jakalski, Takeshita et al. 2016; Philippsen, Avaca-Crusca et al. 2016)

Plasmodium vivax “…recent de novo origin of at least 13 protein-coding genes in the genome of Plasmodium vivax… five of the genes identified in our analysis contain introns…likely evolved from previously intergenic regions together with the coding sequences.”

(Yang and Huang 2011)

Nematode Pristionchus pacificus; “3818-7545 (39-76 %) of orphan genes are under negative selection”

(Prabh and Rodelsperger 2016)

Drosophila melanogaster: “There is a significant excess of retrogenes that originate from the X chromosome and retropose to autosomes; new genes retroposed from autosomes are scarce. Further, we found that most of these X-derived autosomal retrogenes had evolved a testis expression pattern.”

(Betran, Thornton et al. 2002)

Drosophila melanogaster – “142 segregating and 106 fixed testis-expressed de novo genes in a population sample of Drosophila melanogaster…appear to derive primarily from ancestral intergenic, unexpressed open reading frames (ORFs), with natural selection playing a significant role in their spread.”

(Levine, Jones et al. 2006; Zhao, Saelao et al. 2014)

Drosophila melanogaster – “…six putatively protein-coding de novo genes described in Drosophila melanogaster…two de novo genes emerged from novel long non-coding RNAs…four other de novo genes evolved a translated open reading frame and transcription…suggesting that nascent open reading frames (proto-ORFs)…can contribute to the emergence of a new de novo gene... Sequence and structural evolution of de novo genes was rapid compared to nearby genes...”

(Reinhardt, Wanjiru et al. 2013)


Insects: “…comparing 30 arthropod genomes, focusing in particular on seven recently sequenced ant genomes…comparison between social Hymenoptera (ants and bees) and nonsocial Diptera (flies and mosquitoes)…recently split lineages undergo accelerated genomic reorganization, including the rapid gain of many orphan genes…between the two insect orders Hymenoptera and Diptera, orphan genes are more abundant and emerge more rapidly in Hymenoptera, in particular, in leaf-cutter ants. With respect to intragenomic localization, we find that ant orphan genes show little clustering…”

(Wissler, Gadau et al. 2013)

Entelegyne spiders (Araneae, Entelegynae): “…transcriptomes of six entelegyne spider species from three genera (Cicurina travisae, C. vibora, Habronattus signatus, H. ustulatus, Nesticus bishopi, and N. cooperi)… between ~ 550 and 1,100 unique orphan genes were found in each genus.”

(Carlson and Hedin 2017)

Rodent: 75 murine genes (69 mouse genes and 6 rat genes) for which there is good evidence of de novo origin since the divergence of mouse and rat. Each of these genes is only found in either the mouse or rat lineages, with no candidate orthologs nor evidence for potentially-unannotated orthologs in the other lineage…For 11 of the 75 candidate novel genes we could identify a mouse-specific mutation that led to the creation of the open reading frame (ORF) specifically in mouse…A large number of them (51 out of 69 mouse genes and 3 out of 6 rat genes) also overlap with other genes, either within introns, or on the opposite strand.”

(Murphy and McLysaght 2012)

Mouse and human: “over 5000 new genes were integrated into the ancestral GGI {gene-gene interaction} networks of human and mouse; new genes gradually acquire increasing number of gene partners; some human-specific genes evolved into hub structure with critical phenotypic effects.”

(Zhang, Landback et al. 2015)

Primates: “an unexpected important role of transposable elements in the formation of novel protein-coding genes in the genomes of primates.”

(Toll-Riera, Castelo et al. 2009)

Human and Chimpanzee: “…retrocopies of coding transcripts to generate proteins with novel N-terminal domains. Examples include thymopoietin beta (TMPO), eukaryotic translation initiation factor 3 subunit 5 (EIF3F), and the 5'-inverted retrocopy of small nuclear ribonucleoprotein polypeptide N (SNRPN).

(Kojima and Okada 2009)

Humans: “…one human-specific de novo protein-coding gene, FLJ33706 (alternative gene symbol C20orf203)…originated from noncoding DNA sequences: insertion of repeat elements especially Alu contributed to the formation of the first coding exon and six standard splice junctions on the branch leading to humans and chimpanzees, and two subsequent substitutions in the human lineage escaped two stop codons and created an open reading frame of 194 amino acids.”

(Li, Zhang et al. 2010)

Humans: “…we identify 60 new protein-coding genes that originated de novo on the human lineage since divergence from the chimpanzee…RNA-seq data indicate that these genes have their highest expression levels in the cerebral cortex and testes…”

(Wu, Irwin et al. 2011)

Humans: “we identified 24 hominoid-specific de novo protein-coding genes with precise origination timing in vertebrate phylogeny… most of the hominoid-specific de novo protein-coding genes encoded polyadenylated non-coding RNAs in rhesus macaque or chimpanzee with a similar transcript structure and correlated tissue expression profile…the majority of these hominoid-specific de novo protein-coding genes appear to have acquired a regulated transcript structure and expression profile before acquiring coding potential…the coding genes in human often showed higher transcriptional abundance than their non-coding counterparts in rhesus macaque.”

(Xie, Zhang et al. 2012)

Humans: “the de novo origin of at least three human protein-coding genes since the divergence with chimp…chimp, gorilla, gibbon, and macaque share the same disabling sequence difference, supporting the inference that the ancestral sequence was noncoding…We estimate that 0.075% of human genes may have originated through this mechanism leading to a total expectation of 18 such cases in a genome of 24,000 protein-coding genes.”

 (Knowles and McLysaght 2009)

Humans: “We have found 426 different annotated young domains, totalling 995 domain occurrences, which represent about 12.3% of all human domains. We have observed that 61.3% of them arose in newly formed genes, while the remaining 38.7% are found combined with older domains…Young domains are preferentially located at the N-terminus of the protein… young domains show significantly higher non-synonymous to synonymous substitution rates than older domains…”

(Toll-Riera and Alba 2013)

Arabidopsis: “…lineage-specific genes within the nuclear (1761 genes) and mitochondrial (28 genes) genomes are identified…Almost a quarter of lineage-specific genes originate from non-lineage-specific paralogs, while the origins of ~10% of lineage-specific genes are partly derived from DNA exapted from transposable elements (twice the proportion observed for non-lineage-specific genes). Lineage-specific genes are also enriched in genes that have overlapping CDS, which is consistent with such novel genes arising from overprinting. Over half of the subset of the 958 lineage-specific genes found only in Arabidopsis thaliana have alignments to intergenic regions in Arabidopsis lyrata, consistent with either de novo origination or differential gene loss and retention… lineage-specific genes have high tissue specificity and low expression levels across multiple tissues and developmental stages. Finally, stress responsiveness is identified as a distinct feature of Brassicaceae-specific genes; where these LSGs are enriched for genes responsive to a wide range of abiotic stresses…”

(Donoghue, Keshavaiah et al. 2011)

Arabidopsis: “…we find that new genes in plants show a bias in expression to mature pollen, and are also enriched in a gene co-expression module that correlates with mature pollen in Arabidopsis thaliana. Transposable elements are significantly enriched in the new genes, and the high activity of transposable elements in the vegetative nucleus, compared with the germ cells, suggests that new genes are most easily generated in the vegetative nucleus in the mature pollen. We propose an "out of pollen" hypothesis for the origin of new genes in flowering plants.”

(Wu, Wang et al. 2014)





Andersson, D. I., J. Jerlstrom-Hultqvist, et al. (2015). "Evolution of new functions de novo and from preexisting genes." Cold Spring Harb Perspect Biol 7(6).

Andreatta, M. E., J. A. Levine, et al. (2015). "The Recent De Novo Origin of Protein C-Termini." Genome Biol Evol 7(6): 1686-1701.

Basile, W., O. Sachenkova, et al. (2017). "High GC content causes orphan proteins to be intrinsically disordered." PLoS Comput Biol 13(3): e1005375.

Betran, E., K. Thornton, et al. (2002). "Retroposed new genes out of the X in Drosophila." Genome Res 12(12): 1854-1859.

Cai, J., R. Zhao, et al. (2008). "De novo origination of a new protein-coding gene in Saccharomyces cerevisiae." Genetics 179(1): 487-496.

Carlson, D. E. and M. Hedin (2017). "Comparative transcriptomics of Entelegyne spiders (Araneae, Entelegynae), with emphasis on molecular evolution of orphan genes." PLoS One 12(4): e0174102.

Carvunis, A. R., T. Rolland, et al. (2012). "Proto-genes and de novo gene birth." Nature.

Chen, S., B. H. Krinsky, et al. (2013). "New genes as drivers of phenotypic evolution." Nat Rev Genet 14(9): 645-660.

Delaye, L., A. Deluna, et al. (2008). "The origin of a novel gene through overprinting in Escherichia coli." BMC Evol Biol 8: 31.

Donoghue, M. T., C. Keshavaiah, et al. (2011). "Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana." BMC Evol Biol 11: 47.

Guerzoni, D. and A. McLysaght (2011). "De novo origins of human genes." PLoS Genet 7(11): e1002381.

Jakalski, M., K. Takeshita, et al. (2016). "Comparative genomic analysis of retrogene repertoire in two green algae Volvox carteri and Chlamydomonas reinhardtii." Biol Direct 11: 35.

Kaessmann, H. (2010). "Origins, evolution, and phenotypic impact of new genes." Genome Res 20(10): 1313-1326.

Knowles, D. G. and A. McLysaght (2009). "Recent de novo origin of human protein-coding genes." Genome Res 19(10): 1752-1759.

Kojima, K. K. and N. Okada (2009). "mRNA retrotransposition coupled with 5' inversion as a possible source of new genes." Mol Biol Evol 26(6): 1405-1420.\.

Levine, M. T., C. D. Jones, et al. (2006). "Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression." Proc Natl Acad Sci U S A 103(26): 9935-9939.

Li, C. Y., Y. Zhang, et al. (2010). "A human-specific de novo protein-coding gene associated with human brain functions." PLoS Comput Biol 6(3): e1000734.

Light, S., W. Basile, et al. (2014). "Orphans and new gene origination, a structural and evolutionary perspective." Curr Opin Struct Biol 26: 73-83.

Long, M., M. Deutsch, et al. (2003). "Origin of new genes: evidence from experimental and computational analyses." Genetica 118(2-3): 171-182.

Long, M., N. W. VanKuren, et al. (2013). "New gene evolution: little did we know." Annu Rev Genet 47: 307-333.

McLysaght, A. and D. Guerzoni (2015). "New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation." Philos Trans R Soc Lond B Biol Sci 370(1678): 20140332.

Murphy, D. N. and A. McLysaght (2012). "De novo origin of protein-coding genes in murine rodents." PLoS One 7(11): e48650.

Neme, R. and D. Tautz (2013). "Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution." BMC Genomics 14: 117.

Neuhaus, K., R. Landstorfer, et al. (2016). "Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC)." BMC Genomics 17: 133.

Pavesi, A., G. Magiorkinis, et al. (2013). "Viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of Deltaretroviruses." PLoS Comput Biol 9(8): e1003162.

Philippsen, G. S., J. S. Avaca-Crusca, et al. (2016). "Distribution patterns and impact of transposable elements in genes of green algae." Gene 594(1): 151-159.

Prabh, N. and C. Rodelsperger (2016). "Are orphan genes protein-coding, prediction artifacts, or non-coding RNAs?" BMC Bioinformatics 17(1): 226.

Reinhardt, J. A., B. M. Wanjiru, et al. (2013). "De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences." PLoS Genet 9(10): e1003860.

Sabath, N., A. Wagner, et al. (2012). "Evolution of viral proteins originated de novo by overprinting." Mol Biol Evol 29(12): 3767-3780.

Schmitz, J. F. and E. Bornberg-Bauer (2017). "Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA." F1000Res 6: 57.

Tautz, D. and T. Domazet-Loso (2011). "The evolutionary origin of orphan genes." Nat Rev Genet 12(10): 692-702.

Toll-Riera, M. and M. M. Alba (2013). "Emergence of novel domains in proteins." BMC Evol Biol 13: 47.

Toll-Riera, M., R. Castelo, et al. (2009). "Evolution of primate orphan proteins." Biochem Soc Trans 37(Pt 4): 778-782.

Wissler, L., J. Gadau, et al. (2013). "Mechanisms and dynamics of orphan gene emergence in insect genomes." Genome Biol Evol 5(2): 439-455.

Wu, D. D., D. M. Irwin, et al. (2011). "De novo origin of human protein-coding genes." PLoS Genet 7(11): e1002379.

Wu, D. D., X. Wang, et al. (2014). ""Out of pollen" hypothesis for origin of new genes in flowering plants: study from Arabidopsis thaliana." Genome Biol Evol 6(10): 2822-2829.

Xie, C., Y. E. Zhang, et al. (2012). "Hominoid-Specific De Novo Protein-Coding Genes Originating from Long Non-Coding RNAs." PLoS Genet 8(9): e1002942.

Yang, Z. and J. Huang (2011). "De novo origin of new genes with introns in Plasmodium vivax." FEBS Lett 585(4): 641-644.

Yin, Y. and D. Fischer (2006). "On the origin of microbial ORFans: quantifying the strength of the evidence for viral lateral transfer." BMC Evol Biol 6: 63.

Yin, Y. and D. Fischer (2008). "Identification and investigation of ORFans in the viral world." BMC Genomics 9: 24.

Zhang, W., P. Landback, et al. (2015). "New genes drive the evolution of gene interaction networks in the human and mouse genomes." Genome Biol 16(1): 202.

Zhao, L., P. Saelao, et al. (2014). "Origin and Spread of de Novo Genes in Drosophila melanogaster Populations." Science.