I decided to deviate from the regular fluorescent content and discuss a new paper just publish in Nature, entitled “Proto-genes and de novo gene birth“.
Once upon a time, life was simple. We knew that we had X genes; that a gene is being transcribed into a messenger RNA (mRNA) and this mRNA is then being translated into a protein. The rest of the genome was considered as “junk DNA”. That view was the dogma not too long ago. I still learned about “junk DNA” in my undergrad studies, just 15 yrs ago.
The birth of a gene was attributed mostly to gene duplications, recombination events or horizontal transfer. On rare occasions, a gene with a new function is born out of… the junk? it was not clear.
Then, we learned that this “junk” actually contains a lot of regulatory sequences. but these were still “just” DNA sequences, no more. Only genes are being transcribed and translated. Until we started to realize that, actually, a very large proportion of the genome is being transcribed. Some people even said that every nucleotide is being transcribed into RNA. This RNA was first called “transcriptional noise”. Then, we began to realize that at least some of this noise, or ”non-coding” RNAs (ncRNAs), has regulatory roles: they can regulate transcription; the can regulate translation; the can regulate RNA degradation, they can regulate localization etc…
Still, they remained “non-coding” (i.e. not coding a protein) with researcher concentrating mostly on the “long-non-coding” (usually >300 nucleotides) rather than short sequences. Very rarely, some ncRNA was shown to associate with ribosomes, or was suggested that it may code short peptides with some biological function.
Now, Carvunis et al. go the extra kilometer and suggests a new evolutionary model of how new genes are created.
They looked at the budding yeast (S. cerevisiae) genome and found that the yeast has ~6000 known annotated genes, and ~261,000 unannotated open reading frames (ORFs). An ORF is characterized by a start codon, a stop codon and at least one codon between them. The original cut-off when yeast genome was annotated was ~300 nucleotides (nt). Though most of the unannotated ORFs are less than 300nt, ~1800 of them are longer than 300nt.
They further classify the ORFs based on evolutionary conservation between 10 yeast species. Classification is as follows: ORFs0 (~108,000 ORFs) refers to ORFs 30nt or more, that do not overlap annotated genes, and which were only found in S. cerevisiae (S.c.). ORFs1 are annotated ORFs (~2% of annotated ORFs) that are unique to S. c. and ORFs2-4 are conserved only in the Saccharomyces sensu stricto species. ORFs5-10 are conserved among different species of Ascomycota phylum. Most of the annotated ORFs (88%) belong to ORFs5-10, and 12% belong to ORFs1-4.
They then go on to compare ORFs0-4 to ORFs5-10. They show that on average, ORFs0-4 are shorter, less expressed, they show less codon usage bias. ORFs1-4 show intermediate amino-acid composition (compared to ORFs0 and ORFs5-10) and show more hydrophobicity of their hypothetical protein product than ORFs5-10. They then show that ~1% of ORFs0 are associated with ribosomes (which suggests translation though they do not proved that).
Importantly, they see differential expression and translation in rich vs. starvation conditions.
I will not go on to describe more of their data – they have plenty of it and all of their data support a model of continuous evolution from non-genic ORFs, through proto-genes to genes. Genes then “die” as pseudo-genes (the common dogma suggests that genes that lost their coding ability become pseudo-genes, which over time will either revert to genes, or become indistinguishable from non-genic ORFs).
What does this mean?
According to the new model, any region in the genome can become a proto-gene, i.e. can, by chance attain a start & stop codons, get expressed and/or translated. If the expression of that proto-gene has some advantage, maybe under some unique environmental conditions, this ORF will be retained over generations, and conserved upon separation to different species. By chance, short ORFs can become longer (e.g. by elimination of a STOP codon, by frameshift etc…). The more it is advantageous, the more likely it is to gain other advantageous mutations that will increase its expression, its codon usage bias, its translatability, the protein stability etc…
At some point, this proto-gene is becoming a “gene”, by our standards.
Their analysis suggests that more new genes are formed by this process, than by gene duplication.
If their model is further developed and strengthened, this could be the “missing link” in molecular evolution. If laboratory experiments will show that such proto-genes can actually become bona fide genes, it will be a blow to a lot of anti-evolutionists claims that proteins such as we see today could not have been created by evolution, because they are too complex and too efficient, that there are no known “proto-genes” they could have evolved from.
Not only that. Us molecular and cell biologists will now have to take into consideration this new type of “genes” whenever we do an experiment. Maybe a certain proto-gene contributed to the phenotype?
The future is going to be much more exciting!
Carvunis AR, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, Charloteaux B, Hidalgo CA, Barbette J, Santhanam B, Brar GA, Weissman JS, Regev A, Thierry-Mieg N, Cusick ME, & Vidal M (2012). Proto-genes and de novo gene birth. Nature, 487, 370-374 PMID: 22722833