Yesterday a paper (Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution) from John Stam’s lab at University of Washington was published in Science. They claim that „We found that ~15% of human codons are dual-use codons (“duons”) that simultaneously specify both amino acids and TF recognition sites. Duons are highly conserved and have shaped protein evolution, and TF-imposed constraint appears to be a major driver of codon usage bias.” For the non-scientists reader, this means they claim that a some portion of the human genome, which has a function (to code for proteins), also has a second, unrelated function (to be bound by a special class of proteins, called transcription factors (TFs), which control which regions of the genome are activated). They call these regions with double function „duons”, and also claim that the second function imposes constraints on how the first function is achieved and can evolve. When I read the paper, my gut reaction was this:
I was happy to see that most of the twitter community seemed to share my outrage, although it slowly dawned on me, that this outrage was maily due to the complete über-hype of the UW press release (nicely summarized here), as well as the paper itself: in essence the finding that „duons” exist is not novel. It was shown before in viruses, and there have been a couple of papers over the last few years, which have demonstrated regulatory regions overlapping with coding regions in mammals as well (eg these papers by Lin et al in 2011 or Birnbaum et al in 2012). To be fair, maybe not very prominently, but the Stergachis paper and the accompanying perspective reference these earlier findings. The perspective also points out that there’s a whole range of other constraint that can affect the coding sequence.
But for me, the problem is more fundamental. It probably has to do with a phenomenon called „cargo cult science” or confirmation bias, which basically means that we see what we want to see. Now, personally, I believe that mammalian genomes are actually quite tolerant to all kinds of alterations. I think that there’s a lot of stuff going on at the molecular level that has little consequence on physiology. This makes it possible for a huge lot of variation to occur, which – most of the time – has no function, but in adverse conditions some of these variants may be beneficial. I think there’s also plenty of circumstantial evidence in support of this theory. We know that a large proportion of the genome is transcribed in one tissue or another (and for many of the transcripts it’s unclear if they have any function, see eg. this review). We know that for most TFs binding is primarily determined by whether or not the binding region is accessible. It has also been shown that most TF binding sites show pretty rapid evolutionarry turnover, and it is arguable whether TF binding per se has any function.
Using that as a starting point, my read of the Stergachis paper simply revealed that many TF binding sites overlap coding regions. All the rest is questionable. Here only a few examples:
(1) They state that “Approximately 14% of all human coding bases contact a TF in at least one cell type” and “[t]he exonic TF footprints we observed likely underestimate the true fraction of protein-coding bases that contact TFs”. They also show that using a different method they show 7-12-fold more coding footprints per cell type than with their conservative measure. But then they examine at great length how TF binding constrains codon usage (ie how the first function is achieved) and evolutionary conservation relative to non-TF binding regions. However, if they have only recovered a fraction of the coding TF sites, wouldn’t that suggest that most of the coding regions is also a TF binding site in one tissue or another, and thus it’s difficult to find regions to compare to? The authors briefly address this question, but their reasoning is mainly speculative.
(2) They fail to show any actual functional consequences of having TF binding sites. They show that some genomic variants (SNV) exist which disrupt TF occupancy, but they do not show that this actually has any effect on gene regultion (remember, that’s what TFs are supposed to do). In technical terms, they do not show that SNVs overlapping “duons” are also eQTLs. And even if they did, it would still be tough to establish a link from expression to physiological function.
(3) They show that TF binding „parallels the extent of CpG methylation at their binding site” and there’s also a correlation with the expression level of the gene. Now, CpG methylation is known to correlate well with gene expression and with DNA accessibility, and as I already mentioned earlier it’s been shown that TFs will bind to regions as long as they are accessible.
(4) In some places I also simply don’t understand the logic of the paper. For example, they show that TFs avoid stop codons. Their explanation: „If TFs, through selective recognition sequences, could impose changes in protein sequence, deleterious consequences could arise if such changes resulted in a nonsense substitution” But, surely if TF binding imposes additional constraint on coding sequence, and you have an important region like a stop codon, couldn’t it have been beneficial if there was a TF, which increases constraints?
This is only a subset of the issues, which caught my eye. Collectively, I have a much less glamorous explanation for their findings: TFs have a certain repertoire of preferred binding sites, to which they will bind, as long as they are accessible, regardless of whether or not it’s a coding region. Considering that the paper shows that TFs bind to the same kind of sequence inside and outside of coding regions, and that the regions inside of genes seem to follow similar rules to the regions outside, I see no reason why the sites inside of genes should be different (more important) than the ones outside. And if TF binding sites outside of genes show large turnover, why should there be any increased requirement to maintain them inside of genes? And if their turnover within genes is no more restricted than outside, why would they in turn restrict codon usage? Sure, there is definitely additional pressure on codon usage, but whether TF binding is as significant as the authors state is debatable.
Please consider my analysis critically. I am likely subject to cargo cult science just as much as Stergachis et al, and I guess it’ll take some more research to figure out who’s take on genome organisation and evolutionary constraints is closer to the truth.
Andrew B. Stergachis, Eric Haugen, Anthony Shafer, Wenqing Fu, Benjamin Vernot, Alex Reynolds, Anthony Raubitschek, Steven Ziegler, Emily M. LeProust, Joshua M. Akey, & John A. Stamatoyannopoulos (2013). Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution Science DOI: 10.1126/science.1243490