2001 年，人类基因组测序“完成”（注：当时号称完成，其实是基本完成），解析了人类基因组是由ATGC四个字母以怎样的排列顺序组成的。面对一本英文小说，如 果光认识了26个字母，知道了这本书中26个字母的排列方式，离读懂其中的意思还差得远。当然，当时也还做了一些，就是在基因组中注释出了大部分编码蛋白 质的基因。用英文小说打比方，就像在那本英文小说中，找出了哪些字母串是章节标题。把章节标题看一遍，能大概懂得这本小说讲的什么。但面对一本名著，谁都 不会满足于看章节标题。人类基因组也是如此，编码蛋白质的基因知道了远远不够，这些基因是怎样调节控制的，基因组中的其他区域哪些参与调控。说到调控，我 再举一个例子，人类的头发长在头顶上，而不能出现在舌头上，但是舌头和头顶细胞中都包含长头发的基因。光知道哪个基因负责长头发只是第一步，更重要的是知 道为什么舌头上不长头发，肺里也不长头发。很多疾病，就是基因在不该表达的组织细胞中表达了。人类基因组中，编码蛋白质的DNA序列只占1.5%，剩下的 DNA有多少是调控元件，有多少是没用的垃圾DNA或者是寄生遗传元件呢？为此，2003年，科学界启动了ENCODE计划，全称是 Encyclopedia of DNA Elements，人类基因组DNA元件的百科全书。期望通过建设这套百科全书，全面揭示人类基因组中的功能区域。
全世界30多个课题组，400多位科研人员，用了十来年的时间，花费了数亿美元，终于完成了这本百科全书第一版（注：没解决的问题还有很多，需要今后出增补修订版）。今天，相关成果分成30篇论文发表在Nature, Genome Research和Genome Biology三种期刊上，据说还有一些论文将陆续发表在Science和Cell等期刊上。
By the numbers
Researchers already knew that 1.5 percent of the genome codes for proteins. ENCODE found that an additional 8.5 percent codes for regions where proteins stick to DNA, presumably regulating gene transcription. And, because ENCODE hasn’t looked at every possible type of cell or every possible protein that sticks to DNA, this figure is likely conservative. Birney estimates that the total proportion of the genome that either creates a protein or sticks to one is around 20 percent.
The rest of the functional elements in the ENCODE analysis cover other classes of sequence that were thought to be essentially functionless, including introns. “The idea that introns are definitely deadweight isn’t true,” said Birney. Even some repetitive sequences—small chunks of DNA that have the ability to copy themselves and are typically viewed as parasites—are likely to be functional, often containing sequences where proteins can bind to influence the activity of nearby genes. Perhaps their spread across the genome represents not the invasion of a parasite, but a way of spreading control. “These parasites can be subverted sometimes,” Birney said.
Birney expects that many skeptics will argue about the exact proportion—the 80 percent of the genome that ENCODE estimates to be doing something—and about the definition of “functional.” But, he said, “no matter how you cut it, we’ve got to get used to the fact that there’s a lot more going on with the genome than we knew.”
What’s in a gene?
The simplistic view of a gene is that it’s a stretch of DNA that is transcribed to make a protein. But with ENCODE’s data, this definition no longer makes sense. There are a lot of transcripts, probably more than anyone had realized, some of which connect two previously unconnected genes. This means that the boundaries for those genes have to widen, and the gaps between them shrink or disappear.
Gingeras says that this “intergenic” space has shrunk by a factor of four. “A region that was once called Gene X is now melded to Gene Y,” he says. With such blurring boundaries, Gingeras thinks that it no longer makes sense to think of a gene as a specific point in the genome, or as its basic unit. Instead, that honor falls to the RNA transcript. “The atom of the genome is the transcript,” says Gingeras. “They are the basic unit that’s affected by mutation and selection.”
New disease leads
For the last decade, geneticists have run a seemingly endless stream of genome-wide association studies (GWAS), and have thrown up a long list of single nucleotide polymorphisms (SNPs) that correlate with the risk of different conditions. The ENCODE team has mapped all of these GWAS-identified SNPs to their data.
The researchers found that just 12 percent of known SNPs lie within protein-coding areas. They also showed that compared to random SNPs, the disease-associated ones are 60 percent more likely to lie within the non-coding but functional regions that ENCODE identified, especially in promoters and enhancers. This suggests that many of these variants are controlling the activity of different genes, and provides many fresh leads for understanding how they affect our risk of disease. “It was one of those too good to be true moments,” said Birney. “Literally, I was in the room [when they got the result] and I went: Yes!”
The ENCODE researchers also found new links between disease-associated SNPs and specific DNA elements. For example, they found five SNPs that increase the risk of Crohn’s disease, and that are recognized by a group of transcription factors called GATA2. “That wasn’t something that the Crohn’s disease biologists had on their radar,” Birney said. “Suddenly we’ve made an unbiased association between a disease and a piece of basic biology.”
“We’re now working with lots of different disease biologists looking at their data sets,” he added. “In some sense, ENCODE is working from the genome out, while GWAS studies are working from disease in.” So far, the team has identified 400 such hotspots that are worth looking into.
The 3-D genome
Writing the genome out as a string of letters invites a common fallacy: that it’s a two-dimensional, linear entity. In reality, DNA is wrapped around proteins called histones like beads on a string. These are then twisted, folded and looped in an intricate three-dimensional way. In this way, distant parts of the genome can actually be physical neighbors, and can affect each other’s activity.
Job Dekker, a bioinformaticist at University of Massachussetts Medical School,used ENCODE data to map these long-range interactions across just 1 percent of the genome in three different types of cell, and discovered more than 1,000 of them. “I like to say that nothing in the genome makes sense, except in 3D,” said Dekker. The availability of the new ENCODE data is “really a teaser for the future of genome science,” he added.