The dark matter of genome

It has three billion base pairs but only about two per cent of the human genome codes for proteins. In a two part series, Pawan Dhar tries to understand what the remaining bulk of the human genome is doing? Is it a genetic graveyard or a cryptic instruction manual that ensures survival of the species?

doi:10.1038/nindia.2013.10 Published online 24 January 2013

The recent ENCODE (Encyclopedia of DNA Elements) project has thrown new light on the dark matter of genome – traditionally labelled junk. It turns out that more than 80 per cent of the human 'non-protein coding' genome is biologically active and impacts the expression of genes in the neighbourhood.

The dark matter

Susumu Ohno introduced the term 'junk DNA' in 1972 to describe DNA sequences that did not encode for proteins. Over time, the phrase 'junk' got reworded to 'non-coding' and represents sequences that code for RNA but not protein molecules.

Despite advances in genome sequencing, a majority of the human genome remained uncharacterized. Only a tiny two per cent of genome was found to 'do something'. The bulk — approximately 98 per cent genome — with unclear function came to be called 'dark matter'.

This view of genome persisted for many years. Due to paucity of relevant data, the 'dark matter of genome' was considered an evolutionary relic. Things changed when the human genome sequencing project was launched.

The Human Genome Project

The project resulted in a high quality map of the human genome showing about 20,500 protein coding regions. The genomic landscape had an extensive arrangement of duplicated regions. The dark matter had just started getting visible.

However, another major effort was needed to make sense of the large non-protein tracts of human genome. This led to the launch of ENCODE, an "Encyclopedia of DNA Elements".

The ENCODE project

The goal was to catalogue every nucleotide in the genome that is doing "something"! This was the first major effort to systematically study dark matter at very high resolution.

It involved more than 440 scientists from 32 groups around the world who studied 147 cell types resulting in 1648 experiments in the past five years1. The findings appeared in more than 30 scientific papers. Some of the results of ENCODE project were not so obvious and some were along expected lines. In summary:

1. A large number of non-protein coding regions were found to control protein coding genes. These regions also showed strong association with disease outcomes.

2. Based on experimental evidence, at least 10,000 highly conserved elements were believed to be involved in regulating protein synthesis.

3. More than 1,000 new families of functionally distinct RNA secondary structures were reported.

4. Two million new potential targets for transcription factors were identified.

5. Pseudogenes, which have been historically mapped to the fossil hotbeds of genome, seemed to make a lot of non-coding RNAs.

6. Regulatory controls of gene expression that were carefully crafted millions of years ago still seemed to be active in human cells.


Though the entire breadth and depth of human genome biology is still unclear, the ENCODE resource provides a better resolution map compared to what was previously known. It is interesting that most of the genome is made of switches that control protein-coding gene expression. In future, the scientific community will use this treasure trove to understand origin and progression of disease.

The ENCODE findings also tell us that the genetic inheritance of traits is not a straightforward story of dominance and recessiveness!


We now appreciate that the mysterious dark matter of genome is the ultimate key in understanding how the whole cell operates. With ENCODE results, the functionality of the dark matter is clearer. In future, the "gene expression-phenotype association" stories are likely to get greater representation from the RNA coding elements.

On a different note, the massive presence of non-coding DNA does not explain its role. For example, it is not clear why non-coding DNA far exceeds the genomic content than the protein coding one. Why are there far more switches than protein-coding elements?

In future, The ENCODE map may help identify evolutionary conserved mutations that are most likely responsible for the origin and progression of disease. This has direct implications in the clinical setting and on finding novel drug targets.

Pharma companies are now likely to look keenly at non-coding genome as a potential source of new drug targets!


  1. Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012) | Article | PubMed | ADS |