Junk DNA and ENCODE: Part 1

Jon Peters
Sep 7, 2023
14 min read

Updated: Aug 15, 2024

"Dear Francis [Crick], I am sure that you realize how frightfully angry a lot of people will be if you say that much of the DNA is junk. The geneticists will be angry because they think that DNA is sacred. The Darwinian evolutionists will be outraged

because they believe every change in DNA that is accepted in evolution

is necessarily an adaptive change. To suggest anything else

is an insult to the sacred memory of Darwin." ~ Thomas Jukes, 1979

Introduction

Does our genome contain a lot of junk DNA, none, or only some? Or, is it mostly junk? To the anti-evolutionist, there must be little to no junk DNA because a creator would not create us that way. So the "there is no junk DNA" becomes another assertion that must be defended in their religious or intelligent design views at any cost. I assert that our genome is mostly junk and we know why and how that happened. This denial of junk DNA/RNA becomes another statement from a mostly religious presupposition to join claims that there are no transitional fossils, that human evolution did not happen, and that there must be a historical Adam & Eve to found the human race. All are demonstrably wrong by scientific findings and other arguments.

In this case however, the voices crying "no junk DNA" on religious grounds are joined by much of the scientific community who agrees with the anti-evolutionists. The topic of junk DNA is indeed controversial, unlike evolution. In part 1 of this 2 part blog, I will attempt to summarize the history of the controversy and the components of our genome. In Part 2 I will write about the publications in 2012 that appeared to show we had little genomic Junk DNA/RNA and why those were wrong. Both parts are primarily based on Laurence Moran's 2023 book.

DNA/RNA - we need to discuss some basic biology. This is important

Just about everyone has heard of DNA. Another related molecule is RNA, which stands for ribonucleic acid. RNA can be thought of as one side of a ladder whereas DNA is a ladder that has been twisted along its central axis. Besides being a double helix compared to the single stranded RNA, DNA has one less oxygen, hence it’s name of deoxyribonucleic acid. The four bases that make up the steps to the twisted ladder are A,T,C,G. Due to structural constraints and bonding, A bonds with T and C bonds with G to make up the “steps” for each “steps” of the “ladder”. Additionally in RNA uracil (U) replaces thymine (T).

In human cells the amount of DNA if present (RBCs don’t have any, for space) is about 3.2 billion base pairs. Human DNA is normally packaged like luggage when it’s time to divide, into 46 chromosomes. Moran notes that Chromosome 20 for example, one of the smaller chromosomes, has 60 million base pairs. If we unwind DNA we get a sequence of bases and the opposite side will have the pairs that bond as discussed above. For example a sequence might be …ATCGGATTC… The other side would read …TAGCCTAAG… and thus the sides are said to be complimentary. The sides of the ladder, the backbones, run in opposite directions. This was worked out by Watson and Crick and published in 1953 especially after they saw an X-ray photograph of DNA by Rosland Franklin. By convention, biologists and biochemists write the code in one direction, from what is called the 5’ end to the 3’ but that is not important for our purposes.

Not all of your DNA is in the nucleus. In most of your cells there are hundreds of mitochondria and since they are derived from ancient bacteria (see mitochondria and your mom blog) that set up a symbiotic relationship with us millions of years ago; they have their own DNA. Their DNA is not included when we talk about the DNA in an organism. Most plants have the same situation with chloroplasts that were derived millions of years ago from cyanobacteria.

History

As early as the 1950s scientists knew from staining DNA how much was present and then it was an easy calculation to find the number of base pairs - 3.2 billion. By 1991, scientists had worked out the approximate amount of human DNA in each chromosome and the total amount in females was 3.23 Gb and in males 3.17 Gb; the Y chromosome is very small compared to the X. (1). Much of the genome consists of highly repetitive DNA which is difficult to sequence. This is why the first announcements that scientists had sequenced the human genome were really very good drafts and in general did not include the highly repetitive DNA. This is especially common in the ends of chromosomes called telomeres, and chromosome areas called centromeres where spindle fibers attach when duplicated chromosomes are pulled apart during the production of new daughter cells. Although the sequencing of the human genome was announced to great fanfare in 2003, it really wasn’t fully sequenced until 2022.

DNA Makes Various RNAs

When the cell needs to make products, it unwinds some of the double helical DNA and on one side the four bases are “read” to make a complimentary single strand of RNA. If the RNA is destined to code for a protein, it is called messenger RNA (mRNA) and goes to a factory to assemble a protein from amino acids. Those little factories are called ribosomes and are made up of ribosomal RNA (rRNA). Amino acids that will make up proteins are brought to the ribosomes by other RNAs called transfer RNAs (tRNA). There are exceptions for the direction of DNA to RNA. For example a type of virus that infects animals is called a retrovirus because it’s instructions are in RNA and not DNA. To infect a victim for example it must take its RNA and convert it to DNA before it parasitizes animals and inserts its DNA into the host DNA. HIV is an example of a retrovirus (goes from RNA to DNA rather than the more common route). Since retroviruses insert randomly into DNA, when we find thousands of identical ones in the exact same locations between us and the other great apes especially chimps for example, they are great proof of human evolution. See the section on ERVs.

There are other RNAs that do not make proteins also. These we don’t generally need to know their functions but they are also non-coding RNAs and include a gene called 7SL RNA that gave rise to ALUs which we will discuss later, snRNAs, snoRNAs, miRNAs, siRNAs, piRNAs and especially important in the discussion of junk DNA are the lncRNAs. All the non coding RNA genes however add up to only about 5,000 genes in the genome.

Gene

The reading of DNA to produce RNA is called transcription and this is a huge issue with the controversy surrounding junk DNA, especially with the ENCODE researchers which will be discussed in Part 2. To start transcription the cell needs a section that tells an enzyme to start reading the DNA. This binding site is called a promoter. Transcription involves initiation, elongation, and termination. The promoter site is not part of the gene. Sites that control transcription initiation are together called regulatory sequences and can also enhance or inhibit transcription. This will also become important when we discuss the controversy around junk DNA.

At the ribosome, the factory can read the mRNA sequences and an AUG means start making the protein and several codes mean stop assembling the amino acids into the protein (UAA, UAG or UGA).

What is a gene? Believe it or not biologists unfortunately use different definitions, which has caused all kinds of problems, as we shall see. The best definition and the one used by biochemists for decades is a DNA sequence that is transcribed to produce a functional product. There are two types of genes. One type codes mRNA to make proteins. Recall that DNA can also make other RNAs and the genes that produce these are called non-coding genes because they don’t code for proteins. In humans about 20% of genes produce functional RNAs and about 80% of our genes produce proteins (1).

Gene processing

One more aspect needs to be mentioned and that is called RNA processing. It turns out that the transcript that is produced for eukaryotic genes (non-bacteria; us for example) are much larger than the finished product. There are sections in the genes called exons and introns. After transcription is completed the introns are removed and discarded and the exons are spliced together before going to the ribosomes. "Intron sequences account for about 30% of the genome. Most of these sequences qualify as junk and are littered with defective transposable elements.”(2)

Early observations

Over 50 years ago scientists were comparing genome sizes between various species and groups of related organisms when they were confronted with facts that were counterintuitive to the idea that more complexity should equal a larger genome and more genes. As species became more complex surely genomes would track with a size increase. It turned out that genome size did not reflect the number of genes however (1).

For example the genome of the lungfish turned out to be one of the largest vertebrate genomes ever measured at 133 billion base pairs (133 Gb) - nearly 40 times larger than the 3.2 of ours. What was it doing with all that DNA? And this non-correlation with apparent complexity held up within groups also. The leaping frog Xenopus sp. has a genome about the same size as ours, but another called the green frog Rana sp. has a genome size of 10 Gb. How can it be that one frog has a genome 3X the size of another? It is hard to believe that the Rana sp. frog is so much more complex than another frog. This was called the C-Value Paradox; there was no correlation between genome size and complexity.(1)

Beginning in the late 1960s results studying mammalian genomes showed that they consisted of highly repetitive DNA (about 10%), a lot of moderately repetitive DNA (about 40%) and the rest unique sequence DNA (about 50%). Larger genomes just had more repetitive DNA and mRNA hybridization studies showed that in eukaryotic cells only a few percent were typically involved in protein coding genes. These studies established that large eukaryotic genomes contained a great deal of repetitive DNA and that there were fewer than 30,000 genes (1). It became apparent by the late 1960s that the C-Value Paradox could be resolved by assuming that much of the genome is composed of non functional repetitive DNA - junk DNA (1). Thus, all mammals have pretty much the same genes, 10,000 ’house keeping genes’, and the differences in species is in developmental constraints of when genes are turned on and off and not in large numbers of unique genes for more complex species (1). In 1972 the geneticist Ohno coined the term Junk DNA. Notice that it is not garbage that you put at the curb for pick-up but rather refers to some of the used stuff we have in our garages, attics and that ubiquitous junk drawer often in our kitchens that are not being used or are broken.

Ryan Gregory has termed the Onion Test for those who want to say genome size is correlated with function and complexity.

"The onion test is a simple reality check for anyone who thinks they have come up with a universal function for non-coding DNA. Whatever your proposed function, ask yourself this question: Can I explain why an onion needs about five times more non-coding DNA for this function than a human?" (3)

He notes that some non-coding DNA like the RNAs discussed earlier is functional. But that’s only 5% of the genome and does not rescue all the other non-coding DNA. He notes also that members of the onion genus Allium have genome sizes in the range of 7pg to 31.5pg. Can one onion species really make do with only one fifth as much instructions if it’s all functional?

It should also be strongly noted that all the early biologists working in genomics knew that not all the non-coding DNA was junk; the regulatory sequences including promoters and RNA genes were known to be scattered in the non-coding DNA. No one ever said, despite the current false narrative, that it was all junk. A Wikipedia article on junk DNA offers a short and apparently accurate overview of the history of the junk DNA controversy (4).

In 2024 the genome of the African lungfish was fully sequenced and found to have 90 billion base pairs to our 3 billion. Perhaps the Onion Test should be renamed. (5)

Genes, Genes, and Genes

Recall that about a half century ago geneticists predicted that humans would be found to have about 30,000 genes. Today we know the total is closer to 25,000 with 20,000 protein producing genes and about 5,000 non-coding genes (RNAs mainly, including regulatory genes). About the same number as the worm Caenorhabitis elegans. Thus, the earlier scientistic predictions were remarkably close. When the first draft of the human genome was announced in 2003 the media proclaimed that science was shocked that humans had so few genes compared to other species given especially our complex brains. Not true. It was predicted decades before. The bruised egos for many humans was not a problem for many of the scientists studying genomes; it was just what nature was presenting.

The protein producing genes (coding genes) thus only make up about 1% of the human genome and the total percent of the genome of all genes is no more than 2%. In the protein coding genes 37% of those genes are introns, mostly junk DNA. Of the non-coding genes 6% are made up of introns, mostly junk (1). “The total amount of the genome devoted to genes is close to 45%. Of this total, less than 2% is functional, and the rest is junk DNA in introns”(1).

What does it mean to be functional? Moran defines it as any stretch of DNA that cannot be deleted from the genome without reducing the fitness of the individual. Basically, functional DNA is constrained by purifying selection. A good way to determine function is to check to see if the gene exists in other species and is being transcribed. If it does this is called sequence conservation and is perhaps the strongest method of inferring function.

Not Genes (from Moran)

1. Pseudogenes - make up about 5%. These are broken genes that resemble functional genes but have too many mutations to work. However a tiny number can take on new functions. See blog on pseudogenes this site and how they can be used to essentially prove human evolution.

2. Regulatory sequences. About 1.8%. Promoters and DNA sequences that bind various transcription factors. These have been known since the 1960s.

3. Centromeres. About 6%. Consists of millions of base pairs that is repetitive DNA for spindle attachment to pull chromosomes apart during cell reproduction. Much of it is non essential since some people have 2% and others 10%. To be very conservative, assume 1% and the rest is redundant.

4. Telomeres. Only 0.1% 5. Scaffold Attachment Regions. 0.3%. DNA wraps around proteins called histones to package the DNA when not being “read”. DNA sequences called SARs function to maintain the organization. 6. Viruses. About 9%. Defective viruses that invaded our ancestral line but are now non functional (good for us!). We have co-opted many for our own use however. 7. Transposons. About 47%. Includes SINEs (ALUs mostly - 13%), LINEs (21%), LTRs (9%), DNA transposons (4%).

Table 1 summarizes the totals according to Moran. Although functional DNA totals about 4%, the real total based on sequence conservation is probably closer to 8% (1).

Table 1. Functional and junk DNA in the human genome according to Moran, 2023. From: Moran, Laurence A. 2023. What’s In Your Genome?: 90% of your genome is junk. Aevo UTP. University of Toronto Press. 372pp. Page 133. Table 5.1. See text for explanations of terms. Fair use attribution. For educational purposes only.

Thus, the real amount of junk DNA in our genomes is probably closer to 90% according to Moran. The missing 7% in the table could be either functional DNA or junk, or a combination of the two. Although these figures may change some in the coming years Moran notes that it won’t be enough to change from that 10:90 ratio. I have read that other scientists like Shubin have claimed 75% junk and still others 50%.

In Figure 1 shows some of these categories in pie form. Note that these figures are from 2013 and have been revised some but the overall ratios and relationships are similar. Even in 2013 about 45 - 50% of the human genome was felt to be junk by these textbook authors.

Figure 1. From Reece et al. 2013. Campbell’s Biology. No copyright infringement intended. Fair use permitted. [Exons are the coding parts of DNA. Although Introns are non-coding they rarely have functions. Moran lists introns as 30% of the coding genome before splicing out. Transposons and repetitive DNA are often functionless and much is junk left over from evolution.] Will "Dark Matter" DNA reveal in the future that most of the non coding DNA is functional? (2024).

"Tom Cech won a Nobel Prize for discovering one example of a catalytic RNA. He recently published an article in the New York Times extolling the virtues of RNA and non-coding genes [The Long-Overlooked Molecule That Will Define a Generation of Science]. There's a fair amount of hype in the article but the main point is quite valid—over the past fifty years we have learned about dozens of important non-coding RNAs that we didn't know about at the beginning of molecular biology [see: Non-coding RNA, Non-coding DNA].

The main issue in this field concerns the number of non-coding genes in the human genome. I cover the available data in my book and conclude that there are fewer than 1000 (p.214). Those scientists who promote the importance of RNA (e.g. Tom Cech) would like you to believe that there are many more non-coding genes; indeed, most of those scientists believe that there are more non-coding genes than coding genes (i.e. > 20,000). They rarely present evidence for such a claim beyond noting that much of our genome is transcribed. Let's dissect this to see where the bias lies. The first thing you note is the use of the term "dark matter" to make it sound like there's a lot of mysterious DNA in our genome. This is not true. We know a heck of a lot about our genome, including the fact that it's full of junk DNA. Only 10% of the genome is under purifying selection and assumed to be functional. The rest is full of introns, pseudogenes, and various classes of repetitive sequences made up mostly of degraded transposons and viruses. The entire genome has been sequenced—there's not much mystery there. I don't know why anyone refers to this as "dark matter" unless they have a hidden agenda.

The second thing you notice is the statement that 75% of the genome is transcribed at some time or another and, according to Tom Cech, these transcripts have an unknown function. That's strange since protein-coding genes take up roughly 40% of our genome and we know a great deal about coding DNA, UTRs, and introns. If you add in the known examples of non-coding genes, this accounts for an additional 2-3% of the genome.1

Almost all the rest of the transcripts come from non-conserved DNA and those transcripts are present at less than one copy per cell. As the ENCODE researchers noted in 2014, they are likely to be junk RNA resulting from spurious transcription. I'd say we know a great deal about the fraction of the genome that's transcribed and there's not much indication that it's hiding a plethora of undiscovered functional RNAs." https://sandwalk.blogspot.com/2024/06/tom-cech-writes-about-dark-matter-of.html?m=0

Is this Hancock video from 2024 the best explanation and defense for junk DNA?

Intro to why this video: Discovery Institute, functional
Junk DNA - A complete history 3:37
The Ecology of Parasites 15:50
The History of the Junk DNA Hypothesis 39:24
Mutational Load, functional genes 42:40 - 44:45
CoT Analysis 44:55 - 46:37
Junk DNA term, introns, nearly neutral theory, transposons 48:23 - 59:35
Conclusion to date 59.55; ENCODE & why junk DNA won't be accepted by many 1:00:50
Closing thoughts 1:24:40

Summary, Part 1

The point is that anti-evolutionists that claim there can’t be any junk DNA because a Creator would not create genomes with junk are wrong. Certain researchers initially claiming 80% function in the human genome in 2012 were wrong as will be discussed in Part 2. Many other scientists who may not be religious but continue to claim that nearly all or all non-coding DNA must be functional are certainly wrong. Many scientists appear to be unable to accept that most of the human genome is junk, the result of millions of years of duplications, deletions, insertions, and transposons jumping around the genome. Certainly creationists and other anti-evolutionists cannot face the genomic facts due to their religious allegiances. Others may be afflicted with human exceptionalism ego deflation; many of us just can’t admit that our genome is filled with that much junk DNA. We’re the "top species and too complex" to have a genome smaller than some worms and many plants, and about the same functional genes as other mammals. Each of those two sides pin much of their hopes on future discoveries for function.

A large set of studies that were published in 2012 still have today the majority of scientists and creationists believing that the human genome contains little to no junk DNA. Of course the anti-evolutionists celebrate that the majority of main stream science appears to reject that we have lots of junk DNA in our genome. That research came mainly from ENCODE, which will be discussed in Part 2...

Obviously this blog is based on Dr. Moran's book. Please get a copy of it for yourself and see what you think about his thesis.

In March of 2024 Dr. Moran wrote a 9 part blog analysis of a 2024 paper by Niles Walter, PhD Professor of Chemistry at the University of Michigan who supports the view that there is little junk DNA in the human genome. This will help focus the discussion to the various issues that repeatedly arise in the controversy over junk DNA. https://sandwalk.blogspot.com/2024/03/nils-walter-disputes-junk-dna-9.html?fbclid=IwAR1KtPMKrm67N1dCwZdZBD2yTqA3QK8q7otie9Lb2R0t4aMI4D3VgV7CaUE

Citations

1. Moran, Laurence A. 2023. What’s In Your Genome?; 90% of your genome is junk. Aevo UTP. University of Toronto Press. 372pp.

2. What’s in Your Genome? May 08, 2011. Sandwalk. Strolling with a skeptical biochemist.

https://sandwalk.blogspot.com/2011/05/whats-in-your-genome.html

3. The onion test. April 25, 2007. Genomicron. Exploring genomic diversity and evolution.

https://www.genomicron.evolverzone.com/2007/04/onion-test.html

4. https://en.wikipedia.org/wiki/Junk_DNA

5. https://arstechnica.com/science/2024/08/the-fish-with-the-genome-30-times-larger-than-ours-gets-sequenced/