These sources provide an extensive overview of eukaryotic transcription and the history and methodology of DNA sequencing that culminated in the complete human genome assembly.The first source focuses on the three specialized multisubunit RNA polymerases (Pol I, Pol II, and Pol III) in eukaryotes, detailing their distinct functions, structures, and mechanisms for transcription initiation, elongation, and termination. The second and third sources document the evolution of sequencing technology through three generations, from Sanger sequencing used in the original Human Genome Project (HGP) to Next-Generation Sequencing (NGS), and finally to Third-Generation long-read sequencing (PacBio and Nanopore), which was essential for resolving complex, repetitive regions like centromeres and rDNA arrays to complete the telomere-to-telomere (T2T) human genome. Both documents also emphasize the composition and complexity of the human genome, noting that gene number is not proportional to organism complexity and that most of the genome consists of non-coding, repetitive DNA.The sources provide a detailed narrative of the Human Genome Projects (HGP) and subsequent efforts, highlighting how advances in Genome Sequencing and Assembly technologies were necessary to tackle the size, complexity, and highly repetitive nature of the human genome.The Original Human Genome Project (HGP) and First-Generation Sequencing The sequencing of the human genome officially began with the Public Human Genome Project (HGP), which was conceived in 1984 and started in 1990. The ultimate goal was to sequence the entire human genome, a process that initially cost approximately $3 billion in 1990, although this cost dramatically dropped to less than $100 million within a decade due to technological improvements, reflecting Moore's Law.
Sequencing Technology and Strategy: The HGP primarily utilized First-Generation
Sequencing (Sanger sequencing). This technology, developed in 1977, was capable of sequencing only about 500 "letters" (base pairs) at a time. To sequence the massive human genome, the HGP employed a specialized assembly strategy called Hierarchical
shotgun sequencing:
- Large fragments of the human genome were cloned into Bacterial Artificial
- These BACs were selected, ordered based on physical and genetic maps, fragmented, and
- The DNA from individual subclones was then sequenced using automated Sanger
Chromosomes (BACs), which could accommodate fragments over 300 kb.
then subcloned.
sequencing.This hierarchical, clone-based approach was critical because it helped avoid problems associated with repeat sequences. In contrast, the concurrent Private Human Genome Project (started in 1998) used a Whole Genome Shotgun approach. The sources note that while the Whole Genome Shotgun strategy was reasonable for a draft, the sheer pervasiveness of repetitive sequences meant it did not yield the same high-quality reference as the HGP’s clone-based approach.The Initial Outcome and the Reference Genome: A "complete" draft of the human genome was achieved in 2001 by both projects, covering 83–84% of the entire genome.The resulting Human Reference Genome (HRG), which descends from the HGP, is a composite sequence derived from the DNA of multiple individuals (the original HGP used thirteen anonymous volunteers). It is defined as a haploid sequence and does not 1 / 2
correspond to any actual or "ideal" human individual. Even after the HGP's completion in 2003, continuous work is performed by the Genome Reference Consortium (GRC) to correct misrepresented regions and close remaining gaps.The Challenge of Gaps and Assembly The initial completion in 2003 still left sequence gaps. While 99% of the euchromatic
portion was finished, two types of gaps remained:
- Heterochromatic gaps (estimated 200 Mb), which were never intended to be
- Euchromatic gaps (24.4 Mb).
sequenced by the HGP.
These unsequenced regions were highly problematic because large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Shotgun sequencing methods struggled when repetitive sequences were broken into fragments, making reassembly difficult and often leading to the omission of parts of the repetitive region or misconnecting pieces of chromosomes. Furthermore, approximately 5% of the genome consists of segmental duplications (fragments 1kb to >100kb with high sequence identity), which often led to misassembly and collapsed regions in the original HGP assembly.
Regions that defied early assembly included:
• Alpha-satellite and centromeric transition regions.• The short arms of acrocentric chromosomes (13, 14, 15, 21, and 22), which contain large tracts of satellite sequences and tandem arrays of ribosomal DNA (rDNA).Genome Re-sequencing and Second-Generation Sequencing The advent of Second (Next)-Generation Sequencing (NGS) around 2005 marked a shift toward massive parallel sequencing, vastly increasing the number of de novo assemblies.NGS is characterized by "short-read" sequencing (typically 50–400 base pairs). The short reads combined with repetitive genomes were partially managed using new assembly algorithms based on de Bruijn graphs. However, NGS assemblies of larger genomes were generally of poor quality when compared to the HGP's clone-based approach.An example of a project utilizing NGS is the 1000 Genomes Project, which aimed to find common genetic variants in diverse populations. This project sequenced 2,504 individuals and identified 88 million variants, showing that a typical genome differs from the reference at 4.1 to 5.0 million sites, with structural variants affecting ~20 million bases of sequence.The Telomere-to-Telomere (T2T) Completion and Third-Generation Sequencing The final push to sequence the last remaining 8% of the human genome was the Telomere- to-Telomere (T2T) completion effort. This was achieved only after technological limitations were removed by Third-Generation Sequencing ("Long-read" sequencing), which produces reads of several kilobases (kbs).
Key technological innovations for T2T assembly included:
• PacBio HiFi-sequencing: This provided high-accuracy long reads (e.g., 20 kb reads with a 0.1% error rate) which were essential for assembling long, near-identical repeat arrays. PacBio HiFi reads formed the basis of the T2T-CHM13 assembly string graph.• Oxford Nanopore Sequencing (ONT): These ultralong reads were used to guide the correct path ("walk") through the assembly graph when high-resolution paths were ambiguous.
- / 2