Pan-Genome Analysis: Overview, Workflow, Application and Recent Advances
A pan-genome is the sum of all genomic information within a species. Pan-genomes have potential applications in crop improvement, evolution and biodiversity research.
What is pan-genome?
Apan-genomeis the sum of all genomic information within a species. With the development of genomic technology, researchers have found that a singlereference genomecan no longer meet the needs of genomic data analysis, and more and more species, including thehuman genome, are choosing to construct a pan-genome instead of a single reference genome.
Pan-genomesreflect structural variation (SV) and polymorphisms in the genome, allowing in-depth comparisons of variation at the species level or at higher taxonomic levels.Pan-genomes have potential applications in crop improvement, evolution and biodiversity research. To fully exploit the value of pan-genomes, a broader range of information such as phenotypic, environmental and expression data needs to be integrated to provide insight into the role of variable regions in the genome.
How to build a pan-genome?
There is extensive genomic diversity within species, and apan-genomeneed to capture this diversity while removing redundancies to generate an integrated single genome.
Map to pan, which starts withde novoassembly, and matches the sequences of each individual assembled to thereference genometo find the unmatched sequences, then finds all the unmatched sequences and builds thepan-genome, or iterative mapping and assembly methods
Iterative assemblystarts with a singlereference genomeand then complements it with non-redundant sequences from other individuals or the iterative mapping and assembly method starts from a single reference genome and then complements it with non-redundant sequences from other individuals to build apan-genome.
De novoassemblyrequires the individual genomes to be assembled separately, followed by whole genome comparison.
What is the difference between pan and core genome?
Pan-genomic analysis clusters gene sets by co-occurrence in each individual and is usually divided into three categories: Core gene, genes present in all plant and animal strains; dispensable gene, genes present in one or more plant and animal strains; private, genes present in only one strain. The core part is present in all individuals, while the dispensable part is present in only one individual.
Application of pan-genome
Pan-genomic analysis helps to understand the characteristics of species, while the complex genomic variation provided bypan-genomemapping helps to resolve the diversity of crop phenotypes and agronomic traits.
- Selection of different subspecies forpan-genomesequencing allows the study of important biological questions such as the origin and evolution of species
- Selecting germplasm resources with different characteristics, such as wild species and cultivated species, forpangenomesequencing can uncover genetic resources related to important traits and provide guidance for scientific breeding
- Selecting germplasm resources of different ecogeographic types for pangenome sequencing can carry out popular scientific questions such as adaptive evolution of species and invasiveness of exotic species
- The use of crop pangenome advancement QTL mapping andGWAScan be used to identify genomic regions associated with desired phenotypes
- The use of pangenomes to advance genomic prediction. With SNPs as predictors, important agronomic traits such as grain yield, grain moisture, grain quality, biomass traits, and stem and root collapse can be predicted with reasonable accuracy
Application of pan-genome in crop improvement (Della Coletta Ret al., 2021)
Advances in pan-genome assembly technologies
The reduced cost of Illumina sequencing and improvements in assembly algorithms have facilitated the use of low-cost short-read data (e.g., maize genome, rice genome, soybean genome). While this approach has generated highly complete and contiguous assemblies of low-copy gene regions, the more repetitive, TE-rich regions of the genome have proven difficult to assemble with short reads, resulting in large gaps and partial assemblies in these regions. Recently, the maturation oflong-read sequencingtechnologies, especially PacBioHiFi Sequencing, has facilitated more contiguous and complete assemblies of crop genomes and, in some cases, long-read length-based assemblies within a single species. Advances in PacBiopangenomesequencing technologies are described below.
Impact of sequencing technology on polyploid assembly (Della Coletta Ret al., 2021)
PacBio HiFi sequencing, a good solution for pan-genome construction
Nowadays,pan-genomeconstruction generally uses three-generation long read-length sequencing to assemble multiple samples of a population from scratch. The two technology platforms now commonly used for triple sequencing are PacBio'sHiFi sequencingand ONT'sNanopore sequencing, of which HiFi sequencing takes into account long read length and ultra-high accuracy, and is extremely suitable for sequencing genomicde novo assembly.
- HiFi readsare more accurate
The higher accuracy ofHiFi reads allows the assembly algorithm to extend contigs to flanking mitotic regions with high confidence through more repeat assemblies at shorter read lengths, enhancing the integrity of the mitotic and telomeric regions
- Simplify the complexity of polyploidgenome assembly
High-quality genome assembly in polyploid species has been difficult to achieve due to the inclusion of multiple closely related subgenomes and the associated challenges in distinguishing homologous motifs and creating non-mosaic subgenomic scaffolds.Long-read sequencingwith low error rates (e.g., PacBioHiFi readlength) has enabled high-quality polyploid genome assembly, with recent assemblies containing fewer gaps and resolved homologous scaffolds. As polyploidpangenomes of more species are revealed, more novelstructural variantsand markers are likely to be observed.
References
- Della Coletta R, Qiu Y, Ou S, et al. How the pan-genome is changing crop genomics and improvement. Genome biology, 2021, 22(1): 1-19.
- Leonard, Alexander S., et al. Structural variant-based pangenome construction has low sensitivity to variability of haplotype-resolved bovine assemblies. Nature communications 13.1 (2022): 3012.