An integrated platform for genome assembly, comparative genomics and management of genomic variation databases
Конференцијски прилог (Објављена верзија)
,
© 2023 Institute of Molecular Genetics and Genetic Engineering, University of Belgrade
Метаподаци
Приказ свих података о документуАпстракт
The use of long read DNA sequencing technologies is producing an explosion of high-quality
de-novo genome assemblies. The availability of these genomes represents a major step
forward for evolution, population genomics, epidemiology, among other applications. A major
bottleneck for many research groups continues to be the availability of tools to build and
analyze the large datasets of genomes that can be produced using these technologies. In this
talk, I summarize the functionalities developed by my research group in the version four of
the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive
analysis of long and short DNA sequencing reads. First, we designed new algorithms for
assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table
is constructed from the reads , using K-mer hash codes calculated from rankings relative to
the mode of the k-mer counts distribution. Statistics collected during this process are us...ed as
features to build layout paths. For diploid samples, we integrated a reimplementation of the
ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi
and Nanopore sequencing data for different species show that our solution has competitive
contiguity and efficiency, as well as superior accuracy in some cases, compared to other
currently used software. We also developed a functionality to perform ortholog identification
and gene-based alignment of assembled genomes. Proteomes for each genome are extracted
and homology relationships are efficiently predicted building indexes of aminoacid sequences
by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the
graph induced by the predicted relationships. Gene presence/absence matrices are derived
from these orthogroups. If genome assemblies are provided as input, synteny relationships
are identified for each pair of genomes. We also implemented algorithms to perform alignment
of short and long reads to a reference genome. Based on aligned long reads, we improved the
classical variants detector to detect long structural variants. Adding up these developments,
NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA
sequencing reads in a wide variety of experimental settings to solve different research goals.
Кључне речи:
bioinformatics / algorithms / DNA sequencing / software / genome assemblyИзвор:
4th Belgrade Bioinformatics Conference, 2023, 4, 15-15Издавач:
- Belgrade : Institute of molecular genetics and genetic engineering
Финансирање / пројекти:
- This work was supported by the Colombian Ministry of Sciences research fund “Patrimonio Autónomo Fondo Nacional de Financiamiento Para la Ciencia, la Tecnología Y la Innovación Francisco José de Caldas” through the grant with contract number 80740- 441-2020, awarded to J Duitama. We also wish to acknowledge the support of the IT Services Department and ExaCore-IT Core-facility of the Vice Presidency for Research & Creation at the Universidad de Los Andes that allow us to perform the computational analysis.
Напомена:
- Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 2023
Колекције
Институција/група
Institut za molekularnu genetiku i genetičko inženjerstvoTY - CONF AU - Duitama, Jorge PY - 2023 UR - https://belbi.bg.ac.rs/ UR - https://imagine.imgge.bg.ac.rs/handle/123456789/1950 AB - The use of long read DNA sequencing technologies is producing an explosion of high-quality de-novo genome assemblies. The availability of these genomes represents a major step forward for evolution, population genomics, epidemiology, among other applications. A major bottleneck for many research groups continues to be the availability of tools to build and analyze the large datasets of genomes that can be produced using these technologies. In this talk, I summarize the functionalities developed by my research group in the version four of the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive analysis of long and short DNA sequencing reads. First, we designed new algorithms for assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table is constructed from the reads , using K-mer hash codes calculated from rankings relative to the mode of the k-mer counts distribution. Statistics collected during this process are used as features to build layout paths. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi and Nanopore sequencing data for different species show that our solution has competitive contiguity and efficiency, as well as superior accuracy in some cases, compared to other currently used software. We also developed a functionality to perform ortholog identification and gene-based alignment of assembled genomes. Proteomes for each genome are extracted and homology relationships are efficiently predicted building indexes of aminoacid sequences by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the graph induced by the predicted relationships. Gene presence/absence matrices are derived from these orthogroups. If genome assemblies are provided as input, synteny relationships are identified for each pair of genomes. We also implemented algorithms to perform alignment of short and long reads to a reference genome. Based on aligned long reads, we improved the classical variants detector to detect long structural variants. Adding up these developments, NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA sequencing reads in a wide variety of experimental settings to solve different research goals. PB - Belgrade : Institute of molecular genetics and genetic engineering C3 - 4th Belgrade Bioinformatics Conference T1 - An integrated platform for genome assembly, comparative genomics and management of genomic variation databases EP - 15 SP - 15 VL - 4 UR - https://hdl.handle.net/21.15107/rcub_imagine_1950 ER -
@conference{ author = "Duitama, Jorge", year = "2023", abstract = "The use of long read DNA sequencing technologies is producing an explosion of high-quality de-novo genome assemblies. The availability of these genomes represents a major step forward for evolution, population genomics, epidemiology, among other applications. A major bottleneck for many research groups continues to be the availability of tools to build and analyze the large datasets of genomes that can be produced using these technologies. In this talk, I summarize the functionalities developed by my research group in the version four of the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive analysis of long and short DNA sequencing reads. First, we designed new algorithms for assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table is constructed from the reads , using K-mer hash codes calculated from rankings relative to the mode of the k-mer counts distribution. Statistics collected during this process are used as features to build layout paths. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi and Nanopore sequencing data for different species show that our solution has competitive contiguity and efficiency, as well as superior accuracy in some cases, compared to other currently used software. We also developed a functionality to perform ortholog identification and gene-based alignment of assembled genomes. Proteomes for each genome are extracted and homology relationships are efficiently predicted building indexes of aminoacid sequences by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the graph induced by the predicted relationships. Gene presence/absence matrices are derived from these orthogroups. If genome assemblies are provided as input, synteny relationships are identified for each pair of genomes. We also implemented algorithms to perform alignment of short and long reads to a reference genome. Based on aligned long reads, we improved the classical variants detector to detect long structural variants. Adding up these developments, NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA sequencing reads in a wide variety of experimental settings to solve different research goals.", publisher = "Belgrade : Institute of molecular genetics and genetic engineering", journal = "4th Belgrade Bioinformatics Conference", title = "An integrated platform for genome assembly, comparative genomics and management of genomic variation databases", pages = "15-15", volume = "4", url = "https://hdl.handle.net/21.15107/rcub_imagine_1950" }
Duitama, J.. (2023). An integrated platform for genome assembly, comparative genomics and management of genomic variation databases. in 4th Belgrade Bioinformatics Conference Belgrade : Institute of molecular genetics and genetic engineering., 4, 15-15. https://hdl.handle.net/21.15107/rcub_imagine_1950
Duitama J. An integrated platform for genome assembly, comparative genomics and management of genomic variation databases. in 4th Belgrade Bioinformatics Conference. 2023;4:15-15. https://hdl.handle.net/21.15107/rcub_imagine_1950 .
Duitama, Jorge, "An integrated platform for genome assembly, comparative genomics and management of genomic variation databases" in 4th Belgrade Bioinformatics Conference, 4 (2023):15-15, https://hdl.handle.net/21.15107/rcub_imagine_1950 .