Pangenomic Alignment: Strings plus Graphs
Конференцијски прилог (Објављена верзија)
,
© 2023 Institute of Molecular Genetics and Genetic Engineering, University of Belgrade
Метаподаци
Приказ свих података о документуАпстракт
The use of only one or a few reference genomes for DNA alignment is known to bias
research results and medical diagnoses, but aligning against many reference genomes
has been problematic. If we represent such a pangenomic reference as a set of strings,
then each seed we find in a DNA read may occur in many of the genomes, so even
reporting all those occurrences can be slow, and extending and chaining seeds can be
infeasible. On the other hand, if we represent them as a graph then --- even apart from
the significant technical challenges of indexing graphs --- we may find many chimeric
matches. The more of humanity’s genetic diversity we try to represent in the graph, the
fuzzier it becomes, and the greater the probability of spurious results.
Most research on pangenomic alignment uses either a string representation or a graph
representation, but not both. In this talk we first describe how a tool called MONI indexes
a pangenomic reference as a set of strings in small space su...ch that later, for each maximal
exact match in a given read, we can quickly find that match’s length, the position of one of
its occurrences in the set of strings, and the lexicographic rank of the suffix starting with
that occurrence. We then describe how a tool called MARIA will, when fully implemented,
store a pangenomic reference as a graph in small space such that, given MONI’s output
about a maximal exact match, we can quickly report all the non-chimeric occurrences of
that match in the graph.
Combining MONI and MARIA will give us the advantages of working with both strings and
graphs: we index the set of reference genomes, the whole set of reference genomes, and
nothing but the set of reference genomes, but for each maximal exact match we output
relatively few occurrences in the graph, which are easy to use later in a pipeline.
Кључне речи:
pangenomic alignment / reference genomes / data structures / indexingИзвор:
4th Belgrade Bioinformatics Conference, 2023, 4, 20-20Издавач:
- Belgrade : Institute of molecular genetics and genetic engineering
Финансирање / пројекти:
- This talk covers results obtained in collaboration with many other researchers, in particular Christina Boucher and Marco Oliva at the University of Florida, Ben Langmead at Johns Hopkins University and Massimiliano Rossi at Illumina, for MONI; and Andrej Baláž, Adrián Goga and Alessia Petescia at Comenius University, Simon Heumos at the University of Tübingen and Jouni at the UCSC Genomics Institute, for MARIA. The author was funded by NSERC grant RGPIN-07185-2020, NSF/BIO grant DBI-2029552 to Christina Boucher, and NIH/NHGRI grant R01HG011392 to Ben Langmead.
Напомена:
- Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 2023
Колекције
Институција/група
Institut za molekularnu genetiku i genetičko inženjerstvoTY - CONF AU - Gagie, Travis PY - 2023 UR - https://belbi.bg.ac.rs/ UR - https://imagine.imgge.bg.ac.rs/handle/123456789/1956 AB - The use of only one or a few reference genomes for DNA alignment is known to bias research results and medical diagnoses, but aligning against many reference genomes has been problematic. If we represent such a pangenomic reference as a set of strings, then each seed we find in a DNA read may occur in many of the genomes, so even reporting all those occurrences can be slow, and extending and chaining seeds can be infeasible. On the other hand, if we represent them as a graph then --- even apart from the significant technical challenges of indexing graphs --- we may find many chimeric matches. The more of humanity’s genetic diversity we try to represent in the graph, the fuzzier it becomes, and the greater the probability of spurious results. Most research on pangenomic alignment uses either a string representation or a graph representation, but not both. In this talk we first describe how a tool called MONI indexes a pangenomic reference as a set of strings in small space such that later, for each maximal exact match in a given read, we can quickly find that match’s length, the position of one of its occurrences in the set of strings, and the lexicographic rank of the suffix starting with that occurrence. We then describe how a tool called MARIA will, when fully implemented, store a pangenomic reference as a graph in small space such that, given MONI’s output about a maximal exact match, we can quickly report all the non-chimeric occurrences of that match in the graph. Combining MONI and MARIA will give us the advantages of working with both strings and graphs: we index the set of reference genomes, the whole set of reference genomes, and nothing but the set of reference genomes, but for each maximal exact match we output relatively few occurrences in the graph, which are easy to use later in a pipeline. PB - Belgrade : Institute of molecular genetics and genetic engineering C3 - 4th Belgrade Bioinformatics Conference T1 - Pangenomic Alignment: Strings plus Graphs EP - 20 SP - 20 VL - 4 UR - https://hdl.handle.net/21.15107/rcub_imagine_1956 ER -
@conference{ author = "Gagie, Travis", year = "2023", abstract = "The use of only one or a few reference genomes for DNA alignment is known to bias research results and medical diagnoses, but aligning against many reference genomes has been problematic. If we represent such a pangenomic reference as a set of strings, then each seed we find in a DNA read may occur in many of the genomes, so even reporting all those occurrences can be slow, and extending and chaining seeds can be infeasible. On the other hand, if we represent them as a graph then --- even apart from the significant technical challenges of indexing graphs --- we may find many chimeric matches. The more of humanity’s genetic diversity we try to represent in the graph, the fuzzier it becomes, and the greater the probability of spurious results. Most research on pangenomic alignment uses either a string representation or a graph representation, but not both. In this talk we first describe how a tool called MONI indexes a pangenomic reference as a set of strings in small space such that later, for each maximal exact match in a given read, we can quickly find that match’s length, the position of one of its occurrences in the set of strings, and the lexicographic rank of the suffix starting with that occurrence. We then describe how a tool called MARIA will, when fully implemented, store a pangenomic reference as a graph in small space such that, given MONI’s output about a maximal exact match, we can quickly report all the non-chimeric occurrences of that match in the graph. Combining MONI and MARIA will give us the advantages of working with both strings and graphs: we index the set of reference genomes, the whole set of reference genomes, and nothing but the set of reference genomes, but for each maximal exact match we output relatively few occurrences in the graph, which are easy to use later in a pipeline.", publisher = "Belgrade : Institute of molecular genetics and genetic engineering", journal = "4th Belgrade Bioinformatics Conference", title = "Pangenomic Alignment: Strings plus Graphs", pages = "20-20", volume = "4", url = "https://hdl.handle.net/21.15107/rcub_imagine_1956" }
Gagie, T.. (2023). Pangenomic Alignment: Strings plus Graphs. in 4th Belgrade Bioinformatics Conference Belgrade : Institute of molecular genetics and genetic engineering., 4, 20-20. https://hdl.handle.net/21.15107/rcub_imagine_1956
Gagie T. Pangenomic Alignment: Strings plus Graphs. in 4th Belgrade Bioinformatics Conference. 2023;4:20-20. https://hdl.handle.net/21.15107/rcub_imagine_1956 .
Gagie, Travis, "Pangenomic Alignment: Strings plus Graphs" in 4th Belgrade Bioinformatics Conference, 4 (2023):20-20, https://hdl.handle.net/21.15107/rcub_imagine_1956 .