Clustering and classification of SARS-COV-2 isolates using RSCU
Аутори
Malkov, S.Pavlović Lažetić, G.
Stojanović, B.
Maljković, M.
Veljković, A.
Kapunac, S.
Mitić, N.
Остала ауторства
Morić, IvanaĐorđević, Valentina
Конференцијски прилог (Објављена верзија)
,
© 2023 Institute of Molecular Genetics and Genetic Engineering, University of Belgrade
Метаподаци
Приказ свих података о документуАпстракт
The existence of a large number of sequenced SARS-COV-2 isolates provides an
opportunity to observe genomic variability in a massive sample. The goal of our research
was to use data mining techniques to study possible correlation between codon usage
and classification by WHO-labels in a certain period of time.
The material includes 745,533 isolates with 12,236,672 coding sequences (proteins) from
NCBI (10.08.2022.). RSCU was used as a measure of codon usage. Samples are associated
with WHO-labels (based on Pango_Id) and time intervals. Inconsistency of WHO-labels
with periods in which the respective strains were actually present was observed. The
isolates with the observed discrepancy were excluded from the sample. Isolates without
assigned WHO-labels were also excluded. In addition, individual coding sequences
containing ambiguous nucleotide codes were eliminated.
Clustering was performed for each of the 12 common types of coding sequences
(proteins), with multiple methods... and a different number of clusters. Neural clustering
gave the best results. For different protein types, different degrees of RSCU variability
are observed. In the case of proteins with a small variation in nucleotide contents, over
95% of the material belongs to a single cluster, while the other clusters are of negligible
size. In the case of proteins with more variations, a higher number of pure clusters (by
WHO-labels) is obtained, with a small number of heterogeneous clusters (about 10% of
the material). In those heterogeneous clusters, there are isolates with different WHOlabels
that were present in parallel at some point, as a kind of transitional forms between
two strains.
Different classification models were created on the same sample. Models based on
protein types with higher diversity between coding sequences are highly accurate (96-
100%). Using the classification models, the corresponding WHO-labels were associated
with isolates without previously assigned WHO-labels.
Кључне речи:
SARS-COV-2 / RSCU / clustering / classificationИзвор:
4th Belgrade Bioinformatics Conference, 2023, 4, 39-39Издавач:
- Belgrade : Institute of molecular genetics and genetic engineering
Напомена:
- Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 2023
Колекције
Институција/група
Institut za molekularnu genetiku i genetičko inženjerstvoTY - CONF AU - Malkov, S. AU - Pavlović Lažetić, G. AU - Stojanović, B. AU - Maljković, M. AU - Veljković, A. AU - Kapunac, S. AU - Mitić, N. PY - 2023 UR - https://belbi.bg.ac.rs/ UR - https://imagine.imgge.bg.ac.rs/handle/123456789/1977 AB - The existence of a large number of sequenced SARS-COV-2 isolates provides an opportunity to observe genomic variability in a massive sample. The goal of our research was to use data mining techniques to study possible correlation between codon usage and classification by WHO-labels in a certain period of time. The material includes 745,533 isolates with 12,236,672 coding sequences (proteins) from NCBI (10.08.2022.). RSCU was used as a measure of codon usage. Samples are associated with WHO-labels (based on Pango_Id) and time intervals. Inconsistency of WHO-labels with periods in which the respective strains were actually present was observed. The isolates with the observed discrepancy were excluded from the sample. Isolates without assigned WHO-labels were also excluded. In addition, individual coding sequences containing ambiguous nucleotide codes were eliminated. Clustering was performed for each of the 12 common types of coding sequences (proteins), with multiple methods and a different number of clusters. Neural clustering gave the best results. For different protein types, different degrees of RSCU variability are observed. In the case of proteins with a small variation in nucleotide contents, over 95% of the material belongs to a single cluster, while the other clusters are of negligible size. In the case of proteins with more variations, a higher number of pure clusters (by WHO-labels) is obtained, with a small number of heterogeneous clusters (about 10% of the material). In those heterogeneous clusters, there are isolates with different WHOlabels that were present in parallel at some point, as a kind of transitional forms between two strains. Different classification models were created on the same sample. Models based on protein types with higher diversity between coding sequences are highly accurate (96- 100%). Using the classification models, the corresponding WHO-labels were associated with isolates without previously assigned WHO-labels. PB - Belgrade : Institute of molecular genetics and genetic engineering C3 - 4th Belgrade Bioinformatics Conference T1 - Clustering and classification of SARS-COV-2 isolates using RSCU EP - 39 SP - 39 VL - 4 UR - https://hdl.handle.net/21.15107/rcub_imagine_1977 ER -
@conference{ author = "Malkov, S. and Pavlović Lažetić, G. and Stojanović, B. and Maljković, M. and Veljković, A. and Kapunac, S. and Mitić, N.", year = "2023", abstract = "The existence of a large number of sequenced SARS-COV-2 isolates provides an opportunity to observe genomic variability in a massive sample. The goal of our research was to use data mining techniques to study possible correlation between codon usage and classification by WHO-labels in a certain period of time. The material includes 745,533 isolates with 12,236,672 coding sequences (proteins) from NCBI (10.08.2022.). RSCU was used as a measure of codon usage. Samples are associated with WHO-labels (based on Pango_Id) and time intervals. Inconsistency of WHO-labels with periods in which the respective strains were actually present was observed. The isolates with the observed discrepancy were excluded from the sample. Isolates without assigned WHO-labels were also excluded. In addition, individual coding sequences containing ambiguous nucleotide codes were eliminated. Clustering was performed for each of the 12 common types of coding sequences (proteins), with multiple methods and a different number of clusters. Neural clustering gave the best results. For different protein types, different degrees of RSCU variability are observed. In the case of proteins with a small variation in nucleotide contents, over 95% of the material belongs to a single cluster, while the other clusters are of negligible size. In the case of proteins with more variations, a higher number of pure clusters (by WHO-labels) is obtained, with a small number of heterogeneous clusters (about 10% of the material). In those heterogeneous clusters, there are isolates with different WHOlabels that were present in parallel at some point, as a kind of transitional forms between two strains. Different classification models were created on the same sample. Models based on protein types with higher diversity between coding sequences are highly accurate (96- 100%). Using the classification models, the corresponding WHO-labels were associated with isolates without previously assigned WHO-labels.", publisher = "Belgrade : Institute of molecular genetics and genetic engineering", journal = "4th Belgrade Bioinformatics Conference", title = "Clustering and classification of SARS-COV-2 isolates using RSCU", pages = "39-39", volume = "4", url = "https://hdl.handle.net/21.15107/rcub_imagine_1977" }
Malkov, S., Pavlović Lažetić, G., Stojanović, B., Maljković, M., Veljković, A., Kapunac, S.,& Mitić, N.. (2023). Clustering and classification of SARS-COV-2 isolates using RSCU. in 4th Belgrade Bioinformatics Conference Belgrade : Institute of molecular genetics and genetic engineering., 4, 39-39. https://hdl.handle.net/21.15107/rcub_imagine_1977
Malkov S, Pavlović Lažetić G, Stojanović B, Maljković M, Veljković A, Kapunac S, Mitić N. Clustering and classification of SARS-COV-2 isolates using RSCU. in 4th Belgrade Bioinformatics Conference. 2023;4:39-39. https://hdl.handle.net/21.15107/rcub_imagine_1977 .
Malkov, S., Pavlović Lažetić, G., Stojanović, B., Maljković, M., Veljković, A., Kapunac, S., Mitić, N., "Clustering and classification of SARS-COV-2 isolates using RSCU" in 4th Belgrade Bioinformatics Conference, 4 (2023):39-39, https://hdl.handle.net/21.15107/rcub_imagine_1977 .