Publication
Title
Clustering genomic words in human DNA using peaks and trends of distributions
Author
Abstract
In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.
Language
English
Source (journal)
Advances in data analysis and classification. - Berlin, 2007, currens
Publication
Berlin : Springer , 2020
ISSN
1862-5347 [print]
1862-5355 [online]
DOI
10.1007/S11634-019-00362-X
Volume/pages
14 :1 (2020) , p. 57-76
ISI
000518730900004
Full text (Publisher's DOI)
Full text (open access)
UAntwerpen
Research group
Publication type
Subject
External links
Web of Science
Record
Identifier
Creation 27.02.2024
Last edited 07.03.2024
To cite this reference