Title
|
|
|
|
Clustering genomic words in human DNA using peaks and trends of distributions
|
|
Author
|
|
|
|
|
|
Abstract
|
|
|
|
In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns. |
|
|
Language
|
|
|
|
English
|
|
Source (journal)
|
|
|
|
Advances in data analysis and classification. - Berlin, 2007, currens
|
|
Publication
|
|
|
|
Berlin
:
Springer
,
2020
|
|
ISSN
|
|
|
|
1862-5347
[print]
1862-5355
[online]
|
|
DOI
|
|
|
|
10.1007/S11634-019-00362-X
|
|
Volume/pages
|
|
|
|
14
:1
(2020)
, p. 57-76
|
|
ISI
|
|
|
|
000518730900004
|
|
Full text (Publisher's DOI)
|
|
|
|
|
|
Full text (open access)
|
|
|
|
|
|