Generating, sampling and counting subclasses of regular tree languages

Antonopoulos, Timos; Geerts, Floris; Martens, Wim; Neven, Frank

doi:10.1007/S00224-012-9428-X

Title

Generating, sampling and counting subclasses of regular tree languages

Author

Antonopoulos, Timos

Geerts, Floris

Martens, Wim

Neven, Frank

Abstract

To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the generated XSD resembles the target schema. In this paper, we provide the formal foundation for such a testbed. We adopt similarity measures based on counting the number of common and different trees in the two languages, and we develop the necessary machinery for computing them. We use the formalism of extended DTDs (EDTDs) to represent the unranked regular tree languages. In particular, we obtain an efficient algorithm to count the number of trees up to a certain size in an unambiguous EDTD. The latter class of unambiguous EDTDs encompasses the more familiar classes of single-type, restrained competition and bottom-up deterministic EDTDs. The single-type EDTDs correspond precisely to the core of XML Schema, while the others are strictly more expressive. We also show how constraints on the shape of allowed trees can be incorporated. As we make use of a translation into a well-known formalism for combinatorial specifications, we get for free a sampling procedure to draw members of any unambiguous EDTD. When dropping the restriction to unambiguous EDTDs, i.e. taking the full class of EDTDs into account, we show that the counting problem becomes #P-complete and provide an approximation algorithm. Finally, we discuss uniform generation of single-type EDTDs, i.e., the formal abstraction of XSDs. To this end, we provide an algorithm to generate k-occurrence automata (k-OAs) uniformly at random and show how this leads to the uniform generation of single-type EDTDs.

Language

English

Source (journal)

Theory of computing systems. - New York, N.Y.

Source (book)

International Conference on Database Theory (ICDT) held Jointly with the International Conference on Extending Database Technology (EDBT), March 21-25, 2011, Uppsala, Sweden

Publication

New York, N.Y. : 2013

ISSN

1432-4350

DOI

10.1007/S00224-012-9428-X

Volume/pages

52 :3 (2013) , p. 542-585

ISI

000316087900007

Full text (Publisher's DOI)

https://doi.org/10.1007/S00224-012-9428-X

Full text (publisher's version - intranet only)

https://repository.uantwerpen.be/docman/iruaauth/38a4c7/a343856.pdf

Faculty/Department				Faculty of Sciences. Mathematics and Computer Science

Research group				ADReM Data Lab (ADReM)
Publication type				A1 Journal article

Subject				Mathematics Computer. Automation

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

05.06.2013

Last edited

09.10.2023

To cite this reference

https://hdl.handle.net/10067/1082720151162165141