ABSTRACT

Over the last several years, graph mining has emerged as a new field within contemporary data mining. One of the central tasks is the search for subgraphs, called patterns, that occur frequently in either a collection of graphs (e.g., databases of molecules [6], game positions [15], scene descriptions) or in a single large graph

(e.g., the Internet, citation networks [16], social networks [12], protein interaction networks [10]). In the literature, the terms frequency and support have been used interchangeably to denote the measure to quantify the prevalence of a pattern. In the single-graph setting, however, the notion of frequency is not at all straightforward to define. For example, the obvious definition of taking the number of instances of a pattern as its frequency has the undesirable property that extending the pattern (i.e., making it more restrictive) may actually increase its frequency. Indeed, consider for instance, the unlabeled k-clique Kk as the single graph in which we want to find patterns. There are

) different embeddings under subgraph isomorphism of the unla-

beled path of length 1 in Kk, whereas there are 3 (k 3

) embeddings of the path of length

2 in Kk. In fact, the number of different embeddings may increase exponentially in the size of the pattern. Hence, as pointed out by Vanetik et al. [17], a good frequency measure must be such that the frequency of a superpattern is always at most as high as that of a subpattern. This property is called the antimonotonicity. Also, for reasons of efficiency, antimonotonicity of the frequency measure is highly desirable, as it allows for pruning large parts of the search space in a general-to-specific exploration. The efficiency and correctness of most existing graph pattern miners relies critically on the antimonotonicity of the frequency measure being used.