Open Access Open Access  Restricted Access Subscription Access

InfoSift: A Novel, Mining-Based Framework for Document Classification

Sharma Chakravarthy, Aravind Venkatachalam, Aditya Telang, Manu Aery


A number of approaches, including machine learning, probabilistic, and information retrieval, have been proposedfor classifying/retrieving documents where mainly words from the documents are used without considering anypotential structural properties of the document. These techniques do not specifically exploit: structural infor-mation that may be present in these documents or the importance of groups of terms that co-occur in differentparts of the documents and their relationships. However, many documents, such as emails, web pages, and textdocuments have a basic structure which can be beneficially leveraged for the purposes of classification/retrieval.This paper proposes a novel, graph-based mining framework for document classification by taking into accountthe structure of a document. Our approach is based on the intuition that representative – common and recurring– structures or patterns (not just words) can be extracted from a pre-classified document class and similaritywith these extracted patterns can be effectively used for classifying incoming/new documents. To the best of ourknowledge, this approach has not been tried for the classification of text, email or web pages (in general documents).First, we establish the applicability of this approach by identifying a suitable graph mining technique. Next, weestablish relevant parameters and their derivation from the corpus characteristics. The notion of inexact graphmatch is critical for our approach both for extracting substructures as well as for identifying similar substructures.Second, we extend our approach to multi-class classification which is essential for real-world applications.Ranking of substructures globally (across classes) is needed for this purpose. A TF-IDF-like formula is proposedfor global ranking of substructures. Approaches proposed for the computation of the components of the rankingformula are discussed along with their computation challenges. Finally, extensive experimental analysis is carriedout for three diverse document types (emails, text, and web pages) to demonstrate the generality and effectivenessof the proposed framework. We believe that we have established the efficacy of this framework for the classificationof any input that exhibits some structure, and furthermore can be extended to other forms of inputs.

Full Text: