Open Access Open Access  Restricted Access Subscription Access

An Application of MDL Principle for Indian Resource Poor Language

miral pritesh patel, Apurva Shah

Abstract


Stemmer is very important and required module for any morphological system. Stemming process is language dependent, which separates stem and suffix from a given word. Even after notable growth, specifically work at morphological level for Indian resource poor languages like Sanskrit, Assamese, Bengali, Bishnupriya, Manipuri, Bodo etc. are less attended. Standard resources (corpus, data set) for experiment are very scarce for such languages. Many famous unsupervised approaches are tested for European languages only. It is the requirement to see how well famous approach works for other inflective and resource poor languages. In this study, Minimum Description Length principle (MDL) is applied to Sanskrit (resource poor and inflective) language. Initially, all corpus lexicon are split in to substring, which is followed by calculating frequency and length of each sub string. A higher probability split is considered as best split for stem and suffix. Next, multiple iteration is taken until result improved. With 72 % result MDL works well for Indian language. MDL principle is extended to improve performance of Sanskrit stemmer by adding rule based approach. MDL based hybrid approach improves result by 17 %. As no direct Sanskrit stemmer or evaluation is available to compare, therefore, we compare our work with Lovin, Porter and Paice stemmers. Word stemmed factor is highest compared which to all three stemmer. Our results are also comparable to Gujarati and Punjabi language stemmer. Stemmer strength is more as it reduces under stemming errors.

Full Text:

PDF

References


Amaresh Kumar Pandey, T. J. S. 2008. No Title. In Proceeding AND ’08 Proceedings of the second workshop on Analytics for noisy unstructured text data. 99–105.

Ameta, J., Joshi, N., and Mathur, I. 2011. A Lightweight Stemmer for Gujarati. In In Proceedings of 46th Annual National Convention of Computer Society of India.

Bhadra, M., Singh, S. K., Kumar, S., Subash, Agrawal, M., Chandrasekhar, R., Mishra, S. K., and Jha, G. N. 2009. Sanskrit Analysis System (SAS). In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 1–20.

Bhamidipati, N. L. and Pal, S. K. 2007. Stemming via distribution-based word segregation for classification and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 37, 2, 350–360.

Brigs, R. 1985. Knowledge Representation in Sanskrit and Artificial Intelligence. THE AI Megazine 6, 1, 32–39.

Caumanns, J. 1999. A Fast and Simple Stemming Algorithm for German Words. Technial Reports B 99/16, 10.

Dolamic, L. and Savoy, J. 2009. Indexing and stemming approaches for the Czech language.

Information Processing & Management 45, 6 (nov), 714–720.

Goldsmith, J. 2001. Unsupervised Learning of the Morphology of a Natural Language. Com- putational Linguistics 27, 2 (jun), 153–198.

Goyal, P., Huet, G., Kulkarni, A., Scharf, P., and Bunker, R. 2012. A Distributed Platform for Sanskrit Processing. In Proceedings of COLLING 2012: Techncial papers. mumbai, 1011–1028.

Hammarstro¨m, H. and Borin, L. 2011. Unsupervised Learning of Morphology. Computational Linguistics 37, 2 (jun), 309–350.

Huet, G. 2009. Formal structure of sanskrit text: Requirements analysis for a mechanical sanskrit processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).

Jha, G. N., Agrawal, M., Subash, Mishra, S. K., Mani, D., Mishra, D., Bhadra, M., and Singh, S. K. 2009. Inflectional morphology analyzer for sanskrit. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).

Kumar, D. and Rana, P. 2010. Design and Development of a Stemmer for Punjabi. Interna-

tional Journal of Computer Applications 11, 12, 18–23.

Lovins, J. B. 1968. Development of a stemming algorithm. Mechanical Translation and Com- putational Linguistics 11, 22–31.

Majumder, P., Mitra, M., and Pal, D. 2008. Bulgarian, Hungarian and Czech Stemming Using YASS. In Advances in Multilingual and Multimodal Information Retrieval. Vol. 5152. Springer Berlin Heidelberg, Berlin, Heidelberg, 49–56.

Mayfield, J. and McNamee, P. 2003. Single n-gram stemming. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR ’03. Vol. 1. ACM Press, New York, New York, USA, 415–416.

McNamee, P. and Mayfield, J. 2007. N-Gram Morphemes for Retrieval. Working Notes for the CLEF 2007 Workshop, 19-21 September, Budapest, Hungary .

Nehar, A., Ziadi, D., Cherroun, H., and Guellouma, Y. 2012. An efficient stemming for Arabic Text Classification. In 2012 International Conference on Innovations in Information Technology, IIT 2012. Abu Dhabi, 328–332.

Paice, C. 2006. Stemming. In Encyclopedia of Language & Linguistics, K. Brown, Ed. Elsevier, 149–150.

Paik, J. H., Mitra, M., Parui, S. K., and Ja¨rvelin, K. 2011. Gras. ACM Transactions on

Information Systems 29, 4 (nov), 1–24.

Porter, M. 2001. Snowball: A language for stemming algorithms.

Porter, M. F. 1980. The Porter Stemmer Algorithm. 14, 3, 130–137.

Ramanathan, A. and Rao, D. D. 2003. A Lightweight Stemmer for Hindi. In In Proceed- ings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages. BU.

Saharia, N. 2010. A Suffix-based Noun and Verb Classifier for an Inflectional Language. 19–22.

Saharia, N., Konwar, K. M., Sharma, U., and Kalita, J. K. 2013. An Improved Stemming Approach Using HMM for a Highly Inflectional Language.

Sheth, J. R. and Patel, B. C. 2012. Stemming Techniques and Na¨ıve Approach for Gujarati Stemmer. In nternational Conference in Recent Trends in Information Technology and Computer Science (ICRTITCS - 2012) Proceedings published in International Journal of Computer Applications. IJCA, chennai, 975–8887.

Smirnov, I. 2008. Overview of stemming algorithms. Mechanical Translation, 1–8.

Suba, K., Jiandani, D., and Bhattacharyya, P. 2011. Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP. Chiang Mai, 1–8.