Tamil Text Processing using Python

 There are plenty of tools and libraries to process and analyze text in English language , but there are only few tools and libraries to process,clean and analyze tamil text.

In this blog we will explore some of the libraries which helps us to process text in tamil language 


Some of the libraries which provides the feature to process not only tamil and other Indic Languages are

 Indic-NLP

    The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages

 It has more functionality with it, some of them are




                        Image Credits : https://anoopkunchukuttan.github.io/indic_nlp_library/
  • Word Tokenization and Detokenization
  • Sentence Splitting
  • Word Segmentation
  • Indicization
  • Translation

    there are even more functions available, 
    This library supports multiple language, we lang='ta' for Tamil languagge

    Tokenization
    Trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.
    from indicnlp.tokenize import indic_tokenize tamil_text='''பெண்கள் இறப்பதுà®®், பிறந்தபின் குழந்தைகள் இறப்பதுà®®் சர்வ சாதாரணம். லேசான சிà®°ாய்ப்புகளுà®®் கீறல்களுà®®் கூட மரணத்திà®±்கு இட்டுச் சென்றன. à®’à®°ு நுண்ணுயிà®°ை வைத்து இன்னொன்à®±ைக் கொல்லமுடிகிà®± பெனிஸிலின் போன்à®± நச்சுà®®ுà®±ி மருந்துகள்''' print('Tokens: ') for t in indic_tokenize.trivial_tokenize(indic_string): print(t)
    This will return list of all tokens from the given sentence.


    Sentence Tokenization 

    A smart sentence splitter which uses a two-pass rule-based system to split the text into sentences. It knows of common prefixes in Indian languages.
    from indicnlp.tokenize import sentence_tokenize Text2 = '''பெண்கள் இறப்பதுà®®், பிறந்தபின் குழந்தைகள் இறப்பதுà®®் சர்வ சாதாரணம். லேசான சிà®°ாய்ப்புகளுà®®் கீறல்களுà®®் கூட மரணத்திà®±்கு இட்டுச் சென்றன. à®’à®°ு நுண்ணுயிà®°ை வைத்து இன்னொன்à®±ைக் கொல்லமுடிகிà®± பெனிஸிலின் போன்à®± நச்சுà®®ுà®±ி மருந்துகள்''' sentences=sentence_tokenize.sentence_split(Text2, lang='ta') for t in sentences: print(t)

    Text Similarity 
     
    Here text similarity are calculated based on lexical data .
    from indicnlp.script import indic_scripts as isc from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator lang1_str=tamil_text lang2_str=tamil_text lang1='ta' lang2='ta' lcsr, len1, len2 = isc.lcsr_indic(lang1_str,lang2_str,lang1,lang2) # print('{} string: {}'.format(lang2, UnicodeIndicTransliterator.transliterate(lang2_str,lang2,lang1))) print('LCSR: {}'.format(lcsr)) 

    This will return 1 as an answer,since we passed same input to the function.
    These are the most commonly used processing , there are few others as well which are more deeper for noraml text ptocessing.
    Feel free to check this here

    Will do the part-2 of this blog with other libraries i mentioned above.Do follow my blogs to get more blogs like this
    Do Follow me personally on  Linked-in

1 Comments

Previous Post Next Post