Tamil Text Processing using Python

There are plenty of tools and libraries to process and analyze text in English language , but there are only few tools and libraries to process,clean and analyze tamil text.

In this blog we will explore some of the libraries which helps us to process text in tamil language

Some of the libraries which provides the feature to process not only tamil and other Indic Languages are

Indic-NLP
Open-Tamil(Only Tamil)
Inltk
Spacy

We will see one by one and get to know what all it offers. First we will start with Indic-NLP

Indic-NLP

The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages

It has more functionality with it, some of them are

Image Credits : https://anoopkunchukuttan.github.io/indic_nlp_library/

Word Tokenization and Detokenization
Sentence Splitting
Word Segmentation
Indicization

Translation

there are even more functions available,
This library supports multiple language, we lang='ta' for Tamil languagge

Tokenization
Trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.

from indicnlp.tokenize import indic_tokenize  

tamil_text='''பெண்கள் இறப்பதும், பிறந்தபின் குழந்தைகள் இறப்பதும் சர்வ சாதாரணம். லேசான சிராய்ப்புகளும் கீறல்களும் கூட மரணத்திற்கு இட்டுச் சென்றன. ஒரு நுண்ணுயிரை வைத்து இன்னொன்றைக் கொல்லமுடிகிற பெனிஸிலின் போன்ற நச்சுமுறி மருந்துகள்'''

print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print(t)
This will return list of all tokens from the given sentence.


Sentence Tokenization 
A smart sentence splitter which uses a two-pass rule-based system to 
split the text into sentences. It knows of common prefixes in Indian 
languages.

from indicnlp.tokenize import sentence_tokenize
Text2 = '''பெண்கள் இறப்பதும், பிறந்தபின் குழந்தைகள் இறப்பதும் சர்வ சாதாரணம். லேசான சிராய்ப்புகளும் கீறல்களும் கூட மரணத்திற்கு இட்டுச் சென்றன. ஒரு நுண்ணுயிரை வைத்து இன்னொன்றைக் கொல்லமுடிகிற பெனிஸிலின் போன்ற நச்சுமுறி மருந்துகள்'''
sentences=sentence_tokenize.sentence_split(Text2, lang='ta')
for t in sentences:
    print(t)

Text Similarity 
 Here text similarity are calculated based on lexical data .

from indicnlp.script import  indic_scripts as isc
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator

lang1_str=tamil_text
lang2_str=tamil_text
lang1='ta'
lang2='ta'

lcsr, len1, len2 = isc.lcsr_indic(lang1_str,lang2_str,lang1,lang2)

# print('{} string: {}'.format(lang2, UnicodeIndicTransliterator.transliterate(lang2_str,lang2,lang1)))
print('LCSR: {}'.format(lcsr))


This will return 1 as an answer,since we passed same input to the function.

These are the most commonly used processing , there are few others as well which are more deeper for noraml text ptocessing.
Feel free to check this here


Will do the part-2 of this blog with other libraries i mentioned above.Do follow my blogs to get more blogs like this

Do Follow me personally on Linked-in

Tamil Text Processing using Python

1 Comments

Contact form