There are plenty of tools and libraries to process and analyze text in English language , but there are only few tools and libraries to process,clean and analyze tamil text.
In this blog we will explore some of the libraries which helps us to process text in tamil language
Some of the libraries which provides the feature to process not only tamil and other Indic Languages are
- Indic-NLP
- Open-Tamil(Only Tamil)
- Inltk
- Spacy
We will see one by one and get to know what all it offers. First we will start with Indic-NLP
Indic-NLP
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages
It has more functionality with it, some of them are
- Word Tokenization and Detokenization
- Sentence Splitting
- Word Segmentation
- Indicization
- Translation
there are even more functions available,
This library supports multiple language, we lang='ta' for Tamil languagge
Tokenization
Trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.from indicnlp.tokenize import indic_tokenize tamil_text='''பெண்கள் இறப்பதுà®®், பிறந்தபின் குழந்தைகள் இறப்பதுà®®் சர்வ சாதாரணம். லேசான சிà®°ாய்ப்புகளுà®®் கீறல்களுà®®் கூட மரணத்திà®±்கு இட்டுச் சென்றன. à®’à®°ு நுண்ணுயிà®°ை வைத்து இன்னொன்à®±ைக் கொல்லமுடிகிà®± பெனிஸிலின் போன்à®± நச்சுà®®ுà®±ி மருந்துகள்''' print('Tokens: ') for t in indic_tokenize.trivial_tokenize(indic_string): print(t)
This will return list of all tokens from the given sentence.
Sentence Tokenization
A smart sentence splitter which uses a two-pass rule-based system to split the text into sentences. It knows of common prefixes in Indian languages.from indicnlp.tokenize import sentence_tokenize Text2 = '''பெண்கள் இறப்பதுà®®், பிறந்தபின் குழந்தைகள் இறப்பதுà®®் சர்வ சாதாரணம். லேசான சிà®°ாய்ப்புகளுà®®் கீறல்களுà®®் கூட மரணத்திà®±்கு இட்டுச் சென்றன. à®’à®°ு நுண்ணுயிà®°ை வைத்து இன்னொன்à®±ைக் கொல்லமுடிகிà®± பெனிஸிலின் போன்à®± நச்சுà®®ுà®±ி மருந்துகள்''' sentences=sentence_tokenize.sentence_split(Text2, lang='ta') for t in sentences: print(t)
Text Similarity
Here text similarity are calculated based on lexical data .from indicnlp.script import indic_scripts as isc from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator lang1_str=tamil_text lang2_str=tamil_text lang1='ta' lang2='ta' lcsr, len1, len2 = isc.lcsr_indic(lang1_str,lang2_str,lang1,lang2) # print('{} string: {}'.format(lang2, UnicodeIndicTransliterator.transliterate(lang2_str,lang2,lang1))) print('LCSR: {}'.format(lcsr))
This will return 1 as an answer,since we passed same input to the function.These are the most commonly used processing , there are few others as well which are more deeper for noraml text ptocessing.
Feel free to check this here
Do Follow me personally on Linked-in
Will do the part-2 of this blog with other libraries i mentioned above.Do follow my blogs to get more blogs like this
Very thoughtful. Keep going
ReplyDelete