Exploratory data analysis is the main concept and most widely used concept used in Machine Learning and NLP
There are many techniques available for Machine Learning but for NLP there is very less NLP i.e Text data
Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites and also in many other kind of text data
Library used
pip install matplotlib pip install pandas pip install wordcloud
pip install nltk
Suppose you have a text data like this
df.head() will show the below table
title | text | subject | date | |
---|---|---|---|---|
0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 |
1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 |
2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 |
3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 |
4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 |
Do a simple Pre-Processing step
def tokenizeandstopwords(text):
tokens = nltk.word_tokenize(text)
# taken only words (not punctuation)
token_words = [w for w in tokens if w.isalpha()]
meaningful_words = [w for w in token_words if not w in stop]
joined_words = ( " ".join(meaningful_words))
return joined_words
Calling the Pre-Processing Function
df['text'] = df['text'].apply(tokenizeandstopwords)
To generate a word cloud we can use the below function
def generate_word_cloud(text): wordcloud = WordCloud( width = 3000, height = 2000, background_color = 'black').generate(str(text)) fig = plt.figure( figsize = (40, 30), facecolor = 'k', edgecolor = 'k') plt.imshow(wordcloud, interpolation = 'bilinear') plt.axis('off') plt.tight_layout(pad=0) plt.show()
After PreProcessing the data call the wordcloud Function to generate the word cloud
text = df.text.values
generate_word_cloud(text)
Feel free to use the code and all the codes are available in my Kernal
Link to the kernal : https://www.kaggle.com/vpkprasanna/basic-text-cleaning-wordcloud-and-n-gram-analysis
Do Follow me personally on
Linked in : https://www.linkedin.com/in/vpkprasanna/