Exploratory data analysis is the main concept and most widely used concept used in Machine Learning and NLP

There are many techniques available for Machine Learning but for NLP there is very less NLP i.e Text data

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites and also in many other kind of text data

Library used

pip install matplotlib
pip install pandas
pip install wordcloud
pip install nltk

Suppose you have a text data like this

df.head() will show the below table

	title	text	subject	date
0	As U.S. budget fight looms, Republicans flip t...	WASHINGTON (Reuters) - The head of a conservat...	politicsNews	December 31, 2017
1	U.S. military to accept transgender recruits o...	WASHINGTON (Reuters) - Transgender people will...	politicsNews	December 29, 2017
2	Senior U.S. Republican senator: 'Let Mr. Muell...	WASHINGTON (Reuters) - The special counsel inv...	politicsNews	December 31, 2017
3	FBI Russia probe helped by Australian diplomat...	WASHINGTON (Reuters) - Trump campaign adviser ...	politicsNews	December 30, 2017
4	Trump wants Postal Service to charge 'much mor...	SEATTLE/WASHINGTON (Reuters) - President Donal...	politicsNews	December 29, 2017

Do a simple Pre-Processing step

def tokenizeandstopwords(text):
    tokens = nltk.word_tokenize(text)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    meaningful_words = [w for w in token_words if not w in stop]
    joined_words = ( " ".join(meaningful_words))
    return joined_words

Calling the Pre-Processing Function

df['text'] = df['text'].apply(tokenizeandstopwords)

To generate a word cloud we can use the below function

def generate_word_cloud(text):
    wordcloud = WordCloud(
        width = 3000,
        height = 2000,
        background_color = 'black').generate(str(text))
    fig = plt.figure(
        figsize = (40, 30),
        facecolor = 'k',
        edgecolor = 'k')
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()

After PreProcessing the data call the wordcloud Function to generate the word cloud

text = df.text.values

generate_word_cloud(text)

Feel free to use the code and all the codes are available in my Kernal

Link to the kernal : https://www.kaggle.com/vpkprasanna/basic-text-cleaning-wordcloud-and-n-gram-analysis

Do Follow me personally on

Linked in : https://www.linkedin.com/in/vpkprasanna/

Build Word Cloud for Text Analysis

Post a Comment

Contact form