Build Word Cloud for Text Analysis

Exploratory data analysis is the main concept  and most widely used concept used in Machine Learning and NLP
There are many techniques available for Machine Learning but for NLP there is very less NLP i.e Text data

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites and also in many other kind of text data


Library used
pip install matplotlib
pip install pandas
pip install wordcloud
pip install nltk
Suppose you have a text data like this

            df.head() will show the below table


title text subject date
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017
4Trump wants Postal Service to charge 'much mor...SEATTLE/WASHINGTON (Reuters) - President Donal...politicsNewsDecember 29, 2017


Do a simple Pre-Processing step
def tokenizeandstopwords(text):
    tokens = nltk.word_tokenize(text)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    meaningful_words = [w for w in token_words if not w in stop]
    joined_words = ( " ".join(meaningful_words))
    return joined_words

Calling the Pre-Processing Function
df['text'] = df['text'].apply(tokenizeandstopwords)

To generate a word cloud we can use the below function

def generate_word_cloud(text):
    wordcloud = WordCloud(
        width = 3000,
        height = 2000,
        background_color = 'black').generate(str(text))
    fig = plt.figure(
        figsize = (40, 30),
        facecolor = 'k',
        edgecolor = 'k')
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()

After PreProcessing the data call the wordcloud Function to generate the word cloud




Feel free to use the code and all the codes are available in my Kernal
Link to the kernal : https://www.kaggle.com/vpkprasanna/basic-text-cleaning-wordcloud-and-n-gram-analysis

Do Follow me personally on
Linked in : https://www.linkedin.com/in/vpkprasanna/

Post a Comment

Previous Post Next Post