Clustering Text Files utilizing K-Means in Scikit Learn

Enhance Short Article

Conserve Short Article

Like Short Article

Enhance Short Article

Conserve Short Article

Like Short Article

Clustering text files is a common concern in natural language processing (NLP). Based upon their material, associated files are to be organized. The k-means clustering method is a favored option to this concern. In this short article, we’ll show how to cluster text files utilizing k-means utilizing Scikit Learn.

K-means clustering algorithm

The k-means algorithm is a favored not being watched knowing algorithm that arranges information points into groups based upon resemblances. The algorithm runs by iteratively appointing each information indicate its closest cluster centroid and after that recalculating the centroids based upon the freshly formed clusters.


Preprocessing explains the treatments utilized to get information prepared for artificial intelligence or analysis. It often includes changing, reformatting, and cleansing raw information and vectorization into a format proper for extra analysis or modeling.


  1. Packing or preparing the dataset [dataset link:]
  2. Preprocessing of text in case the text is filled rather of by hand including it to the code
  3. Vectorizing the text utilizing TfidfVectorizer
  4. Minimize the measurement utilizing PCA
  5. Clustering the files
  6. Plot the cluster utilizing matplotlib


import json

import numpy as np

import pandas as pd

from sklearn.feature _ extraction.text import TfidfVectorizer

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

df = _ json(' sarcasm.json')

sentence = df.headline

vectorizer = TfidfVectorizer( stop_words =' english')

vectorized_documents = _ change( sentence)

pca = PCA( n_components = 2)

reduced_data = _ change( vectorized_documents. toarray())

num_clusters = 2

kmeans = KMeans( n_clusters = num_clusters, n_init = 5,

max_iter = 500, random_state = 42) vectorized_documents)

outcomes = pd.DataFrame()

outcomes['document'] = sentence

results['cluster'] = kmeans.labels _

print( results.sample( 5))

colors = ['red', 'green']

cluster = ['Not Sarcastic','Sarcastic']

for i in variety( num_clusters):

plt.scatter( reduced_data[kmeans.labels_ == i, 0],

reduced_data[kmeans.labels_ == i, 1],

s = 10, color = colors[i],

label = f' {cluster[i]} ')



 file cluster

. 16263 research study discovers bulk of u.s. currency has touc ... 0

. 5318 an open and individual e-mail to hillary clinton ... 0

. 12994 it's not simply a muslim restriction, it's much even worse 0

. 5395 princeton trainees challenge university preside ... 0

24591 why getting wed might assist individuals consume less 0
Text clustering using KMeans - Geeksforgeeks

Text clustering utilizing KMeans

Last Upgraded:
09 Jun, 2023

Like Short Article

Conserve Short Article

Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: