K-means clustering for text dataset

An innovative way of using k-means clustering for text dataset

Clustering is the grouping of particular sets of data based on their characteristics, according to their similarities. K-means clustering is one of the most popular clustering algorithms in machine learning. In this post, I am going to write about a way I was able to perform clustering for text dataset.

First, we will need to make a gensim model to convert our text data to vector representation. For this step, I used a topic modeling toolkit named Gensim on the text-8 dataset. Firstly, you will need to download the dataset. Open the terminal and type:

wget http://mattmahoney.net/dc/text8.zip -O text8.gz

Extract the dataset using:

tar xvzf text8.gz 

Now you will need to create the gensim model for the text-8 dataset. You will need to have gensim installed for this. If you don’t, just go to terminal and type:

pip3 install gensim

Now open make a python executable file. I will be using nano text editor and the filename is make_gensim_model.py.

nano make_gensim_model.py

Paste the following code into the editor and save it

import gensim

file = gensim.models.word2vec.Text8Corpus('./text8')
model = gensim.models.Word2Vec(file, size=100)

Now, execute the code using the command:

python3 make_gensim_model.py

It’s going to create a file named text-8_gensim in your current directory. We will use this saved model later to convert textual data into vector representation.

Now for the dataset, we are going to use Youtube spam collection dataset provided by UCI Machine Learning Repository. The collection is composed by one CSV file per dataset, where each line has the following attributes:


For our purpose, we will only be needing the CONTENT and CLASS columns. We are going to perform K-means clustering on the CONTENT column with number of labels equal to 2 and later compare our cluster label with the CLASS attribute. The best way to do would be to perform silhouette analysis to select the optimal number of clusters but we will not go into that for now and since we want two clusters (spam and non-spam) for our dataset, we are going to select the number of clusters equal to 2 for our clustering algorithm.

Open the terminal and type:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip

To extract data, type:

unzip YouTube-Spam-Collection-v1.zip

There are five files in the compressed .zip file. You could use any for this purpose. However, I will be using

We will be needing following dependencies. I am going to assume you have them installed. If not, install them:

  • numpy
  • scikit-learn
  • pandas

Now that we have everything we need, we are ready to perform the actual clustering. Create a new file named cluster_text.py and paste the code below.

#import dependencies
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
import gensim
import warnings

#hide runtime warnings

#load gensim model
fname = "text-8_gensim"
model = gensim.models.Word2Vec.load(fname)
print("Gensim model load complete...")

#read the csv file and drop unnecessary columns
df = pd.read_csv('./Youtube04-Eminem.csv', encoding="latin-1")
df = df.drop(['COMMENT_ID', 'AUTHOR', 'DATE'], axis=1)
original_df = pd.DataFrame(df)
df = df.drop(['CLASS'], axis=1)

#prepare the data in correct format for clustering
final_data = []
for i, row in df.iterrows():
    comment_vectorized = []
    comment = row['CONTENT']
    comment_all_words = comment.split(sep=" ")

    for comment_w in comment_all_words:
        except Exception as e:
        comment_vectorized = np.asarray(comment_vectorized)
        comment_vectorized_mean = list(np.mean(comment_vectorized, axis=0))
    except Exception as e:
        comment_vectorized_mean = list(np.zeros(100))
        comment_vectorized_mean = list(np.zeros(100))

    temp_row = np.asarray(comment_vectorized_mean)

X = np.asarray(final_data)
print('Conversion to array complete') 
print('Clustering Comments')

#perform clustering
clf = KMeans(n_clusters=2, n_jobs=-1, max_iter=50000, random_state=1)
print('Clustering complete')

#If you want to save the pickle file for later use, uncomment the lines below
#joblib.dump(clf_news, './cluster_news.pkl')
#print('Pickle file saved')

#Put the cluster label in original dataframe beside CLASS label for comparison and save the csv file
comment_label = clf.labels_
comment_cluster_df = pd.DataFrame(original_df)
comment_cluster_df['comment_label'] = np.nan
comment_cluster_df['comment_label'] = comment_label

print('Saving to csv')
comment_cluster_df.to_csv('./comment_output.csv', index=False)

The code is pretty much self-explanatory.  However, the data preparation step might be a little confusing which I will try to explain through the image below.

K-means clustering data preparation

Since comments can be of different word length, we cannot perform clustering unless we find some way to convert each input into the same dimension. There may be two approaches for this:

  • Taking mean of each column
  • Selecting n-number of words as input and applying padding/trimming.

We are going with the first approach. i.e. taking mean of each column. Here, for the sake of example, I take a comment “I love this song”. Firstly, we are going to convert each word to a 100-dimension vector representation using the gensim model we created earlier. Then, we will take column-wise mean for all the rows in input comment to generate a 100-dimension vector representation for each comment.

Finally, load the comment_output.csv and see the top 10 elements for both Spam and Non-Spam classes.

K-means clustering for text dataset results

Clustering is a method of unsupervised learning and it is not right to assume that clusters will be formed according to class labels. However, this is just a demo to show how clustering for text dataset can be done and it produces good results. The approach used might not be the best way in clustering for text data and I am open to any suggestions but I was able to achieve surprisingly good results. If it worked for you too, please comment below. If it didn’t and you think something else is better, please comment as well. Cheers!


6 Comments on K-means clustering for text dataset

  1. Hello,it’s great work, but i have a question– in the last, did you mean the comment_label 1 represents spam, vice versa. If so, please tell me why, thanks!

    • Hello Paul,
      While clustering, the labels assigned are not necessarily always the same. You could label spam as 1 when running once and non-spam as 1 when you run the code again.

      • Thanks for your reply, it’s really helpful. And i want to compute the cluster accuracy, could i see the mostly comment label(1 or 0) correspond to class 1 as spam ?

  2. Hi, I tried this example but ran into below error at line #46,
    ValueError: n_samples=1 should be >= n_clusters=2

    I have tried several solution but nothing worked. Thanks.

1 Trackbacks & Pingbacks

  1. Text Mining Techniques for Search Results Clustering - Text Analytics Techniques

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.