Clustering is the grouping of particular sets of data based on their characteristics, according to their similarities. K-means clustering is one of the most popular clustering algorithms in machine learning. In this post, I am going to write about a way I was able to perform clustering for text dataset.
First, we will need to make a gensim model to convert our text data to vector representation. For this step, I used a topic modeling toolkit named Gensim on the text-8 dataset. Firstly, you will need to download the dataset. Open the terminal and type:
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
Extract the dataset using:
tar xvzf text8.gz
Now you will need to create the gensim model for the text-8 dataset. You will need to have gensim installed for this. If you don’t, just go to terminal and type:
pip3 install gensim
Now open make a python executable file. I will be using nano text editor and the filename is make_gensim_model.py.
nano make_gensim_model.py
Paste the following code into the editor and save it
import gensim file = gensim.models.word2vec.Text8Corpus('./text8') model = gensim.models.Word2Vec(file, size=100) model.save('./text-8_gensim')
Now, execute the code using the command:
python3 make_gensim_model.py
It’s going to create a file named text-8_gensim in your current directory. We will use this saved model later to convert textual data into vector representation.
Now for the dataset, we are going to use Youtube spam collection dataset provided by UCI Machine Learning Repository. The collection is composed by one CSV file per dataset, where each line has the following attributes:
- COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
For our purpose, we will only be needing the CONTENT and CLASS columns. We are going to perform K-means clustering on the CONTENT column with number of labels equal to 2 and later compare our cluster label with the CLASS attribute. The best way to do would be to perform silhouette analysis to select the optimal number of clusters but we will not go into that for now and since we want two clusters (spam and non-spam) for our dataset, we are going to select the number of clusters equal to 2 for our clustering algorithm.
Open the terminal and type:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
To extract data, type:
unzip YouTube-Spam-Collection-v1.zip
There are five files in the compressed .zip file. You could use any for this purpose. However, I will be using
Youtube04-Eminem.csv
We will be needing following dependencies. I am going to assume you have them installed. If not, install them:
- numpy
- scikit-learn
- pandas
Now that we have everything we need, we are ready to perform the actual clustering. Create a new file named cluster_text.py and paste the code below.
#import dependencies import numpy as np from sklearn.cluster import KMeans import pandas as pd import gensim import warnings #hide runtime warnings warnings.filterwarnings("ignore") #load gensim model fname = "text-8_gensim" model = gensim.models.Word2Vec.load(fname) print("Gensim model load complete...") #read the csv file and drop unnecessary columns df = pd.read_csv('./Youtube04-Eminem.csv', encoding="latin-1") df = df.drop(['COMMENT_ID', 'AUTHOR', 'DATE'], axis=1) original_df = pd.DataFrame(df) df = df.drop(['CLASS'], axis=1) #prepare the data in correct format for clustering final_data = [] for i, row in df.iterrows(): comment_vectorized = [] comment = row['CONTENT'] comment_all_words = comment.split(sep=" ") for comment_w in comment_all_words: try: comment_vectorized.append(list(model[comment_w])) except Exception as e: pass try: comment_vectorized = np.asarray(comment_vectorized) comment_vectorized_mean = list(np.mean(comment_vectorized, axis=0)) except Exception as e: comment_vectorized_mean = list(np.zeros(100)) pass try: len(comment_vectorized_mean) except: comment_vectorized_mean = list(np.zeros(100)) temp_row = np.asarray(comment_vectorized_mean) final_data.append(temp_row) X = np.asarray(final_data) print('Conversion to array complete') print('Clustering Comments') #perform clustering clf = KMeans(n_clusters=2, n_jobs=-1, max_iter=50000, random_state=1) clf.fit(X) print('Clustering complete') #If you want to save the pickle file for later use, uncomment the lines below #joblib.dump(clf_news, './cluster_news.pkl') #print('Pickle file saved') #Put the cluster label in original dataframe beside CLASS label for comparison and save the csv file comment_label = clf.labels_ comment_cluster_df = pd.DataFrame(original_df) comment_cluster_df['comment_label'] = np.nan comment_cluster_df['comment_label'] = comment_label print('Saving to csv') comment_cluster_df.to_csv('./comment_output.csv', index=False)
The code is pretty much self-explanatory. However, the data preparation step might be a little confusing which I will try to explain through the image below.
Since comments can be of different word length, we cannot perform clustering unless we find some way to convert each input into the same dimension. There may be two approaches for this:
- Taking mean of each column
- Selecting n-number of words as input and applying padding/trimming.
We are going with the first approach. i.e. taking mean of each column. Here, for the sake of example, I take a comment “I love this song”. Firstly, we are going to convert each word to a 100-dimension vector representation using the gensim model we created earlier. Then, we will take column-wise mean for all the rows in input comment to generate a 100-dimension vector representation for each comment.
Finally, load the comment_output.csv and see the top 10 elements for both Spam and Non-Spam classes.
Clustering is a method of unsupervised learning and it is not right to assume that clusters will be formed according to class labels. However, this is just a demo to show how clustering for text dataset can be done and it produces good results. The approach used might not be the best way in clustering for text data and I am open to any suggestions but I was able to achieve surprisingly good results. If it worked for you too, please comment below. If it didn’t and you think something else is better, please comment as well. Cheers!
is there a way we can get the top words from each cluster based and output this in word cloud
You can create a dictionary of each class and find word counts on each class. However, I am not sure about word cloud.
Hello,it’s great work, but i have a question– in the last, did you mean the comment_label 1 represents spam, vice versa. If so, please tell me why, thanks!
Hello Paul,
While clustering, the labels assigned are not necessarily always the same. You could label spam as 1 when running once and non-spam as 1 when you run the code again.
Thanks for your reply, it’s really helpful. And i want to compute the cluster accuracy, could i see the mostly comment label(1 or 0) correspond to class 1 as spam ?
Hi, I tried this example but ran into below error at line #46,
ValueError: n_samples=1 should be >= n_clusters=2
I have tried several solution but nothing worked. Thanks.