Using K-Means Clustering to Categorize Text Data

How to use unsupervised learning to categorize text entry fields in a dataset.

Rachael Friedman
5 min readAug 26, 2021
Photo by Patrick Tomasso on Unsplash

In this post, I will share an example of using K-Means Clustering to better categorize text fields in a dataset. I was recently working on a modeling project with my peers to predict adoptions in an animal shelter located in Austin, Texas. This dataset consisted of about 150,000 entries with around 35 different features. Of these features, breed and color of the animal were listed, however the entries were nowhere close to uniform.

In the animal shelter dataset, there were about 2,100 different breeds and 375 different colors listed. Many of the breed and color fields were closely related to each other, just entered slightly differently for example “pitbull” vs “bull pit” or “white black” vs “black white.” There were too many different categories to manually categorize, but my peers and I thought that the breed or color grouping may provide more information for the modeling process so we were looking for other options on how to sort the entries. This is when I remembered that K-Means Clustering could be a good solution for this type of unsupervised learning exercise. I volunteered to test it out and my process and results are shown below. I was pleasantly surprised at how well K-Means ended up working for this dataset.

The first step in my process was to clean up the text fields for breed and color to make them more uniform and easier to work with. I removed all uppercase letters and punctuation and then sorted each word in the entry alphabetically. After the breed and color fields were cleaned, I used Count Vectorizer to transform the text into vectors for analysis. I then fit a K-Means model on the vectorized text fields as shown below.

# instantiate count vectorizer
cvec = CountVectorizer( )
# transform cleaned breed and color fields to vectors
c_breed = cvec.fit_transform(breed)
c_color = cvec.fit_transform(color)
# instantiate k means
km = KMeans(n_clusters=30)
# fit on vectorized breed
km.fit(c_breed)
# print out silhouette score
silhouette_score(c_breed, km.labels_)
> 0.6378088211888134
# repeat on color, instantiate k means
km = KMeans(n_clusters=20)
# fit on vectorized color
km.fit(c_color)
# print out silhouette score
silhouette_score(c_color, km.labels_)
> 0.666093700481587

In the code above, you can see that I specified 30 clusters for breed and 20 clusters for color. This can take some trial and error depending on the data. I would say play around and evaluate the number of clusters that works best for your data using a combination of silhouette scores, graphs to view different clusters, and your best judgement. The closer the silhouette score can get to 1, the better. Check out the print out below for an example breed cluster and an example color cluster to see how well K-Means did!

# print out of unique values in breed cluster 8:
(8, 'basenji shepherd german'),
(8, 'beagle australian shepherd'),
(8, 'beagle shepherd german'),
(8, 'belgian malinois anatol shepherd'),
(8, 'belgian malinois australian shepherd'),
(8, 'belgian malinois shepherd german'),
(8, 'belgian shepherd tervuren german'),
(8, 'brittany shepherd australian'),
(8, 'coonhound shepherd english german'),
(8, 'doberman shepherd german pinsch'),
(8, 'german shepherd bulldog american'),
(8, 'great dane shepherd german'),
(8, 'hound plott shepherd german'),
(8, 'hound shepherd german afghan'),
(8, 'hound shepherd german basset'),
(8, 'hound shepherd ibizan german'),
(8, 'hound shepherd redbone german'),
(8, 'kelpie shepherd australian'),
(8, 'kelpie shepherd australian dutch'),
(8, 'kelpie shepherd australian german'),
(8, 'mastiff shepherd anatol'),
(8, 'mastiff shepherd german'),
(8, 'mix shepherd dutch'),
(8, 'mix shepherd german'),
(8, 'poodle shepherd australian standard'),
(8, 'saluki anatol shepherd'),
(8, 'setter english australian shepherd'),
(8, 'sharpei shepherd anatol chinese'),
(8, 'sharpei shepherd german chinese'),
(8, 'sheepdog shepherd shetland german'),
(8, 'shepherd akita anatol'),
(8, 'shepherd akita german'),
(8, 'shepherd alaskan australian malamute'),
(8, 'shepherd alaskan malamute german'),
(8, 'shepherd anatol'),
(8, 'shepherd anatol australian'),
(8, 'shepherd anatol corgi welsh pembroke'),
(8, 'shepherd anatol dutch'),
(8, 'shepherd anatol german'),
(8, 'shepherd anatol kangal'),
(8, 'shepherd anatol mix'),
(8, 'shepherd anatol pointer'),
(8, 'shepherd australian'),
(8, 'shepherd australian corgi welsh cardigan'),
(8, 'shepherd australian corgi welsh pembroke'),
(8, 'shepherd australian feist'),
(8, 'shepherd australian german'),
(8, 'shepherd australian greyhound'),
(8, 'shepherd australian mix'),
(8, 'shepherd australian pointer'),
(8, 'shepherd australian smooth collie'),
(8, 'shepherd australian spaniel cocker'),
(8, 'shepherd australian unknown'),
(8, 'shepherd bernard anatol smooth coat st.'),
(8, 'shepherd border german collie'),
(8, 'shepherd border german terrier'),
(8, 'shepherd boxer german'),
(8, 'shepherd bull german terrier'),
(8, 'shepherd chow anatol'),
(8, 'shepherd chow australian'),
(8, 'shepherd chow german'),
(8, 'shepherd dalmatian australian'),
(8, 'shepherd dog german carolina'),
(8, 'shepherd english mix'),
(8, 'shepherd eskimo australian american'),
(8, 'shepherd field australian spaniel'),
(8, 'shepherd finnish australian spitz'),
(8, 'shepherd finnish german spitz'),
(8, 'shepherd german'),
(8, 'shepherd german airedale terrier'),
(8, 'shepherd german anatol shorthair pointer')
# print out of unique values in color cluster 7:
(7, 'brown agouti tabby'),
(7, 'brown blue tabby'),
(7, 'brown calico tabby'),
(7, 'brown chocolate tabby'),
(7, 'brown cream tabby'),
(7, 'brown gray tabby'),
(7, 'brown lynx point tabby'),
(7, 'brown merle tabby'),
(7, 'brown orange tabby'),
(7, 'brown tabby'),
(7, 'brown tabby black'),
(7, 'brown tabby black smoke'),
(7, 'brown tabby red'),
(7, 'brown tabby silver'),
(7, 'brown tiger cream tabby'),
(7, 'brown torbie tabby'),
(7, 'brown tortie tabby')

Looking at the different clusters, K-Means did a decent job at grouping breeds and colors. Obviously, some classifications are not perfect. For example, any entry with a breed listed as “mix” was included in the same breed cluster, but this could be “sheep mix” and “beagle mix” based on the different type of animals in the shelter. However, this was much faster and more efficient than anything I could have done manually. My peers and I were now able to narrow down the 2,100 different breeds and 375 different colors to 30 breed clusters and 20 color clusters to include in our predictive model.

I hope this quick tutorial on using K-Means Clustering on text fields in your data was helpful!

--

--

Rachael Friedman
Rachael Friedman

Written by Rachael Friedman

Data Science Student at General Assembly

No responses yet