So an SVM, a GBM and a KNN algorithm walk into a bar…    laughing-emoji
My experience is that computers don’t really have “insights”– people do. Recognizing an insight that might make an engaging, funny, impactful ad is something that takes the kind of intuition that only comes from years of brainstorming with a bunch of creative folks. My goal here isn’t to demonstrate cutting edge algorithms or write award-winning creative, but to give you some sense of how we can acquire very valuable data. And use text mining techniques and algorithms to help turn that data into ads and creative strategies.

Interested to learn about the Project Process and BackgroundClick Here

Awhile back, I was fortunate to pitch some new business with a wonderful shop in Santa Monica named Tiny Rebellion. The pitch was for a large, custom t-shirt maker named ($200 million in annual revenue). While they will make you just one t-shirt, in general, they are the kind of company you would use if you needed 50 custom t-shirts (and other tchotchkes) with your logo, for a corporate event, or a charity 5K run – that sort of thing.

I am always looking for “Creative Data opportunities”. While working on the pitch, I discovered a section, deep within the CustomInk website, for Uncensored Customer Comments. CustomInk is very proud of their customer service and theycustomink-comments_webpage_850px should be. The consistently have a 97% plus Customer Satisfaction Rating. In fact, anyone who purchases their products are automatically directed into this Uncensored Customer Comments section. Some customers simply pass on this but most write detailed comments, good and bad. I thought this would make a great opportunity to experiment with text mining. However, in the heat of battle and usual craziness of a new biz pitch, I didn’t feel confident enough yet to pitch the Creative Director on the idea. I decided later that it would make a great spec/Capstone project. What you see here is just one small part of that project.

Where to start?  I needed to get the Comments (data) from the website. I used a commercial tool called to write an Extractor and used it to scrape over 10,000 comments. Many of these commercial tools are great for web scraping if the site has well-formed HTML. Nowadays I would probably write a web crawler script using Python’s Scrapy framework.

The text mining workflow goes something like this… Clean out everything that isn’t text using a Python program called BeautifulSoup4, built-in Python string methods, and/or Regular Expressions. (Regular Expressions are the most compact code you may ever see.)
regex-for-extracting-email-from-textNext, I used a Python Program called NLTK (The Natural Language Toolkit) to “tokenize” the text. Meaning to break it down into individual word-objects.

Some operations you may choose to do at this point involves further “cleaning”. You can remove words that don’t add any real value to the counts. These words are called “stop words”. Words like “the” and “of”. (“The” accounts for 7% of the words we use. “Of”, 3.5%)

You can also run TF-IDF (Term Frequency – Inverse Document Frequency) to actually score how important a word is within a series of documents. (Each comment can be considered its own document.) There are also other pre-processing techniques like “stemming” and “lemmatizing” that I won’t go into here.

*NOTE: One very popular and useful approach to working with text is called the “Bag of Words” model. You basically throw out all the syntax and order that exists in sentences and you are left with, well, a bag of words. This allows you to do a whole series of operations on the words, that mostly involves counting them in different ways. (Click here to see the Text Mining Techniques page for a deeper dive.)