automated topic modelTwo things I generally want to accomplish with Text Mining: 1) I want to better understand how customers think, so I can write to them more as a person. 2) I want the text to pitch me specific phrases and concepts to brainstorm from (Just like a creative partner would). Text Mining (Natural Language Processing) can be especially useful when the volume of text is too large to read, absorb, and get ideas from.

When most people talk about “data” they usually mean “numbers and stats”. However, text is also data. In fact, the world’s largest database, the Internet, is 80% unstructured text. That’s why I love being able to write web scrapers in the Python programming language. We can also mine… Learn More

…Twitter and Social Media, online focus groups, eBrochures, eBooks, PDFs, lyrics, speeches, and company “gray material”. I created a transcription web app using Node.js and IBM Watson’s Artificial Intelligence that has about a 90-95% accuracy. So we can transcribe and text mine TED talks, YouTube videos, podcasts, focus group recordings, video interviews, etc.

Below are some of the techniques and linguistic markers I find worth exploring with text.

Advertising is a LOT about understanding customers’ beliefs and working to frame and reframe them. (I’m currently writing a book "Framing & Reframing for Advertising" – based on the brilliant works of Robert Dilts, L. Michael Hall, et al.)

Some useful linguistic markers for text mining beliefs are words/phrases like “makes”, “forces”, “means”, “is”, “isn’t” “it’s” (and other verbs of being) “makes”, “cause”, “causes” and “because” just to name a few. We can never have ALL the data/info. (Even Google can’t have next week’s data.) Instead, we have beliefs about what we think is true. We then update our beliefs as we get more info. (This is at the core of Bayesian Statistics and every creative brief I’ve ever read. “What does the customer believe now?” “What do we want them to believe?” “What info can we give them to help them get there?”)


“Having a cool car causes people to notice you.”

“Having a point system makes it easier to stay on my diet.”

“Fish scale free lipstick is better for your health.”


Values are a measure of what a customer believes is important (value-able) and they are often a strong driver of behavior. Values are nouns – Love, Security, Humor, Money, Style, Uniqueness, Boredom, Creativity.  In fact, they are usually verbs that have been turned into nouns, known as Gerunds or Nominalizations. This can make your customer’s values hard to track down. Python’s NLTK Concordance can be a big help with that.


Linguistic Markers – “like”, “as if”, “think of it as”, “is a”

Some social psychologists believe we learn everything by metaphor. “That” is like “some other thing” is one way of understanding something new. Metaphors can also tell us a lot about a customer’s world view. I once wrote some positioning strategies for a weight loss program and heard focus group respondents say things like “Most diets are like torture.”, and “My new diet is a blessing.” Very different perspectives of a diet. Since many commercials are metaphorical themselves (“Our car is like a bullet.”) it can be helpful to mine them. Unfortunately, the word “like” is like totally overused nowadays, if you know, like, what I mean. This is where Python’s NLTK’s Concordance can come in handy, to see “like” or “as if” in context.

Parts-of-Speech Tagging / Verb Tenses / Pronouns

One of the ways we construct our models of the world is what journalism calls “The 5 W’s” (Who, What, When, Where and Why.) And what Robert Dilts calls “Logical Levels” with each level up having a more progressively powerful effect on your neurology. Notice the different level of commitment in your body between saying “I run.” (What = Behavioral Level) and “I am a runner.” (Who = Identity Level). Surprisingly, most “stopword sets” (words of little value you remove before Text Mining) force you to leave out Pronouns.

Parts-of-Speech tagging can get crazy in-depth, but parts I find most useful are “nouns”, “verbs”, “adjectives” and “adverbs”. Nouns can tell you the “things” your customers are interested in. Verbs can tell you what they like to “do”. Adjectives and adverbs can tell you how (and the degree to which) they feel about some thing or some action. Google recently open-sourced a super-accurate app for POS tagging called Parsey McParseface (I’m serious.)

Here's an interesting holistic brand experiment ~ Apple vs Samsung:

Think about Apple and all you know about it as a Brand.

 Logical Levels: Who? (Steve Jobs). Why? (To be the most innovative). What? (iPhones, iTunes, iPhone 7 release, etc.) Where? (Cupertino) When (1977, 1984, Now)

Verb Tenses:  Do you have knowledge of Apple’s Past, Present, Future?

Pronouns. Do you know what “I” (Myself), “You” (My friend), “We” (Us as a group), “they” (Millennials) think about Apple?

 Notice how richly Apple’s powerful brand can be represented in your mind with language.

Now ask yourself the same thing about Samsung. Uhhhh….

What? (they make a cellphone - one that exploded).  Where? (South Korea.)

Comparatives & Superlatives

Since advertisers are often looking for a competitive advantage, I’d like to know how a customer, writer, commenter or reviewer is comparing things. Or what they think is the best. Comparatives and Superlatives can give us some insights into that.

 A Comparative is the name for the grammar used when comparing two things. We can mine a corpus of text for them by searching “as .. as” or” than”. You can also spot them by searching for suffixes “…er” or the word “more”.

“Generic brands are just as good as expensive brands”

“Imports are better built than American cars.”


A Superlative is about one thing and how it is the best/worst/hottest, etc.

Superlatives can be found by searching the text for “…est” suffixes or the word “most”.

 Again, the NLTK Concordance is useful for letting us see the sentence/context around those words.

Emotion-filled Words

A more specific use for the word list used in Sentiment Analysis is to search the text corpus for words that contain emotion. Wouldn’t you like to know what people love, like, hate, adore, etc?

Modal Operator verbs

These are verbs that can tell us a little about whether a person thinks something is possible (“Can”, “may”, “could”, etc.) or impossible (“Can’t”, “won’t”, “couldn’t”). Or mandatory, required – a necessity (“Have to”, “should”, “shouldn’t”, “ought to”, “must”, “musn’t”). So it can tell you about the mindset of the people talking. There usefulness is limited but if one form dominates the text, it’s worth considering when writing copy. “You have to have car insurance”. “This is a diet you can stick with.”


This allows you to count the number of times a single word appears in a text. This can be useful as you iterate through the Text Mining process. For instance, after web scraping 10,000 customer comments from a website for a company that puts custom logos on items, I was able to determine that twice as many customers commented on “koozie” as they did “mug”. Leading me to believe if you were only going to advertise one of the products, “koozie” was probably the more popular item.


Python’s NLTK (Natural Language Toolkit) Concordance allows you to search all your text for the context around a word. Text often doesn’t mean anything until it has a context. Take the word “love”. For instance, “I love that movie.”, And “I don’t love that movie” have very different meanings. Being able to search a word or phrase and see it in context can be extremely helpful to do throughout the Text Mining process.

Frequency Distribution

FreqDist counts all the words in your text and lists them in order of popularity. This is often used in Word Clouds. It’s a good starting place but there are often better tools and algorithms. Partly because a lot of words are what are called “Stopwords”, meaning they don’t add much to your understanding. For instance, “The” is a word we see in text 7% of the time. “Of” over 3%. “Stopwords” have very little information in them, so it’s often a good practice to remove them before Text Mining.

Tf-idf: Term Frequency-Inverse Document Frequency

This is usually a better measure of important words in a document, in a corpus of documents. It’s a statistical accounting of how often a word shows up in each document (Technically, a “document” can be just one customer comment/review/tweet, etc.) It lets you see which words carry the most informative weight in a text. For instance, the word “the” gives you almost no information. “Orgasm” gives you a lot.


Often a writer’s goal is to end up with a 5 to 7-word headline, so this can be helpful for seeing the most popular phrases. (click to see how this can be used to write email Subject Headings). “N” stands for the number of words in a phrase. N=2 (a bigram) means all the text has been divided into two words phrases. NLTK will take all the sentences in a corpus of text and return them as a designated number-word groups. Research shows that given two words, people can usually guess the third word. N=3 (trigram). “Oaky red _____”, for instance. NLTK will also return 4-grams and 5-grams, as well as sort N-grams by their frequency. Google famously scanned millions of books and made available 450 million words with their n-gram viewer so you can track the use of phrases over the last 200 years.


Collocations are words that are often seen together. Though it doesn’t mean there are many of them in the text (unlike counting bigrams) An example of a collocation is a phrase like “Powerful Computer”. You could say “Strong Computer” and it wouldn’t be wrong. It just wouldn’t feel right. Some other examples of collocations include “crystal clear”, middle management, and “maple syrup”. (Click to see an example of how I mined a collocation and turned it into a banner ad.)

Similar words and Associated Words

In some ways, this is like a Thesaurus. (Something I’ve used quite a bit to spark ideas when concepting.) Only this gives you similar words within the corpus of the text your mining. If you search for a similar noun, it will return similar nouns. Same for verbs, adjectives and adverbs. (The R Programming Language will actually return words with a scoring of how associated the words are.) This can be helpful for concepting by helping you understanding an overall theme and mindset in a body of text.

Hapaxes (legomena)

Hapax are unique words that only appear one time in a corpus. At first this might lead you to believe that since a word isn’t popular, it’s not important. But the opposite can also be true. I heard a story that the wonderful Toyota “Swaggerwagon” campaign came from someone seeing that awesome word(s) in their Social Listening. Because a hapax only appears one time, it can make it hard to sort for. In the 10,000 customer comments, spec project I web scraped, there were 8000 unique hapaxes. Sometimes though, hapax can be worth the time, in order to fish out unique expressions to brainstorm from. I’ve also played with writing a For Loop that randomly selected seven hapax, grouped them into a cluster, and pitched them to me like they belonged together.

Sentiment Analysis

Sentiment Analysis is a way of determining whether a text is more positive or negative. It’s used a lot to gauge reactions to brands on Social Media because it can take large amounts of text, beyond what you could really read, and compare it to a list of several thousand words that contain emotion. It will then tell you how people feel, positive, negative, or neutral (on a scale of 1-100). I did a project recently where a researcher was doing online focus groups and by using Sentiment Analysis, we were able to give the agency and the client a tangible, numerical score for the winning and losing campaign. It was very helpful.

Regular Expressions for extracting “quotes” and (parentheticals)

A Regular Expression (or “Regex”) is a sequence of characters that define a search pattern. It’s the most dense code you may ever see. It’s used a lot for extracting, and often replacing, text within a document. One useful technique is to search for any series of letters, numbers or punctuation inside of “quotes” or (parentheses). Just like how people “air quote” when speaking to emphasize a point or call it out in jest, info inside of quotes can be really useful. Often it’s a cultural euphemism, a slang word, a metaphor, or also, just a great quote.

Example: In mining some text for a cloud service client, my search turned up a quote for one of the downsides to shared hosting. The euphemism “noisy neighbor”. (Which basically means, the downside of sharing a hosted server with other people/companies). It’s a commonly know phrase in the industry but also led to a fun ad.

 A Parenthetical is a word, clause, or sentence inserted as an explanation or afterthought (those can also contain valuable info).

LSI (Latent Semantic Indexing)

Latent Semantic Indexing is an algorithm Google developed to help determine the contents of a web pages so they know how to steer you to the page you want. For instance, if you type “Mustang” into the search box, do you want to go to a website about Horses or the Ford Mustang? LSI essentially eats all the words on a web page, reduces the text down to core words using vector math, and categorizes web pages based on core concepts within the page. So the Horse “Mustang” web page might have core concepts like “horses”, “feed”, and “barns”. While the Ford “Mustang” web page might have words like “engine”, “wheels” and “fuel injection”. This is one of my favorite algorithms because it will take a large amount of text (more than you could reasonably read through), reduce it down to a specified number of clusters and literally pitch them to you as “concepts”. (Is that awesome or what?) You can also run it again and again with different parameters and get pitched different “concepts”. The downside is that in reducing the text down, it has a tendency towards pitching vague concepts.

 (Which reminds me of an old Woody Allen joke, “I took a speed-reading course and read War and Peace in twenty minutes. It involves Russia.”)

 To see an example of some rough, spec concepts I did as a demonstration of LSI, click here.


Topic Modeling with LDA - Latent Dirichlet Allocation

LDA is an algorithm that's useful for discovering a mixture of topics within a set of documents. Keep in mind that a "document" can be a number of pages or a single comment from a customer review or focus group. LDA (Latent Dirichlet Allocation) can distill a chosen/set number of topics from a large series of documents.

This can be useful for taking an overwhelming number of documents/comments/reviews, etc. and condensing them down to a series of aggregated topics you can easily get your head around.

Word Embeddings as a form of Lateral Thinking

Computers don’t like words, they compute numbers. So before you can work with words, you have to convert them to some sort of number. One way to do that is to convert words into linear algebra vectors. Vectors have magnitude (size) and direction. By doing this, you can reduce the dimensionality, making the text easier to work with. Also, as linguist J.R. Firth famously said, “You shall know a word by the company it keeps.” So similar words hang out together. Which means Word Embedding algorithms (Word2Vec, GloVe) can pitch you similar words, much like a thesaurus, but often more interesting and provocative. Sometimes, words and thoughts that you would never have even considered – the very definition of Lateral Thinking.

(This is a very popular but involved approach. One I've only begun to experiment with but I believe it offers great promise.)

Text Classification with Naive Bayes Classifier / Logistic Regression / SVM / Random Forest

Naive Bayes is one of the more useful machine learning algorithms I've found for classifying text. A Naive Bayes classifier gives every word in a corpus a "vote" as to which category it might belong in. For instance, early on it was used for classifying emails into either a "spam" category or a "ham" category. But it can be used for a number of classification tasks, including Sentiment Analysis. For instance, I've used Sentiment Analysis to analyze text from an online focus group transcription where two different concepts were presented. I was able to give my researcher client a Sentiment Score (72 for the winning campaign, 28 for the losing campaign). So they had some tangible way to measure how much more the winning campaign was preferred. (In addition to other insights I text mined.)

Logistic Regression, like Naive Bayes, is also a great algorithm for classifying text. One of the added benefits of Logistic Regression is that will not only classify text into two or more categories (or "classes, hence the word "classifier") but it will also give a you a probability as to the odds that a target is in a specific category (as opposed to a simple binary "yes" or "no" answer).

Text Summarization & Tf-idf as Creative Fodder

Using Tf-idf scores and random selection, it's possible to have text pitch you a cluster of relevant, related words to brainstorm with.