Recently, the Priceonomics data crawling team built Keywords, an application that extracts keywords from text by looking at the frequencies of usage of those words. Common words like but tend to be uniformly distributed throughout a document, whereas more important keywords like Quixote tend to show up in clusters and non-random patterns. An uneven pattern of clustering signals that a word is important.

The distribution of the words Quixote and but in the first 50,000 words of Don Quixote; Source: Carpena et al.

Of course, after building Keywords, the first thing we analyzed was the text of E.L. James’ novel, 50 Shades of Grey. We were very satisfied to learn that the most statistically significant keyword in the book was, by far, the word dominant. As in:

“You’re a sadist?”

“I’m a Dominant.” His eyes are a scorching gray, intense.

“What does that mean?” I whisper.

Next, we tested the tool on more trifling subjects, such as the entire history of the United States of America.

Every year, the President of the United States gives a speech reporting on the State of the Union. These speeches seemed like a good place to start analyzing American history. By running Keywords on each State of the Union address, we could trace which subjects were politically important throughout American history (at least, the official version of history, as reported by the President).

Below are the top 10 keywords used in the State of the Union addresses given by 41 Presidents (President Obama is the 43rd president, and William Henry Harrison and James Garfield both died before getting to deliver a State of the Union). Instead of showing all 93 speeches, we picked a representative address from each President. In each speech, we highlighted the terms we found the most interesting.

Keywords in State of the Union Addresses, by President

This chart gives an interesting snapshot of each presidency: In 1902, Theodore Roosevelt talked about building a canal in Panama and busting monopolistic corporations:

We can do nothing of good in the way of regulating and supervising these corporations until we fix clearly in our minds that we are not attacking the corporations, but endeavoring to do away with any evil in them.

In 1945, 43 years later, his cousin Franklin talked about German and Japanese forces, and the struggle for lasting peace

[W]herever men love freedom the hope and purpose of the people are for peace—a peace that is durable and secure.

The chart also gives insight into the perennial issues that remained important for decades, but are now long gone from the national spotlight. Presidents used to talk about canals and railroads, western expansion, “vessels,” gold and silver, communists, and tribes. It’s interesting to think if the more recent focus on healthcare, energy, and “terror” will be viewed similarly anachronistically centuries from now.

This post was written by Rohin Dhar and Rosie Cima. Data crawled by Elad Yarom. For updates on the API and Analysis Engine, join our developer email list.

