Extract Text and Calculate the Reading Level of Any Article Using the Analysis Engine

For those of you who follow our blog because you’re interested in data analysis, here is a quick update on the Priceonomics Analysis Engine, our tool for crawling and analyzing web data. For those who aren’t familiar with it, our engine allows you to analyze any web page and extract its number of social shares, its links, or its HTML. In short, it makes crawling web pages easier.

Now, the tool is even more useful: we just launched two new apps.

The first application is called Article, and it helps you extract text from a sea of HTML. It intelligently analyzes HTML, identifies the most likely article text, and then extracts that text and strips all HTML formatting. This could assist you in building an app like Readability, or Instapaper, or could help you extract just the text from a random page without writing a parser.

The end result is plain old human language text that you can use as a starting point for deeper analysis. Our Article app uses the python-readability tool (available here) to extract the meaningful text content from an HTML webpage.

To extract just the text, you can follow the API documentation of the Article extraction app, or you can just type URLs manually into our Analysis Engine.

What kind of analysis can you then do to this raw text? Well, anything really. The first raw text analysis application we’re launching today is called Reading Level and uses a variety of reading level indexes — the most well-known being the Flesch–Kincaid index — to calculate the Reading Level of a block of text. All of these indexes generally take a body of text and then calculate how many years of education someone would need to understand the text (an article is written at an 9th grade level, for example).

Here’s the API documentation for the Reading Level app; to test it out yourself, just type a URL into our Analysis Engine.

Finally, we made some updates to Fetch, the core app of the Priceonomics Analysis Engine that makes HTTP requests. A couple weeks ago we added the ability to specify custom headers with each Fetch request, as well as the ability to allow other HTTP request methods (previously only HTTP GET was supported). These features allow you to do things like crawl pages that require a login, or send data to other services.

More improvements are on the way, including a ‘glue’ app that pipes data from app to app, a ‘broadcast’ app that splits lists of inputs across multiple apps, and more tutorials to showcase usage. If you have any ideas for more apps that can analyze text or web pages, please let us know!

In the meantime, we’ll be using these new applications to publish some interesting articles on the blog. For example, the fact that, of all the writers on the blog, the Priceonomics CEO writes at the lowest grade level.

This post was written by Brendan Wood and Rohin Dhar. For updates on the API and Analysis Engine, join our developer email list.

Extract Text and Calculate the Reading Level of Any Article Using the Analysis Engine

Priceonomics

Published February 23, 2015 by Priceonomics

Request a Demo