Many readers of this blog might have no idea how we make money. The short answer is that we sell data to companies. We also write books that many of our awesome readers purchase. But our biggest source of revenue comes from companies who hire us to crawl data for them. Unlike other sites that have a strong content-focus, our philosophy is to make money by selling things, mostly to businesses.
Tech companies and hedge funds pay us between $2K to $10K per month to crawl web pages, structure the information, and then deliver it to them in analyzed form. This is a pretty significant amount of money because acquiring data is a burning problem for some companies.
Why is web crawling such a pain point? Well, unless it’s your primary focus, you’re constantly re-inventing the wheel: setting up a backend infrastructure, figuring out how to browse this particular web page and encode it properly, figuring out how to queue and rate-limit crawl tasks, how parse the information, and how to keep things running when they break a few months later when the website updates.
Even we were reinventing the wheel almost every time for our first few customers. But we figured if we did this enough times, not only would we get better at it, but we could build the product we wished existed for crawling and analyzing web data. And that’s exactly what we’ve been working on.
Today, we’re introducing the Priceonomics Analysis Engine, a tool that helps you crawl and analyze web data both at scale, or just one page at a time. It’s still what you would call a “minimum viable product”, in skeletal form, but we wanted to release it to get feedback from developers and anyone who might want to use it.
Fetch: An API for Web Crawling
The core of the Priceonomics Analysis Engine is its ability to fetch data with high reliability. This service is called Fetch, and it allows developers to fetch web pages at scale without having to implement an entire crawling backend. Developers simply point Fetch towards a URL and it handles the rest. What is “the rest”?
Fetch performs HTTP requests through our crawling backend. Our API comes with the built-in ability to route requests through several countries around the world, understand and obey (or disobey, at your discretion) robots.txt, set custom user agents, and normalize encoding to UTF-8 for text content. Fetch is also designed for maximum reliability and built in rate limiting, so you may find it solves a lot of the problems you’ve had on large-scale web crawling projects in the past.
Since this is a very early release, the tool still has limitations we plan to fix. For example, this version of Fetch only supports HTTP GET requests, there is no interface for custom headers in general, and cookies cannot be passed through the tool. As the Analysis Engine matures, we intend to expand on Fetch to make it an extremely flexible tool for accessing web data in a variety of ways.
Developers simply POST input data to the Fetch endpoint as JSON, and receive the output as soon as the request completes. For large workloads, requests can be made asynchronously and results collected later.
The Analysis Engine
While the process of machines fetching web pages is surprisingly challenging, we would have zero customers if we gave them raw data dumps of the HTML of hundreds of millions of web pages. We analyze this raw data by writing applications to interpret the pages. Some of these applications are very specific (eg., “parse the fields on this particular web page”), while others are general purpose data science applications and could be used on any web page.
The engine is built around the concept of applications with specific, well-defined functionalities. An application works a lot like a function in any other programming language, except that applications are accessed as their own endpoint on the Analysis Engine API.
There are four analysis applications available at launch, and we will be continuously implementing more based on what developers need. The four applications are:
Social: Look up Facebook, Twitter, Pinterest, etc. shares for a given URL.
Extensions: Detect third-party extensions embedded within HTML, so you can analyze what powers a site.
Links: Extract all links and anchor from an HTML document (useful for recursive crawling or building your own Google!).
Contacts: Extract clear-text contact information from text or HTML, including phone numbers and email addresses.
We’re also currently planning to build the following analysis applications:
Assemble: Take the output of hundreds of asynchronous calls and assemble it into a CSV document, or other data format.
Deliver: Fire off output as an email, deliver via SFTP, or create a download link.
Summarize: Automatically summarize a page based on machine learning and natural language processing.
Screenshot: How does your page look on various browsers and devices?
Pricer: For a top ecommerce site, what is the price and name of the product at a given URL.
RankSEO: For a search engine query, what sites rank first, second, third, etc.
Dedupe: Don’t crawl the same page twice.
We’re open to suggestions for things you’d like to see. An image classifier, malware detector, cookie detector, an ad detector — if enough of you want it, we can build it! Or, if people are really interested, they can even build it themselves (eventually).
Have an idea for a useful application for analyzing web pages? Let us know!
How to use the API
To get started, let’s learn how to call the Fetch application to crawl web pages and route requests through any supported country. Our documentation includes a live console that hits our production API, so you don’t even need a programming language to play with the API and learn how to use it. If you’re so inclined, feel free to skip to the full documentation for more detail and examples for every application.
In short, this call to Fetch asks that we get the HTML content from the URL as it would appear from Canada, obey robots.txt, and provide a custom user agent string. The call is synchronous, meaning that it waits for the call to complete and the results are returned as soon as possible.
If you aren’t a user of Python, our live console allows you to generate code samples for a variety of languages. For even more samples and tools to make using the Analysis Engine easier, we’re publishing an open source repository on GitHub. If it’s missing something you want, fork it and contribute!
This call to Social is nearly identical to the call to Fetch, except that “url” is the only field we need to include in the input. Again, this call is synchronous, so the results will be returned immediately.
The API is free for modest usage (1500 requests a day for now, sign up for a private API key), and will have a low price per request for larger crawls. Or you can just hire our data services group to implement the crawl for you. We haven’t yet built a full terms of service beyond the following: doing bad or illegal things with the engine is prohibited.
Right now, the analysis applications only support web pages and text, but our roadmap includes support for other unstructured data sources like PDFs, images, and we’re open to other suggestions.
Next in the development pipeline: Pipeline
The goal of the Analysis Engine is not only to make applications easy to use with web data, but also easy to use with each other. In the case of early applications like Extensions, Links, and Contacts, they don’t make much sense without the Fetch app to provide the data. We designed the interface to these applications such that output from Fetch is also valid input to Extensions, Links, and Contacts. Applications are intended to be chained together into data flow paths, to produce larger analysis systems that produce more value than any single application can individually.
One of the biggest limitations of the API right now is that application calls have to be made individually, so the client has to rate-limit (some aspects of) their own usage, manage data flow, and generally babysit the entire process. We are working on a solution to this called Pipeline, which will be an application for defining data flow between applications and scheduling massive crawls with a single API call.
If you’d like to be updated when Pipeline launches, please join our developers email list.
The Vision: A Platform to Run Data Science Applications
We built the Analysis Engine as a platform for developers to write applications that analyze web pages and other unstructured data sources.
We hope the longer-term value in the Analysis Engine will be in the diversity of data analysis applications that run on the platform. We are committed to building lots more applications to expand the possibility of things which can be done with the Analysis Engine, but we strongly believe that opening application development to the community is the most powerful way forward if others are interested. We are still working on the best way to implement this, but it will involve developers making money if they create useful apps, so please stay tuned!
For now, the primary pain point solved by the analysis engine right now is it will help abstract away the need for a developer to set up a backend to crawl web pages.
The Priceonomics data business started off as a service to help companies crawl web pages and make sense of the data. What we’re releasing today is the product we built that service around. We’re eager to see how people use the Analysis Engine in practice, so please: go play with the engine, read the docs, ask us questions, use the engine to fetch data, and feel free to build things with it!
This post was written by Brendan Wood and Rohin Dhar. Thanks to Fred McDavid and Elad Yarom for their incredibly hard work on this.
For updates on the API and Analysis Engine, join our developer email list.