Kitsune Deepcrawl
deepcrawl
Crawler is a program that automatically fetches the contents of a web page. litecrawl is about quickly accessing the contents of web pages via http requests, understanding the assets used in the webpage and politely downloading particular assets. We also understand the interlinking between the documents. This comes with a challenge, the static yet dynamic sites.

Static Yet Dynamic Sites

no image found
Google Maps, when you disable javascript.
This is 2017 and static websites are no more just static. Generally static websites do only minimal javascript to make their page look nice. It is a challenge to crawl websites which have a static front-end but make several requests to the backend server to populate the DOM with data. This has become a trending pattern especially with the help of frameworks like React and AngularJS which update the DOM as and when more data is available or based on how the user is interacting with the page.
When we try to use litecrawl for accessing such websites, we are often greeted with a blank or truly broken webpage. These crawled pages will have next to no hyperlinks. To crawl such dynamic yet static websites we need to render them using a browser. Thankfully there exists some work in this particular field as dynamic content in the websites grow exponentially.

deepcrawl to the rescue

no image found
Based on ideas of combining a real browser with the crawler from Internet Archive's brozzler project we developed deepcrawl as a way to start accessing content from websites that load content via javascript execution. Brozzler is aimed recording interactions between the servers and the web browsers as they occur, more closely resembling how a human user would experience the web resources that they want to archive [1]. Instead of following the hyperlinks and directly downloading the webpage, deepcrawl opens them in the browser. After the rendering of content is complete in the browser, we convert the DOM to its corresponding html source via injecting javascript in runtime.

\\Let’s get some source

document.head.outerHTML

document.body.outerHTML

Two lines of vanilla javascript gives us the required HTML source. This works great for pages without iframes within them. For the web pages with iframes, we need to crawl each iframe’s sources. We use a fork of Splash - github.com/scrapinghub/splash. Quoting their readme, "Splash is a javascript rendering service with an HTTP API. It's a lightweight browser implemented in Python 3 using Twisted and QT5" [2]. It gets us the html sources from the DOM and the iframe within the web page over an HTTP API. This enables us to share core crawling logic between litecrawl and deepcrawl and powers our screenshot service. Using a combination of Brozzler and Splash implementations of browser based crawling, we are able to gather information about linked assets and their impact on the load time of the webpage. This information is crucial in filtering out non-existent asset links and optimizing those assets which have a negative effect on page load time.
litecrawl and deepcrawl techniques maintain a graph called ReferenceGraph of all the assets being referenced in all the documents, much like a Social Graph of a network with unidirectional links, similar to Followers in Twitter. This ReferenceGraph is crucial in syncing the filename changes of any asset post optimization in O(1).

looking forward

deepcrawl allows us to trigger events based on each network’s response, thus opening a dimension of studying the webpage’s interaction with the server. Network interaction between static frontend and dynamic backend helps us ensure that the web page has loaded completely.

many thanks

We are grateful to Internet Archive for the Brozzler project and ScrapingHub for the Splash project and keeping them openly accessible.