Crawling a website in 2019, the good, the bad & the ugly

As part of our on-going SaaS development work, our marketing suite required that we crawl, analyse and store information about websites. This blog post / news post covers the challenges we faced and how they were overcome.

The move to headless / JavaScript websites.

Website development is moving reasonably fast - the days of fully server side rendered websites are fast disappearing. This brings about a whole new site of challenges for us, for Google and everyone else in between.

A website built in React, Angular or VueJS often only compromises of a single HTML container, the JavaScript does the rest - pulling all the HTML into the DOM.

This means your traditional website crawler as we've known them for the last 15+ years just see a blank page.

Introducing headless browsers

In comes headless browsers - chrome and chromium. Us, like Google and many others, have moved to literally emulate a browser without a face. Have you ever noticed, when you browse to website the view-source often shows very different HTML to the Inspect Element tool.

The view source shows you the page prior to any JavaScript interaction, the inspect element tool shows you the Document Object Model (DOM). This includes all the JavaScript updates that are made

Therefore it is essential for any modern day crawler to be looking at the DOM and not the source like they would traditionally. Google has been doing this, albeit slowly for several years. In early 2019, they kicked it up a notch, announcing they will be persistently crawling using the latest version of the Chrome web browser.