Crawling a website in 2019, the good, the bad & the ugly
As part of our on-going SaaS development work, our marketing suite required that we crawl, analyse and store information about websites. This blog post / news post covers the challenges we faced and how they were overcome.
Website development is moving reasonably fast - the days of fully server side rendered websites are fast disappearing. This brings about a whole new site of challenges for us, for Google and everyone else in between.
This means your traditional website crawler as we've known them for the last 15+ years just see a blank page.
Introducing headless browsers
In comes headless browsers - chrome and chromium. Us, like Google and many others, have moved to literally emulate a browser without a face. Have you ever noticed, when you browse to website the view-source often shows very different HTML to the Inspect Element tool.
Therefore it is essential for any modern day crawler to be looking at the DOM and not the source like they would traditionally. Google has been doing this, albeit slowly for several years. In early 2019, they kicked it up a notch, announcing they will be persistently crawling using the latest version of the Chrome web browser.