Web crawling: Why we use it and the common challenges we face

16 May, 2022

By Nathan Trevivian

Web crawling: Why we use it and the common challenges we face

You may have heard of web crawling processes before, but how does a crawler work, and why do we need them? 

Web crawling is behind so much of what we do here at CameraForensics. It enables us to source and index pages for images from the world wide web, as well as the deep and dark web. The result: digital evidence which forensics investigators can use for image analysis.   

With the ability to function at a massive scale, web crawling is an essential component that informs our work. It enables image forensics users to more accurately and rapidly source information that can help locate potential victims of abuse and identify those responsible. 

  

What does web crawling involve? 

Put simply, web crawling (or web scraping) involves deploying ‘crawlers,’ that discover URLs and links on the open web then move between these pathways, cataloguing and indexing images that it encounters. 

To begin with, web crawlers will start with a list of known URLs and index these domains. As they progress, they will follow links to other sites, before beginning the process again. Almost like a snowball effect, this technique allows crawlers to uncover and categorise a large amount of web data from a much shorter set of initial known URLs.  

Why do we crawl? 

A search engine is only as strong as the amount of information it holds and can retrieve on command. The CameraForensics platform is no different. We can improve the experience of our users, and help them to reach crucial intelligence used to safeguard victims, by adding intelligence. 

By ethically deploying web crawlers on the open, deep, and occasionally the dark web, we can bolster our search engine with more relevant, and actionable, intelligence to drive digital image forensics efforts. 

One of the core advantages of deploying web crawlers rather than using a manual approach is both the efficiency of indexing, and the scale on which they can work. What would take us a day to index and collate by hand can be completed in just a few minutes with crawlers. As a result, they can operate quickly over huge amounts of pages, providing us with more information to supply to our users, in a much shorter time period.  

  

Read more: Why is R&D so important? 

 

Web crawling and a moral approach 

With permissions from domains, and a focus on improving the capabilities of similarity hashing, we can make sure that our crawlers are as efficient as possible. With this focus in mind, we can optimise our user experience and supply relevant information for use in image forensics techniques.  

However, obtaining this permission is not always possible. This poses a significant moral dilemma - to crawl or not to crawl. At CameraForensics, we always believe in choosing the right course of action, which is why we always try to crawl with the express permissions of said sites. While this can present challenges for how efficient our processes can be, this moral approach is one that we’re proud to stand by.  

  

The challenges of web crawling 

Obtaining the permission of these domains isn’t the only issue that we face. 

One of the core challenges that we experience when deploying web crawlers is the unique nature of each domain. The general idea that one crawler can navigate any web page is a myth. Instead, with different pathways and links to consider, we have to navigate how to crawl each site independently - which can cause unwanted complications.  

Another significant challenge that we face comes when domains that do not want to be crawled, or that are concerned about their safety.  

These sites often go to measures to prevent crawlers from interacting with their site at all. This is commonly achieved through several different means, such as: 

Crawler traps 

Crawler traps are specifically designed to prevent crawlers from indexing their info. One approach is to deploy endless redirects, which will continuously send crawlers from link to link without gaining any intelligence.  

Another example is a bitbomb which, when crawled, explodes into an infinite amount of data – overwhelming a crawler with more data than it can effectively index. 

Referrer-based URL entry 

Some sites, malicious or otherwise, may only allow access to specific areas or may deny entry to a site altogether if joined through an unrelated referral path, such as a link on another domain. This can cause significant issues to a web crawler that is trying to navigate an entire domain, as it leaves some parts of the site ‘invisible’ unless followed via certain channels.  

Sites hosted on the dark web 

While continuous improvements and R&D projects are navigating how best to deploy web crawlers on the dark web, conducting successful crawling on this evasive platform can be inefficient and unreliable. With a limited number of known URLs, it can be difficult to understand the scope of the dark web, and gaining access to other sites through links is less common than on the open web.  

How will web crawling capabilities develop?  

The advent of AI and classifiers within web crawling promises to bring a new range of capabilities and insights to forensics tools.  

New use cases are already being realised, such as the possibility of crawlers directing themselves for greater efficiency. This is now more achievable than ever as more domains take on templated structures, such as those provided by WordPress and Squarespace, which follow traditional pathways.  

As crawlers continue to advance in sophistication and intelligence, we also expect greater functionality for crawling and indexing videos as well as imagery - supplying us with greater intelligence than ever before. 

Discover our commitment to ongoing R&D and innovation in web crawling tools here.


Subscribe to the Newsletter