11th October 2019

NetWatch

An extensible web crawler that can watch a list of websites, discover links, extract content, and download files.

A while ago, I started working on a number of projects which required information from around the web. I decided to make a web crawler that could be easily modified for these projects: NetWatch.

How it works

It firstly makes requests to the initial URLs which are defined in a "Crawl Rules" file. In that same file, you can define how the application continues with further rules that target URLs as loosely or tightly as you desire:

Should it find links from those pages and then crawl those?
Should it extract text from the page?
Should it save images or other files linked on the pages?

Crawled pages go through a series of post processing modules which handles the parsing, storage, and transportation of responses.

I've used NetWatch for several projects: Terabase, Time To Nom, Car Lookout. If you want to try it yourself, everything is available on GitHub.