NetWatch

An extensible web crawler that can watch a list of websites, discover links, extract content, and download files.

0:00
/0:12

A while ago, I started working on a number of projects which required information from around the web. I decided to make a web crawler that could be easily modified for these projects: NetWatch.

How it works

It firstly makes requests to the initial URLs which are defined in a "Crawl Rules" file. In that same file, you can define how the application continues with further rules that target URLs as loosely or tightly as you desire:

  • Should it find links from those pages and then crawl those?
  • Should it extract text from the page?
  • Should it save images or other files linked on the pages?

Crawled pages go through a series of post processing modules which handles the parsing, storage, and transportation of responses.

I've used NetWatch for several projects: Terabase, Time To Nom, Car Lookout. If you want to try it yourself, everything is available on GitHub.