An extensible web crawler that can watch a list of websites, discover links, extract content, and download files.
A while ago, I started working on a number of projects which required information from around the web. I decided to make a web crawler that could be easily modified for these projects: NetWatch.
How it works
It firstly makes requests to the initial URLs which are defined in a "Crawl Rules" file. In that same file, you can define how the application continues with further rules that target URLs as loosely or tightly as you desire:
- Should it find links from those pages and then crawl those?
- Should it extract text from the page?
- Should it save images or other files linked on the pages?
Crawled pages go through a series of post processing modules which handles the parsing, storage, and transportation of responses.
I've used NetWatch for several projects: Terabase, Time To Nom, Car Lookout. If you want to try it yourself, everything is available on GitHub.