Heroshi is a web crawler.
The goal of the project is to build very fast, distributed web spider.
Current status:
Project is under heavy development, so expect big changes.
Heroshi source code is hosted on Github, so you may use either
go get/install:
go install github.com/temoto/heroshi/heroshi-worker
clone repository to hack around:
git clone git://github.com/temoto/heroshi.git
or download latest Heroshi master tarball.
Heroshi identifies itself with:
User-Agent: HeroshiBot/version (+http://temoto.github.com/heroshi/; temotor@gmail.com)
Heroshi worker doesn’t open more than 1 concurrent connection to each domain:port. This is a very low load to properly configured websites but the world is not perfect, and it may hurt legacy installations.
Heroshi was not meant to be a harm tool, it will not abuse your servers again and again continuously. Instead, it will wait for some time before visiting same pages again.
So far i believe i’m the only one who runs Heroshi, so if it loads your website too much, there is no need to ban User-agent/IP or something, just contact me, and i’ll set up as low limit for your website/domain/IP, as acceptable.
Heroshi obeys standard robots.txt rules. As implemented by Go robots.txt library.
To completely disallow Heroshi crawl your site, place the following lines into file, accessible as /robots.txt on your site:
User-agent: HeroshiBot
Disallow: /
Use this email (XMPP/Jabber too) for questions/demands/reports about Heroshi: temotor@gmail.com
Heroshi is made available under the terms of the open source MIT license.