Heroshi documentation

Heroshi is a web crawler.

The goal of the project is to build very fast, distributed web spider.

Current status:

  • low level HTTP client – alpha, usable for test runs
  • URL queue – not started

Project is under heavy development, so expect big changes.

Download

Heroshi source code is hosted on Github, so you may use either

  • go get/install:

    go install github.com/temoto/heroshi/heroshi-worker
  • clone repository to hack around:

    git clone git://github.com/temoto/heroshi.git
  • or download latest Heroshi master tarball.

Identity

Heroshi identifies itself with:

User-Agent: HeroshiBot/version (+http://temoto.github.com/heroshi/; temotor@gmail.com)

Load problems

Heroshi worker doesn’t open more than 1 concurrent connection to each domain:port. This is a very low load to properly configured websites but the world is not perfect, and it may hurt legacy installations.

Heroshi was not meant to be a harm tool, it will not abuse your servers again and again continuously. Instead, it will wait for some time before visiting same pages again.

So far i believe i’m the only one who runs Heroshi, so if it loads your website too much, there is no need to ban User-agent/IP or something, just contact me, and i’ll set up as low limit for your website/domain/IP, as acceptable.

Robots.txt support

Heroshi obeys standard robots.txt rules. As implemented by Go robots.txt library.

To completely disallow Heroshi crawl your site, place the following lines into file, accessible as /robots.txt on your site:

User-agent: HeroshiBot
Disallow: /

Contact information

Use this email (XMPP/Jabber too) for questions/demands/reports about Heroshi: temotor@gmail.com

License

Heroshi is made available under the terms of the open source MIT license.

Indices and tables

Table Of Contents

Next topic

Worker (TODO)

This Page