Heroshi documentation ===================== Heroshi is a `web crawler `_. The goal of the project is to build very fast, distributed web spider. Current status: * low level HTTP client – alpha, usable for test runs * URL queue – not started Project is under heavy development, so expect big changes. Download -------- `Heroshi source code `_ is hosted on `Github `_, so you may use either * go get/install:: go install github.com/temoto/heroshi/heroshi-worker * clone repository to hack around:: git clone git://github.com/temoto/heroshi.git * or download `latest Heroshi master tarball `_. Identity -------- Heroshi identifies itself with:: User-Agent: HeroshiBot/version (+http://temoto.github.com/heroshi/; temotor@gmail.com) Load problems ------------- Heroshi worker doesn't open more than 1 concurrent connection to each domain:port. This is a very low load to properly configured websites but the world is not perfect, and it may hurt legacy installations. Heroshi was not meant to be a harm tool, it will not abuse your servers again and again continuously. Instead, it will wait for some time before visiting same pages again. So far i believe i'm the only one who runs Heroshi, so if it loads your website too much, there is no need to ban User-agent/IP or something, just contact me, and i'll set up as low limit for your website/domain/IP, as acceptable. Robots.txt support ------------------ Heroshi obeys `standard robots.txt rules `_. As implemented by `Go robots.txt library `_. To completely disallow Heroshi crawl your site, place the following lines into file, accessible as /robots.txt on your site:: User-agent: HeroshiBot Disallow: / Contact information ------------------- Use this email (XMPP/Jabber too) for questions/demands/reports about Heroshi: temotor@gmail.com License ------- Heroshi is made available under the terms of the open source `MIT license `_. Contents ======== .. toctree:: :maxdepth: 2 worker storage Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`