spiderfetch

Free Python-based web spider
Download

spiderfetch Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Freeware
  • Price:
  • FREE
  • Publisher Name:
  • Martin Matusiak
  • Publisher web site:
  • Operating Systems:
  • Mac OS X
  • File Size:
  • 37 KB

spiderfetch Tags


spiderfetch Description

Free Python-based web spider spiderfetch is a free modular web spider driven by recipes composed of regular expressions. spiderfetch started out as a tool to spider all the links on a web page, but little by little became a full blown web spider.spiderfetch is now more a suite of tools that can be used on their own, such as the spider and the fetcher. The whole suite is written in pure Python (previously ruby) and requires no dependencies. Here are some key features of "spiderfetch": · Spiders the page for anything that looks like a url. · Ability to filter urls for a regular expression (keep in mind this is still Ruby’s regex, so .* to match any character, not * as in file globbing, (true|false) for choice and so on.) · Downloads all the urls serially, or just outputs to screen (with --dump) if you want to filter/sort/etc. · Can use an existing index file (with --useindex), but then if there are relative links among the urls, they will need post-processing, because the path of the index page on the server is not known after it has been stored locally. · Uses wget internally and relays its output as well. Supports http, https and ftp urls. · Semantics consistent with for url in urls; do wget $url… does not re-download completed files, resumes downloads, retries interrupted transfers. Requirements: · Python Limitations: · Not guaranteed to find every last url, although the matching is pretty lenient. If you can’t match a certain url you’re still stuck with grep and sed. · If you have to authenticate yourself somehow in the browser to be able to download your media files, spiderfetch won’t be able to download them (as with wget in general). However, all is not lost. If the urls are ftp or the web server uses simple authentication, you can still post-process them to: ftp://username:password@the.rest.of.the.url, same for http.


spiderfetch Related Software