WWW::Spyder

WWW::Spyder is a Perl module that acts like a web spider.
Download

WWW::Spyder Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Ashley Pond V.
  • Publisher web site:
  • http://search.cpan.org/~ashley/WWW-Spyder-0.18/Spyder.pm

WWW::Spyder Tags


WWW::Spyder Description

WWW::Spyder is a Perl module that acts like a web spider. WWW::Spyder is a Perl module that acts like a web spider.A web spider that returns plain text, HTML, and other information per page crawled and can determine what pages to get and parse based on supplied terms compared to the text in links as well as page content.METHODS$spyder->new() Construct a new spyder object. Without at least the seed() set, or go_to_seed() turned on, the spyder isn't ready to crawl.$spyder = WWW::Spyder->new(shift||die"Gimme a URL!n"); # ...or...$spyder = WWW::Spyder->new( %options );Options include: sleep_base (in seconds), exit_on (hash of methods and settings). Examples below.$spyder->seed($url) Adds a URL (or URLs) to the top of the queues for crawl'ing. If the spyder is constructed with a single scalar argument, that is considered the seed_url.$spyder->bell() This will print a bell ("a") to STDERR on every successfully crawled page. It might seem annoying but it is an excellent way to know your spyder is behaving and working. True value turns it on. Right now it can't be turned off.$spyder->spyder_time() Returns raw seconds since Spyder was created if given a boolean value, otherwise returns "D day(s) HH::MM:SS."$spyder->terms() The more terms, the more the spyder is going to grasp at. If you give a straight list of strings, they will be turned into very open regexes. E.g.: "king" would match "sulking" and "kinglet" but not "King." It is case sensitive right now. If you want more specific matching or different behavior, pass your own regexes instead of strings. $spyder->terms( qr/bkings?b/i, qr/bqueens?b/i );terms() is only settable once right now, then it's a done deal.$spyder->spyder_data() A comma formatted number of kilobytes retrieved so far. Don't give it an argument. It's a set/get routine.$spyder->slept() Returns the total number of seconds the spyder has slept while running. Useful for getting accurate page/time counts (spyder performance) discounting the added courtesy naps.$spyder->UA->... The LWP::UserAgent. You can reset them, I do believe, by calling methods on the UA. Here are the initialized values you might want to tweak (see LWP::UserAgent for more information): $spyder->UA->timeout(30); $spyder->UA->max_size(250_000); $spyder->UA->agent('Mozilla/5.0');Changing the agent name can hurt your spyder b/c some servers won't return content unless it's requested by a "browser" they recognize.You should probably add your email with from() as well. $spyder->UA->from('bluefintuna@fish.net');$spyder->cookie_file() They live in $ENV{HOME}/spyderCookie by default but you can set your own file if you prefer or want to save different cookie files for different spyders.Requirements:· Perl


WWW::Spyder Related Software