Text::Scraper

Text::Scraper contains structured data from (un)structured text.
Download

Text::Scraper Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Chris McEwan
  • Publisher web site:
  • http://search.cpan.org/~mcewan/Text-Embed-0.03/lib/Text/Embed.pm

Text::Scraper Tags


Text::Scraper Description

Text::Scraper contains structured data from (un)structured text. Text::Scraper contains structured data from (un)structured text.SYNOPSIS use Text::Scraper; use LWP::Simple; use Data::Dumper; # # 1. Get our template and source text # my $tmpl = Text::Scraper->slurp(*DATA); my $src = get('http://search.cpan.org/recent') || die $!; # # 2. Extract data from source # my $obj = Text::Scraper->new(tmpl => $tmpl); my $data = $obj->scrape($src); # # 3. Do something really neat...(left as excercise) # print "Newest Submission: ", $data->{submissions}{name}, "nn"; print "Scraper model:n", Dumper($obj), "nn"; print "Parsed model:n", Dumper($data) , "nn"; __DATA__ < div class=path>< center>< table>< tr> < ?tmpl stuff pre_nav ?> < td class=datecell>< span>< big>< b> < ?tmpl var date_string ?> < /b>< /big>< /span>< /td> < ?tmpl stuff post_nav ?> < /tr>< /table>< /center>< /div> < ul> < ?tmpl loop submissions ?> < li>< a href="< ?tmpl var link ?>">< ?tmpl var name ?>< /a> < ?tmpl if has_description ?> < small> -- < ?tmpl var description ?>< /small> < ?tmpl end has_description ?> < /li> < ?tmpl end submissions ?> < /ul>ABSTRACTText::Scraper provides a fully functional base-class to quickly develop Screen-Scrapers and other text extraction tools. Programmatically generated text such as dynamic webpages are trivially reversed engineered.Using templates, the programmer is freed from staring at fragile, heavily escaped regular expressions, mapping capture groups to named variables or wrestling with the DOM and badly formed HTML. In addition, extracted data can be hierarchical, which is beyond the capabilities of vanilla regular expressions.Text::Scraper's functionality overlaps some existing CPAN modules - Template::Extract and WWW::Scraper.Text::Scraper is much more lightweight than either and has a more general application domain than the latter. It has no dependencies on other frameworks, modules or design-decisions. On average, Text::Scraper benchmarks around 250% faster than Template::Extract - and uses significantly less memory.Unlike both existing modules, Text::Scraper generalizes its functionality to allow the programmer to refine template capture groups beyond (.*?), fully redefine the template syntax and introduce new template constructs bound to custom classes. Requirements: · Perl


Text::Scraper Related Software