Text::DeDuper

Near duplicates detection module
Download

Text::DeDuper Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Jan Pomikalek
  • Publisher web site:
  • http://search.cpan.org/~janpom/

Text::DeDuper Tags


Text::DeDuper Description

Near duplicates detection module Text::DeDuper is a Perl module that uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.SYNOPSIS use Text::DeDuper; $deduper = new Text::DeDuper(); $deduper->add_doc("doc1", $doc1text); $deduper->add_doc("doc2", $doc2text); @similar_docs = $deduper->find_similar($doc3text); ... # delete near duplicates from an array of texts $deduper = new Text::DeDuper(); foreach $text (@texts) { next if $deduper->find_similar($text); $deduper->add_doc($i++, $text); push @no_near_duplicates, $text; } Requirements: · Perl


Text::DeDuper Related Software