Text::DeDuperNear duplicates detection module | |
Download |
Text::DeDuper Ranking & Summary
Advertisement
- License:
- Perl Artistic License
- Price:
- FREE
- Publisher Name:
- Jan Pomikalek
- Publisher web site:
- http://search.cpan.org/~janpom/
Text::DeDuper Tags
Text::DeDuper Description
Near duplicates detection module Text::DeDuper is a Perl module that uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.SYNOPSIS use Text::DeDuper; $deduper = new Text::DeDuper(); $deduper->add_doc("doc1", $doc1text); $deduper->add_doc("doc2", $doc2text); @similar_docs = $deduper->find_similar($doc3text); ... # delete near duplicates from an array of texts $deduper = new Text::DeDuper(); foreach $text (@texts) { next if $deduper->find_similar($text); $deduper->add_doc($i++, $text); push @no_near_duplicates, $text; } Requirements: · Perl
Text::DeDuper Related Software