Text::DeDuper free download, Text::DeDuper download on software download

	Text::DeDuper Near duplicates detection module
Download

Text::DeDuper Ranking & Summary

Rating:

License:
Perl Artistic License

Price:
FREE

Publisher Name:
Jan Pomikalek

Publisher web site:
http://search.cpan.org/~janpom/

Text::DeDuper Tags

Text::DeDuper Description

Near duplicates detection module Text::DeDuper is a Perl module that uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.SYNOPSIS use Text::DeDuper; $deduper = new Text::DeDuper(); $deduper->add_doc("doc1", $doc1text); $deduper->add_doc("doc2", $doc2text); @similar_docs = $deduper->find_similar($doc3text); ... # delete near duplicates from an array of texts $deduper = new Text::DeDuper(); foreach $text (@texts) { next if $deduper->find_similar($text); $deduper->add_doc($i++, $text); push @no_near_duplicates, $text; } Requirements: · Perl

Text::DeDuper Related Software