Text::Bloom

Text::Bloom can evaluate Bloom signature of a set of terms.
Download

Text::Bloom Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Andrea Spinelli and Walter Vannini
  • Publisher web site:
  • http://search.cpan.org/~aspinelli/Text-Document-1.07/Bloom.pod

Text::Bloom Tags


Text::Bloom Description

Text::Bloom can evaluate Bloom signature of a set of terms. Text::Bloom can evaluate Bloom signature of a set of terms.SYNOPSIS my $b = Text::Bloom->new(); $b->Compute( qw( foo bar baz ) ); my $sig = $b->WriteToString(); $b->WriteToFile( 'afile.sig' ); my $b2 = Text::Bloom::NewFromFile( 'afile.sig' ); my $b3 = Text::Bloom->new(); $b3->Compute( qw( foo bar barbaz ) ); my $sim = $b->Similarity( $b2 ); my $b4 = Text::Bloom::NewFromString( $sig );Text::Bloom applies the Bloom filtering technique to the statistical analysis of documents.The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV function.Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.The signature may be used for:testing whether a given set of terms is present in the document,computing which fraction of terms are common to two documents.The bit representation may be written to and read from a file. Text::Bloom prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /+/.There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV.Requirements:· Perl Requirements: · Perl


Text::Bloom Related Software