Managing Gigabytes for Java

Managing Gigabytes for Java is a free full-text indexing system for large document collections written in Java.
Download

Managing Gigabytes for Java Ranking & Summary

Advertisement

  • Rating:
  • License:
  • LGPL
  • Price:
  • FREE
  • Publisher Name:
  • Sebastiano Vigna
  • Publisher web site:
  • http://archive4j.dsi.unimi.it/

Managing Gigabytes for Java Tags


Managing Gigabytes for Java Description

Managing Gigabytes for Java is a free full-text indexing system for large document collections written in Java. Managing Gigabytes for Java (MG4J) is a free full-text indexing system for large document collections written in Java. As a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsychronised buffered streams, (possibly signed) minimal perfect hashing for very large strings collections, etc.With release 1.1, MG4J becomes a highly customisable, high-performance, full-fledged text-indexing system providing state-of-the-art features (such as BM25 scoring) and new research algorithms.Here are some key features of "Managing Gigabytes for Java":- Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents. - Efficiency. We do not provide meaningless data such as "we index x GiB per second" (with which configuration? which language? which data source?)-we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents. - Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms. - Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax. - Virtual fields. MG4J supports virtual fields-fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document. - Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It's up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques). - Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions. - Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text). - Multithreading. Indices can be queried and scored concurrently. - Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries. Requirements: · fastutil · JAL What's New in This Release: · WARNING: Massive revamp of the DocumentIteratorVisitor subsystem. Now such visitors can return data, much like a QueryIteratorBuildervisitor. It also has a special visit method for MultiTermIndexIterators. You'll have to adapt your previous implementations. · WARNING: QueryParser instances are required to provide a parse(MutableString) method and two new escape methods that can be used to turn a string into a text token. This feature is fundamental for automatic query generation (thanks to Hugo Zaragoza for pointing out this problem). · WARNING: To make a few things easier, we now have explicit document iterators representing true and false. Their construction requires a reference index (contrarily to that was happening with DocumentIterators.EMPTY_ITERATOR), so the getInstance() methods of most document iterators had to be updated, and DocumentIteratorVisitor instances need to implemented two new visit() methods. The iterators are generated by the tokens #TRUE and #FALSE. · WARNING: Indexing of virtual fields uses much less memory, but batches now have a different content: they represent actual positions in the final virtual document. Sizes of each batch represent the known size of a virtual moment when the batch was written. With this change, Paste does no longer require more memory than Concatenate. · WARNING: A new RemappingDocumentIterator class makes it possible to mix results from different indices with positional operators. Since there is a new Remap query node, all DocumentVisitors will have to be updated. · WARNING: All deprecated classes have been removed. · WARNING: The -B option of IndexBuilder is now aligned to Scan--it specifies the basename of a collection to be built at indexing time. It used to be the size of the Combine buffer. · New classes for efficient document collection construction at indexing time. The architecture is now also very open--you can plug in your own builders. · Completely restructured size handling for Combine and subclasses. Unless you use Golomb coding, you will not need to load sizes. This is true even of batches of virtual fields, as Paste now by default does not renumber positions, but rather expects them to be already renumbered. The old behaviour can be obtained via a flag. · We moved to Jetty 6. Also, a few problems with Velocity not finding templates have been fixed. · New, more intelligent memory handling that should be able to avoid completely out-of-memory errors. There is also a limit on the number of terms per batch that should help with garbage collection. · Fixed a bug in collection creation: we used to provide the original factory, but this is wrong as we might not be indexing all fields. Now we generate a suitable factory that contains only the indexed fields. · New important feature: high-performance indices may have now variable quanta depending on the list frequency and density. Indices now sport a .posnumbits file that records how many bits are used to store positions. It is used as a basic statistics to compute the correct quantum. You can ask for a percentage of the index to be used to skip towers, and the right quantum for each list will be computed for you. The process is quite empirical, so always look into .stats files to check that you are actually using no more than the percentage requested. In general, old indices will have to be rebuilt before being able to Combine them into an index with variable quanta, but for high-performance indices the tool ComputePosNumBitsPositions can be used to add the missing file. · Memory mapping of indices now uses the new multiplexed approach implemented in ByteBufferInputStream. This means that we can map into memory essentially every index. Thanks to Valentin Tablan and Ian Roberts for suggesting this approach. · Now we feature an implementation of the state-of-the-art BM25F ranking function. · ZipDocumentCollection.getInstance() makes it possible to load realiably ZipDocumentCollection instances even if they are not in the current directory. · New UTF-8 nice mathematical symbols for conjunction, disjunction, TRUE and FALSE. · Fixed problem with too many connections open when using JdbcDocumentCollection. · A new SUCCINCTSIZES URI key makes it possible to ask for loading sizes into an Elias-Fano compressed list. This will slow down access by two orders of magnitude, but it can be very useful when pasting large indices, as pasting needs to load a large amount of size data. · EmptyIndexIterator instances are no longer Index-based singletons. This change was necessary to make it possible to run ranking algorithms that require to set the weight or id even of empty iterators. This should cause no problem. · All document iterators have now a settable weight. The weight can be espressed in standard syntax using braces. Note that weights per se have no meaning--it is up to the scorers to use them. · Now the metadata-only option of Combine and its implementations generates the file of frequencies. This is very useful as it makes it possible to compute the term frequencies for the virtual documents obtained by concatenating all fields--something that is necessary for the correct computation of BM25F. · Fixed a bug in the grammar: queries such as "(a))" would have been parsed as "(a)" because of a lack of check for EOF (thanks to Hugo Zaragoza for reporting this bug). · The parser will now accept Unicode characters 0x2227 and 0x2228 (the standard mathematical symbols for conjunction and disjunction) for AND and OR, respectively. · Following some testing TREC GOV2, the defaults for MAXPREANCHOR and MAXPOSTANCHOR in HtmlDocumentFactory have been reduced to 8 and 4, respectively. · Fixed old bug in SemiExternalGammaList; readBits(0) was not called after numLongs estimation, leading to EOFExceptions. · Document pointers can now be coded in unary. · Fixed bad bug in PartitionLexically: for high-performance indices, the positions of the last term were not being written. · HttpFileServer has a settable port. · New Scorer.getWeights() method to get weights. · Fixed a bug in TfIdf scorer that would have caused NaNs. · Query accepts a newline-separated list of titles, besides the usual serialised object.


Managing Gigabytes for Java Related Software