marisa-trie

Python bindings to marisa-trie (unofficial)
Download

marisa-trie Ranking & Summary

Advertisement

  • Rating:
  • License:
  • MIT/X Consortium Lic...
  • Price:
  • FREE
  • Publisher Name:
  • Mikhail Korobov
  • Publisher web site:
  • http://bitbucket.org/kmike/

marisa-trie Tags


marisa-trie Description

marisa-trie provides static memory-efficient Trie structures for Python (2.x and 3.x). Uses marisa-trie C++ library.There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative unofficial Cython-based pip-installable Python bindings.Installationpip install marisa-trieUsageThere are several Trie classes in this package:- marisa_trie.Trie - read-only trie-based data structure that maps unicode keys to auto-generated unique IDs and supports exact lookups, prefix lookups and searching for all prefixes of a given key;- marisa_trie.RecordTrie - read-only trie-based data structure that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed with using python struct module). RecordTrie supports exact and prefix lookups.- marisa_trie.BytesTrie - read-only Trie that maps unicode keys to lists of bytes objects. BytesTrie supports exact and prefix lookups.marisa_trie.TrieCreate a new trie:>>> import marisa_trie>>> trie = marisa_trie.Trie()Check if key is in trie:>>> u'key1' in trieTrue>>> u'key20' in trieFalseEach key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):>>> trie.key_id(u'key2')1Key can be reconstructed from the ID:>>> trie.restore_key(1)u'key2'Find all prefixes of a given key:>>> trie.prefixes(u'key12')There is also a generator version of .prefixes method called .iter_prefixes.Find all keys from this trie that starts with a given prefix:>> trie.keys(u'key1')(iterator version .iterkeys(prefix) is also available).marisa_trie.RecordTrieCreate a new trie:>>> keys = >>> values = >>> fmt = "< HH" # a tuple with 2 short integers>>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))Trie initial data must be an iterable of tuples (unicode_key, data_tuple). Data tuples will be converted to bytes with struct.pack(fmt, *data_tuple).Take a look at http://docs.python.org/library/struct.html#format-strings for the format string specification.Duplicate keys are allowed.Check if key is in trie:>>> u'foo' in trieTrue>>> u'spam' in trieFalseGet a values list:>>> trie>>> trieFind all keys from this trie that starts with a given prefix:>> trie.keys(u'fo')Find all items from this trie that starts with a given prefix:>> trie.items(u'fo')NoteIterator version of .keys() and .items() are not implemented yet.PersistenceTrie objects supports saving/loading, pickling/unpickling and memory mapped I/O.Write trie to a stream:>>> with open('my_trie.marisa', 'w') as f:... trie.write(f)Save trie to a file:>>> trie.save('my_trie_copy.marisa')Read trie from stream:>>> trie2 = marisa_trie.Trie()>>> with open('my_trie.marisa', 'r') as f:... trie.read(f)Load trie from file:>>> trie2.load('my_trie.marisa')Trie objects are picklable:>>> import pickle>>> data = pickle.dumps(trie)>>> trie3 = pickle.loads(data)You may also build a trie using marisa-build command-line utility (provided by underlying C++ library; it should be downloaded and compiled separately) and then load the trie from the resulting file using .load() method.Memory mapped I/OIt is possible to use memory mapped file as data source:>>> trie = marisa_trie.RecordTrie(fmt)>>> trie.mmap('my_record_trie.marisa')This way the whole dictionary won't be loaded to memory; memory mapped I/O is an easy way to share dictionary data among processes.WarningMemory mapped trie might cause a lot of random disk accesses which considerably increase the search time.BenchmarksMy quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):- list(unicode words) : about 300M- BaseTrie from datrie library: about 70M- marisa_trie.RecordTrie : 11M- marisa_trie.Trie: 7MNoteLengths of words were stored as values in datrie.BaseTrie and marisa_trie.RecordTrie. RecordTrie compresses similar values and the key compression is better so it uses much less memory than datrie.BaseTrie.marisa_trie.Trie provides auto-assigned IDs. It is not possible to store arbitrary values in marisa_trie.Trie so it uses less memory than RecordTrie.Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):dict __getitem__ (hits): 4.090M ops/secTrie __getitem__ (hits): not supportedBytesTrie __getitem__ (hits): 0.469M ops/secRecordTrie __getitem__ (hits): 0.373M ops/secdict __contains__ (hits): 4.036M ops/secTrie __contains__ (hits): 0.910M ops/secBytesTrie __contains__ (hits): 0.573M ops/secRecordTrie __contains__ (hits): 0.591M ops/secdict __contains__ (misses): 3.346M ops/secTrie __contains__ (misses): 1.643M ops/secBytesTrie __contains__ (misses): 0.976M ops/secRecordTrie __contains__ (misses): 1.017M ops/secdict items(): 58.316 ops/secTrie items(): not supportedBytesTrie items(): 2.456 ops/secRecordTrie items(): 2.254 ops/secdict keys(): 211.194 ops/secTrie keys(): 3.341 ops/secBytesTrie keys(): 2.308 ops/secRecordTrie keys(): 2.184 ops/secTrie.prefixes (hits): 0.176M ops/secTrie.prefixes (mixed): 0.956M ops/secTrie.prefixes (misses): 1.035M ops/secTrie.iter_prefixes (hits): 0.170M ops/secTrie.iter_prefixes (mixed): 0.799M ops/secTrie.iter_prefixes (misses): 0.898M ops/secTrie.keys(prefix="xxx"), avg_len(res)==415: 0.825K ops/secTrie.keys(prefix="xxxxx"), avg_len(res)==17: 19.934K ops/secTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3: 85.239K ops/secTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 136.476K ops/secTrie.keys(prefix="xxx"), NON_EXISTING: 1073.719K ops/secTries from marisa_trie uses less memory, tries from datrie.Trie are faster.Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.ContributingDevelopment happens at github and bitbucket: https://github.com/kmike/marisa-trie https://bitbucket.org/kmike/marisa-trieThe main issue tracker is at github: https://github.com/kmike/marisa-trie/issuesFeel free to submit ideas, bugs, pull requests (git or hg) or regular patches.If you found a bug in a C++ part please report it to the original bug tracker.Product's homepage


marisa-trie Related Software