Acora

Fast multi-keyword search engine for text strings
Download

Acora Ranking & Summary

Advertisement

  • Rating:
  • License:
  • BSD License
  • Price:
  • FREE
  • Publisher Name:
  • Stefan Behnel
  • Publisher web site:
  • http://behnel.de

Acora Tags


Acora Description

Fast multi-keyword search engine for text strings Acora is 'fgrep' for Python, a fast multi-keyword text search engine.Based on a set of keywords, it generates a search automaton (DFA) and runs it over string input, either unicode or bytes.Acora is based on the Aho-Corasick algorithm and an NFA-to-DFA powerset construction.Acora comes with both a pure Python implementation and a fast binary module written in Cython.How do I use it?Import the package:>>> from acora import AcoraBuilderCollect some keywords:>>> builder = AcoraBuilder('ab', 'bc', 'de')>>> builder.add('a', 'b')Generate the Acora search engine for the current keyword set:>>> ac = builder.build()Search a string for all occurrences:>>> ac.findall('abc')>>> ac.findall('abde')Iterate over the search results as they come in:>>> for kw, pos in ac.finditer('abde'):... print("%2s" % (kw, pos)) aab bdeFAQs and recipes1. how do I run a greedy search for the longest matching keywords? >>> builder = AcoraBuilder('a', 'ab', 'abc') >>> ac = builder.build() >>> for kw, pos in ac.finditer('abbabc'): ... print(kw) a ab a ab abc >>> from itertools import groupby >>> from operator import itemgetter >>> def longest_match(matches): ... for pos, match_set in groupby(matches, itemgetter(1)): ... yield max(match_set) >>> for kw, pos in longest_match(ac.finditer('abbabc')): ... print(kw) ab abc2. how do I parse line-by-line, as fgrep does, but with arbitrary line endings? >>> def group_by_lines(s, *keywords): ... builder = AcoraBuilder('\r', '\n', *keywords) ... ac = builder.build() ... ... current_line_matches = [] ... last_ending = None ... ... for kw, pos in ac.finditer(s): ... if kw in '\r\n': ... if last_ending == '\r' and kw == '\n': ... continue # combined CRLF ... yield tuple(current_line_matches) ... del current_line_matches ... last_ending = kw ... else: ... last_ending = None ... current_line_matches.append(kw) ... yield tuple(current_line_matches) >>> kwds = >>> for matches in group_by_lines('a\r\r\nbc\r\ndede\n\nab', *kwds): ... print(matches) () () ('bc',) ('de', 'de') () ('ab',) Here are some key features of "Acora": · works with unicode strings and byte strings · about 2-3x as fast as Python's regular expression engine for most input · finds overlapping matches, i.e. all matches of all keywords · support for case insensitive search (~10x as fast as 're') · frees the GIL while searching · additional (slow but short) pure Python implementation · support for Python 2.5+ and 3.x · support for searching in files · permissive BSD license Requirements: · Python What's New in This Release: · minor speed-up in inner search engine loop · some code cleanup · built using Cython 0.12.1 (final)


Acora Related Software