opencorpora-tools

opencorpora.org Python interface
Download

opencorpora-tools Ranking & Summary

Advertisement

  • Rating:
  • License:
  • MIT/X Consortium Lic...
  • Price:
  • FREE
  • Publisher Name:
  • Mikhail Korobov
  • Publisher web site:
  • http://bitbucket.org/kmike/

opencorpora-tools Tags


opencorpora-tools Description

opencorpora-tools is a module that provides Python interface to http://opencorpora.org/Installationpip install opencorpora-toolsIf you have python < 2.7 then argparse and ordereddict packages are required:pip install argparsepip install ordereddictUsageObtaining corporaOpencorpora-tools works with XML from http://opencorpora.org/.You can download and unpack the XML manually (from 'Downloads' page) or just use the provided command-line util:opencorpora downloadRun opencorpora download --help for more options.Using corporaInitialize:>>> import opencorpora>>> corpus = opencorpora.Corpora('annot.opcorpora.xml')Get a list of documents:>>> catalog = corpus.catalog()>>> doc_id, doc_title = catalog>>> print doc_id1610>>> doc_title24105 Герман Греф советует россиянам «не суетиться» с валютойWork with a document:>>> doc = corpus>>> print doc.title()24105 Герман Греф советует россиянам «не суетиться» с валютой>>> print doc.words()Сбербанка>>> doc.sents()< class 'opencorpora.Sentence' >: Герман Греф советует россиянам «не суетиться» с валютой>>> print doc.paras()Герман Греф советует россиянам «не суетиться» с валютой Пре·идент Сбербанка уверен, что в ближайшее время на валютных рынках сохранится высокая волатильность и «шараханье».Corpora, Document, Paragraph and Sentence classes support the following methods (when it make sense, e.g. sentence doesn't have paragraphs):- words() - returns a list of words and other tokens;- sents() - returns a list of Sentence instances;- paras() - returns a list of Paragraph instances;- documents() - returns a list of Document instances (this is memory hog!);- tagged_words() - returns a list of (str, str);- tagged_sents() - returns a list of (list of (str, str));- tagged_paras() - returns a list of (list of (list of (str, str)));- iterwords(), itersents(), iterparas(), iterdocuments(), iter_tagged_words, iter_tagged_sents, iter_tagged_paras - return iterators over words, sentences, paragraphs or documents;You can also iterate over Corpora, Document, Paragraph and Sentence (this yields documents, paragraphs, sentences and words), e.g.:>>> sent = doc.sents()>>> for word in sent:... print word...ГерманГрефсоветуетроссиянам«несуетиться»свалютойThe API is modelled after NLTK's CorpusReader API.It it not exactly the same, but is very similar. E.g. sents() in opencorpora-tools returns a list of Sentence instances and sents() in NLTK returns a list of list of strings, but Sentence instances quacks like a list of strings (it can be indexed, iterated, etc.) so opencorpora.Corpora API may be seen as a superset of NLTK CorpusReader API.Product's homepage


opencorpora-tools Related Software