metadata_parser

A module to parse metadata out of documents
Download

metadata_parser Ranking & Summary

Advertisement

  • Rating:
  • License:
  • MIT/X Consortium Lic...
  • Price:
  • FREE
  • Publisher Name:
  • Jonathan Vanasco
  • Publisher web site:
  • http://search.cpan.org/~jvanasco/Authen-PluggableCaptcha-0.05/lib/Authen/PluggableCaptcha/Tutorial.pm

metadata_parser Tags


metadata_parser Description

metadata_parser is a Python module for pulling metadata out of web documents.It requires BeautifulSoup , and was largely based on Erik River's opengraph module (https://github.com/erikriver/opengraph).I needed something more aggressive than Erik's module , so had to fork.Installationpip install metadata_parserFeatures- it pulls as much metadata out of a document as possible- you can set a 'strategy' for finding metadata ( ie, only accept opengraph or page attributes )Notes This requires BeautifulSoup 3 or 4. If it can import bs4 it does, otherwise it tries BeautifulSoup (3) For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to 'none' (the internal pure python) if it can't load lxmlThe default 'strategy' is to look in this order: og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elementsYou can specify a strategy as a comma-separated list of the above.The only 2 page elements currently supported are: < title >VALUE< /title > - > metadata < link rel="canonical" href="VALUE" > -> metadataUsageFrom an URL>>> import metadata_parser>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")>>> print page.metadata>>> print page.get_field('title')>>> print page.get_field('title',strategy='og')>>> print page.get_field('title',strategy='page,og,dc')From HTML>>> HTML = """""">>> page = metadata_parser.MetadataParser(html=HTML)>>> print page.metadata>>> print page.get_field('title')>>> print page.get_field('title',strategy='og')>>> print page.get_field('title',strategy='page,og,dc')Product's homepage


metadata_parser Related Software