metadata_parserA module to parse metadata out of documents | |
Download |
metadata_parser Ranking & Summary
Advertisement
- License:
- MIT/X Consortium Lic...
- Price:
- FREE
- Publisher Name:
- Jonathan Vanasco
- Publisher web site:
- http://search.cpan.org/~jvanasco/Authen-PluggableCaptcha-0.05/lib/Authen/PluggableCaptcha/Tutorial.pm
metadata_parser Tags
metadata_parser Description
metadata_parser is a Python module for pulling metadata out of web documents.It requires BeautifulSoup , and was largely based on Erik River's opengraph module (https://github.com/erikriver/opengraph).I needed something more aggressive than Erik's module , so had to fork.Installationpip install metadata_parserFeatures- it pulls as much metadata out of a document as possible- you can set a 'strategy' for finding metadata ( ie, only accept opengraph or page attributes )Notes This requires BeautifulSoup 3 or 4. If it can import bs4 it does, otherwise it tries BeautifulSoup (3) For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to 'none' (the internal pure python) if it can't load lxmlThe default 'strategy' is to look in this order: og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elementsYou can specify a strategy as a comma-separated list of the above.The only 2 page elements currently supported are: < title >VALUE< /title > - > metadata < link rel="canonical" href="VALUE" > -> metadataUsageFrom an URL>>> import metadata_parser>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")>>> print page.metadata>>> print page.get_field('title')>>> print page.get_field('title',strategy='og')>>> print page.get_field('title',strategy='page,og,dc')From HTML>>> HTML = """""">>> page = metadata_parser.MetadataParser(html=HTML)>>> print page.metadata>>> print page.get_field('title')>>> print page.get_field('title',strategy='og')>>> print page.get_field('title',strategy='page,og,dc')Product's homepage
metadata_parser Related Software