PyKHTML

PyKHTML is a Python module for writing website scrapers/spiders.
Download

PyKHTML Ranking & Summary

Advertisement

  • Rating:
  • License:
  • BSD License
  • Price:
  • FREE
  • Publisher Name:
  • Paul Giannaros
  • Publisher web site:
  • http://paul.giannaros.org/pykhtml/

PyKHTML Tags


PyKHTML Description

PyKHTML is a Python module for writing website scrapers/spiders. PyKHTML is a Python module for writing website scrapers/spiders. Whereas traditional methods focus on writing the code to parse HTML/forms themselves, PyKHTML uses the excellent KHTML engine to do all the trudge work.It therefore handles webpages very well (even the severely crufty ones) and is pretty darn fast (implemented in C++). As a bonus the module handles JavaScript and cookies transparently.How?PyKHTML requires PyKDE 3 (and hence in turn PyQt 3 + KDE libs). If you would like to run PyKHTML on servers without an X display then Xvfb is required. Fortunately these requirements should come bundled with most modern Linux distributions, and support for Windows/Mac should appear in the next few months.Show me some codeOkay. Here is an example (one of many examples included in the bundle) that scrapes the title and navigation from this page, with excessive commenting to give you a feel of what programming with PyKHTML is like:import pykhtmlPyKHTMLUrl = "http://paul.giannaros.org/pykhtml"def extractBitsFromPage(browser): # getElementsByTagName returns a generator, so we convert # to a list and access the first element title = list(browser.document.getElementsByTagName("title")) print "Title:", title.text # Get the text of the navigation items navigation = [] # First get the container of the list items... navigationElement = browser.document.getElementById("navigation") # ... and then loop over the li elements we find for listItem in navigationElement.getElementsByTagName("li"): # Inside the list item is an anchor anchor = listItem.children # And the text inside the anchor is what we want navigation.append(anchor.text) print "Navigation:", " | ".join(navigation) # Stop here, we're done pykhtml.stopEventLoop()def main(): browser = pykhtml.Browser() # the browser is passed as a parameter to extractBitsFromPage # when it is called (when the page has loaded) browser.load(PyKHTMLUrl, extractBitsFromPage) # kick things off pykhtml.startEventLoop() main()


PyKHTML Related Software