PDFTextStream

A PDF text and metadata extraction library available for Java, Python, and .NET.
Download

PDFTextStream Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Other/Proprietary Li...
  • Price:
  • USD 1900.00 | BUY the full version
  • Publisher Name:
  • Snowtide Informatics Systems, Inc.
  • Publisher web site:
  • http://snowtide.com/

PDFTextStream Tags


PDFTextStream Description

A PDF text and metadata extraction library available for Java, Python, and .NET. PDFTextStream project is a PDF text and metadata extraction library available for Java, Python, and .NET.It supports all versions of the PDF document specification, (including v1.6, used by Acrobat 7), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of 40-bit and 128-bit encrypted documents, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included. Requirements: · Apache Lucene What's New in This Release: · Added an .isStruckThrough() method to com.snowtide.pdf.TextUnit, indicating whether a character has a strikethrough drawn through it. · Improved PDFTextStream's support for embedded character mappings. · The calculation of whitespace between words has been fixed to properly account for whitespace that is explicitly encoded in the source PDF documents. · Improved PDFTextStream's handling of composite content encodings, which previously could fail resulting in some ranges of PDF content being 'ignored' during extraction. · Fixed a bug in VisualOutputTarget where text from a single line would be split over multiple lines · Improved vertical alignment of text extracted using VisualOutputTarget · Improved VisualOutputTarget-produced extracts to eliminate spurious additional whitespace between closely-adjacent words


PDFTextStream Related Software