PDFlib TET是一款可以从任意PDF文档格式中可靠地提取文本信息的软件。
标签:PDF开发商: PDFlib
当前版本: v5.4
平台语言:Activex & COM|.NET|JAVA|C++/ MFC|其他
本产品的分类与介绍仅供参考,具体以商家网站介绍为准,如有疑问请来电 023-68661681 咨询。
PDFlib TET(文本和图像提取工具包)可靠地从 PDF 文档中提取文本、图像和元数据。TET 将 PDF 的文本内容作为 Unicode 字符串提供,以及详细的颜色、字形和字体信息以及页面上的位置。以通用图像格式提取栅格图像。TET 可以选择将 PDF 文档转换为基于 XML 的格式,称为 TETML,该格式包含文本和元数据以及资源信息。TET 包含用于确定字边界、将文本分组到列、标识表结构和删除冗余项(如阴影文本)的高级内容分析算法。
PDFlib TET (Text Extraction Toolkit) reliably extracts text, images and metadata from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Raster images are extracted in common raster formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information.
* 关于本产品的分类与介绍仅供参考,精准产品资料以官网介绍为准,如需购买请先行测试。
TET的专利阴影检测算法可识别并删除多余的文本实例,以避免过多的文本提取。 就算其他软件会提取阴影或粗体文本乘积,但TET会正确删除多余的副本。 尽管一个单词的额外实例仍将导致搜索引擎的点击,但是,如示例中所示,如果逐个字符地重复复制文本,则将找不到更多的点击。
在许多语言中,都会将重音符号和其他变音标记放置在其他字符附近,以形成组合字符。一些排版程序(最著名的是TeX)分别发出两个字符(基本字符和重音符)以创建组合字符。 例如,要创建字符ä,首先将字母a放置在页面上,然后将降压字符¨放置在页面顶部。 TET会检测到这种情况,并重新组合两个字符以形成适当的组合字符。
TET获得专利的Unicode映射算法实现了一种级联算法,该算法采用所有可用信息来确定Unicode值。 对于许多有问题的文档,TET会提取适当的Unicode文本,而其他产品只会传递不可用的垃圾。
PDF不对逻辑文本进行编码,而只是页面上字形的容器。 阿拉伯语和希伯来语脚本中的文本从右到左排列。 由于它通常包含从左到右的插入物(例如西方语言中的数字或名称),因此文本必须在两个方向上都进行解释,因此使用术语“双向”。 TET对从右到左和从左到右的文本的视觉混合重新排序,以创建适当的逻辑文本输出。
许多PDF文档中的图像被生成PDF的软件分解为小片段。在页面上看似单一的图像实际上可能由许多小块组成。例如,Microsoft Office应用程序和TeX通常会产生大量碎片图像,其中包含成百上千个小碎片。Adobe InDesign通常将图像分成大小不一的片段。TET检测碎片图像并将其合并以形成可用的较大图像。只有合并图像后,才能合理地重新使用碎片图像。
TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.
Accepted PDF input
TET supports all relevant flavors of PDF input:
Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:
Content analysis and word detection
TET includes advanced content analysis algorithms:
Page Layout and Table Detection
The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified.
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
Image Extract
Images on PDF pages can be extracted as TIFF, JPEG, or JPEG 2000 files. Precise geometric information (position, size, and angles) are reported for each image. Fragmented images will be combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color space conversion occurs. This ensures the highest possible image quality.
PDF Analysis
The TET library includes the pCOS interface for querying details about a PDF document, such as document info and XMP metadata, font lists, page size, and many more.
Configuration Options for problematic PDF
TET contains special handling and workarounds for various kinds of PDF where the text cannot be extracted correctly with other products. In addition, it includes various configuration features to improve processing of problem documents:
Unicode Postprocessing
TET supports various Unicode postprocessing steps which can be used to improve the extracted text:
Document Domains
PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:
XMP Metadata
TET supports XMP metadata in several ways:
TETML represents PDF Contents as XML
TET optionally represents the PDF contents in an XML flavor called TETML. It contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata.
TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.
The following fragment shows TETML output with glyph details:
<Box llx="111.48" lly="636.33" urx="161.14" ury="654.33">
<Glyph font="F1" size="18" x="111.48" y="636.33" width="9.65">P</Glyph>
<Glyph font="F1" size="18" x="121.12" y="636.33" width="11.88">D</Glyph>
<Glyph font="F1" size="18" x="133.00" y="636.33" width="8.33">F</Glyph>
<Glyph font="F1" size="18" x="141.33" y="636.33" width="4.88">l</Glyph>
<Glyph font="F1" size="18" x="146.21" y="636.33" width="4.88">i</Glyph>
<Glyph font="F1" size="18" x="151.08" y="636.33" width="10.06">b</Glyph>
TET Connectors
TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments:
TET Cookbook
The TET Cookbook is a collection of programming examples which demonstrate the use of TET for various text and image extraction tasks. Several Cookbook samples show how to combine the TET and PDFlib+PDI products in order to enhance PDF documents, e.g. add bookmarks or links based on the text on the page.
更新时间:2023-07-13 15:00:44.000 | 录入时间:2006-01-18 11:46:00.000 | 责任编辑:胡涛
扫码联系 获取帮助
允许开发人员在不需要Office Automation的情况下处理Word文档的API
Add-in Express for Office and .NET开发商业类微软Office扩展的一体化框架,如Office COM Add-in、Outlook插件
3-Heights PDF OptimizationPDF优化类库,用于压缩PDF文件的尺寸大小、提高网络浏览速度、提供高质量的打印等
PDF StudioPDF Studio是一款功能强大的,易于使用的PDF编辑器,它以Adobe® Acrobat®和其他PDF工具的小部分代价在PDF文档上提供了大量的功能。
重庆/ 023-68661681
华东/ 13452821722
华南/ 18100878085
华北/ 17347785263
地址 : 重庆市九龙坡区火炬大道69号6幢