彩票走势图

文档过滤器和企业搜索

原创|行业资讯|编辑:龚雪|2015-03-11 13:09:15.000|阅读 171 次

概述:很多人听说过企业搜索,但很少有人知道企业搜索的底层其实是文档过滤器。

# 慧都年终大促·界面/图表报表/文档/IDE等千款热门软控件火热促销中 >>

IF YOU LOOKED at a Microsoft Word file in binary format (as a search engine needs to review it), the file structure is so complex as to make it nearly impossible to pick out the text. In fact, MS Word documents include not only body text but also fields and often even hidden meta data. And MS Word files can have a nested structure, embedding multiple layers of other documents within the Word file.

Delving through these levels of complexity requires a programmatic implementation embedding a deep understanding of file structure. That is the job of document filters.

Document filters are a dynamic component. Every update, for example, that Microsoft makes to the MS Word format requires an adjustment to the document filters going forward, while still preserving backward compatibility with existing Word files.

One leading supplier of enterprise and developer text search software, dtSearch Corp., has spent over two decades building its own document filters. And the company continually upgrades its document filters to correspond with the release of new data formats.

In addition to Word, other MS Office file types that dtSearch supports include PowerPoint, Excel, Access, and OneNote. The document filters also support PDF, RTF, OpenOffice, HTML, XML, CSV, and many other file types, along with compression formats like RAR, ZIP, and GZIP/TAR. And the dtSearch document filters support recursively embedded versions of files, such as a Word file embedded in an Excel file contained in a ZIP attachment.

The dtSearch document filters can also support browser-compatible images in files, including recursively embedded files. The document filters further include Unicode support covering hundreds of international languages.

Document Filters: Not Just For Documents

With so much data now in emails, the dtSearch document filters also support email formats like MS Outlook, Exchange, and Thunderbird. And support extends beyond the email body and meta data to cover multi-layered nested attachments, including recursively-embedded images.

The dtSearch Engine APIs can also work with database data like SQL. While SQL itself is not a file format, it can include BLOB data consisting of embedded documents. The same integrated support for recursively embedded documents, meta data, images, and the like apply to this BLOB data.

Finally, the dtSearch Spider supports static and dynamic Web data (SharePoint, PHP, ASP.NET, CMS, etc.). This data can further consist of (or simply embed) document data such as HTML, PDF, XSL/XML, or even Office files, all of which require the document filters.

Beyond Document Filters: Hit-Highlighted Search

dtSearch enterprise and developer products can index more than a terabyte of data in a single index. A single index can span multiple file directories, emails and attachments, online data, and other databases. The products can create and search any number of indexes.

After indexing, the product line supports highly concurrent, multithreaded searching. Indexed search time is typically less than a second, even across terabytes of data. dtSearch products offer more than 25 search options.

For federated searching, dtSearch products support integrated relevancy ranking across both online and offline repositories. Following a search, the document filters enable hit-highlighting of federated search content.

In the dtSearch Engine, API filters and objects provide an even wider range of advanced data classification options. SDKs include native 64-bit and 32-bit APIs for C++, Java, and .NET (through current versions).


标签:文本检索搜索控件

本站文章除注明转载外,均为本站原创或翻译。欢迎任何形式的转载,但请务必注明出处、不得修改原文相关链接,如果存在内容上的异议请邮件反馈至chenjj@cahobeh.cn

文章转载自:慧都控件网

为你推荐

  • 推荐视频
  • 推荐活动
  • 推荐产品
  • 推荐文章
  • 慧都慧问
相关产品
dtSearch Desktop with Spider

全球领先的文本检索工具,支持在千兆字节数量级的数据源中进行搜索。

dtSearch Network with Spider

全球领先的文本检索工具,支持在千兆字节数量级的数据源中进行搜索。

dtSearch Web with Spider

全球领先的文本检索工具,能够快速地将大量的搜索内容即时发布到基于IIS的Web站点上。

dtSearch Publish

全球领先的文本检索工具,能够为CD/DVD publishing提供强大的功能。

dtSearch Engine

超过20年的全球领先的文本检索控件,使开发者为应用程序快速添加文本查检索功能。

扫码咨询


添加微信 立即咨询

电话咨询

客服热线
023-68661681

TOP