HomeCompanySolutionsDeveloper
BYTE Development
Home > Developer > Indexing process in SharePoint Portal Server
Technical overview of the document indexing process in Microsoft SharePoint Portal Server
 


Both SharePoint Portal Server and the Indexing Service allow for external content sources to be added to the workspace and crawled. Protocol handlers are software components of the Filter Daemon that implement the protocol for accessing a content source in its native format. This exposes it to be crawled by the Search service. The figure to the left illustrates the protocol handler architecture and the data flow during the crawl process.

SharePoint Protocol Handler Architecture Crawls are initiated within the Gatherer process for a SharePoint Portal Server workspace. The Gatherer receives an URL for content that must be crawled. The URL can be the start address for a content source, a link stored from a previous crawl or a notification from a SharePoint Portal Server workspace. The Gatherer checks the URL against the crawl restrictions set for this workspace.

When crawling of a content source starts a crawler or robot thread in the Gatherer passes the crawling request to the Filter Daemon. The robot thread allocates a Filter object from a pool. When the Filter object is allocated it is also associated with a Filter thread object. Each document being filtered corresponds to one Filter thread in the Filter Daemon. The Filter Daemon runs in a separate process from the Gatherer so it can be terminated in case it crashes. The Filter Daemon and the Gatherer communicate using pipes in shared memory.

The Filter thread receives the URL for content to be filtered and also the time the content was last crawled. The Filter thread determines and invokes the appropriate protocol handler for the URL item. The protocol handler creates an UrlAccessor object that will control the filtering of this item.

Properties are extracted from documents by filters implemented for specific document types. Some value-type properties are obtained by other means, like the property-storage interfaces. The implementer of a custom IFilter interface can interpret the contents of a document type in any number of ways, and the description here represents "best practices" for an implementation.

The IFilter interface contains several methods that Indexing Service uses when filtering a document. The following figure graphically represents an example document. The external value-type property DocTitle (obtained using methods of the IPropertySetStorage and IPropertyStorage interfaces) and the internal value-type property Book (obtained as a result of a custom IFilter implementation) describe the document as a whole. The text-type properties Contents and Chapter describe the content of the document. When processing this document, the IFilter implementation identifies and extracts these properties.

IFilter Doc Example


© 2005 BYTE Development; All Rights Reserved.
Creative concept developed by Archiweb