Configuration for Indexing

Newly added resources have to be indexed before they appear in Content Search results. This automatic indexing process happens in the background and therefore newly added resources might not be immediately available for Content Search. Indexing frequency and initial startup delay for indexing can be configured in <indexingConfiguration> element in $GREG_HOME/repository/conf/registry.xml as shown below. The indexers that implement the indexer interface for a relevant media type are also configured here and new indexers can be configured too.

<indexingConfiguration>
        <startingDelayInSeconds>300</startingDelayInSeconds>
        <indexingFrequencyInSeconds>60</indexingFrequencyInSeconds>
        <!-- location storing the time the indexing took place-->
        <lastAccessTimeLocation>/_system/local/repository/components/org.wso2.carbon.registry/indexing/lastaccesstime</lastAccessTimeLocation>
        <indexers>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.MSExcelIndexer" mediaTypeRegEx="application/vnd.ms-excel"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.MSPowerpointIndexer" mediaTypeRegEx="application/vnd.ms-powerpoint"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.MSWordIndexer" mediaTypeRegEx="application/msword"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.PDFIndexer" mediaTypeRegEx="application/pdf"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.XMLIndexer" mediaTypeRegEx="application/xml"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.XMLIndexer" mediaTypeRegEx="application/(.)+\+xml"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.PlainTextIndexer" mediaTypeRegEx="text/(.)+"/>
            <indexer class="org.wso2.carbon.registry.indexing.indexer.PlainTextIndexer" mediaTypeRegEx="application/x-javascript"/>
        </indexers>
        <exclusions>
            <exclusion pathRegEx="/_system/config/repository/dashboards/gadgets/swfobject1-5/.*[.]html"/>
            <exclusion pathRegEx="/_system/local/repository/components/org[.]wso2[.]carbon[.]registry/mount/.*"/>
        </exclusions>
    </indexingConfiguration>

For indexing to work across multiple nodes, they should either be sharing a common database for the config and governance sub collections of the /_system collection. This will happen by default for JDBC mounts. For other types of mounts, if the database is not shared, appropriate replication of the REG_LOG table needs to happen between the nodes for indexing to work.

Creating a custom indexer

WSO2 G-Reg uses Apache Solr to support the Registry search feature. To support the Solr search, all Registry resources saved in the RDMS are indexed using a periodic task. You need to extract the content of the resource to index the resource content. The logic of extracting content of the resource varies from resource type. Therefore, Apache Solr provides an extension point to write your own custom logic to create an index document of each resource.

Follow the steps below to create a custom indexer.

Create an Apache Maven project, and add the following dependencies and WSO2 Maven repository in the pom.xml file of your project:

<dependencies>
        <dependency>
            <groupId>org.wso2.carbon</groupId>
            <artifactId>org.wso2.carbon.registry.core</artifactId>
            <version>4.2.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.wso2.carbon</groupId>
            <artifactId>org.wso2.carbon.registry.indexing</artifactId>
            <version>4.2.0</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

    <repositories>
        <repository>
            <id>wso2-nexus</id>
            <name>WSO2 internal Repository</name>
            <url>http://maven.wso2.org/nexus/content/groups/wso2-public/</url>
            <releases>
                <enabled>true</enabled>
                <updatePolicy>daily</updatePolicy>
                <checksumPolicy>ignore</checksumPolicy>
            </releases>
        </repository>

        <repository>
            <id>wso2.releases</id>
            <name>WSO2 internal Repository</name>
            <url>http://maven.wso2.org/nexus/content/repositories/releases/</url>
            <releases>
                <enabled>true</enabled>
                <updatePolicy>daily</updatePolicy>
                <checksumPolicy>ignore</checksumPolicy>
            </releases>
        </repository>
    </repositories

Create a Java class implementing the indexer interface (i.e., org.wso2.carbon.registry.indexing.indexer.Indexer), and overriding the getIndexedDocument() method. Include the logic of extracting the resource content in this. For example,

package org.wso2.carbon.registry.indexer;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.solr.common.SolrException;
import org.wso2.carbon.registry.core.exceptions.RegistryException;
import org.wso2.carbon.registry.core.utils.RegistryUtils;
import org.wso2.carbon.registry.indexing.AsyncIndexer;
import org.wso2.carbon.registry.indexing.IndexingConstants;
import org.wso2.carbon.registry.indexing.indexer.Indexer;
import org.wso2.carbon.registry.indexing.solr.IndexDocument;

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class TextIndexer implements Indexer {
    public static final Log log = LogFactory.getLog(TextIndexer.class);

    @Override
    public IndexDocument getIndexedDocument(AsyncIndexer.File2Index fileData) throws SolrException, RegistryException {

        if (log.isDebugEnabled()) {
            log.debug("Registry Text Indexer is running");
        }

        return getPreProcessedDocument(fileData);
    }

    private IndexDocument getPreProcessedDocument(AsyncIndexer.File2Index fileData) throws RegistryException {
        String jsonAsString = RegistryUtils.decodeBytes(fileData.data);

        IndexDocument indexDocument = new IndexDocument(fileData.path, jsonAsString,
                null);
        Map<String, List<String>> attributes = new HashMap<String, List<String>>();
        if (fileData.mediaType != null) {
            attributes.put(IndexingConstants.FIELD_MEDIA_TYPE, Arrays.asList(fileData.mediaType.toLowerCase()));
        }
        if (fileData.lcState != null) {
            attributes.put(IndexingConstants.FIELD_LC_STATE, Arrays.asList(fileData.lcState.toLowerCase()));
        }
        if (fileData.lcName != null) {
            attributes.put(IndexingConstants.FIELD_LC_NAME, Arrays.asList(fileData.lcName.toLowerCase()));
        }
        if (fileData.path != null) {
            attributes.put("overview_name", Arrays.asList(RegistryUtils.getResourceName(fileData.path).toLowerCase()));
        }
        indexDocument.setFields(attributes);
        return indexDocument;
    }
}

Build the project, and add built JAR file into the <G-Reg_HOME>/repository/components/dropins/ directory.

Add the below configuration within the <indexers> element in the <G-Reg_HOME>/repository/conf/registry.xml file as shown below.

    <indexingConfiguration>
        ............
        <indexers>
            ..........
            <indexer class=”org.wso2.carbon.registry.indexer.TextIndexer” mediaTypeRegEx=”application/text\+plain” profiles=”default,uddi-registry”/>
        </indexers>
        <exclusions>
            <exclusion pathRegEx="/_system/config/repository/dashboards/gadgets/swfobject1-5/.*[.]html"/>
            <exclusion pathRegEx="/_system/local/repository/components/org[.]wso2[.]carbon[.]registry/mount/.*"/>
        </exclusions>
    </indexingConfiguration>

Start the WSO2 G-Reg server. The custom indexer you added will apply when you add/update a resource with the media-type mapped for the indexer in the <G-Reg_HOME>/repository/conf/registry.xml file.