JavaScript Scraping Assistant

The JavaScript Scraping Assistant tool provides a menu-driven user interface to create a scraper file, which can extract the contents of a HTML page as an XML document. This functionality is provided by the following feature in the WSO2 feature repository.

Name: WSO2 Carbon - Javascript Web Scraping Feature
Identifier: org.wso2.carbon.jsscraper.feature.group

The scraping assistant tool is bundled by default in the WSO2 Application Server and WSO2 Mashup Server products. If it is not included in your product distribution, you can add it by installing the above feature using the instructions given in section Feature Management.

Follow the instructions below to invoke the tool.

1. Log on to the product's Management Console and select "Scraping Assistant" in the "Tools" menu.

2. The "Scraping Assistant" window opens with an empty config tag.

3. The tool's menus can be used to write the XML configuration according to your requirements. For example,

3.1 select the "Add HTTP request" menu item.

It inserts a line to retrieve a page available at a given URL as follows:

<http url="url-to-fetch" method="post"/>

Replace the "url-to-fetch" section with the URL of the page you want to fetch.

3.2 Highlight the line of code inserted before and the select the 'Convert HTML to XML' menu item. It converts the HTML to XML by including the code within the following tags:

<html-to-xml outputtype="pretty"></html-to-xml>

3.3 Highlight the existing code segment and the select the 'Convert to Variable' menu item. It includes the code within the following tags:

<var-def name="variable_1"></var-def>

3.4 Optionally, use a variable name that has some semantic value, to get a completed configuration as shown in the example below.

3.5 This example scraper configuration can be used in script as shown in the example below:

function getString() {  
    var config = <config>  
                    <var-def name="mashupSite">  
                        <html-to-xml outputtype="pretty">  
                            <http url="http://wso2.org/projects/mashup" method="post"/>  
                        </html-to-xml>  
                    </var-def>  
                </config>  
    var scraper = new Scraper(config);  
    var result = scraper.response;  
    return result;  
}

3.6 The code segment above will fetch all content from the URL. You can now modify your configuration to filter out the information you don't need from this URL, or use logic within your script itself to extract the bits you need.

Scraper Host Object

The Scraper object we created before allows data to be extracted from HTML pages and presented in XML format. It provides a bridge to data sources that don't have XML or Web service representations at present. The scraping component wraps WebHarvest: http://web-harvest.sourceforge.net/index.php.

There are a few caveats when using the screen scraping language from within the Scraper object and within E4X, as listed below:

The result of the scrape must be saved in a variable. The contents of the variable appear as a property on the Scraper object.

var config = <config>  
                <var-def name='response'>  
                    <html-to-xml>  
                        <http method='get' url='http://ww2.wso2.org/~builder/'/>  
                    </html-to-xml>  
                </var-def>  
             </config>;

Currently, the result comes back as a string. When the result represents XML, you have to parse it into XML and also ensure that you remove the XML declaration. The XML constructor does not parse documents, but only node lists, and rejects the declaration as an illegal processing instruction:

var scraper = new Scraper(config);  
var result = scraper.response;  
  
// strip off the XML declaration and parse as XML.  
var resultXML = new XML(result.substring(result.indexOf('?>') + 2));  
return resultXML;

The WebHarvest language <template> instruction allows variables to be referenced, using the notation ${variable-name}. The curly brackets conflict with the use of XML literals in E4X, where they cause evaluation of the enclosed data. To escape the curly brackets in E4X (so they will be interpreted by WebHarvest), use the character entity references { and } for '{' and '}' respectively.