Power Search 3
Ad-hoc website searching and scraping made easy.

Power Search Tutorial: Extract Data with a Wildcard Match Query

A "How-To" guide for extracting website data with Power Search

Introduction

This tutorial describes how to create a Power Search query that will capture the Product Name, Product Quantity, Product Code, and Product Description of all the items on the Power Search demo site: powersearchdemo.inspyder.com

Steps

  1. Open Internet Explorer and navigate to the demo site, http://powersearchdemo.inspyder.com.

  2. Navigate to a particular item webpage step1.jpg
  3. Right click on the page and click "View Source" or "View Page Source". This will open up the underlying HTML of the page.

    step2.jpg
  4. Locate the code that displays the item details. (Look for the product name or other unique text that is found on this page that you wish to capture.)

    step3.jpg
  5. Select the HTML code that surrounds the text you wish to extract. Copy that code into notepad. Below you can see that we've copied everything between the "table" tags.

  6. Query strings can get some-what messy so the first thing we do is start off with a simple query string that captures only the name of the product.

    From the HTML source above we can see that the product name follows a unique pattern. If structured consistently throughout the website, it can be identified by the CSS class name "pName". Therefore, our query string must be long enough to include that unique identifier.

    The next thing we do is replace the current product name, which changes from one product to another with a Wildcard Match identifier. The Wildcard Match identifier is the text that changes arbitrarily between a opening and close tag.

    In this example we change the text "<td>Animal A</td>" to "<td>#Name#</td>".

    Below is the corresponding query string which will capture the name of all the products:

     

    Notice that we've replace the line-break between the </td> and <td> tags with "*". Query strings can only be one line, so to match line breaks (and to keep the query strings short) we can replace all the line breaks and spaces with the wildcard character (*). The wildcard character matches any text but does not capture it. In our case, any text can exists between the close </td> and the following open <td>.

    A more detailed tutorial on Wildcard can be found on this page: Wildcard Tutorial
  7. Now we will expand the query string to capture both the product name and product code.

    The query string for capturing the product quantity is similar to the one for the product name. However, instead of the CSS class name "pName", the CSS class name is "pCode" uniquely identifies the product quantity.

    We are not interested in any text after the </td> tag and beginning <td> tag, so we will add a Wildcard Character.


    Below is the corresponding query string which will capture the name and quantity of all the products:

     

  8. We can expand this to include the product name, code, quantity, and description:

     

  9. Open Inspyder Power Search and copy the query string that was created above into the Query text-box.

  10. Setting the Project Settings:

    1. Since we are using Wildcard Matches in our query, we need to select Wildcard Match as the Query Type.
    2. Since we are searching for HTML strings, we need to select the Include HTML check-box
  11. Setting Query Options:

    1. Since the match result will span more than one line, we need to select the Match may Span Multiple Lines check-box.
    2. Since we are using Wildcard Character, we need to select the Wildcard (*) Includes HTML check-box
    step10.jpg
  12. Search Result:

    final.jpg