Crawler

SearchStax Site Search includes a Crawler for indexing your website pages starting with a single root node. See Crawler Walkthrough for the complete procedure.

Crawler Limitations

The Crawler feature has the following limitations:

  • You can run only one crawl per day.
  • Crawls support 10,000 or 100,000 pages per crawl, depending on your contract.

Crawler Perspective

For perspective on the Crawler feature (and a video demonstration), see Data Ingestion for Site Search.

What Is the Crawler?

SearchStax Site Search's Crawler is an add-on connector that finds and indexes all pages on your website, making them searchable through a Search App.

The Crawler starts with a root URL and follows page links to all connected pages using the same corporate domain, subject to a configurable crawl-depth limitation.

Each Search App can have multiple crawlers, so you can index multiple websites into a single combined index.

After the initial crawl, scheduled and manual crawler runs refresh the index. The crawler can add newly discovered pages, update existing index entries, and delete entries for pages that are no longer available.

Configure the Crawler

First, Create a Search App!

You can't enable the Crawler until you've created a SearchStax Search App.

You'll find the Crawler under Site Search > App Settings > Data Management > Crawler in the Navigation Menu:

Left navigation menu showing Crawler under Site Search, App Settings, and Data Management.

This link opens the Crawler list, which is initially empty.

Crawler List

Each Search App can have one or more Crawlers, each indexing pages from one or more websites.

How Many Crawlers Can You Have?

Your account may have authorization to create several concurrent crawlers. This limit applies to the account, not to individual Site Search Apps. The progress bar shows how many crawlers your account uses and what your account limit is.

Progress bar shows 7 out of 200 available crawler slots used in the account.

From this list, you can monitor crawler status, edit a crawler to create or modify it, launch an immediate crawl, or delete a crawler.

When you rerun a crawler, Site Search refreshes that crawler's index entries.

Crawler list interface showing a completed DemoDocCrawler with 508 indexed items and options to manage settings and history.

To start a crawl, check the crawler in the list and select the Crawl Now button.

To view a crawler's details, settings, and history, click the crawler in the list.

Settings Tab

Clicking on a crawler in the Crawler List takes you to the Crawler Details screen. Select the Settings tab.

Crawler Settings screen for a crawler named Docs with a validated start URL, crawl depth set to 0 Unlimited, and Crawl Now and Save Changes buttons.

Crawler Name

Each crawler in your SearchStax account must have a unique name. Names can be multi-word, mixed-case, and alphanumeric. Site Search ignores case when checking for duplicate names.

Create new crawler interface showing fields for crawler name, start URL, and crawl depth with URL validation.

Start URL

The crawler needs a starting or "seed" web page to anchor the crawling process. The crawl follows all outgoing links from that page recursively until it runs out of pages with the same DNS domain as the starting page. The crawler won't cross into other domains. If you want to include pages from another domain in the same index, create a second crawler. Your Search App can support more than one crawler, subject to your contract terms.

The Start URL can also point to a sitemap file, such as:

https://example.com/sitemap.xml

or a sitemap-index file:

https://example.com/sitemap_index.xml

When you use a sitemap URL, the crawler uses it for discovery. A URL listed only in a sitemap isn't automatically indexed. Inclusion and exclusion rules still apply.

Exclusion rules are applied without interrupting sitemap-based discovery.

Crawl Depth

Crawl depth controls how many link levels the crawler follows from the Start URL. A crawl depth of 0 is unlimited.

If you use a sitemap URL as the Start URL, the crawler can discover the pages listed in the sitemap without increasing crawl depth beyond the default.

Schedule

When enabled, the crawler repeats its crawl daily at the indicated local time. Subsequent crawls add newly found pages, update existing pages, and can remove pages that are no longer available.

  1. Open the crawler and go to Settings.
  2. In Schedule, enable scheduled crawling.
  3. Set Incremental Crawl as needed.
  4. Save changes and confirm the next run time.

Scheduled crawls run once per day.

Incremental Crawl

When scheduling is enabled, the Incremental Crawl option is available in Schedule settings. 

When Incremental Crawl is enabled, the crawler checks for new or changed content and updates only the affected content during the scheduled run. Incremental Crawl can help reduce unnecessary crawler activity when your site changes regularly but usually doesn’t need a full crawl every day. 

For new scheduled crawlers, Incremental Crawl is enabled by default. 

With Incremental Crawl enabled, the crawler runs an incremental crawl each day and a full crawl once every 7 days.

Crawler Schedule settings showing the Incremental Crawl option enabled for a scheduled crawler.

Manage Fields for Search Index

The crawler maps information about a web page to search index fields. Although the crawler has a default set of mappings, some customization is common. The Fields table lets you edit and refine your field mappings.

Search index configuration table showing field mappings with columns for field name, element type, page property, field type, and transformations.

For a discussion of the default field mappings, see Crawler Field Map.

You can edit an existing field from the Fields table. To remove a field, use the trash-can icon in the rightmost column.

To add a new field, click the Add Field button. This opens the field editing form. In some field-management areas, Site Search uses Field as the general label for crawler fields.

Add Field window in crawler settings showing Text selected in the Field Type list.

Notes on the field options:

  • Field Type is a dropdown list of search index schema field types: Boolean, Date, Float, Integer, String, Text, and Custom. This affects how the data is indexed and queried. For instance, a string field requires an exact whole-string match, but a text field matches individual words.
  • Site Search modifies the Field Name to show the field type and language. For instance, the field paragraph becomes paragraph-txts-en in the fields list.
  • When you create a new field, you can choose whether to Allow multiple values for this field. Use this when a document stores several values for the same field.
  • Meta Tag Name retrieves the content of a named meta tag in the web page. The default field list includes the description and keywords meta tags.
  • XPath uses an XPath formula to scrape the content of HTML tags in the page. For instance, //p//text() retrieves the content of all paragraph <p> elements.
  • CSS lets you input a CSS class selector. The crawler retrieves the content of all HTML elements matching the selector. For instance, class~=name matches any element whose class attribute contains "name" as a separate word in a space-separated list.
  • System Field offers a dropdown list of internal Site Search fields about a web page, such as id, title, url, and document_type. Most are predefined default fields.
  • The Apply Transformations option is available when you define string and text fields. This makes transformers available to normalize irregular field values during ingestion. See Crawler Transformations in the Help Center.

Facet Fields

The "text" field type doesn't work well with facet lists. Try the "string" field type instead.

Inclusions

The Crawler stays within your DNS domain, but it doesn't limit itself to the tree below the Start URL. Inclusion rules let you confine the crawler to pages where the page URL contains a specific substring. When you use Inclusion rules, the crawler crawls only URLs matching at least one rule.

Crawler Inclusions table filtered by rule pattern doc showing one matching inclusion rule.

Rule Pattern: Enter part or all of a URL (or regex pattern) as the basis of an inclusion rule. Site Search interprets it according to one of these contexts:

  • Beginning with: Includes any page with a URL starting with this string.
  • Contains: Includes any page containing the substring.
  • Ending with: Includes any page where the URL ends with this string.
  • Matching regex: Includes any page where the URL matches the regular expression.

Additional controls:

  • Plus (+) icon: Click to add the inclusion to your list of active inclusions.

Exclusions

After your initial crawl, you may find the Crawler needs a narrower scope. Exclusions are rules that prevent the crawler from exploring every branch of your domain.

Crawler Exclusions table filtered by rule pattern login showing one matching exclusion rule.

Exclude URLs: Enter part or all of a URL (or regex pattern) as the basis of an exclusion rule. Site Search interprets it according to one of these contexts:

  • Beginning with: Excludes any page with a URL starting with this string.
  • Contains: Excludes any page containing the substring.
  • Ending with: Excludes any page where the URL ends with this string.
  • Matching regex: Excludes any page where the URL matches the regular expression.

Additional controls:

  • Plus (+) icon: Click to add the exclusion to your list of active exclusions.

To delete an exclusion, check the box left of the exclusion and click the trashcan icon.

Inclusion/Exclusion Doesn't Work?

Inclusion and exclusion URLs are case-sensitive. You might need multiple rules to cover variations in capitalization.

When Inclusion/Exclusion Conflict

If inclusion rules contradict exclusion rules, the exclusion rules win.

Don't forget to click the Save Changes button to save the changes you've made on this screen. After you save crawler changes, Site Search can direct you to the history cleanup path when you need to remove indexed items from selected crawl events.

Use This Crawler in Other Apps

You can configure one crawler as a primary crawler and send its output to additional associated Apps, including Sandbox Apps or other compatible Apps in your account.

  • Primary vs associated Apps: Crawler settings are managed only in the primary App. Associated Apps receive crawl output from that primary crawler.
  • Selection constraints: You can associate Apps only when schema and version compatibility checks pass.
  • Limit checks: You can associate Apps up to the crawler's maximum App count. When you reach that limit, you can't add another App until one is removed.
  • Removing an associated App: If an App is unselected and the crawler settings are saved, results of the next crawl aren't pushed to that App. Data from previous crawls remains in the App.

History Tab

The History tab shows summary statistics of crawler runs.

The History tab now includes a Type column so you can see whether each crawl was Full or Incremental. One-time crawl events and older history entries appear as Full.

  1. Open the crawler and go to History.
  2. Check the Type column for each run.
  3. Confirm expected behavior from Full versus Incremental values.

Crawler History tab showing the Type column with Full and Incremental crawl entries.

To remove indexed items from selected crawl events, select the crawls in History and use the delete action. You can't select runs with a status of Crawling, Queued, Deleted, or Deleting. After you confirm the deletion, the selected runs move to Deleting and then to Deleted when cleanup finishes.

Not all discovered links crawl successfully, usually because of inappropriate file types. The Items Indexed and URL Crawled columns show how successful the crawl was.

File Size Limits

The Crawler enforces these file size limits:

  • HTML files: The Crawler ignores and doesn't download HTML files over 1 MB in size. It indexes only the first 100 KB of the file's content field.
  • Rich Text Documents: The Crawler ignores and doesn't download RTF files over 1 GB in size. It indexes only the first 100 KB of the file content.
Articles in this section