Ingesting Data with a Crawler

You'll configure and run the Site Search web crawler to index a public website. You'll start with a simple crawl and then learn where to monitor results.

By the end, you'll be able to:

  • Add start URLs for your crawl
  • Apply basic include and exclude rules
  • Run and monitor your first crawl

Note: Crawlers are only available in certain subscriptions. Daily crawl frequency and maximum pages per crawl depend on subscription limits.

Prerequisites

Before you start, you'll need:

  • A SearchStax Site Search account
  • A Search App created and ready for data ingestion
  • A public website URL that doesn't require authentication
  • A basic understanding of your site's structure (sitemap, sections, or domains)

1. Create a Crawler

  1. In SearchStax, open your Search App.
  2. Go to Site Search > App Settings > Data Management > Crawler.
  3. Click Create a Crawler and enter a unique Crawler name.
  4. Enter the Start URL for the page where the crawl should begin. You can also enter a sitemap.xml or sitemap_index.xml URL.
  5. Click Save changes.

Confirm the new crawler appears in the Crawler List.

SearchStax Site Search Crawler list in App Settings.

Tip: For your first run, use your sitemap.xml to quickly discover pages without deep crawling. Sitemap discovery doesn't automatically index URLs that only appear in the sitemap. Inclusion and exclusion rules still apply.

2. Set Crawl Depth and Schedule

  1. In Settings, find Crawl depth and choose a value that matches your scope. Leave the default for an unlimited crawl unless you need to limit reach.
  2. Under Schedule, enable the daily schedule and choose a Target Start Time.
  3. Click Save changes.

SearchStax Site Search Crawler Settings tab showing Crawl depth set to 2 and Schedule enabled for daily runs.

3. Define Inclusion and Exclusion Rules

  1. In Settings, open Inclusions.
  2. Add a rule that targets a section of your site (for example, Beginning with https://example.com/blog/). Click + to add it.
  3. Open Exclusions and add a rule to omit unwanted paths (for example, Contains /admin/). Click + to add it.
  4. Click Save changes.

SearchStax Site Search Crawler Settings tab showing an include rule for the /blog path.

Note: Inclusion and exclusion URL patterns are case-sensitive. If rules conflict, exclusions take precedence.

4. Run Your First Crawl

  1. Return to the Crawler list.
  2. Select your crawler and click Crawl Now.

After completion, proceed to Preview to validate results.

SearchStax Site Search Crawler list.

5. Verify Indexing with Search Preview

  1. In the top navigation bar, click Preview.
  2. Run a broad search, such as *, or search for a known page title.

Confirm documents from your site appear in Preview results. See Previewing your first search in SearchStax for more information.

6. View Crawl History

  1. Open your crawler and go to the History tab.
  2. Review summary statistics for recent runs, including items indexed and URLs crawled.

Confirm the most recent run shows expected counts and status.

SearchStax Site Search Crawler History tab with run summary details.

Note: Because daily crawl frequency and maximum pages per crawl depend on your plan, large sites may require scoped rules or multiple crawlers.

What's Next?

Now that your data is ingested, preview your search in SearchStax to confirm that data ingestion was successful.

Articles in this section