Crawl Depth

Crawl depth controls how far the SearchStax Site Search Web Crawler can follow links from the Start URL. It helps you control crawler scope, especially when you want to crawl only pages that are close to the starting point. For broader crawler setup guidance, see Crawler.

Crawl depth is based on links between pages, not the number of slashes in a URL path.

How Crawl Depth Works

The crawler starts at the Start URL and follows links from that page. Crawl depth defines how many link steps, or “clicks,” the crawler can follow away from the Start URL.

  • Depth 0: Unlimited. The crawler keeps following links as far as it can, subject to crawler limits and your crawler settings.
  • Depth 1: The crawler can crawl the Start URL and pages linked directly from the Start URL.
  • Depth 2: The crawler can also crawl pages linked from those directly linked pages.
  • Depth 3 and higher: The crawler can continue following links by that many steps from the Start URL.

You can set crawl depth from 0 through 10.

Crawl Depth Isn’t URL Path Depth

Crawl depth doesn’t count the number of folders, path segments, or slashes in a URL.

For example, this URL may look deep because it has several path segments:

https://www.example.com/level-one/level-two/level-three

However, if that page is linked directly from the Start URL, it is only one click away. With crawl depth set to 1, the crawler can discover it.

The reverse can also be true. A short URL may require several clicks to reach:

https://www.example.com/news

If the crawler reaches that page only after following links from the Start URL to another page, then to another page, and then to the News page, it has a deeper crawl depth even though the URL looks short.

Default Crawl Depth Values

SearchStax sets the default crawl depth based on the type of Start URL.

Start URL Type Default Crawl Depth What It Means
HTML page 0 The crawler can follow links without a depth limit, subject to crawler limits and your crawler settings.
sitemap.xml 1 The crawler can discover and crawl the URLs listed in the sitemap.
Sitemap index 2 The crawler can read the sitemap index, then read the individual sitemap files it lists.

Why Sitemap Defaults Are Different

A sitemap is usually a curated list of URLs that should be discoverable by search engines and crawlers. Because the sitemap already lists the intended pages, the crawler usually doesn’t need to continue following links from every page listed in the sitemap.

When you use a sitemap URL, the crawler uses the sitemap for discovery. The crawler discovers and crawls the URLs listed in the sitemap, but a URL listed in a sitemap isn’t automatically indexed. Crawler limits, inclusion rules, exclusion rules, page availability, file type support, and indexing rules still apply.

For a sitemap.xml Start URL, the default depth is 1. The crawler can read the sitemap and crawl the URLs listed in it. At this default depth, the crawler uses the sitemap entries for discovery and doesn’t continue following additional links found on those listed pages.

For a sitemap index, the default depth is 2. The crawler needs one step to read the sitemap index and another step to read the individual sitemap files listed in that index.

For an HTML page, the default depth is 0, or unlimited. HTML pages don’t provide a complete structured list of content. The crawler discovers content by following links.

When to Lower Crawl Depth

Lower crawl depth when you want to keep the crawl close to the Start URL or prevent the crawler from following large sections of linked content.

Lower crawl depth can help when:

  • You only want to crawl landing pages or top-level sections.
  • Your site has many automatically generated pages.
  • Your site links to large archives, calendars, search result pages, or filtered listing pages.
  • You want a smaller, more predictable crawl before expanding crawler scope.

After lowering crawl depth, review crawler results to confirm that the crawler still reaches the pages you want indexed.

When to Use Inclusion and Exclusion Rules

Crawl depth is useful for controlling how far the crawler can travel from the Start URL, but it isn’t the best tool for every use case.

Use inclusion and exclusion rules when you need to allow or block specific URL patterns, sections, or page types. For example, you might use exclusion rules to prevent the crawler from crawling URLs that contain /search, /calendar, or /print. For more details, see Crawler Exclusions.

Use crawl depth and crawler rules together when needed:

  • Use crawl depth to control how many link steps the crawler can follow.
  • Use inclusion rules to limit the crawler to specific allowed URL patterns.
  • Use exclusion rules to block known unwanted URL patterns.

Be careful with broad exclusion rules. A rule that looks simple can block more pages than expected, especially if the same string appears in many URLs.

Best Practices

  • Use a sitemap Start URL when you have a reliable sitemap that lists the pages you want indexed.
  • Use an HTML Start URL when you want the crawler to discover pages by following links from a page.
  • Choose a limited crawl depth when you want to test or constrain crawler scope.
  • Use inclusion and exclusion rules for specific URL patterns instead of relying only on crawl depth.
  • Review crawler history and indexed results after changing crawl depth or crawler rules.
Articles in this section