The SearchStax Site Search solution's Crawler add-on explores the pages of a website starting at a start URL. It then follows the embedded links in the pages rather than following the hierarchical structure of the website.
Crawler Constraints
Three limits constrain the crawl:
- The Crawler won't travel outside the DNS domain specified in the start URL. For example, if the start URL is "https://my.company.com/bios/", the Crawler confines itself to pages within "my.company.com."
- You can set the "crawl depth" of the run. The Crawler confines itself to pages that are no more than N links away from the start URL.
- The Crawler has configurable Exclusions. These rules prevent the Crawler from crawling pages where the page URL includes explicit substrings. For example, don't include any page that contains the string "/internal/" in the URL.
Exclusion rules apply to sitemap-discovered URLs and don't interrupt sitemap-based discovery.
Note: Exclusion rules are case-sensitive, so "/internal/" won't exclude "/Internal/".
Viewing Crawled URLs
Exclusions are easy to configure, but it isn't immediately obvious what branches of the namespace the Crawler has included. Here's one way to view the URLs of crawled pages. If your Site Search App uses security tokens:
curl -H "Authorization: Token <read-only token>" "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1"
If your Site Search App uses Basic Auth credentials:
curl -u <read-only user>:<read-only password> "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1"
Run this /select query in a Linux terminal window to get a list of URLs from the Site Search index, similar to this:
"response":{"numFound":368,"start":1,"numFoundExact":true,"docs":[
{
"url":"https://www.searchstax.com/docs/"},
{
"url":"https://www.searchstax.com/docs/searchstax-cloud-filing-a-support-request/"},
{
"url":"https://www.searchstax.com/docs/integration-overview/"},
{
"url":"https://www.searchstax.com/docs/searchstax-cloud-docs-home/"},
Adjust the &rows and &start parameters to view different portions of the list.