Crawler behavior

Understand how same-domain crawler discovery becomes accepted scan seeds and why outside-scope, excluded or over-limit URLs are skipped.

Customers setting scope and technical reviewers

Feature availability

Product, package, provider and deployment boundaries for this page.

Available from
Current documentation
Deployment modes
cloud

Product screenshots

Current customer-safe screenshots are generated from the application so examples do not drift from the product.

Generated customer-safe screenshot of the WebRiskOps scan scope selection and scope acceptance controls.

Before crawler discovery

Crawler behavior starts before a scan run. Discovery can find many same-domain URLs, but the live audit should receive only the customer-approved public boundary. Use this page after [Manual URLs and path rules](/docs/projects/manual-urls-and-path-rules) and [Accepted scan scope](/docs/projects/accepted-scan-scope) are clear. If the customer has not accepted scope, the crawler should not treat discovered URLs as approved scan input.

Review crawler inputs

Follow the path `Projects → Project detail → Scan scope selection → Scope accepted → /scans/{scanRun}`.

  1. Open /projects and choose the project with discovered URL groups. Result: Scan scope selection shows candidate groups, selected pages, manual URLs and plan limits before they become crawler seeds.
  2. Confirm selected groups are public and same-domain. Result: the crawler receives only URLs that match the accepted project domain and customer authorization.
  3. Review Manual URLs, Include paths and Exclude paths. Result: explicit seeds can be added while private, admin, logout or unrelated paths stay out of scope.
  4. Check Page budget and Crawl depth. Result: the scanner knows when to stop following links even if more public pages are discovered.
  5. Start the live audit only after Scope accepted. Result: crawler seeds come from the accepted scan scope rather than every URL discovery found.
  6. Open /scans/{scanRun} and review skipped URLs. Result: skipped reasons explain outside scope, excluded path, duplicate URL, crawl depth or plan limit decisions.

How crawler limits are applied

The crawler follows links only inside the accepted boundary and plan limits.

  • Same-domain public pages can be followed when they match accepted URL groups, manual URLs or include paths.
  • External domains, private paths, admin areas, logout paths and excluded paths should stay out of crawler input.
  • Page budget limits how many pages can be scanned for the selected plan.
  • Crawl depth limits how far the worker can follow links away from an accepted seed.
  • Rate limit and timeout settings protect the customer site and keep evidence collection predictable.

Skipped crawler states

Skipped URLs are not silent failures. They should show why the crawler stopped.

  • Outside accepted scope means the URL was not selected, manually added or included by path rules.
  • Excluded path means a configured rule intentionally removed that URL from crawler input.
  • Crawl limit reached means the worker stopped after page budget or crawl depth was exhausted.
  • Redirect outside scope means the final URL left the accepted customer domain or selected path boundary.
  • Unsupported scope means the target is private, login-only, unsafe or unrelated and should not be retried as a crawler issue.

Continue to browser rendering

When crawler seeds are correct but a page still fails or looks different in evidence, continue to [Browser rendering](/docs/projects/browser-rendering). That page explains what the scanner tries to observe after the crawler has chosen an allowed URL. If the scan run skips many URLs, use [Failure and skipped-page meanings](/docs/projects/failure-and-skipped-page-meanings) before changing scope or retrying the live audit.

Related documentation

Was this page helpful?

Feedback goes into the product documentation review queue.