Crawler behavior

Understand how same-domain discovery becomes scan input and why outside-boundary, excluded or over-limit URLs are skipped.

Customers reviewing project pages and technical reviewers

Feature availability

Product, package, provider and deployment boundaries for this page.

Available from: Current documentation
Deployment modes: cloud

Product screenshots

Current customer-safe screenshots are generated from the application so examples do not drift from the product.

WebRiskOps Project Pages view with discovered public pages, scan groups and discovery-ready state — Generated customer-safe screenshot of the WebRiskOps Project Pages review step.

Before crawler discovery

Crawler behavior starts before a scan run. Discovery can find many same-domain URLs, but the scan should receive only the reviewed public project boundary. Use this page after [Project page boundaries](/docs/projects/manual-urls-and-path-rules) and [Project Pages review](/docs/projects/accepted-scan-scope) are clear. If Project Pages are empty or unsupported, the crawler should not treat discovered URLs as approved scan input.

Review crawler inputs

Follow the path `Projects → Project detail → Project Pages → Actions → /scans/{scanRun}`.

Open /projects and choose the project with discovered URL groups. Result: Project Pages shows candidate public pages and plan limits before they become crawler seeds.
Confirm selected groups are public and same-domain. Result: the crawler receives only URLs that match the project domain and customer authorization.
Review include and exclude path rules when they are visible. Result: private, admin, logout or unrelated paths stay out of scan input.
Check Page budget and Crawl depth. Result: the scanner knows when to stop following links even if more public pages are discovered.
Start the scan only after Project Pages are ready. Result: crawler seeds come from the project boundary rather than every URL discovery found.
Open /scans/{scanRun} and review skipped URLs. Result: skipped reasons explain outside scope, excluded path, duplicate URL, crawl depth or plan limit decisions.

How crawler limits are applied

The crawler follows links only inside the project boundary and plan limits.

Same-domain public pages can be followed when they match project page groups or include paths.
External domains, private paths, admin areas, logout paths and excluded paths should stay out of crawler input.
Page budget limits how many pages can be scanned for the selected plan.
Crawl depth limits how far the worker can follow links away from an accepted seed.
Rate limit and timeout settings protect the customer site and keep evidence collection predictable.

Skipped crawler states

Skipped URLs are not silent failures. They should show why the crawler stopped.

Outside project boundary means the URL was not part of Project Pages or included by path rules.
Excluded path means a configured rule intentionally removed that URL from crawler input.
Crawl limit reached means the worker stopped after page budget or crawl depth was exhausted.
Redirect outside boundary means the final URL left the customer domain or selected path boundary.
Unsupported scope means the target is private, login-only, unsafe or unrelated and should not be retried as a crawler issue.

Continue to browser rendering

When crawler seeds are correct but a page still fails or looks different in evidence, continue to [Browser rendering](/docs/projects/browser-rendering). That page explains what the scanner tries to observe after the crawler has chosen an allowed URL. If the scan run skips many URLs, use [Failure and skipped-page meanings](/docs/projects/failure-and-skipped-page-meanings) before changing Project Pages or retrying the scan.

Was this page helpful?

Feedback goes into the product documentation review queue.