Browsertrix Host
Hosted by Webrecorder
What change would you like to see?
The combination of soft404 (responding with 200 when the page isn't found) and relative links results in endless 'fake' pages being crawled.
The crawler don't know to stop, without 404.
I am wondering if we could do something about it? It would save space and time.
Here is an example of soft404: https://budgets.dk/_ignition/horizon/api/jobs/_ignition/salt-edge/get-consent
If it
One idea I have (but I lack the skill to implement) is this approach:
As one of the first things crawling a domain (right after the seeds maybe?), a request is made to a fictive url with a random name (could be seed + browsertrix_soft404check_[random string]
If the response code from this url is 200, then continue as usual. But handle all files from this domain with the same filesize/content as the bogus-file as 404s. (except the first one, because the frontpage is often used as redirect for soft404)
We do not want to crawl https://example.com/products/products/products/products/products/products/
because of soft404 combined with Products
And we can not just reject the pattern with regex, since /img/img/img/ is a real thing some people do.
Additional details
No response
Browsertrix Host
Hosted by Webrecorder
What change would you like to see?
The combination of soft404 (responding with 200 when the page isn't found) and relative links results in endless 'fake' pages being crawled.
The crawler don't know to stop, without 404.
I am wondering if we could do something about it? It would save space and time.
Here is an example of soft404: https://budgets.dk/_ignition/horizon/api/jobs/_ignition/salt-edge/get-consent
If it
One idea I have (but I lack the skill to implement) is this approach:
As one of the first things crawling a domain (right after the seeds maybe?), a request is made to a fictive url with a random name (could be seed + browsertrix_soft404check_[random string]
If the response code from this url is 200, then continue as usual. But handle all files from this domain with the same filesize/content as the bogus-file as 404s. (except the first one, because the frontpage is often used as redirect for soft404)
We do not want to crawl https://example.com/products/products/products/products/products/products/
because of soft404 combined with Products
And we can not just reject the pattern with regex, since /img/img/img/ is a real thing some people do.
Additional details
No response