[Change]: handling of soft404

### Browsertrix Host

Hosted by Webrecorder

### What change would you like to see?

The combination of soft404 (responding with 200 when the page isn't found) and relative links results in endless 'fake' pages being crawled.
The crawler don't know to stop, without 404.

I am wondering if we could do something about it? It would save space and time.

Here is an example of soft404: https://budgets.dk/_ignition/horizon/api/jobs/_ignition/salt-edge/get-consent
If it 

One idea I have (but I lack the skill to implement) is this approach:

As one of the first things crawling a domain (right after the seeds maybe?), a request is made to a fictive url with a random name (could be seed + browsertrix_soft404check_[random string]

If the response code from this url is 200, then continue as usual. But handle all files from this domain with the same filesize/content as the bogus-file as 404s. (except the first one, because the frontpage is often used as redirect for soft404)

We do not want to crawl https://example.com/products/products/products/products/products/products/
because of soft404 combined with <a href="products/">Products</a>

And we can not just reject the pattern with regex, since /img/img/img/ is a real thing some people do.

### Additional details

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Change]: handling of soft404 #3375

Browsertrix Host

What change would you like to see?

Additional details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Change]: handling of soft404 #3375

Description

Browsertrix Host

What change would you like to see?

Additional details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions