Skip to content

[Change]: handling of soft404 #3375

@thsm-kb

Description

@thsm-kb

Browsertrix Host

Hosted by Webrecorder

What change would you like to see?

The combination of soft404 (responding with 200 when the page isn't found) and relative links results in endless 'fake' pages being crawled.
The crawler don't know to stop, without 404.

I am wondering if we could do something about it? It would save space and time.

Here is an example of soft404: https://budgets.dk/_ignition/horizon/api/jobs/_ignition/salt-edge/get-consent
If it

One idea I have (but I lack the skill to implement) is this approach:

As one of the first things crawling a domain (right after the seeds maybe?), a request is made to a fictive url with a random name (could be seed + browsertrix_soft404check_[random string]

If the response code from this url is 200, then continue as usual. But handle all files from this domain with the same filesize/content as the bogus-file as 404s. (except the first one, because the frontpage is often used as redirect for soft404)

We do not want to crawl https://example.com/products/products/products/products/products/products/
because of soft404 combined with Products

And we can not just reject the pattern with regex, since /img/img/img/ is a real thing some people do.

Additional details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementRequests a change to a featureideaIdea for a feature in consideration
    No fields configured for Feature.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions