• Rose@slrpnk.net
    link
    fedilink
    English
    arrow-up
    16
    ·
    14 hours ago

    The problem isn’t that the data is already public.

    The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.

    AI crawlers don’t care about robots.txt or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.

    • interdimensionalmeme@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      3
      ·
      14 hours ago

      Yeah but there’s would be scrappers if the robots file just pointed to a dump file.

      Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.

      • Leon@pawb.social
        link
        fedilink
        English
        arrow-up
        11
        ·
        13 hours ago

        Given that they already ignore robots.txt I don’t think we can assume any sort of good manners on their part. These AI crawlers are like locusts, scouring and eating everything in their path,

        • interdimensionalmeme@lemmy.ml
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          2
          ·
          12 hours ago

          Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data. If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete. This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone

          • Leon@pawb.social
            link
            fedilink
            English
            arrow-up
            1
            ·
            4 hours ago

            I wish I was still capable of the same belief in the goodness of others.

          • Mr. Satan@lemmy.zip
            link
            fedilink
            English
            arrow-up
            2
            ·
            10 hours ago

            Dunno, I feel you’re giving way too much credit to these companies.
            They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?

            Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.

            If you have contrary examples I’d love to see some references or sources.

            • interdimensionalmeme@lemmy.ml
              link
              fedilink
              English
              arrow-up
              1
              ·
              10 hours ago

              The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.

              • Mr. Satan@lemmy.zip
                link
                fedilink
                English
                arrow-up
                1
                ·
                2 hours ago

                Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
                If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.

                Not to mention that robots.txt is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?

                To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”

          • Zink@programming.dev
            link
            fedilink
            English
            arrow-up
            2
            ·
            10 hours ago

            My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.

            Spoiler for you bros: It will never be enough.