For one month beginning on October 5, I ran an experiment: Every day, I asked ChatGPT 5 (more precisely, its “Extended Thinking” version) to find an error in “Today’s featured article”. In 28 of these 31 featured articles (90%), ChatGPT identified what I considered a valid error, often several. I have so far corrected 35 such errors.

  • helpImTrappedOnline@lemmy.world
    link
    fedilink
    English
    arrow-up
    18
    ·
    edit-2
    5 hours ago

    The first edit was undoing a vandalism that persisted for 5 years. Someone changed the number of floors a building had from 67, to 70.

    A friendly reminder to only use Wikipedia as a summary/reference aggregate for serious research.

    This is a cool tool for checking these sorts of things, run everything through the LLM to flag errors and go after them like a wack-a-mole game instead of a hidden object game.

    • acosmichippo@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 hour ago

      90% errors isn’t accurate. It’s not that 90% of all facts in wikipedia are wrong. 90% of the featured articles contained at least one error, so the articles were still mostly correct.

    • Ace@feddit.uk
      link
      fedilink
      English
      arrow-up
      50
      arrow-down
      3
      ·
      edit-2
      5 hours ago

      If you read the post it’s actually quite a good method. Having an LLM flag potential errors and then reviewing them manually as a human is actually quite productive.

      I’ve done exactly that on a project that relies on user-submitted content; moderating submissions at even a moderate scale is hard, but having an llm look through for me is easy. I can then check through anything it flags and manually moderate. Neither the accuracy nor precision is perfect, but it’s high enough to be useful so it’s a low-effort way to find a decent number of the thing you’re looking for. In my case I was looking for abusive submissions from untrusted users; in the OP author’s case they were looking for errors. I’m quite sure this method would never find all errors, and as per the article the “errors” it flags aren’t always correct either. But the effort:reward ratio is high on a task that would otherwise be unfeasible.

    • Treczoks@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 hours ago

      Yep. Let it flag potential problems, and have humans react to it, e.g. by reviewing and correcting things manually. AI can do a lot of things quick and efficiently, but it must be supervised like a toddler.

    • ordnance_qf_17_pounder@reddthat.com
      link
      fedilink
      English
      arrow-up
      11
      arrow-down
      2
      ·
      4 hours ago

      “AI” summed up. 95% of the time it’s pointless bullshit being shoehorned into absolutely everything. 5% of the time it can be useful.

    • anamethatisnt@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      66
      arrow-down
      1
      ·
      8 hours ago

      I find that an extremely simplified way of finding out whether the use of an LLM is good or not is whether the output from it is used as a finished product or not. Here the human uses it to identify possible errors and then verify the LLM output before acting and the use of AI isn’t mentioned at all for the corrections.

      The only danger I see is that errors the LLM didn’t find will continue to go undiscovered, but they probably would be undiscovered without the use of the LLM too.

      • shiroininja@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        edit-2
        4 hours ago

        Or it flags something as an error falsely and the human has so much faith in the system that it must be correct, and either wastes time finding the solution or bends reality to “correct” it in a human form of hallucinating bs. Especially dangerous if saying there is an error supports the individual’s personal beliefs

        Edit:

        I’ll call it “AI-induced confirmation bias” cousin to AI-induced psychosis.

      • porcoesphino@mander.xyz
        link
        fedilink
        English
        arrow-up
        16
        ·
        edit-2
        8 hours ago

        I think the first part you wrote is a bit hard to parse but I think this is related:

        I think the problematic part of most genAI use cases is validation at the end. If you’re doing something that has a large amount of exploration but a small amount of validation, like this, then it’s useful.

        A friend was using it to learn the linux command line, that can be framed as having a single command at the end that you copy, paste and validate. That isn’t perfect because the explanation could still be off and it wouldn’t be validated but I think it’s still a better use case than most.

        If you’re asking for the grand unifying theory of gravity then:

        • validation isn’t built into the task (so you’re unlikely to do it with time).
        • validation could be as time intensive as the task (so there is no efficiency gain if you validate).
        • its beyond your ability to validate so if it says nice things about you then a subset of people will decide the tool is amazing.
        • anamethatisnt@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          8
          ·
          5 hours ago

          Yeah, my morning brain was trying to say that when it is used as a tool by someone that can validate the output and act upon it then it’s often good. When it is used by someone who can’t, or won’t, validate the output and simply uses it as the finished product then it usually isn’t any good.

          Regarding your friend learning to use the terminal I’d still recommend validating the output before using it. If it’s asking genAI about flags for ls then sure no big deal, but if a genAI ends up switching around sda and sdb in your dd command resulting in a wiped drive you only got yourself to blame for not checking the manual.

    • passepartout@feddit.org
      link
      fedilink
      English
      arrow-up
      4
      ·
      7 hours ago

      Yes and no. I have enjoyed reading through this approach, but it seems like a slippery slope from this to “vibe knowledge” where LLMs are used for actually trying to add / infer information.

  • GeneralEmergency@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    19
    ·
    2 hours ago

    No surprise.

    Wikipedia ain’t the bastion of facts that lemmites make them out to be.

    It’s a mess of personal fiefdoms run by people with way too much time on their hands and an ego to match.