Evaluating ChatGPT’s Knowledge Based On Year of Source Data

Evaluating ChatGPT’s Knowledge Based On Year of Source Data

I’ve been talking to myself in JavaScript about Google’s terrible AI results and why it’s so difficult to have AI turn scraped web into useful search results.

I made a thing that does a Mojeek search and restricts results to a specific year via url pattern matching/result filtering. It then retrieves and bundles the filtered result pages and sends them to a chatGPT prompt, asking it to answer a user-specified question. The prompt is further told that it must limit the results to the information given and if it has no data available to answer the question it must say so.

It’s interesting to give it questions like “how to treat a snake bite” and note the little differences.

In 2015: wash wound with soapy water.
In 2024: don’t wash! We need that venom residue!
In 2008: Apply a bandage or restricting band.
In 2021: Nothing to impede blood flow!

Even more interesting is to look at public figures and see what is considered important to note about them. For example, here’s the answer the question “Who is Matt Gaetz?” from 2016 source:

Matt Gaetz is a Republican politician who ran for the U.S. House of Representatives 1st District seat in Florida, which was being vacated by U.S. Rep. Jeff Miller. He is noted for his significant fundraising efforts, amassing nearly $1 million in campaign contributions, far outpacing his competitors. Gaetz, who is a state representative, chose to skip a debate to fulfill a prior commitment of engaging directly with voters. His father, State Sen. Don Gaetz, was a candidate for the University of West Florida presidency. Gaetz’s campaign faced scrutiny over contributions from super PACs and allegations of coordination, but he maintained that his campaign adhered to the rules. He emphasized his local support, claiming a large number of donors from District 1. Gaetz ultimately won the election for the U.S. House 1st District seat.

Now here’s an answer to the same question from 2023 source:

Matt Gaetz is a Florida representative and a member of Congress. He has been involved in several high-profile controversies and legal issues. Recently, the Department of Justice concluded an investigation into allegations of sex trafficking and obstruction of justice against him, deciding not to bring any charges. Gaetz was implicated by his former friend and associate, Joel Greenberg, who pleaded guilty to multiple felonies, including underage sex trafficking. Despite these allegations, Gaetz maintained that he never paid for sex and that the women involved were ex-girlfriends. Additionally, Gaetz played a significant role in the recent political maneuvering that led to the removal of Kevin McCarthy as Speaker of the House, highlighting his influence and contentious nature within the political sphere.

If you search Matt Gaetz now, the content will be weighted toward more recency and your overview will look more like 2023 than 2016. The thing is, a lot of that 2016 information is relevant. The fact that Gaetz is a good fundraiser might explain why the GOP was so tolerant of his attempts to oust Kevin McCarthy, himself an excellent fundraiser. With all Florida’s recent higher education controversy, it’s significant that his father was under consideration for the U of West Florida presidency. (He didn’t get it.) Given today’s political atmosphere, even the fact that some considered Gaetz’ Congressional campaign to be sketchy is meaningful. As more information piles up, however, you’re less likely to get small details, even salient ones.

The idea that a search engine’s AI can be taught to pick the right answer assumes that there’s such a thing as one right answer. But so often there’s not. Instead it seems to me that what one finds in a Web search are a set of facts around a topic, to which one can apply the lenses of authority, time, expertise, or circumstance. Those are human contexts and critical to understanding and using the information we find. Unfortunately, the non-transparency of AI means there’s no place in the process of AI-powered Web search to apply that framing in a meaningful way.

This gives me an idea.

Back To Top