AI Accelerates: New Gemini Model + AI Unemployment Stories Analysed

AI Accelerates: New Gemini Model + AI Unemployment Stories Analysed

AI-Generated Summary

Google’s latest AI model, Gemini 2.5 Pro, has been released, claiming superiority over competitors like Claude Opus 4, Grok 3, and OpenAI’s GPT-3 in most benchmarks. It’s faster, cheaper via API, and can process up to 1 million tokens, significantly more than other models. However, Google’s leadership, including CEO Sundar Pichai and DeepMind’s Demis Hassabis, caution that AGI (Artificial General Intelligence) is unlikely before 2030. While Gemini 2.5 Pro excels in tasks like obscure knowledge and reading charts, coding performance varies, with Claude Opus sometimes outperforming it in troubleshooting. Despite AI’s rapid advancements, hallucinations and errors persist, limiting its current impact on white-collar jobs. Still, experts predict significant societal shifts by 2030, with AI automating many roles but also boosting productivity. Companies like Duolingo and Clerk have reversed decisions to replace humans with AI, highlighting the ongoing reliance on human oversight. This suggests a “calm before the storm” phase, where AI complements humans before potentially transforming the workforce in the coming decade.

📜 Full Transcript

while everyone else is focused on other stuff like Twitter spats let’s focus on the real news the developments in AI which I would say are accelerating particularly if you are Google who have just released the latest version of Gemini 2.5 Pro fairly unambiguously the best language model in the world for the majority of benchmarks and yes including my own simple bench it beats out all other models including Claude Opus 4 Grock 3 and OpenAI’s 03 though we are expecting 03 Pro from OpenAI fairly shortly and that’s before you get to the fact that it’s quicker to respond it’s cheaper via the API it can ingest up to 1 million tokens that’s four or five times more than other models now before we get too hyped up though there’s a reason why the CEO of Google Deepmind Demis Sarabis responsible for Gemini and the CEO of Google itself Zundabachai yesterday both said that they don’t expect AGI before 2030 now I’m sorry for those listening on the podcast but take a look at these two lines here and which two of these vertical lines would you say is longest well Gemini 2.5 Pro the latest version 0605 yes if you are not in America that naming scheme is incredibly confusing but this latest version what do you think it says it says “At first glance line A appears to be much longer than line B.” However this is a trick of the eye and they are the same length in fact later on the model doubles down by saying “You can test this yourself by placing a ruler up against the screen you’ll find they are identical in length.” For those listening they are pretty obviously not the same length now of course that is anecdotal but there is a reason why Sundur Pachai said that in the near to medium-term Google will be hiring more workers not firing them of course you can’t always trust CEOs which is why I’m going to dedicate the end portion of this video to investigating all those headlines you’ve been seeing recently about a white collar blood bath i found that when you dig deeper not everything is as it seems now somewhat strangely I want to start with an interview released in the last 18 hours on Lex Friedman with the CEO of Google Sundar Pachai because the first half of this video is going to be about Gemini 2.5 Pro but that’s not even the biggest and best version of Gemini 2.5 which is Gemini 2.5 Ultra unavailable to practically anyone so all these record benchmark scores you’re going to see this isn’t even their biggest and best model each year I sit and say okay we’re going to throw 10x more compute over the course of next year at it and like will we see progress sitting here today I feel like the year ahead we’ll have a lot of progress I think it’s comput limited in this sense right like you know we can all part of the reason you’ve seen us do flash nano flash and pro models but not an ultra model it’s like for each generation we feel like we’ve been able to get the pro model at like I don’t know 80 90% of ultra capability but ultra would be a a lot more slow and lot more expensive to ser but what we’ve been able to do is to go to the next generation and make the next generation’s Pro as good as the previous generation’s Ultra but be able to serve it in a way that it’s fast and you can use it and so on the models we all use the most is maybe like a few months behind the maximum capability we can deliver right because that won’t be the fastest easiest to use etc but as the latest version of Gemini 2.5 Pro is apparently going to be a stable release used by hundreds of millions of people over the coming months let’s quickly dive into those benchmark results on the right by the way you can see the results of the three iterations of Gemini 2.5 Pro to be clear the latest one is what’s going to be rolled out to everyone in the coming couple of weeks on obscure knowledge as tested by humanity’s last exam it nudges out other models for incredibly challenging science-based questions it gets 86.4% when PhDs in those respective domains get around 60% on very approximate gauges of hallucinations it scores better than any other model and on reading charts and visuals and other types of graphs it’s at least on par with 03 which is around four times more expensive and a lot slower than Gemini 2.5 Pro again it’s worth highlighting that Gemini 2.5 Pro is really the middle model of the Gemini series you may also notice that the vast majority of these record-breaking scores are on a single attempt we haven’t yet seen the deep think mode from Gemini 2.5 Pro that would be roughly the equivalent of the multiple attempts or parallel trials that some of the other models utilize as for coding the picture is a lot less clear when you’re talking about multiple languages Gemini seems to do better as judged by ADA’s polyglot benchmark when you’re talking about a slightly more software engineering focus like Swebench Verified it seems like Claude is still very much in the lead however I will make a confession which is that I was having an issue with connecting a domain on Firebase which is Google on the back end now this was more to do with the app hosting infrastructure but you’d have thought as a Google entity Firebase that Gemini would know the most about it now I won’t show you the full 2-hour conversation but I basically gave up with Gemini 2.5 Pro this was in fairness the May instance of Gemini 2.5 Pro but Claude for Opus was able to diagnose the issue almost immediately and I’m sure everyone who uses these models for coding will have similar anecdotes where the benchmarks don’t always reflect real world usage but while we are on benchmarks what about my own benchmark Symbol Bench well I am going to make a confession which is that I thought the latest version of Gemini 2.5 Pro the one from yesterday would underperform why did I think that well because the first version of Gemini 2.5 Pro the one I think from March got 51.6% but then when we tried the May version of Gemini 2.5 Pro it was really hard to get a full run out of the model i talked about this on Twitter but the one run where agreed to actually answer the question I think it got around 47% so I actually had a theory that I was going to come to you guys and gloat and be like “Yeah they’re doing RL for coding and mathematics but that’s kind of eroding the common sense of the models.” This shows how Symbol bench tests things that other benchmarks don’t capture unfortunately what actually happened is that when we tested the very latest version of Gemini 2.5 Pro yesterday evening we couldn’t get because of rate limiting a full five runs which is why we’re not yet reporting the result but based on the four runs we did get it was averaging around 62% so my little theory about RL maximization just completely went out the window no but seriously even based on four runs you can see that performance is getting better and better and better across all model types hate to say it but I genuinely think Simple Bench won’t last much longer than maybe 3 to 12 months we’ve got to talk about those job articles now but if you want a bit more of a reflection about the kind of questions that Claude for and Gemini 2.5 Pro are now getting right do check out this video on my Patreon suffice to say though that when we reach the moment that there are no textbased benchmarks for which the average human could beat frontier models we will have crossed quite the Rubicon sundar Pachai and Deisab CEOs of Google and Google DeepMind put the date of full AGI at just after 2030 then you see stuff which obviously you know we are far from AGI too so you have both experienced this simultaneously happening to you i’ll answer your question but I’ll also throw out this i almost feel the term doesn’t matter what I know is by 2030 there’ll be such dramatic progress we’ll be dealing with the consequences of that progress both the positive externalities and the negative externalities that come with it in a big way by 2030 so that I strongly feel right whatever we may be arguing about the term or maybe Gemini can answer what that moment is in time in 2030 but I think the progress will be dramatic right so that I believe in now please do let me take a moment to tell seeing plenty of articles like this one going viral on Twitter and Reddit has the decline of knowledge work begun asked the New York Times for one LinkedIn executive in a guest essay on New York Times it has already begun with the bottom rung of the career ladder breaking now obviously I am one of the last people to underestimate the potential of AI and its impacts on the world of work but these stories were about what was happening now not what might be coming in 3 to 5 years so I wanted to ask do they have any stats to back this stuff up a lot of the articles cross reference each other but the one stat that they all seem to turn to is the fact that the unemployment rate for college graduates in the US has risen 30% since September 2022 not risen to 30% has risen 30% that sounds pretty ominous right but let me give you two contextual facts the first is that that 30% rise is from 2% to 2.6% for college graduates that’s versus 4% for all workers so a tiny bit less dramatic when you hear it is 2.6% now I can just feel the rage building up among some of you so let me just give you one more contextual fact and then my own thoughts because even though 2.6% unemployment rate for college grads in the US doesn’t sound too dramatic a 30% rise is pretty real so I dug deep and looked at the data source that these articles were citing and you can see it here with the college graduates at well now it seems 2.7% that is the line in red and it comes from March of this year but if we zoom out we can see that for example in 2010 it was 5% among all college graduates even in what is this 1992 it was 3.5% don’t worry I am not in any way downplaying the impact of what’s coming i’m just saying it’s a bit much to say the impact is already incredibly noticeable now the other article that went viral was this one behind the curtain a white collar bloodbath which heavily featured quotes from Dario Ammedday the CEO of Anthropic when the language is caveed like AI could wipe out half of all entry-level white collar jobs over the next 1 to 5 years it’s actually quite hard to disagree the way AI is accelerating it’s really hard to counter say a could scenario amade gets on to slightly more dangerous territory when he says most people are unaware that this is about to happen others at anthropic like Schulto Douglas are even more definitive there’s important distinctions to be made here one is that I think we’re near guaranteed at this point to have effectively models that are capable of automating any white collar job um by like 2728 and or near guaranteed end of decade this topic obviously deserves a full video on its own but for me the necessary but not sufficient condition for white collar automation would be the elimination of hallucinations and dumb mistakes that the models don’t self-correct if there is even a 1% chance that Frontier models of 2027 and 2028 make mistakes like this one then having a human in the loop to check for those mistakes would surely allow for massively increased productivity which leads me personally to the whole calm before the storm theory which I first outlined on this channel in 2023 i said back then that we would first see a massive increase in productivity as humans complement the work of Frontier AI that’s why I don’t think this white collar automation will happen as Amade says in as little as a couple of years or less now I know what many of you are thinking well these CEOs would know far better than those of us on the outside but I remember almost 2 years to the day Sam Orman saying and I quote “We won’t be talking about hallucinations in 18 months to 2 years.” That was on the world tour that he did after the release of GPT4 well almost exactly 2 years on from that quote we get this in the New Scientist ai hallucinations are getting worse and they’re here to stay among other things the article cites a stat on a benchmark called simple QA which I’ve talked about before on the channel where basically 03 the latest OpenAI model hallucinates a bit more than previous models then you guys might remember those viral articles about Cler eliminating its customer service team so it could use AI instead now very quietly without the same fanfare they’ve actually reversed on that policy saying that customers like talking to people instead after getting rid of those 700 employees it’s now rehiring many human agents duolingo the language app also said that it was going to rely on AI before backing down and reversing that policy hiring more humans which leads me to the whole calm before the storm theory while frontier language models are still weak at self-correcting their own hallucinations the human can still complement their efforts and lead to overall more productivity this leads to limited effect on the unemployment rate i do know there are anecdotal examples about people losing their jobs to AI trust me I am aware of that and I have read those articles but limited net effect on the unemployment rate this of course leads to more and more investment in AI and less and less regulation of AI as countries try to win the AI race socalled but then there might come a tipping point where models using enough compute having access to enough diverse methodologies for self-correction finally stop making dumb mistakes and only miss things that are beyond their training data of course at that point and I’ve actually got a documentary covering this endless amounts more data would be given to them through for example screen recording mass surveillance or robotics data then the complacency that might have set in throughout the remainder of the 2020s might be quickly upended and to be honest it’s not like blue collar work would be immune from the effects of AI automation for that much longer than white collar work at that point this is the fully autonomous figure O2 robot humanoid so yes I probably pissed off those who expect imminent upheaval and those who think LM are completely overhyped but there you go that’s just my opinion of what is coming while all of this is going on of course we get access to some pretty epic AI tools like the brand new 11 Labs V3 Alpha hey Jessica have you tried the new 11 V3 i just got it the clarity is amazing i can actually do whispers now like this ooh fancy check this out i can do full Shakespeare now to be or not to be that is the question fun to the lights here for this semi-final clash the stadium buzzing with anticipation 11 Labs United in their iconic black and white shirts pushing forward with intent straight from the opening whistle as has been the theme of this video though 11 Labs can’t rest easy because Google with their native text to speech in Gemini 2.5 Flash isn’t that far behind hey Jessica have you tried the new 11 V3 i just got it the clarity is amazing i can actually do whispers now like this ooh fancy check this out i can do full Shakespeare now to be or not to be that is the question hey Jessica have you tried the new 11 V3 i just got it the clarity is amazing i can actually do whispers now like this ooh fancy thank you so much for watching let me know what you think as always and have a wonderful

[ad_1]
[ad_2]