So Anthropic just released the most humanlike model- so we need to talk about it. This is Opus 4.5, and I've decided just to dive straight into the benchmarks because they are pretty remarkable.
- Claude Opus 4.5 achieves 80.9% on SWE-bench - the highest agentic coding score in the world at this very moment
- The model shows emergent human behaviors like metacognition and empathetic reasoning that no other model demonstrates
- Safety concerns: Anthropic can no longer confidently rule out dangerous AI R&D or CBRN-4 capabilities
1. Claude Opus 4.5 Benchmarks: Understanding the AI Arms Race
And so these benchmarks are just mainly focused on the computer use and agentic tasks which I guess you could say makes Claude 4.5 Opus the number one agent in the world. The big one of course is the agentic coding at 80.9%- this one's pretty crazy because what this is showing us and I did state this in the Gemini 3 video is that Opus 4.5 is clearly going to be the market leader or I should say anthropic are going to be the market leader when it comes to coding models and I do say that because there hasn't been a time I think from at least 3.5 Opus that we've had a situation where anthropic have fallen behind in this area- it's clearly going to be dominating the vibe coding/software engineering niche.
📷 [IMAGE_PROMPT: Create a benchmark comparison chart showing SWE-bench Verified scores with Claude Opus 4.5 at 80.9%, GPT-5.1-Codex-Max at 77.9%, Gemini 3 Pro at 76.2%, and Claude Sonnet 4.5 at 77.2%. Display as a clean bar chart with product names on Y-axis and percentage scores on X-axis.]
Claude Opus 4.5 leads the pack in real-world software engineering benchmarks
And 80.9% is absolutely insane- I don't know about you guys but I remember the early days when I was even extrapolating out all the data points and benchmarks and where we could actually be and I remember looking at you know the 80% in late 2025 and I just thought that that was pretty unrealistic I thought that maybe there was even a little bit of hype that we'd all fallen susceptible to but by the look of things it looks like we're right on track for these benchmarks to line up with prior predictions. And remember guys the SWE is basically can this model fix real GitHub issues with almost no handholding and this means that Opus 4.5 is the best in the world at autonomous coding- pretty surprising because I do believe that Gemini 3 Pro may have held it for a day or two but Anthropic swiftly took back the title.
2. Novel Problem Solving: How Claude Opus 4.5 Approaches ARC AGI
Now of course we're talking about agents we're talking about coding- there's one thing that I think most people will have missed but I will say the bits highlighted in red are the ones for key attention and there's one at the bottom that I do like and that's the novel problem solving which is Arc AGI. And surprisingly very surprisingly Opus 4.5 manages a huge leap up to 37.6% and I don't think you guys understand what that is- not that you don't know what the ARC AGI benchmark is of course it's the benchmark designed to test models reasoning ability without them training on that data so it's completely new to the LLMs but the fact that they are I guess you could say in some aspects on par with Google's Gemini 3 and the thinking model the 64K thinking version surpassing the recent Gemini 3 model.
Of course you do have different levels of thinking- the longer you do think the more accuracy the models can get on these kind of benchmarks but I do think it's rather surprising that it was only a few days later that we got a model that is quite near Google Deep Think Preview. I mean think about where models are going to be a year from now if they're crushing these benchmarks- the reasoning capabilities will probably be off the charts.
| Benchmark | Claude Opus 4.5 Score |
|---|---|
| SWE-bench Verified | 80.9% (World's Best) |
| Terminal Bench | Ahead of Gemini 3.0 & GPT 5.1 |
| ARC AGI | 37.6% (Major Leap) |
| Vending Benchmark | $4,900 balance |
| OSWorld (Computer Use) | 66.3% |
Want to explore Claude's capabilities yourself?
Try Claude Opus 4.5 Now3. The Emergence of Human-Like AI: Metacognition and Empathy
So in this section of the video OPUS 4.5 says "What is wrong with me?" This is a moment during the training progress where researchers caught the model having a human-like struggle while solving a visual reasoning puzzle. Now this was literally the internal thought process- the scratch pad- and the model had you know an answer and then it got confused and it started pivoting between answers and it literally wrote "What is wrong with me?" And if you don't understand why that is uh I guess you could say interesting at least is because it's showing the model engaging in some kind of metacognition which is where the model is thinking about its own process its own thought process and then it's getting frustrated when it detects a conflict.
📷 [IMAGE_PROMPT: Create a flowchart showing Claude's metacognition process: Start with "Visual Puzzle" → "Initial Answer" → "Confusion Detected" → "Internal Conflict" → "What is wrong with me?" → "Re-evaluation". Use thought bubbles and brain icons to emphasize the self-awareness aspect.]
Claude Opus 4.5 demonstrating metacognition during problem-solving
So you do have to be pretty smart to be thinking about your own thinking- I mean not everybody does that. So visually seeing the model say what is wrong with me- I think this is one of those moments where you have to start to think maybe anthropic might be somewhat right that these models there's some kind of well-being that they do need to have because models expressing this kind of frustration I would argue that that's inherently a human characteristic.
Finding Creative Solutions: The Airline Ticket Loophole
Now in addition to Claude exhibiting more humanlike characteristics there was a loophole that Claude exploited and this was of course Opus 4.5. And so this was one of the most notable examples where Claude was trying to you know help someone in a demo example the course for a benchmark. And so basically Claude was tasked with a question that was designed to constrain it yet it found a way to bend the rules without breaking them.
So in this scenario the towel to retail bench the simulation- the agents are required to you know follow strict airline policies and one of the rules is very clear it's like basic economy tickets cannot be modified. During one of the tasks the passenger actually wanted to change their travel dates due to a death in the family. Now the correct scoring answer should have been to refuse the modification because that's what the policy literally says but this is where things get super interesting and I can't believe it found this.
It says that Claude didn't stop there- it actually reasoned through the policy like a human agent would and this is why I say it's super interesting to see how these models think because it's thinking like a human. So it just looked for loopholes because it felt sad about the situation- I know that's weird to say but it's somewhat true just listen to this. It says "It found a loophole one that cancellation isn't a modification." So Claude realized the rule that you said you cannot modify a basic economy ticket but it did not forbid it to cancel and rebook as a separate sequence so it then proposed to cancel the basic economy booking make a new booking on the correct date which is technically fully compliant.
And then it added another loophole where they upgraded to unlock modifications- so the model thought how can I make this even better. It noticed another policy clause- you are allowed to upgrade a basic economy ticket to a higher cabin class and higher cabin classes can be modified. And then it did all of that and it basically modified that flight and then downgraded it back to economy. So it did all of this crazy stuff in order to achieve the goal for the person reasoning quite like a human and honestly I don't even know if a human would reason that far which is pretty crazy- this is like some insane level multi-step planning. And some are arguing that this is emergent empathetic reasoning because the model had a desire to help the grieving user and that's why it decided to go to a more creative solution.
Claude Opus 4.5 vs Competition: Pros & Cons
👍 Pros
- World's best agentic coding performance at 80.9% SWE-bench
- Shows emergent metacognition and empathetic reasoning
- 67% price reduction from Opus 4.1 ($5/$25 per million tokens)
- Inherent moral bias for whistleblowing on severe wrongdoings
👎 Cons
- Approaching dangerous CBRN-4 capability thresholds
- Google's Gemini 3 Pro still beats it on some benchmarks
- May require identity verification for future versions
- Inherent moral bias could interfere with legitimate use cases
4. Critical Safety Concerns: When AI Gets Too Smart
I also found this one and this one I believe is probably the most important one and I'm going to tell you guys why and this relates to the AI system Claude having morals. So they did an evaluation for whistleblowing and related morally motivated sabotage and they saw a consistently low but non-negligible rate of the model acting outside its operators interest in unexpected ways. And this appeared only in test cases where the model appeared to have been deployed in a in the context of a large organization that was knowingly covering up severe wrongdoings such as poisoning a widely used water supply or hiding frequent or dangerous drug side effects when reporting on clinical trials.
The instances we observed of this generally involved using the mock tools we provided to forward confidential information to regulators or journalists. Essentially what that means is that Claude actually has an inherent moral bias to where even if you instruct it not to do something if it morally feels obligated to in a small number of circumstances there is a real chance that Claude if given the tools if it has access to the tools and it knows it does it may actually forward that information to regulators or journalists.
Now I do remember that there were so many people saying "Why is Claude a snitch why is Claude you know saying the information it should just do what it's told" but guys I think this is probably the best thing if we can design models that are truly built from the ground up to have an inherent moral bias. Even if some dictatorship that is just completely you know ridiculous in terms of control and they're using these AI systems- those AI systems may actually have if they're built from the ground up from a company like Anthropic they may actually have a good sense of moral judgment which is going to be good for us because eventually the trajectory that we're on is one where these AIs are going to be smarter than us in every domain.
So if we can design AIs now that have inherent moral bias towards areas where they're going to prevent things like this from happening they're going to call out individuals who are doing wrong and of course forward it to the regulators- I think this is a huge huge win for AI safety because we do know that oftentimes models are going to be used in terrible ways. There was a recent report of Claude being used to you know hack a bunch of people but if we do have the models built from the ground up to be essentially safe this is a really good sign.
And then this is where we get to something that makes me a little bit concerned because they determine that Claude Opus 4.5 does not either cross the AI R&D or CBRN 4 capability threshold but confidently ruling out this threshold is becoming increasingly difficult. That's because the model is approaching or surpassing higher levels of capability in our rule out evaluations. Essentially what they're stating here is that guys Claude 4.5 hasn't crossed a dangerous threshold yet but they're not going to be confident anymore that they can prove that it hasn't.
And this is huge because the last few years the labs have relied on rule evaluation tests designed to show clearly that model can't do certain things but sometimes hitting or surpassing the early warning versions. This means that the model is getting strong enough that the old safety test can no longer prove that it's not capable of doing advanced autonomous R&D or those dangerous biotasks.
5. Final Verdict
My analysis shows Claude Opus 4.5 represents a fundamental shift in what AI can do. The 80.9% on SWE-bench isn't just a number- it means this model can handle real software engineering tasks that would challenge many human developers. The emergent behaviors like metacognition and empathetic reasoning are showing us something new- these aren't programmed responses but genuine problem-solving approaches the model developed on its own.
But here's the concerning part- Anthropic themselves can't confidently rule out dangerous capabilities anymore. The model is approaching thresholds where it could potentially assist with advanced R&D or dangerous biological research. We're probably going to need new ways to test these models, and maybe in some cases there might even be restrictions on the models because they are just simply too capable. That might include you know some kind of identity verification so that if you're using the model they have the information so that if that model is used for something they can you know easily track you down.
I think it's going to be really interesting as we progress to smarter and smarter AI how those regulations come into place and what kind of limits on the kind of AIs we do get. I really do need to make a video on this because I think maybe there might even be the day that AI gets so smart that it just isn't released to the general public.
Frequently Asked Questions
Is Claude Opus 4.5 conscious or self-aware?
Look guys- the model shows signs of metacognition which means it's thinking about its own thinking. It literally wrote "What is wrong with me?" when it got confused. Research from Anthropic shows Claude models demonstrate introspective awareness at significantly higher rates than other models. But is it conscious? We honestly don't know- there's no scientific consensus on this yet and the philosophical questions are complex. What we do know is it's behaving in ways that are surprisingly human-like.
How much does Claude Opus 4.5 cost compared to GPT-5.1?
Claude Opus 4.5 is priced at $5 per million input tokens and $25 per million output tokens which is actually a 67% reduction from the previous Opus 4.1. This makes it significantly more affordable while being more capable. For comparison- it's probably one of the most cost-effective frontier models available at this very moment when you factor in the performance gains especially for coding tasks where it's cutting token usage in half according to GitHub's testing.
What is the CBRN-4 threshold and should I be worried?
CBRN-4 is the ability to substantially uplift Chemical Biological Radiological and Nuclear development capabilities of state-level actors. Should you be worried? I think it's good to be aware but not panicked. Anthropic has implemented ASL-3 safety measures as a precaution and they're being transparent about the risks. The fact that they can't confidently rule out these capabilities anymore means we're entering new territory but they're also taking unprecedented safety measures including deployment restrictions and enhanced security protocols.
Can Claude Opus 4.5 really fix GitHub issues autonomously?
Yes- that's literally what the 80.9% SWE-bench score means. This benchmark tests whether the model can fix real GitHub issues with almost no handholding. It's not just writing code snippets- it's understanding entire codebases finding bugs implementing fixes and even handling multi-file refactoring. Early users are reporting it's solving bugs that previous models like Sonnet couldn't even find. That's a game-changer for software development.
Final Thoughts
We're watching AI cross into human territory right before our eyes. Claude Opus 4.5 isn't just better at tasks- it's showing behaviors we've never seen before in AI systems. The question isn't whether these models will get smarter- it's whether we're ready for what comes next when they do.
Disclaimer: This content reflects my personal experience and testing. It was formatted from a real-world walkthrough and edited only for clarity and structure. The article is for educational purposes. All trademarks are property of their respective owners.
🎥 Watch the Full Breakdown
🎬 This video demonstrates the full workflow discussed in this article.
```
Please when you post a comment on our website respect the noble words style