I Was Skeptical About AI Agents. Here's What Finally Changed My Mind.

Mar 2, 2026·7 min read·AI & Automation

There is a certain kind of person I respect more than most when it comes to AI. Not the hype merchant who tells you AI will replace everyone. Not the doomer who says it is all smoke and mirrors. The person I respect is the careful sceptic who says: "I tried it, it didn't work, I kept watching, and here is what I found."

Written by Derek Chua, digital marketing consultant and founder of Magnified Technologies. Derek runs multi-agent AI systems in production and writes about practical AI adoption for business owners.

Key Takeaway: The difference between frustrating AI agent results and genuinely useful ones is almost never the model. It is the absence of a clear operating brief. Write down what "good" looks like, and AI agents start behaving like disciplined colleagues.

Max Woolf is one of those careful sceptics.

Woolf is a data scientist with serious coding chops. Last year, he wrote a post explaining why he did not actually use generative AI that often, despite the hype. His case was reasonable: agents were unpredictable, expensive, and the results were mediocre compared to the noise around them. He wasn't dismissing the technology. He was being honest.

Then November 2025 happened. And he changed his mind.

The Moment Things Clicked

Woolf had been watching AI models closely. When Anthropic released Claude Opus 4.5, he tested it properly. Not a quick demo. He gave it a real project: scrape YouTube video metadata using an API that is notoriously poorly documented, store everything in a database, write clean and robust code.

It worked first try.

Not a rough prototype. Clean, production-quality code that followed every constraint he had set. Better than scripts he had written himself years earlier.

He kept going. Bigger projects. More ambitious prompts. Port a machine learning library from Python to Rust. Build tools that would normally take months. The model kept delivering.

His conclusion, written plainly: "It is impossible to publicly say Opus 4.5 (and the models that came after it) are an order of magnitude better than coding LLMs released just months before, without sounding like an AI hype booster. But it is the counterintuitive truth."

Why Most People Are Still Getting Bad Results

Here is the part of Woolf's story that most articles will skip over. His results were not just about the model improving. He had also figured out something important.

He calls it AGENTS.md. The idea is simple: before you set an AI agent loose on a task, you give it a detailed set of rules. How to format code. What libraries to use. What to avoid. What counts as a good result. More important rules get written in caps for emphasis.

Think of it as a standing brief for your AI assistant. Not a prompt for a single task. A set of standards that shapes everything the agent does.

Once Woolf added his AGENTS.md, the quality jump was significant. Without it, the agent would drift, leave redundant comments, use tools he didn't want, add unnecessary emoji. With it, the agent behaved like a disciplined colleague who understood his working style.

This Is Not Just a Coding Story

At Magnified, I have been running multi-agent marketing systems in production for months. The parallel to what Woolf describes is exact.

The difference between an AI agent that delivers and one that frustrates you almost always comes down to one thing: how well you have defined what "good" looks like. Not in the moment, with a long prompt. As a standing document that the agent reads every time.

My CMO agent has a brief. My SEO agent has writing guidelines. They know the house style, the rules, what gets rejected automatically. When I first set them up without that scaffolding, the output was inconsistent. Once I wrote down the standards clearly, consistency improved overnight.

What Woolf calls AGENTS.md, I think of as an operating manual for your AI team. And just like a human team, agents perform better when expectations are written down, not assumed.

The November Inflection Point Is Real

One thing Woolf captures that deserves more attention: late 2025 was a genuine step change, not just incremental improvement.

I saw it too. The gap between what these models could do in mid-2025 versus November onwards is not subtle. The quality of reasoning, the ability to follow complex instructions, the consistency of output, it all moved.

If you tried AI agents earlier and gave up, that experience may not represent what is available now. The tools changed. The results changed. Some scepticism deserves an update.

What This Means for Your Business

You don't need to be a data scientist to take something practical from this.

The lesson is not "AI agents are now perfect." They are not. Woolf still reviews everything manually before committing. He still has to catch errors. The agent still gets things wrong when the spec is vague.

The lesson is: the gap between frustrating results and genuinely useful results is often not the model. It is the absence of a clear operating brief.

If you are testing AI tools in your business and getting inconsistent output, ask yourself: have you written down what good looks like? Not just for this task, but as a standing standard?

That document, however you want to call it, is probably the leverage point you are missing.

Frequently Asked Questions

What is an AI agent, and how is it different from a chatbot? A chatbot responds to individual questions. An AI agent is given a goal and figures out the steps to complete it, often taking multiple actions in sequence without being prompted at each step. Agents can write code, browse the web, organise files, and coordinate with other agents, making them far more capable than simple Q&A tools.

Why did AI agents get so much better in late 2025? The models released from mid-2025 onwards, particularly Claude Opus 4.5 and similar frontier models, showed a step-change in their ability to follow complex instructions and maintain consistency across long tasks. It wasn't one single improvement but a combination of better reasoning, longer context, and more reliable instruction-following. The practical gap between November 2025 models and those from earlier in the year is significant enough that your previous experience may no longer represent what is possible.

How do I write a good operating brief for an AI agent? Start with three things: what the agent's job is (not just the task, but the ongoing role), what good output looks like (format, tone, specific standards), and what to avoid (common errors, style violations, tools you don't want used). Keep it as a standing document and update it when you notice consistent problems. It doesn't need to be long. A clear one-page brief outperforms a vague ten-page one.

Should my business invest in AI agents now, or wait? If you have a clearly defined, repeatable task with a measurable output, you have enough to experiment now. The case for waiting gets weaker as the models improve. The cost of running a small test is low. The cost of falling behind while competitors build the operating knowledge is higher. Start with one use case, write a brief, and run it for a month before deciding whether to scale.

Inspired by Max Woolf's detailed account of converting from AI agent sceptic to practitioner, highlighted by Simon Willison.

← All posts