Back/How I AI
How I AI

How I AI: GPT-5.3 Codex vs. Claude Opus 4.6—Shipping 44 PRs in 5 Days

I put OpenAI's GPT-5.3 Codex and Anthropic's Claude Opus 4.6 head-to-head, using them to redesign my marketing site and refactor a complex component. Discover the powerful, real-world workflow I used to ship 93,000 lines of code in just five days.

Claire Vo's profile picture

Claire Vo

February 13, 2026·9 min read
How I AI: GPT-5.3 Codex vs. Claude Opus 4.6—Shipping 44 PRs in 5 Days

The past week has been a whirlwind of new releases, with OpenAI dropping their Codex desktop app and the new GPT-5.3 Codex model, and Anthropic quickly following with Claude Opus 4.6 and Opus 4.6 Fast. When new models drop, I love to put them through their paces on real, complex tasks to see where they shine and where they fall apart.

In this episode, I'm sharing the results of my side-by-side comparison. I didn't just test them on a simple landing page; I threw them into an established, complex codebase—my ChatPRD marketing site—and then into our core application. The goal was to see how they handle both creative, greenfield projects and nitty-gritty technical refactoring. I wanted to know which model would go where in my AI engineering stack.

The results were honestly staggering. Spoiler alert: I shipped more code in the last five days than I have in the last month. We're talking 44 pull requests, 98 commits, and over 92,000 new lines of code. This wasn't just about quantity; it was about quality and velocity. I discovered that these models have very different personalities and excel at different things. One is a creative product engineer, ready to build, while the other is a meticulous principal engineer, perfect for tearing apart code and hardening it for production.

Let's dive into the two major workflows I used to test these new powerhouses and how I'm now combining them to create an unstoppable engineering dream team.

Workflow 1: The Ultimate Website Redesign Showdown

To start, I picked an ambitious task: a complete redesign of the ChatPRD marketing site. The current site is great for our product-led growth motion, but as we move upmarket to serve more enterprise customers, I wanted something more polished and sophisticated. This felt like the perfect creative challenge to compare how OpenAI’s GPT-5.3 Codex and Anthropic’s Claude Opus 4.6 handle a broad, design-oriented task.

First Up: GPT-5.2 Codex in the Codex App

I started with OpenAI's new Codex desktop app. Before diving in, I have to say I'm impressed with how the app centers core Git concepts. It visually exposes repositories, branches, work trees, and diffs, which is fantastic for both seasoned engineers managing complex agentic workflows and newer developers learning these fundamentals. The first-class treatment of 'Skills' and 'Automations' is also a brilliant move, making them feel much more accessible than the zip files of the past.

An AI assistant helps a developer refine marketing site content and structure, displaying the conversation alongside the relevant TypeScript code and project files in a multi-panel development environment.

My test began with a high-level prompt, giving the model creative freedom:

Optimize the marketing site and this repo for PLG plus enterprise. You can create new pages, redesign templates, et cetera, to make it the highest quality marketing site I could have.
The landing page for ChatPRD.ai, an AI-powered product manager tool, showcasing its main value proposition and call-to-action buttons.

Unfortunately, this is where I hit my first wall. I found that the GPT-5.x Codex models are so literal. They follow instructions incredibly well, but they do it blindly, without nuance or creative interpretation. For a task like this, that literalness became a huge bottleneck. It would explicitly write copy that said things like, "If you're here for product-led growth, click here... If you are here as an enterprise customer, click here." It lacked the subtlety needed for a high-quality marketing site.

My entire experience was a frustrating cycle of providing feedback and watching the model overfit to my last instruction. When I asked for more about integrations, the entire page became about integrations. The most hilarious and telling moment came when I asked for a more content-dense site, like Hex's. After a few prompts, Codex produced this headline:

"A dense product workflow for AI powered teams."
An AI chat interface demonstrates a user's critical feedback on proposed website copy, specifically questioning the phrase 'dense workflow.' The AI revises the headline and references a TypeScript React component file.

I had to laugh. I wanted a content-dense site, not a product with a dense workflow! In the end, after a lot of back and forth, the result was... okay. It redesigned the homepage and an enterprise page but didn't touch the rest of the site as requested. The code was solid, but the design and copy just weren't there.

Next Up: Claude Opus 4.6 in Cursor

Switching gears, I took the exact same task over to Cursor and used Opus 4.6. I have to admit, the harness you use matters, and I find Cursor's features like planning mode and its to-do list structure really help get the best out of these models.

I gave Opus the same high-level prompt, and the difference was immediate. It was much better at planning and executing the long-running task. It explored the codebase, created a plan, and started building components independently. I was thrilled... until I saw the first design. The copy was great, but the design was, as I put it in my follow-up prompt, "Tailwind Indigo AI slop." It was generic and unsophisticated.

So, I gave it some desperate, aspirational feedback:

I want it to look like I spent a million dollars on my design with the best agencies out here... I want you to develop a unique and modern frontend visual style. This is Tailwind Indigo AI slop.
A user provides blunt feedback on an AI-generated design in the Cursor interface, stating 'the design is TERRIBLE,' leading the AI to admit to 'generic Tailwind slop' and plan a rebuild with a custom brand design system using CSS, gradients, and textures.

And you know what? It worked. Opus acknowledged the feedback and came back with a stunning redesign. It integrated our existing brand aesthetic but elevated it, using our colors, pulling in relevant graphics instead of placeholders, and creating value-driven sections for enterprise features. It was beautiful, on-brand, and exactly what I was looking for. Best of all, when I asked it to apply these styles to the rest of the site, it did so consistently across our pricing page and other sections. This is the version we're likely going to ship.

AI-powered coding in action: The screenshot displays a code editor making real-time changes to a TypeScript/React website, using TailwindCSS for styling. Detailed prompts guide the AI to fix specific UI elements, showcasing a practical application of AI in web development.

The Verdict: For creative, greenfield work like a site redesign, Opus 4.6 was the clear winner. It demonstrated better planning, took feedback more effectively, and ultimately produced a far superior creative result.

Workflow 2: The Dream Team—Opus for Building, Codex for Reviewing

While the website redesign was a great test, most of my work is deep in the backend of our application. This is where I uncovered a workflow that has supercharged my productivity, combining the strengths of both models. The project was to refactor a particularly messy set of components for our new MCP connectors (for tools like GitHub, Linear, etc.). The code was inconsistent and hard to maintain, and I needed a clean, reusable, and customizable solution.

This workflow mimics a powerful real-world dynamic: the eager product engineer paired with a seasoned principal engineer.

Step 1: Build the First Draft with Opus 4.6, the Eager Engineer

I started in Cursor with Opus 4.6. I tasked it with refactoring our tool components. Just like with the website, Opus did a fantastic job of planning and executing. It created a sensible, flexible component structure that was easy to customize. It built out the front-end components, and the result was 80-90% of the way there. It was functional, looked great, and was a massive improvement over what we had. Opus is the engineer on your team who just gets stuff done and builds things quickly.

Step 2: Review and Harden with Codex, the Principal Engineer

Once I had the solid first draft from Opus, I took the code over to the Codex app. Here, I switched hats and asked Codex to act as a principal engineer doing a rigorous code review. I gave it this prompt:

I've redesigned tool usage in this index. It's gone through several rounds of feedback. Can you review the architecture and performance and see if you have any feedback we should consider before shipping. We're looking for something scalable, but customizable and we don't wanna overfit in any direction.
An AI-powered development environment showing a code review in progress on the left, with chat interactions and technical discussions, alongside a code editor on the right displaying a `prisma` database schema and migration documentation. The interface highlights how AI assists with code analysis and project management.

This is where GPT-5.3 Codex truly shines. It was phenomenal. It tore the code apart, identifying several high-impact issues and edge cases that Opus had missed. It prioritized them for me, asked clarifying questions, and then, once I gave the green light, it implemented the fixes and polished the code to a production-ready state. The code sailed through our AI-powered Bugbot review (which also runs on a Codex model) and was shipped.

A detailed view of the Devin AI agent interface, providing a summary of GitHub merge activity for 'chatprd-app'. The report highlights 44 PRs merged in 5 days, a net gain of 5,367 lines of code, and key development themes including SDK refactoring, components reorganization, and integrations with tools like Notion, Linear, and GitHub. A dependency installation warning for 'chatprd-app' is also visible.

This is my new go-to flow. I've been saying GPT-5.3 Codex replicates the principal software engineer experience perfectly: you might have to fight them tooth and nail to build something new, but they are more than happy to find every single flaw in someone else's code. Pairing Opus's building capability with Codex's critical eye is the key.

A Note on Speed, Cost, and Token Abundance

A quick word on Opus 4.6 Fast. It is, as the name suggests, incredibly fast. But that speed comes at a price—roughly 6x the cost of the standard model, at around $150 per million output tokens. I've used it a lot in the last week, and while my bill is climbing, I'm embracing a 'token abundance' mindset.

When you look at the ROI, it's a no-brainer. Shipping 44 PRs and major features like this would traditionally take a team months and cost tens of thousands of dollars. Even with the high cost of the top-tier models, the value and velocity they provide are off the charts. Just be careful, as my friend from Cody at Sentry said, "don't pick the wrong task" for the fast model, or you'll get a bill you're not happy with.

Conclusion: Your New AI Engineering Stack

After an intense week of coding, the verdict is in. These models aren't competitors; they're collaborators. Each has a distinct and valuable place in a modern AI engineering stack.

  • Use Claude Opus 4.6 for creative, generative, and greenfield work. Think new features, UI design, and initial implementation. It's your eager and effective product engineer.
  • Use GPT-5.3 Codex for code review, architectural analysis, and finding edge cases. It's your meticulous and brilliant principal engineer, ensuring your code is hardened, scalable, and production-ready.

By combining these two models, I've unlocked a new level of productivity. This multi-model approach allows me to build faster and with higher quality than ever before. I highly recommend you try replicating this workflow. Let Opus build it, then let Codex break it. The results speak for themselves.

I can't wait to hear about your experiences with these new models. Find me on X or LinkedIn and let me know which is your favorite and how you're using them!

A special thanks to our sponsor

This episode is brought to you by WorkOS—Make your app Enterprise Ready today

Try These Workflows

Step-by-step guides extracted from this episode.

Start shipping
better products.

Join 100,000+ product managers who use ChatPRD to write better docs, align teams faster, and build products users love.

Free to start
No credit card
SOC 2 certified
Enterprise ready
How I AI: GPT-5.3 Codex vs. Claude Opus 4.6—Shipping 44 PRs in 5 Days | ChatPRD Blog