This week, I had Hamel Husain on the show, an AI consultant and educator who demystified the process of debugging errors in your AI product and writing effective evaluations (evals.) He also showed us how he runs his entire business using sophisticated AI workflows.
Building AI products is a new frontier for many of us, especially product managers. The technical intricacies, the non-deterministic nature of large language models, and the sheer breadth of data make ensuring high quality, consistency, and reliability a truly challenging problem.
What I love about what Hamel shared with us is his systematic approach to quality, moving beyond mere “vibe checks” to implement data-driven processes that yield real, measurable improvements. He showed us that while the landscape is new, the fundamentals, like looking at data, are the same, just with an AI twist.
Hamel dove into two distinct, powerful workflows: first, a systematic method for identifying and fixing errors in AI products. Second, he gave us a peek into his personal AI-powered operations, revealing how he leverages Claude and Gemini within a GitHub monorepo to automate and streamline his entire business.
Workflow 1: Systematic Error Analysis for AI Products
When building AI products, a common challenge is that AI fails in weird, often non-obvious ways. You fix one prompt, and you're not sure if you're breaking something else or genuinely improving the system as a whole. Hamel's first workflow tackles this head-on with a structured approach to error analysis that helps teams identify, categorize, and prioritize AI failures.
Step 1: Log and Examine Real User Traces
The fundamental starting point, according to Hamel, is to look at your data. For AI products, this means examining “traces”—the full, multi-turn conversations and interactions your AI system has with real users. These traces capture not just user prompts and AI responses, but also internal events like tool calls, retrieval augmented generation (RAG) lookups, and system prompts. This is where you see how users actually interact with your AI, often with vague inputs or typos, which is crucial for understanding real-world performance.
- Tools: Platforms like Braintrust or Arize are designed for logging and visualizing these AI traces. You can also build your own logging infrastructure.
- Process: Collect real user interactions from your deployed system. If you're just starting, you can generate synthetic data, but Hamel emphasizes that real user data reveals the true distribution of inputs.
- Example: Hamel demonstrated this with Nurture Boss, an AI assistant for property managers. He showed a trace where a user asked, "Hello there, what's up to four month rent?"—an ambiguous query that highlights how real users deviate from ideal test cases.



Step 2: Perform Manual Error Analysis
This is the surprisingly effective “low-tech” part. Instead of immediately looking for automated solutions, you manually review a sample of traces and document what went wrong.
This process, known as “open coding” or journaling, involves reading through conversations and making one-sentence notes on every error you find. The key is to stop at the most upstream error in the sequence of events, as this is typically the causal root of downstream problems.
- Process: Randomly sample about 100 traces. For each trace, read until you hit a snag—an incorrect, ambiguous, or high-friction part of the experience. Write a concise note about the error.
- Insight: Focusing on the most upstream error is a heuristic to simplify the process and get fast results. Fixing early intent clarification or tool call issues often resolves many downstream issues.
Example Note: For the "what's up to four month rent?" query, Hamel's note was: "Should have asked follow up questions about the question, what's up with four month rent? Because it's unclear user intent."
Step 3: Create a Custom Annotation System
To make the manual review faster and more efficient, Hamel recommends building a custom annotation system. This could be a simple internal app or a customized view within an observability platform. The goal is to remove friction, allowing human annotators (often product managers or subject matter experts) to quickly categorize and label issues.
- Tools: While platforms like Braintrust and Phoenix offer annotation features, a custom app can be tailored to your specific needs, channels (text message, email, chatbot), and metadata.
- Benefits: Streamlines the process, ensures human-readable output, and makes it easy to “vibe code” and quickly navigate through data.

Step 4: Categorize and Prioritize Errors by Frequency Counting
Once you have a collection of notes, the next step is to categorize them. You can use an LLM like ChatGPT to help bucket notes into common themes, though some back-and-forth might be needed to refine categories. The ultimate goal is simple: count the frequency of each error category. This frequency count provides a clear, prioritized list of problems to address.
- Process: Aggregate all your notes. Use an LLM or manual review to group similar notes into error categories (e.g., "transfer and handoff issues," "tour scheduling issues," "incorrect information"). Count how many times each category appears.
- Outcome: This gives you a data-driven roadmap for product improvements. For Nurture Boss, this revealed common problems like AI not handing off to a human correctly or repeatedly scheduling tours instead of rescheduling them.
- Key Insight: "Counting is powerful." This simple metric provides objective confidence in what to work on, moving past paralysis and guesswork.

Step 5: Write Targeted Evaluations (Evals)
With prioritized error categories, you can now write specific evaluations to test for these issues at scale. Evals fall into two main types:
- Code-based Evals: For objective, deterministic checks. If you know the exact right answer or can check for specific patterns (e.g., user IDs not appearing in responses), you can write unit tests. An excellent example is ensuring sensitive information (like UIDs from system prompts) doesn't leak into user-facing outputs.
- LLM Judges: For subjective problems that require nuanced understanding. If an error like a "transfer handoff issue" is more ambiguous, an LLM can act as a judge. However, it's critical to set these up correctly.
- Binary Outcomes: LLM judges should output binary (yes/no, pass/fail) results for specific problems, not arbitrary scores (like a "helpfulness score" of 4.2 vs 4.7, which is meaningless).
- Validation: You must hand-label some data and compare the LLM judge's output to human labels. This measures the "agreement" and builds trust in your automated evaluations. Without this, you risk showing "good" eval scores while users experience a "broken" product, eroding trust.
- Context: The research paper "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences" emphasizes that humans are bad at writing specifications until they react to an LLM's output. The error analysis process helps externalize those needs, refining your LLM judge prompts.

Step 6: Iterate and Improve with Prompt Engineering or Fine-Tuning
Once you have reliable evals deployed, you can continuously monitor performance and identify where errors persist. The improvements could involve simple prompt engineering (e.g., adding today's date to a system prompt so the AI understands "tomorrow"), or in more advanced cases, fine-tuning your models with specific "difficult examples" identified during error analysis. Retrieval issues (in RAG systems) are often an Achilles' heel and a common area for improvement.
- Techniques: Experiment with prompt structures, add more examples to prompts, or even fine-tune models with data derived from your identified errors. As I learned with ChatPRD, even two incorrect words in a monster system prompt can significantly degrade tool calling quality.
- Advanced Analytics: For agent-based systems with multiple handoffs, you can use analytical tools like transition matrices to pinpoint where errors are most likely to occur between different agent steps (e.g., generate SQL to execute SQL).

Workflow 2: Hamel Husain's AI-Powered Business Operations
Beyond product quality, Hamel runs his entire consulting and education business using AI as a co-pilot. His approach prioritizes efficiency, context management, and staying provider-agnostic, all managed through a single monorepo.
Step 1: Centralized "Claude Projects" for Every Business Function
Hamel uses Claude (and previously Claude's "projects" feature) to create dedicated, context-rich environments for different aspects of his business. Each "project" is essentially a detailed instruction set, often accompanied by examples, that helps Claude perform specific tasks.
Examples: He has projects for copywriting, a legal assistant, consulting proposals, course content generation, and creating "Lightning Lessons" (lead magnets).
Consulting Proposals Workflow
When a client requests a proposal, Hamel feeds the call transcript into his "Consulting Proposals" project. This project contains context about his skills (e.g., "partner of Palantir's, expert generative AI"), instructions (e.g., "get to the point, writing short sentences"), and numerous examples. Claude then generates a near-ready proposal that requires only about a minute of editing.
Course Content Workflow
For his Maven course on evals, Hamel has a Claude project loaded with the entire course book, an extensive FAQ, transcripts, and Discord messages. This project helps him create standalone, interesting FAQs and other educational content, guided by a prompt that emphasizes concise, filler-free writing.



Step 2: Custom Software for Content Transformation with Gemini
Hamel has developed custom software to automate content creation, particularly transforming video content into accessible, readable formats. This leverages the power of multimodal models like Gemini.
- Workflow: He takes a YouTube video and uses his software to create an annotated presentation. The system pulls the video transcript, and if the video has slides, it screenshots each slide and generates a summary underneath about what was said. This allows for consuming a one-hour presentation in minutes.
- Tools: Gemini models are particularly brilliant for video information ingestion, pulling transcripts, video, and slides all at once to produce comprehensive, structured summaries.
- Application: This is invaluable for Hamel's educational work, helping him distribute notes and make complex content digestible for his students.

Step 3: The GitHub Monorepo: The "Second Brain" for AI Workflows
The most fascinating aspect of Hamel's setup is his GH monorepo. This private repository serves as his central "second brain," housing all his data sources, notes, articles, personal writings, and, crucially, his collection of prompts and tools. This approach allows him to provide his AI co-pilots (like Claude Code or Cursor) with a unified, comprehensive context for everything he does.
- Structure: The monorepo contains everything from his blog and the YouTube transcription project to copywriting instructions and proposals. Everything is interrelated.
- AI Access: He points his AI tools at this repo, providing a set of "Claude rules" within the repo itself. These rules instruct the AI on where to find specific information or context for different writing or development tasks (e.g., "if you need to write, look here").
- Benefits: This prevents vendor lock-in, ensures all context is available to the AI, and creates a highly organized, prompt-driven system for managing complex information and generating content. It's an engineer's dream for managing data and prompts in a way that truly scales personal productivity.

Conclusion
This episode was a masterclass in how to approach AI product development and personal productivity with rigor and intentionality. We learned that the path to higher quality AI products is all about systematic data analysis, diligent error identification, and thoughtful evals. Hamel's pragmatic advice to "do the hard work" of looking at real data, annotating errors, and validating your LLM judges is truly empowering for any team building with AI.
His personal workflows also offered a glimpse into a highly efficient, AI-powered future for business operations. Hamel showed us how to build a flexible, powerful system that reduces toil and scales expertise.
Whether you're a product manager debugging an AI chatbot or an entrepreneur looking to automate your daily tasks, Hamel's insights provide actionable strategies to move your AI initiatives from good to great. I highly encourage you to explore his website and his Maven course to dive deeper into these invaluable techniques.
Sponsor Thanks
Brought to you by
GoFundMe Giving Funds—One Account. Zero Hassle.
Persona—Trusted identity verification for any use case
Episode Links