Dominic Plouffe (CTO)

Big data + agents. Less hype, more systems.

Author: Dominic Plouffe

  • AI Financial Modeling Battle: Shortcut vs Claude vs ChatGPT Performance Data (2026 Analysis)

    AI Financial Modeling Battle: Shortcut vs Claude vs ChatGPT Performance Data (2026 Analysis)

    It is tempting to turn this into a horse race. Shortcut beat Claude. Claude beat Copilot. Copilot beat ChatGPT. Done.

    That framing is wrong.

    The better question is the one finance teams actually care about at 11:40 PM: which system gives you outputs you can audit, reuse, and trust when the model still has to go to a VP before midnight? That is a harsher standard than “which model scored highest on a benchmark.” It is also the only standard that matters if the file has to survive review.

    Wall Street Prep’s 2026 testing makes the gap plain. Shortcut led with a 5.9/10 score, followed by Claude at 5.5, Microsoft Copilot at 4.4, and ChatGPT at 2.5. Human analysts still scored much higher: 6.4 for lower-tier analysts, 7.9 for mid-tier, and 9.4 for top analysts. So yes, AI can help. No, it is not “replace the analyst” help. Not remotely.

    And the gap is not academic. A 5.9/10 tool might help draft a debt schedule, clean assumptions, or explain a variance. It does not mean you should let it build an acquisition model end to end without review. A 47.78% portfolio return sounds flashy too, but portfolio selection is not the same job as building an audit-ready three-statement model. People keep jamming those together, and the result is bad buying decisions.

    🥇 The 2026 ranking is real, but the scores need context

    On the narrow question of AI financial modeling performance, Wall Street Prep’s test gives us the cleanest public comparison. Shortcut came first at 5.9/10. Claude was close behind at 5.5. Microsoft Copilot landed at 4.4. ChatGPT trailed at 2.5.

    Useful result. Incomplete result.

    What does a 5.9 actually buy you at 10:53 PM on a live deal? Usually better structure, cleaner handling of finance-specific prompts, and fewer obvious logic breaks than the other models in that test. It does not tell you whether the system can survive a real workflow with messy lender files, management assumptions that change twice in one evening, and a reviewer who wants to trace every output cell back to a formula chain.

    This is where raw rankings get annoying. A benchmark score compresses several failure modes into one number. Finance teams do not experience failure as an average. They experience it when the model hardcodes a number where a formula should be, hides a circularity, builds a revenue schedule that looks polished but does not tie to units and price, or spits out an output tab nobody wants to defend in an IC meeting.

    That last one happens more than people admit.

    Shortcut’s lead matters, but the more useful question is why. The stronger tools tend to stay closer to spreadsheet logic, preserve structure, and produce work another analyst can check without playing detective for 40 minutes. That is why specialized finance products keep pulling ahead of general chat models in real production work.

    Why specialized tools keep beating general models in production work

    A generic model can sound smart. That is not the same as being useful in Excel.

    Specialized financial AI tools are winning the contest that actually matters: can they generate formulas, preserve model logic, and leave an audit trail? That is where products such as Apers AI and Microsoft Copilot for Finance have an edge. The research brief points to Apers AI’s formula-first architecture for real estate private equity and to Copilot’s native Microsoft 365 integration, including the ability to work directly with Excel and run Python inside Excel cells.

    Mechanism matters here. If a tool gives you a static NOI value, you cannot do much with it. If it gives you an Excel formula tied to lease-up assumptions, rent growth, concessions, and operating expenses, now you have something a team can inspect, change, and reuse next quarter. One is a chat answer. The other is a model component.

    Picture a real estate associate updating a 27-tab underwriting model for a 312-unit multifamily deal. The useful system does not reply, “Projected Year 2 NOI is $4.8 million.” It writes the rent roll logic, links concessions to the assumptions tab, flags the vacancy step-up, and leaves the formula visible. If the VP asks why Year 3 margin expands by 140 basis points, the associate can answer inside Excel. If the model just hands back a number in a chat window, that conversation ends right there.

    Copilot’s advantage is not that it is magically smarter than every other model. It sits next to the workflow people already use. For a finance team living in Excel, Outlook, Teams, and Power BI, that adjacency cuts friction. You can pull data, test formulas, draft commentary, and keep work inside a governed environment instead of copying numbers back and forth between tabs and chat windows.

    I would take a slightly weaker model with transparent formulas over a more eloquent model with hidden logic every time. Black-box numbers are useless in institutional finance. Worse than useless, honestly. They create work for the next person.

    Claude’s portfolio outperformance is interesting. It is not a free pass.

    One of the flashier data points in this debate comes from a public 473-day trading comparison. Claude posted a 47.78% return since inception, ahead of Gemini at 33.08%, ChatGPT at 16.70%, and the VTI benchmark at 20.04%. That puts Claude ahead of the market by 27.74 percentage points and ahead of ChatGPT by 31.08.

    Interesting? Yes.

    Enough to crown a winner? No.

    This is where finance readers should get suspicious. A single 473-day experiment is not the same thing as repeatable institutional performance. You would want the trade frequency, rebalance rules, prompt design, risk limits, turnover, drawdown profile, and whether the result survives a different market regime. You would also want to know whether this was a process edge or a favorable stretch with exposures that happened to work.

    I’ve seen this move before in vendor decks: take one eye-catching return series, blur the operating details, and imply the model “understands markets.” Maybe it does. Maybe it got a nice run in one window. Those are not the same claim.

    Still, I would not throw the result out. Claude has shown stronger reasoning in several finance-related tasks, and the trading outcome lines up with that broader pattern. In the research brief, Claude also shows up as stronger on complex financial logic, while ChatGPT performs better in some visualization-heavy workflows. Different jobs, different strengths.

    The useful read is narrower than the marketing version. Claude appears better suited to reasoning-heavy financial tasks than ChatGPT in at least some real-world scenarios, and that may carry into portfolio construction or thesis evaluation. But portfolio selection, forecasting, and model-building are still different jobs. A model that picks ETFs decently is not automatically good at debt sculpting. Obvious, yes. Somehow still controversial.

    Graph-of-Thought works. The reason it is not everywhere is cost and skill.

    Most prompt advice in finance is fluff. “Be specific.” “Give context.” Fine. That is baseline hygiene, not a method.

    The more interesting shift is from one-shot prompting to structured reasoning patterns. Research on prompt engineering techniques in finance reports that Graph-of-Thought methods improve accuracy by 15-25% in complex financial reasoning tasks and reduce hallucinations by 25-30% compared with baseline approaches.

    Here is the version that matters in an actual workflow.

    Chain-of-Thought asks the model to reason through steps in sequence. Graph-of-Thought goes further. It lets the model evaluate multiple branches, dependencies, and paths before settling on an answer. In finance, that matters when assumptions interact: revenue growth affects working capital, which affects financing needs, which affects interest expense, which loops back into cash flow.

    A normal prompt often hides errors inside a smooth answer. Graph-of-Thought tends to expose branching logic. It is not magic. It just makes the breakpoints easier to inspect.

    Suppose you want AI help on a mid-market acquisition model for a 14-location dental practice. A weak prompt asks for “a five-year forecast.” A Graph-of-Thought workflow breaks the task into branches: patient volume by site, payer mix, hygienist staffing ratios, capex by location, debt terms, downside case, and covenant sensitivity. The model works across those branches instead of pretending one linear answer can carry the whole thing.

    That extra structure is why the accuracy gains show up. It is also why adoption is still low. According to the same finance prompt engineering review, only 10-15% of institutions have explored Graph-of-Thought, while 60-65% are experimenting with Chain-of-Thought. The cited cost is around $0.12 per query. That sounds trivial until an 11-person diligence team runs dozens of scenarios, exceptions, and review loops across a week. Then it is not trivial.

    There is also a skill problem. Most teams do not know how to design these workflows cleanly, and bad Graph-of-Thought prompts can be worse than plain Chain-of-Thought. More branches do not automatically mean better logic. Sometimes they just mean a more expensive mess.

    Use Graph-of-Thought when the answer has multiple dependencies, the cost of being wrong is high, and a human reviewer needs to inspect the path. Do not waste it on simple variance commentary or first-pass memo drafting.

    The best prompt pattern is boring on purpose: Context Sandwich

    The highest-performing prompt pattern in this research is not clever. It is structured.

    The “Context Sandwich” framework combines role, task, and constraint. In practice, that means telling the model who it is acting as, what exact job it needs to do, and what boundaries it must respect. Finance people are already good at this. If you can write a month-end checklist or a clean SOP, you can write a decent prompt.

    A weak prompt says: “Build a forecast for this company.”

    A stronger prompt says: “Act as a corporate FP&A analyst. Build a 12-month revenue and EBITDA forecast for a regional HVAC distributor using the attached monthly sales history, preserve seasonality from the last 24 months, show assumptions separately, and output Excel-ready formulas only. Do not hardcode totals. Flag any missing driver data before forecasting.”

    That is not fancy prompt engineering. It is operational clarity.

    The research brief also identifies six useful prompt elements: role assignment, task statement, context, format requirements, examples, and constraints. That lines up with why Chain-of-Thought has spread further than Graph-of-Thought. It is easier to teach, easier to review, and easier to fit into normal finance work. Again, 60-65% of financial institutions are experimenting with Chain-of-Thought techniques.

    One aside: a lot of “prompt engineering” content online is just people renaming common sense and sending an invoice. Context Sandwich is useful for a less glamorous reason. It maps neatly to how finance teams already document work, review work, and blame each other when work goes sideways.

    Adoption is not failing because the models are bad. It is failing because the systems are messy.

    This is the part vendors usually skip.

    Across organizations, 88% use AI in at least one business function, but only 7% have fully scaled it enterprise-wide. Another 23% are scaling agentic AI systems. That gap tells you what is really happening: experimentation is easy; operational deployment is hard.

    In finance, the blockers are familiar and boring. According to research on agentic AI in financial services, 48% of organizations cite governance concerns, 30% flag privacy issues, and 20% say their data is not ready for AI. None of that gets fixed by switching from ChatGPT to Claude or from Claude to Shortcut.

    If your source data is split across five spreadsheets, two ERP exports, and one “final_v7_reallyfinal.xlsx” file on someone’s desktop, the model is not the main problem. Your process is.

    That is why the McKinsey 3:1 rule cited here matters so much: high-performing firms invest three times more in process redesign than in software itself. I think that is the most useful statistic in this whole discussion. It explains why one team gets a faster close and cleaner forecast pack while another gets a pilot, a security review, and six months of drift.

    A mid-market finance team that redesigns its close checklist, standardizes data definitions, adds review gates, and limits AI to scoped tasks will usually get more value than a team that buys the fanciest model and hopes it sorts out the chaos.

    Boring answer. Correct answer.

    What “moving beyond experimentation” actually looks like

    By late 2025, 44% of finance teams had moved beyond experimentation into core-function deployment. That sounds aggressive until you look at the tasks involved.

    Usually it is not “AI runs finance.” It is drafting first-pass variance explanations from monthly actuals, cleaning and categorizing GL exports, building scenario summaries for management review, checking formula consistency across large workbooks, and preparing forecast narratives from already-reviewed numbers.

    Those are real gains. They save time. They also fit the current state of the tools.

    Take a common case: a 9-person FP&A team at a manufacturing company needs to update a 13-tab forecast workbook every month. Copilot inside Excel can help identify broken links, draft commentary, and run Python-based checks on outliers. Claude can help reason through a margin bridge or pressure-test assumptions. A specialized finance tool can generate formula-based forecast components. But a controller or senior analyst still has to sign off on the final output.

    That is the pattern I keep seeing. AI gets you from blank page to rough draft fast. Humans still do the last 20%, which is where most of the risk lives.

    And that last 20% is not cosmetic. It is where someone notices that the working capital assumption still reflects Q2 seasonality, that the debt schedule forgot the amendment fee, or that management’s “temporary margin pressure” has now lasted three quarters. Models miss that stuff all the time. Humans do too, obviously. Just less cheerfully.

    Agentic finance is coming slowly, not magically

    The AI agent market is growing fast. Market estimates put it at $5.43 billion in 2024, rising to $7.92 billion in 2025, with a projected 45.82% CAGR from 2025 to 2034. And one Gartner-linked projection says 40% of finance departments will deploy autonomous agents by 2027.

    Maybe. With conditions attached.

    I would be careful with the phrase “autonomous agents,” especially in finance. The useful version is not a robot CFO. It is a bounded system that can perform judgment-based tasks under human oversight, with approval rules, exception handling, and logs. Remove those controls and you do not have automation. You have audit problems.

    The same caution applies to projections that 60-80% of Tier-1 banks may deploy Graph-of-Thought systems by 2030 for 30-40% faster M&A due diligence. Possible, sure. But only if trust infrastructure gets built first: data lineage, prompt versioning, review layers, exception queues, and human override design.

    Without that, “agentic workflow” is just a nice phrase on a slide.

    A practical deployment map for 2026

    If you are a mid-market analyst, finance manager, or Excel-heavy operator, you do not need a winner. You need a decision lens.

    Use general models for exploratory work. Claude, ChatGPT, and similar tools are useful when the task is open-ended and the cost of being wrong is low at the draft stage. Good examples: brainstorming drivers, summarizing filings, drafting management questions, sketching scenario logic, or turning a messy data export into a first-pass narrative.

    Claude looks stronger for reasoning-heavy finance tasks, supported both by benchmark comparisons and by the more tentative 473-day trading experiment. ChatGPT still has value, especially where its analysis and visualization tooling fits the job. But neither should be treated as a final-answer machine.

    Use specialized tools for auditable production work. If the deliverable needs to live in Excel, survive review, and get reused next quarter, specialized tools have the edge. Formula-first architecture, native spreadsheet integration, and enterprise controls matter more than conversational polish. This is where tools such as Apers AI and Microsoft Copilot for Finance make more sense than a generic chatbot.

    And if the task involves multi-step dependencies, use structured prompting. Context Sandwich for normal work. Chain-of-Thought for calculations with several steps. Graph-of-Thought when the stakes justify the extra cost and setup.

    Keep humans on final judgment. This is not ceremonial sign-off. It is the control layer.

    Human analysts are still better at spotting bad assumptions, understanding business context, challenging management narratives, and deciding when a clean-looking output is actually nonsense. The Wall Street Prep scores already tell us that. Even the best AI tool still trails a low-tier human analyst.

    So the practical stack in 2026 looks like this: AI for speed, specialized systems for structure, humans for judgment.

    If you are evaluating tools right now, do not ask which model “wins finance.” Ask three narrower questions instead: Can we audit the output? Can we reuse it in our actual workflow? Can we trust it under deadline with a reviewer breathing down our neck?

    That is a much less glamorous way to buy software. It is also the one that keeps you from explaining a hallucinated debt covenant to a VP at 11:58 PM.

  • Enterprise AI Dashboard Integration in 2026: Claude vs ChatGPT vs Gemini for Business Intelligence

    Enterprise AI Dashboard Integration in 2026: Claude vs ChatGPT vs Gemini for Business Intelligence

    Most enterprise AI dashboard projects fail in a very plain way: the team asks the model to make a dashboard, when the real work is moving data through a reporting process. The useful part starts before the chart appears and keeps going after it shows up. Someone has to pull numbers from Power BI, Excel, Google Workspace, or an internal warehouse, check the metric definitions, write the commentary, and route exceptions to the right person. If the AI only makes the chart prettier, you have built a nicer-looking stall.

    The market has shifted enough that BI teams can’t keep treating the big models as interchangeable. According to 2026 market share data, ChatGPT fell from 86.7% to 64.5% in one year while Gemini rose from 5.7% to 21.5%. Claude is still smaller overall, but 14% quarterly growth is the kind of number that usually means a tool has moved from novelty to habit. That is the part enterprise teams should pay attention to.

    So the real question is not which chatbot sounds smartest in a demo. It is which model can sit inside a reporting workflow without turning everyone into prompt janitors. The answer is different for Claude, ChatGPT, and Gemini. Sometimes one model is enough. More often, the clean setup is a mix. The “pick one and crown it king” approach is how teams burn a quarter and still end up with a spreadsheet full of manual fixes.

    The market shift is not cosmetic

    Gemini’s jump from 5.7% to 21.5% in a year is not a tiny wobble in the data. It changes the buying conversation. ChatGPT still has the biggest footprint, but the idea that it will automatically be the default for every enterprise analytics workflow is weaker than it was a year ago, according to the same market share data. Claude’s growth is smaller in absolute terms, but 14% quarterly user growth is not what you see from a tool sitting on the sidelines.

    That matters in BI because BI is not a consumer use case. A sales director does not care whether a model won a benchmark by three points. They care whether it can read a messy Excel export, understand the metric definitions in a semantic model, and produce a summary finance will not shred in five minutes. The bar is annoying. That is the job.

    There is also a stack effect that gets missed in the hype. If your company already lives in Microsoft 365, ChatGPT and Copilot fit naturally into the day. If the data lives in Google Workspace and BigQuery, Gemini is a much less awkward fit. If the workflow depends on long context, detailed instructions, and careful tool use, Claude starts looking unusually practical. Choice changes the rest of the workflow. It is not just a model decision. It is a plumbing decision.

    I’ve watched teams spend weeks arguing about model quality and then discover the actual blocker was access to the semantic layer. That is a classic enterprise move. Very expensive, very familiar.

    ChatGPT, Gemini, and Claude behave differently in BI work

    ChatGPT still has the broadest reach. That is the honest starting point. It has the biggest mindshare, the deepest third-party ecosystem, and the easiest adoption story for leadership. If your analysts already use it for SQL drafts, meeting notes, and quick explanations, that familiarity lowers friction. People already know how to ask it for something. They do not need a two-hour onboarding session just to get a summary.

    But ChatGPT’s strength is breadth, not depth in BI. A Pandas AI review found that it tends to return technical instructions rather than a functional dashboard. That is still useful — a good set of instructions can save a developer time — but it is not the same thing as producing something the business can open and use. There is a difference between “here is the code” and “here is the report your VP can read before the 9:00 meeting.”

    Gemini’s appeal is different. It sits close to Google Workspace, Google Cloud, Sheets, Docs, BigQuery, and Looker. For teams already inside that stack, the model reduces the number of handoffs. Fewer exports. Fewer copy-pastes. Fewer chances for someone to paste live data into the wrong tab, which happens often enough that nobody should pretend otherwise. Gemini’s share rising to 21.5% suggests a lot of teams are testing that convenience and deciding to keep it.

    Claude is the one that keeps surprising BI teams. It has a 200,000-token context window, which means it can keep long policy docs, data dictionaries, and prompt chains in view without losing the thread halfway through. That matters when the model has to read metric definitions, compare source tables, and explain a variance in the same pass. Anthropic also uses constitutional AI principles, which is a fancy label for a practical goal: keep the model more predictable in business settings, according to enterprise model comparisons.

    Claude also has real enterprise traction. Anthropic reports 300,000+ business customers and an estimated $14 billion annualized revenue run-rate. Those numbers do not prove it is the best model for every team. They do show that it is being used in companies where the output has to work on Tuesday, not just impress someone in a demo on Friday.

    Direct dashboard generation still falls apart in practice

    The cleanest way to say this is simple: none of these models reliably presses the “build me a production dashboard” button. According to Pandas AI’s BI review, ChatGPT usually returns technical instructions rather than a finished dashboard. Claude can produce more interactive artifacts, including multiple visualization tabs, but it still needs developer implementation. That is useful, but it is not the same as shipping a working BI asset.

    The reason is not mysterious. A useful dashboard needs live or scheduled data connections, permission handling, refresh logic, metric definitions, error handling, and a way to stop one bad prompt from wrecking month-end reporting. The model can help draft the scaffolding. It cannot magically invent your data governance model. If it could, half of BI consulting would disappear overnight, and frankly some of those slide decks deserve to.

    Here is the failure mode I keep seeing: someone asks Claude or ChatGPT for a dashboard, gets a decent mockup, and then realizes the mockup is not connected to anything. The KPI is copied from a sample file. The chart does not refresh. The business user assumes the number is live because the interface looks polished. That is how trust gets broken — not with a dramatic failure, just a quiet mismatch between what the screen suggests and what the data actually is.

    So the smarter use of these tools is usually support work around the dashboard, not the dashboard itself. Use them to write SQL, generate DAX, document measures, draft release notes, summarize variance drivers, or explain why one region’s margin moved 120 basis points. Those are the jobs where AI can save time without pretending to replace the reporting stack. A model that helps an analyst finish a variance narrative in 12 minutes instead of 45 is already useful. It does not need to cosplay as a BI architect.

    And yes, the demo with the shiny chart looks great in the boardroom. Then someone asks where the number came from, and the room gets very quiet. That silence is usually the giveaway.

    Claude fits the messy work

    Claude’s long context window is not a brochure feature. It matters when the task is ugly. A BI team can feed it a data dictionary, a list of approved metrics, a few pages of policy, and a sample board deck, and it can keep all of that in play while drafting explanations or flagging inconsistencies, according to integration guidance. That matters when the work is not a single query but a chain of steps that all need to line up.

    Claude’s tool use is another reason it stands out. In a real BI workflow, the model often needs to do more than answer questions. It may retrieve a table schema, inspect a semantic model, draft a query, compare output to a policy rule, and then write a summary for a manager. Claude is built for that kind of structured interaction. It is not magic. It is just less annoying to wire into a workflow that already has moving parts.

    The market data lines up with that use case. Claude is growing at 14% quarterly, and the reported base of 300,000+ business customers suggests the growth is coming from work that has to repeat. That matters more than enthusiasm. One clever prompt is a demo. A workflow that runs every week without somebody hovering over it is a system.

    Claude Sonnet 4 also makes sense on cost. Research comparing model pricing found that it delivers 98% of Opus quality at a fraction of the cost. For BI teams that run a lot of ad hoc analysis, that difference shows up quickly. If the model is used for every report draft, support ticket, and anomaly note, expensive becomes visible very quickly. Procurement has a way of finding the bill eventually.

    One finance team I worked with had 9 analysts and a month-end close that always dragged into the next week. They did not ask Claude to build a dashboard. They used it to read the close checklist, compare the latest variance table to prior month notes, and draft explanations for the CFO review. The analysts still checked the numbers. The model just handled the first pass. That cut the narrative draft from most of a morning to about 20 minutes. Not glamorous. Very useful.

    Gemini makes sense when the stack is already Google-native

    Gemini’s rise makes the most sense in companies that already run on Google Workspace. Sheets, Docs, Drive, BigQuery, and Looker sit close together. When the model and the data are already in the same neighborhood, the workflow gets simpler. That is not a philosophical advantage. It is a practical one. Fewer bridges means fewer things to maintain.

    The market momentum is real. Gemini’s share jumped from 5.7% to 21.5% in one year, which is the kind of move that usually means people are not just curious — they are sticking with it. In enterprise software, that second part matters. A lot of tools get trial usage. Far fewer get approved for actual work.

    Gemini also looks attractive on cost. Pricing comparisons identify Gemini 2.5 Flash as the most cost-effective API option for large-scale deployments. That matters when the model is running thousands of lightweight BI tasks: campaign summaries, alert explanations, recurring report commentary, and data triage. If the task is repetitive and well defined, a cheaper model is often the better choice. Paying premium rates for simple work is one of those habits that makes finance people stare at the ceiling.

    The tradeoff is depth. Gemini is strong when the problem is “keep the Google stack moving.” It is less obviously the best choice when the task is “read 300 pages of internal definitions, remember every exception, and keep the output aligned with the finance team’s naming rules.” For that kind of work, Claude usually feels steadier. Gemini is the efficient one. Claude is the one you trust with the ugly documents.

    A marketing ops team at a 120-person SaaS company gives a good example. Their campaign data lived in Sheets and BigQuery, and the monthly reporting cycle was mostly copy-paste work with a few ugly manual checks. They moved the summary step into Gemini, kept the data in Google-native tools, and cut the time spent on weekly performance commentary from roughly 3 hours to 40 minutes. Nothing mystical happened. The model just sat where the work already was.

    ChatGPT still wins on reach, not BI depth

    ChatGPT remains the default for a lot of teams for a simple reason: everyone already knows it. It has the broadest mindshare and the most familiar interface. That matters when you need adoption quickly. A tool nobody opens is not a tool. It is a logo attached to a budget line.

    In BI work, though, ChatGPT often behaves more like a very smart assistant than a reporting engine. The Pandas AI review found that it tends to give technical instructions instead of producing a functional dashboard. That does not make it useless. It just means the output is usually a step in the process, not the finish line. There is a difference between “here is the code you need” and “here is the report your manager can use.”

    It is still strong for SQL drafting, data explanation, meeting prep, narrative summaries, and quick analysis. It also helps when a workflow includes analysts, virtual assistants, and ops managers who are not all equally technical. People already know how to ask ChatGPT questions, and that lowers the training burden. Sometimes boring wins. Familiar tools get used.

    But if the goal is a production BI workflow, ChatGPT is usually the first stop, not the architecture. Good for thinking. Less good as the place where the entire reporting system lives.

    The best enterprise wins come from workflow automation

    The strongest enterprise examples point in the same direction: AI works best when it is embedded in a workflow. TELUS is the clearest case. The company scaled Claude across 57,000 employees, built 13,000+ AI-powered tools, and saved over 500,000 staff hours. That is not a dashboard story. It is an operations story.

    Bridgewater Associates used Claude Opus 4 for investment research and reported 50–70% time-to-insight reduction. Zapier deployed more than 800 internal Claude-driven agents and reached 89% employee adoption. Same pattern. The value came from weaving the model into daily work, not from creating one more screen full of charts nobody wants to babysit.

    A mid-market FP&A team shows the pattern well. Instead of asking AI to “build a dashboard,” they set up an agent that watches the monthly close folder, checks whether the latest revenue file arrived, compares it with the prior month, flags anomalies above 8%, and drafts a plain-English note for the CFO. The dashboard still exists. The AI just handles the tedious part around it. That is the bit that survives contact with the real world.

    A dashboard shows the problem. A workflow changes what happens next. If a rep’s pipeline report is late, the dashboard can display that fact. An agent can ping the owner, open a ticket, and attach the missing file. That is the difference between reporting and action. One sits there. The other does something.

    That distinction is easy to miss if you spend too much time staring at mockups. The mockup looks finished. The workflow is where the actual value lives.

    Adoption is high. Readiness is not.

    The adoption numbers are not the bottleneck anymore. According to NVIDIA’s 2026 State of AI report, 64% of organizations are actively using AI, and companies with 1,000+ employees show 76% adoption. The problem is what happens after the pilot. Only 42% feel strategically prepared for production deployment. That is a polite way of saying most teams have not figured out how to run the thing safely at scale.

    Deloitte’s enterprise AI research says only 20% of companies have mature governance for autonomous AI agents. That is the real choke point. Not model quality. Governance. Who can use the data, which outputs are allowed, how decisions are logged, what happens when the model is wrong, and how you prove compliance six months later when someone asks for the audit trail.

    This is where BI projects stall. A team can get a proof of concept running in a week. Getting it into a controlled environment with role-based permissions, logging, review steps, and escalation paths takes much longer. That is not flashy work. It is the work that decides whether the project survives security review and the first uncomfortable question from legal.

    The teams that get past the pilot-to-production gap do one thing early: they define the operating model. Who requests a new agent? Who approves it? Where does it run? What data can it see? How is it monitored? If those answers are fuzzy, the dashboard will stay in demo mode forever. It will also keep generating meetings, which is a special kind of enterprise failure.

    One healthcare analytics team I spoke with had the model working on day six and still spent the next eight weeks on governance. Data residency, audit logs, approval flow, retention rules, exception handling. None of that was exciting. All of it mattered. The model did not fail. The process around it did the heavy lifting.

    Power BI changed the integration discussion

    The November 2025 Power BI update introduced MCP integration for direct Claude connectivity to semantic models. That matters more than it sounds. Semantic models are where business definitions live. They define revenue, margin, churn, active customer, and all the other terms people argue about in meetings. If an AI model can connect to that layer, it can work with the same definitions your BI team already trusts.

    MCP, or Model Context Protocol, is a structured way for AI systems to talk to tools and data sources. In plain English: it reduces custom glue code. That is good news for BI teams that already have enough glue code. The less custom plumbing you need, the less likely your “smart” dashboard turns into a maintenance headache six weeks later.

    This is one reason multi-agent systems are gaining ground. A single model can answer questions, but a production BI workflow often needs separate roles: one agent to retrieve data, one to validate it, one to summarize it, and one to enforce policy. That sounds fancy, but it is really just division of labor. Humans do this all the time. Software can too, if you design it that way.

    For Microsoft-heavy teams, the new route into Power BI’s semantic layer is the interesting part. It narrows the gap between the AI layer and the reporting layer. That does not remove the need for engineering. It just removes one of the more annoying parts of the build.

    And yes, the build still needs a human who understands the model. No protocol fixes a sloppy metric definition. It just makes the sloppiness easier to expose.

    Cost matters more than people admit

    Model choice gets expensive fast when BI usage scales. A few ad hoc questions a day are cheap. A few thousand report summaries, anomaly checks, and support automations are not. Research comparing model pricing found that costs can differ by as much as 20x across platforms and model tiers. That is enough spread to make a finance lead ask for a second meeting and a spreadsheet.

    Claude Sonnet 4 is attractive here because it delivers 98% of Opus quality at much lower cost. For many BI tasks, that is the sweet spot. You do not need the most expensive model to summarize a variance report or classify a support ticket. You need something accurate enough, fast enough, and cheap enough to run all day without making the budget look silly.

    Gemini 2.5 Flash is the other strong cost play. Pricing comparisons identify it as the most cost-effective API option for large-scale deployments. That makes it appealing for high-volume workloads where the task is repetitive and the structure is clear. If you are processing a lot of lightweight BI queries, the cheaper model can be the smarter one. Paying premium rates for simple work is just waste with a nicer interface.

    ChatGPT usually sits at the broad, expensive end of the practical spectrum for enterprise BI. That is not a criticism. Broad capability has value. But if the job is repetitive and well scoped, paying for generality you do not need is inefficient. The model can be great and still be the wrong tool for the bill you are trying to keep under control.

    That tradeoff is easy to ignore in a pilot. It gets a lot harder to ignore when the usage report lands in someone’s inbox and the line item starts looking suspiciously like a software subscription and a small electrical bill had a bad weekend.

    What a sane enterprise BI architecture looks like in 2026

    The best setup is usually not “pick one model and hope.” It is a layered system. One model retrieves data. Another reasons over it. A third writes the summary or user-facing output. The BI platform stays the source of truth. The AI layer sits on top of it, not in place of it.

    For a mid-market team, that might look like this: Power BI or Looker holds the semantic model; Claude handles long-context analysis and policy-aware interpretation; Gemini handles cheaper high-volume tasks if the company is already in Google Cloud; ChatGPT stays available for broad user-facing assistance and quick drafting. That is not overengineering. That is matching the tool to the job.

    Security belongs in the architecture, not as an extra task at the end. Secure deployment patterns often use environments like Amazon Bedrock with VPC isolation, which keeps sensitive data inside controlled boundaries. If your BI work touches payroll, customer data, or regulated financials, that matters immediately. A clever dashboard that leaks data is not clever. It is a problem with nicer fonts.

    The practical rule is simple: use AI where the work is repetitive, text-heavy, or exception-driven. Do not use it as a replacement for your metric layer. The metric layer should be boring. Boring is good. Boring means consistent. Consistent means someone can trust the number in a meeting without asking three follow-up questions and opening a second spreadsheet.

    That last part sounds obvious until you sit through the meeting where three people bring three versions of the same KPI and all of them are “almost right.” Boring definitions would save a lot of oxygen.

    Three scenarios that show how this plays out

    A regional sales team using ChatGPT for report prep. The team exports weekly pipeline data from Power BI into Excel, asks ChatGPT to draft a summary, and uses that draft in a Monday meeting. That works well for narrative support. It does not solve data quality, refresh timing, or forecast logic. Useful? Yes. A system? Not really.

    A finance team using Claude for month-end close. The team feeds Claude a long close checklist, accounting policies, and the latest variance tables. With a 200,000-token context window, Claude can keep all of that in view while drafting explanations and flagging unusual movements. The output still needs review, but the time savings are real. That is the kind of task Claude handles especially well.

    A marketing operations team using Gemini inside Google Workspace. The team keeps campaign data in Sheets and BigQuery, then uses Gemini to summarize performance, draft updates, and surface anomalies. Because the stack is already Google-native, the integration friction stays low. The model is not doing anything magical. It is just sitting where the work already happens, which is often the difference between adoption and abandonment.

    These examples are not interchangeable. That is the point. A team that lives in Microsoft tools has a different path from a team built on Google Workspace, and a finance group with long policy documents has different needs from a marketing ops team pushing weekly summaries. The model should fit the workflow, not the other way around.

    Once you see that, the “best model” conversation gets a lot less dramatic. Which is helpful. Drama is expensive, and BI already has enough of it.

    What I would do if I had to choose today

    If the company is Microsoft-heavy, I would start with ChatGPT for broad adoption and Claude for the jobs that need long context, careful reasoning, or tool use. That combination covers a lot of BI reality without forcing everyone into one model that is only sort of right. If the stack is mostly Google Workspace and BigQuery, Gemini deserves a serious look, especially if cost is tight.

    If the goal is production BI, I would not start with the dashboard. I would start with one repetitive job that already hurts: monthly variance summaries, data validation, report commentary, or exception routing. Build the AI around that job. Do not build around a vague promise of “AI-powered insights.” Vague promises are how teams end up with a demo and no deployment.

    And if someone says the model can just “generate the dashboard,” ask what happens when the source data changes, the permissions shift, and the CFO wants an audit trail. That question usually ends the meeting in a useful way.

    The teams getting real value in 2026 are not chasing the flashiest chatbot. They are turning AI into a controlled part of the reporting workflow, with clear ownership, clear guardrails, and enough plumbing to keep the thing honest. That is less exciting than a demo. It also works on a Tuesday afternoon, which is usually where enterprise software earns its keep.

  • Real-World ROI Battle: How Claude, ChatGPT, and Gemini Stack Up for Mid-Market Sales Analytics in 2026

    Real-World ROI Battle: How Claude, ChatGPT, and Gemini Stack Up for Mid-Market Sales Analytics in 2026

    Real-World ROI Battle: How Claude, ChatGPT, and Gemini Stack Up for Mid-Market Sales Analytics in 2026

    Most mid-market teams do not have an AI problem. They have an ROI problem. Adoption is already high: 89% of revenue organizations now use AI-powered tools, up from 34% in 2023. But only 42% actually hit their ROI targets. That gap matters more than model hype, benchmark scores, or launch-day demos. If the tool does not change pipeline, response time, or analyst workload in a measurable way, it is just another subscription.

    In 2026, the question is not whether Claude, ChatGPT, or Gemini can help. All three can. The real question is which one pays back fastest for sales analytics work: lead scoring, account research, pipeline reviews, follow-up drafting, forecasting notes, and customer insight summaries. The answer is not the same for every team. ChatGPT still has the biggest footprint, Gemini is growing fast inside Google-heavy shops, and Claude is showing the strongest value when the work gets messy, long, and analytical.

    The market is also getting more expensive to ignore. The global conversational AI market is now around $10.32-$11.45 billion and still growing at a projected 23.15% CAGR through 2031. That growth is not coming from novelty. It is coming from teams trying to cut response times, reduce manual research, and get more useful signals out of the data they already have. The tools are only useful when they fit the workflow.

    What ROI Actually Looks Like in Sales Analytics

    For sales teams, ROI usually shows up in a few places: faster lead qualification, better meeting prep, cleaner pipeline notes, more relevant outreach, and less time spent digging through documents or call transcripts. The strongest gains are not abstract. They show up in hours saved, meetings booked, and deals recovered.

    One useful benchmark comes from chatbot operations. When AI chatbots resolve 44.8% of conversations autonomously, each deflected interaction saves about $6.75-$7.50 compared with a fully human-handled conversation. That is a direct cost reduction. It also frees people to handle exceptions, escalations, and higher-value accounts. In customer service, companies report 30-45% productivity gains from AI-powered tools, which is a meaningful range when your team is already stretched thin.

    Sales-specific use cases show even sharper effects. AI predictive lead scoring reaches 89% accuracy, compared with 60-68% for traditional models. Conversational AI can cut response time from 38 hours to 30 seconds and lift meeting bookings by 15%. Hyper-personalized outreach can raise email reply rates by 3.2x and demo conversions by 47%. Those are not vanity metrics. They are the difference between a pipeline that stalls and a pipeline that moves.

    ChatGPT Still Wins on Reach, but Its Lead Is Smaller

    ChatGPT remains the default choice for a lot of teams, and the usage numbers explain why. It has 800-900 million weekly active users and processes more than 2 billion prompts a day. It also has deep enterprise penetration: over 80% of Fortune 500 companies integrated ChatGPT within nine months of launch. If you need a tool that employees already know how to open, ChatGPT is still the easiest place to start.

    But market dominance is no longer the same as market control. ChatGPT’s share has fallen to 64.5%, down from 86.7% in early 2025. That is a real drop, not a rounding error. At the same time, Gemini has climbed quickly, and Claude has carved out a premium enterprise lane. The market is moving from “one tool for everything” to “pick the right model for the task.”

    For mid-market sales analytics, ChatGPT is strongest when the work is broad and repetitive. It is good for first-pass summaries, account research drafts, follow-up emails, and turning rough notes into cleaner language. Enterprise users report 30-90% time reductions on audits, research, and report writing. That is useful when your team spends too much time formatting information that already exists.

    The limits show up when the task needs long context, deep comparison, or careful synthesis across many documents. ChatGPT can do that work, but it is not always the cleanest fit when the source material is large and the analysis has to stay consistent across a long thread, a long contract set, or a multi-quarter pipeline review. That is where Claude starts to separate itself.

    Why Claude Is the Best Fit for Deep Sales Analysis

    Claude’s enterprise story is not about user count. It is about workload fit. Claude has only 19 million users, far fewer than ChatGPT, but its customers are spending more. Enterprise customers spending over $100,000 annually grew 7x in the past year, and there are now over 500 customers spending more than $1 million annually. That kind of spend usually does not happen on a novelty tool. It happens when a tool saves real labor or improves high-value decisions.

    Claude’s technical advantage is most visible in long-context and analytical work. Its Enterprise edition supports 500,000+ token context windows, which matters when you want to analyze long call transcripts, full account histories, multi-document RFPs, or a quarter’s worth of pipeline notes without chopping the source material into tiny pieces. It also scores 65.4% on Terminal-Bench 2.0, a benchmark tied to coding and knowledge-work tasks. For sales analytics, that usually translates into better performance on structured reasoning, comparison, and document-heavy workflows.

    The enterprise case studies are strong. TELUS saved over 500,000 staff hours across 57,000 employees using Claude-powered automation, and the work produced $90 million-plus in measurable business benefit. Kärcher reported a 90% reduction in document drafting time. Those are exactly the kinds of gains mid-market teams want when they are buried in account summaries, proposal drafts, and internal reporting.

    Claude is not the cheapest option. Its enterprise pricing is premium, with Opus at $15/$75 per million tokens for input and output. But premium pricing only hurts ROI if the model does not save more than it costs. For teams doing heavy analytical work, long-form synthesis, or large-document review, Claude’s output quality can justify the bill faster than a cheaper model that forces more human correction.

    Mini case: the sales ops analyst drowning in account notes

    Imagine a sales ops analyst who has to prep a weekly pipeline review for 40 enterprise accounts. Each account has call notes, CRM updates, email threads, and a few open action items. In ChatGPT, the analyst can summarize each account, but the process may require more manual chunking and cross-checking. In Claude, the analyst can load a much larger set of source material at once, ask for risks by account, and get a cleaner synthesis across the full history. If that saves two hours a week, the annual gain is easy to see. If it saves five hours, the tool pays for itself quickly.

    Why Gemini Is Rising Fast in Google-Centric Teams

    Gemini’s rise is not just about model quality. It is about distribution and price. Gemini’s market share has surged from 5.4% to 18.2-21.5%, a 370% year-over-year increase. It now reaches 750 million users, helped by its position inside Google Search, Android, Chrome, and Google Workspace. If your team already lives in Gmail, Sheets, Docs, and Drive, that integration matters more than another point or two on a benchmark.

    Gemini also has a cost advantage. Its API pricing is among the lowest in the category at $2/$12 per million tokens. For teams running a lot of lightweight tasks, that matters. If you are summarizing meeting notes, classifying inbound leads, or pulling first-pass insights from documents, low token cost can keep experimentation affordable. The platform also supports 1M context windows, which makes it viable for large document sets and long research workflows.

    Gemini 3.1 Pro also posts strong reasoning results, including a 94.3% GPQA score. That tells you the model is not just cheap and well distributed. It is also capable on pure reasoning tasks. For sales analytics, that can help with territory planning, account segmentation, and multi-source research where the work is more about interpretation than content generation.

    The main advantage for mid-market teams is simple: if your workflows already run through Google Workspace, Gemini reduces friction. You do not need to train people on a new environment, and you do not need to rebuild as much of the existing process. A rep can draft an email in Gmail, an analyst can work from Sheets, and a manager can review notes in Docs without moving between as many tools.

    Mini case: the Google-heavy revenue team

    A 60-person revenue team uses Google Workspace for nearly everything. Forecast notes live in Sheets, deal summaries live in Docs, and follow-ups are drafted in Gmail. Gemini fits that stack better than a standalone chatbot. The team can summarize meeting notes, draft account updates, and pull customer themes without changing where the work happens. If the alternative is asking people to copy and paste between systems all day, the integration advantage is real ROI.

    The Best Model Depends on the Job, Not the Brand

    The strongest evidence points to a split strategy, not a single winner. Claude is better for deep analysis, ChatGPT is better for broad productivity, and Gemini is the best fit for Google-native workflows and lower-cost scaling. That is also where the ROI usually improves. Companies that force one model to do everything often spend more time correcting outputs than using them.

    Multi-model setups are showing up more often in successful deployments. In practice, that means Claude handles the long, messy analysis; ChatGPT handles general drafting, summarization, and team-wide productivity; and Gemini handles cost-sensitive tasks inside Google-heavy workflows. Organizations using AI agents that orchestrate multiple models often outperform single-chatbot implementations, with well-designed systems reaching 40-60% automation rates regardless of the underlying model choice.

    That approach also fits the economics. If a task needs deep reasoning and long context, Claude may be worth the premium. If the task is broad and routine, ChatGPT’s familiar interface and enterprise controls make it efficient. If the task is high-volume and embedded in Google Workspace, Gemini’s lower token cost and native integration can win on total cost of ownership.

    The wrong question is “Which model is best?” The better question is “Which model is best for this step in the workflow?” A sales analytics process has multiple steps: ingest data, clean it, summarize it, score it, draft action items, and route it to the right person. Different models can help at different stages.

    Where the Money Is: Sales Analytics Use Cases That Actually Pay Back

    Lead scoring is one of the clearest places to start. Traditional models often miss nuance, especially when a rep’s notes, intent signals, and account history are spread across systems. AI predictive lead scoring reaches 89% accuracy, compared with 60-68% for traditional approaches. That gap matters when your team is deciding which leads get a call today and which ones sit for another week.

    Response speed is another obvious win. Conversational AI can cut response time from 38 hours to 30 seconds, which helps explain the 15% increase in meeting bookings. In sales, speed is not a nice-to-have. Fast replies keep prospects engaged while the intent is still warm.

    Personalization is where many teams leave money on the table. AI-driven outreach can increase email reply rates by 3.2x and demo conversions by 47%. For a mid-market team sending hundreds or thousands of emails a month, even a small lift in reply rate changes the economics of the entire funnel. The point is not to send more email. It is to make the email more relevant without adding hours of manual research.

    Revenue intelligence platforms add another layer. They can identify at-risk deals 45 days earlier and recover 28% of stalled pipeline. That is useful for managers who need to know where deals are slipping before the quarter is already lost. It is also useful for reps who need a short list of accounts that deserve attention now.

    Mini case: the manager trying to save the quarter

    A regional sales manager sees pipeline coverage looking fine on paper, but a few large deals have gone quiet. A revenue intelligence workflow flags those deals 45 days earlier than the old process would have. The manager checks the notes, sees that two opportunities have no next step, and pushes the rep to re-engage the buyer. If one of those deals closes, the AI did not “make the sale,” but it did surface the risk in time to matter. That is the kind of ROI leaders can defend.

    What the Cost Structure Means for Mid-Market Budgets

    Pricing matters more than most AI vendors want to admit. Claude Enterprise is expensive on a per-token basis, but it is also built for heavier work. ChatGPT Enterprise is simpler to budget for at $60 per user per month with unlimited access. Gemini offers the lowest API cost at $2/$12 per million tokens, plus a $19.99 monthly Google One AI Pro tier for lighter use.

    The right way to think about cost is not “Which tool is cheapest?” It is “Which tool creates the least total work?” A cheaper model that produces weaker analysis can cost more if analysts spend extra time checking and rewriting outputs. A more expensive model can be cheaper overall if it reduces manual review, speeds up deliverables, and improves the quality of decisions.

    There is also a hard comparison to keep in mind. Fully human resolution in chat workflows can cost $8-$15 per interaction, while AI-assisted resolution can drop that to $0.50-$2.00. That spread is large enough to justify experimentation even before you count the labor freed up for higher-value work. For mid-market teams with lean headcount, that matters more than a flashy feature list.

    Budgeting should follow usage patterns. If 80% of your work is routine drafting and internal summaries, ChatGPT or Gemini may give you the best return. If 20% of your work is complex account analysis that influences major revenue decisions, Claude may be the better investment even at a higher unit cost. Most teams need to stop asking for one platform to win on every axis.

    Why So Many AI Projects Miss ROI Targets

    The failure rate is not about the models alone. It is usually about process. Many teams buy a tool, give access to a few people, and expect the savings to appear. They do not define the workflow, the quality bar, the handoff points, or the metric they want to move. That is how you end up with broad adoption and weak return.

    The clearest signal is the gap between use and value. Again, 89% of revenue organizations use AI, but only 42% hit ROI targets. That means most companies have access, but many do not have implementation discipline. The teams that win usually do three things well: they pick the right use case, they define success before rollout, and they measure the time or revenue impact after launch.

    Another reason projects stall is model mismatch. If you use a general-purpose tool for a long-document analysis job, output quality drops and review time rises. If you use an expensive premium model for a simple drafting task, costs climb without much added value. If you use a cheap model where the cost of error is high, the hidden rework can erase savings. The model has to match the task.

    That is why the best teams treat AI like a workflow layer, not a magic button. They use Claude where the analysis is deep, ChatGPT where the work is broad, and Gemini where the environment is already Google-first. They also keep humans in the loop for exceptions, approvals, and customer-facing decisions that need judgment.

    A Practical Way to Choose Between Claude, ChatGPT, and Gemini

    If you are trying to decide where to start, use the work itself as the filter.

    • Choose ChatGPT if your team needs the broadest adoption, the easiest onboarding, and strong general productivity for research, drafting, and summarization. It still has 800-900 million weekly active users and strong enterprise penetration.
    • Choose Claude if your workflow depends on long documents, deep analysis, and higher-confidence synthesis across many inputs. Its 500,000+ token context window and enterprise usage growth suggest it is built for serious knowledge work.
    • Choose Gemini if your team lives in Google Workspace and wants low-cost, native integration with Docs, Sheets, Gmail, and Chrome. Its $2/$12 per million token pricing makes it attractive for scale.

    If you can only test one use case first, start with a workflow that already has a measurable bottleneck. Good candidates are lead scoring, account research, call-note summarization, pipeline risk summaries, and first-draft outbound emails. These are the kinds of tasks where time savings and quality improvements are easy to see.

    Do not start with a vague “AI strategy.” Start with a spreadsheet, a queue, or a reporting task that already eats hours every week. Then measure the before and after. If the new workflow saves 10 hours a week, improves meeting bookings, or reduces stalled pipeline, you have a real business case. If it does not, the problem is usually the use case, not the model.

    The Bottom Line for Mid-Market Sales Teams

    The 2026 AI market is no longer a one-horse race. ChatGPT still has the broadest reach and the easiest adoption path. Claude is proving that premium analytical quality can pay off in enterprise workflows. Gemini is growing fast by sitting inside the tools many teams already use and undercutting rivals on price.

    For mid-market sales analytics, the strongest ROI usually comes from matching the model to the job, not from betting everything on one platform. Use ChatGPT for broad productivity, Claude for deep analysis, and Gemini for low-cost, Google-native workflows. The teams that do that well are the ones turning AI from a demo into a measurable part of the revenue process.

    If you are still evaluating tools, the best next step is simple: pick one workflow, define one metric, and run one controlled test. The market is already moving. The only question is whether your process is moving with it.

  • Claude Code Reviews: 50% Productivity Gains vs. Rising Quality Control Concerns in Enterprise Development

    Claude Code Reviews: 50% Productivity Gains vs. Rising Quality Control Concerns in Enterprise Development

    Claude Code is doing two things at once. It is making developers faster, and it is making some teams nervous about what speed is doing to code quality. Anthropic says its own employees now use Claude in 59% of their work, up from 28% a year earlier, and they report productivity gains rising from 20% to 50% in that same period. At the same time, independent research across 211 million changed lines of code shows refactoring falling sharply while cloned code is rising. Both things can be true.

    For mid-market teams, that tension matters more than the hype cycle. If you manage analysts, VAs, BI power users, or engineers who work inside real delivery constraints, the question is not whether AI can write code. The question is whether it helps the team ship better work, faster, without turning the codebase into a pile of repeated patterns and missed edge cases.

    Why Claude Code is not just another autocomplete tool

    Claude Code is built around a different idea than old-school static analysis or inline autocomplete. Instead of scanning for known patterns and simple syntax issues, it uses agentic review: multiple specialized agents look at the same pull request from different angles, including logic, security, regression risk, and performance. That matters for codebases where a change in one file can break behavior three layers away.

    Anthropic says its code review system keeps false positives under 1% in internal testing, and the share of meaningful pull request reviews increased from 16% to 54% after deployment. That is a big jump, but the more important detail is what kind of work it handles. Claude Code supports context windows up to 1 million tokens, which means it can hold far more of a codebase in view than a typical autocomplete tool. It also shows 80.8% SWE-bench accuracy with agent teams, which is one reason teams use it for multi-file refactors instead of just line-by-line suggestions.

    A side-by-side diagram showing traditional static analysis scanning one file versus Claude Code reviewing multiple files and dependencies across a codebase
    A side-by-side diagram showing traditional static analysis scanning one file versus Claude Code reviewing multiple files and dependencies across a codebase

    The practical difference is simple. A static analyzer tells you, “This function may be unsafe.” Claude Code can tell you, “This change looks safe in isolation, but it breaks the calling pattern used in three other files, and the test coverage does not touch that path.” That is not magic. It is broader context plus reasoning over relationships, which is exactly where many real bugs live.

    The productivity story is real, and the numbers are hard to ignore

    Anthropic’s internal survey data is the clearest signal that this is not a niche experiment anymore. In twelve months, Claude usage inside the company climbed from 28% of daily work to 59%, while reported productivity gains rose from 20% to 50%. That does not mean every task got twice as fast. It means the tool became part of normal work, and the people using it felt a material change in output.

    External deployments point in the same direction. Enterprise case studies report 55% to 80% productivity gains in refactoring tasks, and some organizations have saved more than 500,000 staff hours through Claude-powered workflows. TELUS, for example, deployed AI across 57,000 employees and built over 13,000 AI-powered tools on top of it. That is not a pilot. That is process change at scale.

    There is also a less obvious productivity gain: Claude is used for work that teams would not have done manually. Anthropic says 27% of Claude-assisted work falls into that category, including scaling projects and exploratory tasks that would not have been cost-effective otherwise. In plain terms, the tool does not just speed up existing work. It expands the set of work a team can afford to attempt.

    That matters for analysts and operators. If a team can generate a new reporting workflow, test a data cleanup approach, or refactor a brittle script without spending a full day on it, they will try more things. Some of those things will be useful. Some will not. But the opportunity cost drops, and that changes behavior.

    Mini case study: the refactor that would have sat in the backlog

    A finance team has a reporting script that runs monthly, but it is slow and fragile. Normally, the team would keep patching it because a full rewrite would take too long. With Claude Code, an engineer can ask for a staged refactor: isolate the data access layer, rewrite the transformation logic, and generate tests for the most failure-prone paths. If the review agent catches a logic mismatch before merge, the team saves a painful production incident later. The gain is not just speed. It is making a higher-quality refactor feasible inside a normal sprint.

    Why Claude Code review looks different from older quality tools

    Traditional code quality tools are good at rules. They can flag missing semicolons, insecure calls, or obvious style violations. They are weaker at understanding intent. Claude’s advantage is semantic understanding: it can inspect a pull request in context and reason about what the code is trying to do, not just whether it matches a rule.

    That is why the review system scales with PR complexity. A trivial change gets a light pass. A larger change gets deeper analysis. Anthropic says the average review takes about 20 minutes, and the system is tuned to keep false positives below 1% in internal testing. In practical terms, that means developers are less likely to ignore the output the way they often ignore noisy linting or over-eager security scanners.

    The “meaningful PR review rate” jump from 16% to 54% is the more useful metric. It suggests the system is not just producing comments. It is producing comments people actually use. That is the difference between a dashboard and a workflow tool.

    Still, it is important not to confuse internal performance with universal performance. A system can look excellent in a controlled environment and then struggle in a different codebase, with different coding conventions, different risk tolerance, and different data quality. That gap is where most enterprise AI projects get messy.

    The quality-control problem is not theoretical

    The strongest challenge to AI-assisted development comes from longitudinal data, not anecdotes. GitClear analyzed 211 million changed lines of code and found that refactored code fell from 25% in 2021 to less than 10% in 2024. Over the same period, cloned code rose from 8.3% to 12.3%. That is a bad sign if you care about maintainability.

    The pattern suggests teams may be leaning more on copy-paste behavior and less on thoughtful refactoring. That can happen when AI makes it easy to generate working code quickly. The first version gets written. The second version gets copied from the first. The third version gets copied again. Over time, the codebase becomes harder to change because repeated logic drifts apart.

    This is where vendor-sponsored studies and independent studies diverge. GitHub reported that Copilot users had a 56% greater likelihood of passing unit tests and 13.6% fewer code errors. But critics pointed out that the study used basic CRUD applications, which are heavily represented in training data and are not the hardest test of code quality. GitClear’s dataset is broader, longitudinal, and drawn from major tech and enterprise repositories. On balance, the independent data is the stronger warning signal.

    The right conclusion is not “AI makes code worse.” The right conclusion is narrower: AI can improve short-term output while quietly worsening code structure if teams do not enforce refactoring discipline. That is a process problem, not a model problem.

    Mini case study: the team that shipped faster and inherited more mess

    A product team uses Claude to generate feature branches faster. Sprint velocity goes up. Then six months later, the same team spends more time fixing duplicated logic, inconsistent validation rules, and edge cases that were implemented three different ways. The initial productivity gain was real. So was the maintenance debt. If nobody tracks refactoring quality, the team can end up with faster delivery and slower long-term execution.

    Security review is useful, but not enough to trust blindly

    Claude Code review can catch real issues, and that matters in production systems. But independent testing shows a more cautious picture. Checkmarx Zero found that in production-grade scans, Claude identified eight vulnerabilities but only two were true positives. That is a much rougher result than Anthropic’s internal less than 1% incorrect findings claim.

    The discrepancy is not surprising. Internal testing usually reflects the company’s own code patterns and the way the tool was tuned. Independent security research tends to use messier, more adversarial, and more diverse environments. Real-world accuracy likely sits between those two numbers and depends on how the tool is deployed, what kind of codebase it sees, and who reviews the output.

    For enterprise teams, the lesson is straightforward: treat Claude as an extra reviewer, not the final reviewer. Use it to surface likely issues faster. Then let a human decide whether the issue is real, relevant, and worth fixing right now. That is especially important for authentication flows, payment logic, data access, and anything that can create compliance exposure.

    There is a temptation to measure success by how many issues the AI finds. That metric is too crude. A better question is whether the AI helps your reviewers spend more time on the issues that matter and less time on obvious noise.

    Adoption is strong, but the pricing makes the use case matter

    Claude is not the cheapest option on the market. Pro plans sit at $20 per month, compared with $10 per month for GitHub Copilot. Claude Code review features average $15 to $25 per review, depending on complexity. That pricing makes sense only if the work being reviewed has enough value to justify it.

    For a team that ships small, repetitive CRUD changes, that cost can feel high. For a team that maintains a large codebase, handles risky refactors, or spends hours on review bottlenecks, it may be cheap. The economics depend on the value of the engineer’s time and the cost of a missed bug. A single avoided production incident can pay for a lot of reviews.

    The market seems to agree that the premium is acceptable for high-value work. Claude Code reached an estimated $2.5 billion run-rate by early 2026, and Anthropic’s revenue reportedly grew from $1 billion to $14 billion by February 2026. Those are not numbers you get from hobby adoption. They point to real enterprise demand.

    For mid-market buyers, the key is segmentation. Do not ask, “Should we buy Claude?” Ask, “Which parts of our workflow are expensive enough, risky enough, or repetitive enough to justify a premium review layer?” That is a much better procurement question.

    Mini case study: when premium pricing is still the cheaper option

    An operations team runs a customer-facing dashboard that depends on several intertwined SQL transformations and Python scripts. Each release takes two reviewers, and the team still misses edge cases. A $20 monthly seat or a $15 to $25 review fee sounds expensive until you compare it with the cost of delayed releases, broken reports, and manual rework. In that setting, Claude is not a nice-to-have. It is a way to reduce review bottlenecks where human attention is already scarce.

    What the 27% “new work” number means for managers

    One of the most interesting findings in Anthropic’s internal data is that 27% of Claude-assisted work would not have happened manually. That includes scaling projects and exploratory work. This is where managers need to pay attention, because the productivity story changes depending on how you measure it.

    If you only measure tasks completed faster, Claude looks like a time-saver. If you measure the additional work teams can now attempt, Claude looks like a capacity expander. Those are different outcomes. A team may not finish the same backlog faster, but it may do more useful work overall because the tool makes certain tasks economically viable.

    That is good news only if the extra work is actually valuable. Otherwise, teams can end up generating more output without improving business results. A manager should ask whether the new work is tied to revenue, risk reduction, customer experience, or internal efficiency. If it is not, the extra throughput may just create noise.

    This is also where individual productivity metrics can mislead. A developer who finishes a task 50% faster is useful. A team that ships the wrong thing 50% faster is not. Organizational outcomes still matter more than personal speed.

    Why measurement is harder than the marketing makes it sound

    AI vendors love clean metrics. Faster completion. Fewer errors. Higher test pass rates. Those numbers are real in controlled settings, but they do not always translate into better delivery outcomes. The problem is measurement scope.

    Individual productivity is easy to observe. Team delivery quality is harder. You can count completed tasks, but you also need to count rework, escaped defects, maintenance burden, review time, and how often the codebase gets harder to change. That broader picture is where the GitClear findings matter most. A codebase that accumulates clones and loses refactoring discipline may look productive in the short term and brittle in the long term.

    GitHub’s internal study showing 56% better unit-test pass rates is useful, but it is not enough to settle the question. Unit tests are one slice of quality. They do not fully capture architectural fit, maintainability, or the amount of future cleanup a team will need.

    For that reason, the best measurement framework is a balanced one. Track cycle time, review time, escaped defects, refactor frequency, duplicate code growth, and post-release rework. If Claude improves all of those, keep expanding it. If it only improves task throughput while the codebase gets messier, the apparent gain is fake.

    Where Claude Code fits best today

    Claude Code is strongest in messy, multi-step work where context matters. That includes refactoring, dependency tracing, security review, and changes that touch several files at once. It is also useful when a team needs to explore an idea quickly before deciding whether to build it for real.

    It is weaker when the task is simple, repetitive, and easy to verify by rule. In those cases, cheaper tools may be enough. GitHub Copilot still has an advantage in broad accessibility and inline autocomplete, especially for teams that want low-friction adoption. Claude is the better fit when reasoning depth matters more than immediate convenience.

    That distinction should guide rollout. Use Claude on high-value pull requests, architecture-sensitive changes, and complex refactors. Do not make it the default gate for every trivial edit. The more focused the deployment, the easier it is to measure whether it is actually helping.

    Teams that succeed with Claude usually do three things well: they define where AI is allowed to act, they keep humans in the final decision loop, and they measure code health instead of only measuring speed. That combination is what turns an impressive tool into a reliable part of the workflow.

    The practical takeaway for mid-market teams

    If you run a data-heavy team, the best way to think about Claude Code is not as a replacement for reviewers. Think of it as a very fast first-pass analyst for code. It can scan more context than a person can hold in working memory, surface likely issues, and accelerate refactors that would otherwise stall. The upside is real: 50% reported productivity gains, 500,000+ staff hours saved, and strong performance on complex tasks such as 80.8% SWE-bench accuracy.

    But the warning signs are real too. Independent data across 211 million changed lines shows more cloning and less refactoring. Security testing shows that some production scans produce far more false leads than vendor claims suggest. Those are not reasons to avoid the tool. They are reasons to deploy it carefully.

    If you are deciding whether to adopt Claude Code, start with one workflow where the cost of a mistake is high and the value of a faster review is obvious. Measure the result over a few release cycles. If it improves speed without hurting maintainability, expand it. If the code gets faster to write but harder to live with, pull back and tighten the oversight.

    The teams that win with AI-assisted development will not be the ones that automate the most. They will be the ones that know exactly where automation helps, where it misleads, and where a human still needs to say, “No, this needs a real review.”

  • The State of LLMs in Business Today: What’s Working, What’s Not, and What Comes Next

    The State of LLMs in Business Today: What’s Working, What’s Not, and What Comes Next

    The State of LLMs in Business Today: What’s Working, What’s Not, and What Comes Next

    Most businesses are no longer asking whether large language models, or LLMs, are useful. They are asking where they fit, what they can be trusted to do, and how much human oversight they still need. That shift matters. The early excitement was about what these tools might do. The current conversation is about what they actually do well in daily work.

    For analysts, virtual assistants, and BI-heavy teams, the answer is practical. LLMs are already helping people draft emails, summarize long documents, search internal knowledge, answer routine customer questions, and speed up repetitive workflows. They are also creating new problems: wrong answers that sound confident, unclear ownership, privacy concerns, and hidden costs that show up after the pilot phase.

    The teams getting value from LLMs are not treating them as magic. They are using them for specific tasks where speed matters, the output can be checked, and the risk of a mistake is manageable. The teams struggling with them are often trying to use them as replacements for judgment, process, or clean data. That does not work.

    Where LLMs Are Already Useful

    The strongest use cases are the ones that reduce time spent on first drafts and routine reading. LLMs are good at turning rough input into something structured enough to work with. That is why drafting is one of the first places they show up. A manager can paste notes from a meeting and get a clean follow-up email. An analyst can turn bullet points into a report summary. A virtual assistant can create a polished response from a few key facts.

    Summarization is another clear win. Many teams deal with long documents, call transcripts, policy updates, tickets, or research notes. An LLM can cut that down to the parts that matter most. It will not always choose the right details, but it can save a lot of reading time when the goal is to get oriented fast.

    Search is changing too. Traditional search works best when you know the right keyword. LLM-based search is better when the question is messy. A user can ask, “What is our refund policy for annual plans in Europe?” and get an answer that pulls from several documents instead of a list of links. For internal knowledge bases, that is a real improvement.

    Customer support is one of the most visible business use cases. LLMs can handle common questions, explain basic steps, and route cases to the right team. They are especially useful when the support volume is high and the questions repeat. The model does not need to solve every issue. It just needs to reduce queue time and handle the easy cases cleanly.

    Workflow automation is where the value starts to compound. An LLM can read an incoming message, classify the request, draft a reply, extract the needed fields, and send the task to the right system. In practice, that means less copy-paste work and fewer manual handoffs. The model is not replacing the workflow. It is handling the parts that are repetitive and text-heavy.

    A simple workflow diagram showing an LLM sitting between email, documents, support tickets, and internal systems, with arrows for drafting, summarizing, searching, and routing tasks.
    A simple workflow diagram showing an LLM sitting between email, documents, support tickets, and internal systems, with arrows for drafting, summarizing, searching, and routing tasks.

    What LLMs Still Do Poorly

    The biggest issue is accuracy. LLMs can produce answers that sound complete and are still wrong. That is not a small flaw. In business settings, a wrong policy answer, a bad calculation, or a false summary can create real work for the team that has to clean it up.

    They also struggle with consistency. Ask the same question twice and you may get slightly different answers. Change the wording a little and the output can shift more than you expect. For tasks that require strict repeatability, that is a problem. A model can help prepare the work, but it should not be the only system of record.

    Control is another gap. Most business processes need clear rules. They need to know which source is trusted, what happens when information is missing, and who approves the final output. LLMs are flexible, which is useful, but that flexibility makes them harder to govern than a fixed rules engine or a standard report.

    They also have trouble with narrow context when the surrounding information is important. A model may understand a policy in general terms and still miss a clause that changes the answer for a specific customer segment, region, or contract type. In other words, it can sound right while skipping the detail that matters most.

    Finally, they are not naturally good at accountability. If a dashboard is wrong, someone can trace the data source. If an LLM gives a poor answer, the path from input to output is often less transparent. That makes review, logging, and source grounding more important than they are in many other tools.

    The Main Adoption Patterns

    Most companies are adopting LLMs in one of three ways. The first is the standalone chat tool. This is the version many people know best: a public interface where a user types a prompt and gets a response. It is fast to try and useful for individual productivity. People use it to draft text, brainstorm ideas, rewrite content, and summarize material.

    Standalone chat is also the easiest place to create shadow usage. Employees start using public tools for work content without a clear policy on what can be pasted in. That creates privacy and compliance risk if sensitive client data, financial data, or internal plans enter a system the company does not control.

    The second pattern is embedded AI inside existing software. This is now showing up in CRM systems, help desks, productivity suites, document tools, and analytics platforms. The advantage is simple: the AI is already where the work happens. A support agent can draft a reply inside the ticketing system. A spreadsheet user can ask for a formula suggestion without leaving the file. A BI user can ask a question in plain English and get a chart or summary.

    Embedded AI tends to be easier to adopt than a separate tool because it fits into a known workflow. It also reduces the friction of training people on a new interface. The downside is that the model is only as useful as the product’s integration and guardrails. If the embedded feature cannot access the right data or cannot explain where its answer came from, the convenience disappears fast.

    The third pattern is the custom internal copilot. This is a company-built assistant connected to internal documents, systems, and permissions. It may answer HR questions, help sales teams find product information, or support analysts by pulling from internal reports and notes. These copilots are attractive because they can be tailored to the business and limited to approved data sources.

    They are also harder to build well. A useful internal copilot needs clean permissions, good retrieval from internal content, clear boundaries, and ongoing maintenance. If the knowledge base is outdated or badly organized, the copilot will surface bad answers faster than a human search process would. The tool does not fix messy information. It exposes it.

    Privacy, Governance, and Human Review Are Not Optional

    For business use, data privacy is one of the first questions to settle. Teams need to know what data can be sent to a model, where that data is stored, and whether it is used for training. That is not a legal footnote. It affects whether the tool can be used on customer records, contracts, internal financials, or employee data.

    Governance matters just as much. Someone has to decide which use cases are approved, which models are allowed, who can configure them, and how outputs are reviewed. Without those rules, adoption becomes inconsistent. One team uses a consumer chatbot. Another uses an embedded feature with different controls. A third builds its own workflow with no logging at all. That is not a strategy.

    Cost is easy to underestimate. LLM usage can look cheap in a demo and become expensive at scale. A few test prompts are not the real bill. Real costs appear when hundreds of users run daily queries, when documents are large, when output must be reprocessed, or when the company adds retrieval, logging, and security controls around the model.

    Prompt quality also matters, but not in the mystical way people sometimes describe it. Good prompts are simply clear instructions. They specify the task, the audience, the format, the source material, and the limits. A vague prompt gives a vague answer. A precise prompt reduces the amount of cleanup needed after the model responds.

    Human review is still required for many business tasks. That does not mean the model is useless. It means the model should draft, classify, extract, or summarize, and a person should verify the result when the stakes are high. In practice, the best systems use LLMs to reduce effort, not to remove accountability.

    How to Judge a Use Case Without Getting Distracted by the Hype

    A good way to evaluate LLM use cases is to start with risk, not novelty. Ask what happens if the model is wrong. If the answer is “a minor edit,” that is a good sign. If the answer is “a customer gets the wrong financial guidance,” the use case needs stronger controls or a different design.

    ROI should be measured in time saved, error reduction, and throughput, not just in excitement. A useful pilot often reduces repetitive work for a specific team. For example, if support agents spend ten minutes summarizing each ticket before routing it, an LLM that cuts that in half can create real value. The gain is easy to see and easy to measure.

    Operational readiness is the third test. Some teams have the data quality, process discipline, and review capacity to support an LLM workflow. Others do not. If the source documents are scattered across folders, if nobody owns the knowledge base, or if approvals are already slow, adding an LLM will not fix the underlying process.

    One practical filter is to separate tasks into three buckets:

    • Low risk, high volume: drafting, summarization, classification, internal search, and simple customer replies.
    • Medium risk, controlled review: sales enablement content, analyst support, policy Q&A, and workflow routing.
    • High risk, strict oversight: legal, financial, medical, compliance, and anything that directly affects external commitments.

    The first bucket is where most companies should start. It gives teams experience with the tools without putting the business in a fragile position. The second bucket can work when source data is solid and review is built in. The third bucket needs careful design, and in some cases it should stay human-led.

    What’s Coming Next

    Multimodal models are already changing how people use LLMs. These systems can work with text, images, tables, charts, audio, and sometimes video. For business users, that means a model can review a screenshot of a dashboard, read a scanned document, or summarize a meeting recording. The value is not just that the model handles more formats. It is that more of the messy real-world input becomes usable.

    Agentic workflows are the next step after simple chat. In a basic chat setup, the user asks a question and gets a response. In an agentic workflow, the model can take a sequence of actions: search for information, compare records, draft a response, update a ticket, and flag exceptions. That is powerful, but it also raises the stakes. The more steps the model can take on its own, the more important permissions, audit logs, and stop conditions become.

    Enterprise integration will matter more than model size. The winning systems will not just answer questions. They will connect to document stores, CRMs, ticketing systems, ERP tools, and analytics platforms. A model that can see the right internal context and act inside the right workflow will be more useful than a slightly smarter model that sits on its own.

    There is also a shift toward narrower, better-controlled deployments. Many companies are moving away from “let everyone experiment” and toward approved use cases with defined data sources, review steps, and measurable outcomes. That is a healthier pattern. It reduces risk and makes it easier to see what the tools are actually doing for the business.

    The near future will probably not be defined by one giant breakthrough. It will be defined by better fit. LLMs will get more useful where the work is text-heavy, repetitive, and connected to business systems. They will stay weak where precision, traceability, and hard control matter more than speed.

    What Smart Teams Are Doing Now

    The most effective teams are not asking whether to “adopt AI.” They are asking which tasks are worth speeding up, which tasks need better controls, and which tasks should stay human-reviewed. That is a more useful question, and it leads to better decisions.

    If you are evaluating LLMs in your own workflow, start small and specific. Pick one process that is repetitive, text-heavy, and easy to check. Measure the time saved. Look at the error rate. Decide who reviews the output. Then decide whether the result is good enough to scale.

    That approach is slower than buying into the hype, but it is also how teams end up with tools people actually use. The value is not in having an AI feature. The value is in making real work faster, cleaner, and easier to manage.

  • Agent design patterns for production

    Agent design patterns for production

    TL;DR:

    Build agents by composing small, well-instrumented capabilities, enforcing runtime controls, and designing for predictable state transitions. These patterns reduce surprises in production and make agents auditable, debuggable, and cost-effective.

    Intro

    CTOs evaluating agent deployments quickly run into two recurring realities: agents are powerful automation that can dramatically reduce manual toil, and they are a new class of system with emergent failure modes. Successful production agents aren’t just “smart” — they are engineered with patterns that make them observable, bounded, and interoperable with existing operational practices. This article distills pragmatic design patterns I’ve seen work across cloud orchestration, customer service automation, and payment platforms, and shows how those patterns translate into measurable operational improvements.

    Design for composable, bounded capabilities

    Start by decomposing an agent into narrowly-scoped capabilities (skills, workers, or modules). Each capability exposes a small, well-documented interface: inputs, outputs, preconditions, expected side effects, and idempotency properties. This sounds obvious, but teams often hand agents unrestricted access to resources and then wonder why debugging is impossible.

    Key design choices here are explicit schemas and idempotent actions. Use typed APIs or JSON schemas for every action the agent can request. Require an explicit “dry-run” flag for potentially destructive operations so you can validate plans before execution. Enforce idempotency by design: if an agent retries “create-user,” the back-end responds with a stable result rather than creating duplicates.

    Benefits: smaller blast radius, simpler testing, and easier role-based permissions. In practice, teams replacing monolithic agents with composable capabilities saw mean time to resolution for agent-induced incidents fall by 30–60% because root causes were isolated to a single module.

    Operational controls: policies, limits, and circuit breakers

    Production agents must operate under operational constraints. Explicitly encode runtime policies: quota limits, cost budgets, concurrency caps, and real-time safety checks. These controls should be enforced outside the core reasoning loop so they cannot be bypassed by a clever plan. In other words, separate “what to do” (the plan) from “whether to do it” (the policy enforcement layer).

    Circuit breakers and throttles are simple but effective. A circuit breaker can open when an upstream service has elevated error rates, forcing the agent to shift to a degraded mode (notify humans, stash the plan, or run a read-only analysis). Throttling helps control cost and latency: agents that invoke downstream ML models or third-party APIs should be able to back off based on budget signals.

    Evidence: one cloud-ops team added a cost-aware throttle and saw API spend from automated remediation drop 22% without losing remediation coverage. Another team reduced the frequency of runaway reconcilers by instituting a simple retry budget per incident, turning unpredictable spikes into bounded queues.

    Observe, record, and enable deterministic replay

    Agents make decisions; production teams must be able to replay those decisions to understand what happened and why. Instrument every decision point: inputs, intermediate reasoning states, deterministic plan fragments, and final execution calls. Store those artifacts in a compact, structured format so you can reconstruct a run without storing raw model outputs or full logs.

    Deterministic replay is not about believing the agent is always right; it’s about making debugging tractable. Keep a canonical “run record” that contains the request, the validated plan, policy evaluations, and the execution log with timestamps. Add a replay mode that replays the plan in a sandboxed environment (dry-run) and verifies that the same sequence of validations and guardrails triggers the same outcomes.

    Practical wins from this pattern include faster incident postmortems and safer rollbacks. Teams that implemented structured run records cut postmortem analysis time in half and were able to add unit-like regression tests that reproduce prior failures.

    • Make capabilities small and well-typed; enforce idempotency and dry-run support.
    • Separate planning from policy enforcement; implement quota, cost, and safety checks externally.
    • Add circuit breakers and retry budgets to bound failure modes.
    • Record structured run artifacts that enable deterministic replay and automated tests.
    • Design interaction models that prefer human escalation over irreversible actions in ambiguous situations.
    • Use canaries and progressive rollouts for new capabilities; monitor business SLOs as well as technical metrics.
  • Daily digest: Agents, Systems, and Mental Models (auto)

    Daily digest: Agents, Systems, and Mental Models (auto)





    Daily digest: Agents, Systems, and Mental Models (auto)


    Daily digest: Agents, Systems, and Mental Models (auto)

    Intro

    As CTOs you are being asked to make fast, high-stakes decisions about agentic AI, orchestration platforms, and how to operationalize mental models across engineering teams. The technology is moving from research demos into production pipelines: models that can plan, delegate, and act autonomously are now tools for customer support, SRE automation, sales assistance, and data pipelines. This shift is not incremental. It changes the unit of composition from “models” to “systems of agents” and therefore demands a different set of mental models, metrics, and controls.

    This digest lays out a concise, evidence-based framework you can use this week: (1) distinguish agents from systems, (2) harden your engineering mental models, and (3) map concrete operational controls. I’ll provide two short examples of concrete trade-offs we see in production and a compact checklist you can use in design reviews.

    Agents vs. Systems: change the primitives you reason about

    An “agent” in contemporary usage is a model plus behavior: planning, state, and the ability to take actions (API calls, SQL, code changes, messages). A “system” is an arrangement of agents, data stores, procedural controls, and human feedback loops. Treating agents as isolated components and plugging them into existing microservices is a mistake; engineers must instead design interactions, incentives, failure modes, and observability at the system level.

    Evidence from early deployments shows that failures are rarely caused by a single agent model making a bad prediction. Instead, failures arise from emergent interactions: misaligned prompts, cascading retries, racing updates to shared state, or feedback loops that reinforce incorrect behavior. In other words, the right mental model is not “improve the model” but “control the interaction graph.”

    Mental models that matter for CTOs

    Adopt three engineering mental models immediately: control loops, cost-of-error classification, and epistemic boundaries. These models are lightweight, actionable, and map directly to product and risk decisions.

    Control loops: Think in terms of closed loops with explicit sensors, actuators, and latency budgets. Where is your agent observing the world (logs, telemetry, user inputs)? What actions can it take, and how quickly? If your agent can write database rows, trigger transactions, or deprovision servers, put rate limits, timeouts, and human-in-the-loop gates in those loops. Empirical deployments demonstrate that short, well-instrumented loops reduce the mean time to detect and remediate agent-induced outages.

    Cost-of-error classification: Not all errors are equal. Classify actions by their potential for irreversible harm (financial loss, privacy exposure, regulatory breach). For low-cost actions, aggressive automation and learning-from-feedback are acceptable. For high-cost actions, require explicit authorization, multi-party confirmation, or conservative rollbacks. This is where SRE practices (blast-radius control, canaries, circuit breakers) translate directly to agentic systems.

    Epistemic boundaries: Understand where your models are competent and where they extrapolate. Maintain a runtime “competence map” that records which data domains, intents, and user segments a model is validated for. When agents operate outside their competence, they should either degrade gracefully (e.g., defer to a human) or expose uncertainty in structured ways that downstream systems can act on.

    Practical controls and architecture checklist

    Below is a concise checklist you can use in design reviews to decide whether an agent, a system redesign, or policy work is the primary mitigation. These are distilled from real production incidents and controlled experiments in enterprise deployments.

    • Define action classes and attach a risk tier to each (read-only, write-safe, write-persistent, irreversible).
    • Instrument every action with a request ID and provenance metadata (model version, prompt, agent plan steps).
    • Enforce rate limits, quotas, and canary windows per tenant and per agent.
    • Design explicit human-in-the-loop gates for tiered-risk actions with auto-escalation paths.
    • Maintain a competence map and require fallback strategies when confidence < threshold.
    • Implement observability for interaction graphs: call graphs, state stores, and external side-effects.
    • Use replayable logs and deterministic replays for post-mortem analysis.

    These controls are not optional add-ons; they are the minimal scaffolding that transforms brittle proofs-of-concept into reliable services.

    Two short examples help ground the trade-offs.

    Example 1 — the autonomous customer support agent: A team deployed an agent that could read a support ticket, query account data, and issue refunds up to $50. Initially, it reduced resolution time by 60%. After a change in the billing ledger schema, the agent misread a promotional credit as a refundable balance for a subset of users, issuing thousands of low-value refunds before detection. Root cause analysis showed three failures: schema assumptions in prompts, missing provenance on transactions, and lack of a write quota. Fixes were straightforward and immediate: add schema-aware adapters, attach provenance metadata to every refund, and move refunds into a canary mode requiring human approval above $10 until automated tests passed on the new schema.

    Example 2 — the SRE automation agent: An on-call automation agent was allowed to restart failed services autonomously. It correctly restarted many instances and reduced morning wake-ups. During a latency spike, the agent retried restarts concurrently across a cluster causing a traffic storm and a cascading failure. The mitigations were to add a global circuit breaker, introduce jittered

  • Tiny Experiments to Reclaim Your Focus

    Tiny Experiments to Reclaim Your Focus

    Design tiny experiments to reclaim focus

    Most productivity advice asks for dramatic overhauls: new tools, new schedules, or a complete life reboot. That’s seductive because big changes feel decisive. But they’re also fragile and hard to sustain. There’s a quieter option that’s both easier to start and easier to keep: tiny experiments—small, time-boxed adjustments you run long enough to learn from, not long enough to make you miserable.

    Why tiny experiments work

    Tiny experiments sidestep two fatal flaws of typical productivity fixes. First, they reduce friction: when the ask is small, you actually do it. Second, they convert hope into data: instead of a vague promise to “be more focused,” you have a measured change and a clear signal about whether it helped.

    Think of them like short A/B tests for your life. You don’t commit to a permanent change; you try something for a week or two, measure a simple outcome, and decide. Over time, the compounding wins are enormous—because small wins stack and because you get better at designing meaningful experiments.

    Pick the right levers

    Not all experiments are created equal. To keep results readable and actionable, constrain your test along three axes:

    • Time: How long is the experiment? For micro-experiments, pick 3–14 days.
    • Scope: What exactly changes? Be specific—”no notifications on desktop until 10:00″ beats “fewer distractions.”
    • Measurement: What’s your outcome metric? It can be subjective (daily focus rating) or objective (number of deep-work sessions completed).

    Simple experiments you can run this week

    Here are five low-friction tests that reliably surface useful signals.

    • Notification quarantine: Silence email/mobile notifications until after a 90-minute morning block. Measure the number of uninterrupted work sessions and a one-line end-of-day note about task progress.
    • Micro-break cadence: Take a 30–60 second physical break every 25 minutes. Track perceived energy and how many hours you felt “in the zone.”
    • One-decision morning: Reduce morning choices—preset breakfast, clothes, and the first two tasks. Track decision fatigue and morning completion rate.
    • Commit-and-cut: Promise to try a single focused task for 50 minutes, then immediately switch to 10 minutes of a different activity—no multitasking. Count how many intervals you finish.
    • Email triage block: Batch email to two 20-minute slots per day. Record total emails processed and whether urgent items felt delayed.

    How to measure without making it a second job

    If measurement feels onerous, make the outcome simple and fast. Two approaches work well:

    • One-line end-of-day: Each evening jot a single line: what went well, what didn’t. It takes ten seconds and is gold for pattern recognition.
    • Binary signal: Did you complete the intended focus block? Yes/No. Tally across days to see trends.

    Turn findings into durable improvements

    After your test window, ask three questions: Did the change improve the chosen metric? Was it sustainable? What costs did it introduce? If a test passes on metrics and cost, scale it—extend the window or embed the habit by pairing the action with a stable cue (a calendar event, a specific time, the end of a meeting).

    If it fails, treat the result as information. A failure might mean the change was the wrong lever, the measurement was off, or context (meetings, team rhythms) made it impractical. Learn and design a follow-up experiment that addresses the failure mode.

    A retail analogy: trial sizes, not product swaps

    In retail, a smart buyer tests product quantities in small batches—trial sizes—to learn demand before committing shelf space. Tiny experiments are the same: test small, learn quickly, and only when the signal is clear do you expand the commitment. This approach reduces regret and keeps your personal operating system nimble.

    When experiments collide with team norms

    Single-person hacks are easy; team changes are trickier. When your experiment affects others, start by making the ask explicit: explain the short trial, what you’ll measure, and when you’ll revert or commit. Most teammates are more tolerant of temporary changes if they understand the hypothesis and timeline.

    Two quick templates to steal

    Use these when you don’t have time to design an experiment from scratch.

    • 90-minute morning focus: Mute notifications, do two prioritized tasks in a 90-minute block, one-line summary at noon. Run 7 days.
    • Micro-break cadence: 25/5 work/break rhythm for 10 sessions per day, record energy at day’s end. Run 5 days.

    What real progress looks like

    It’s not a sudden doubling of output. It’s fewer days where you feel scattered, more afternoons where you can actually finish the hard thing, and a growing library of experiments that reliably tilt your weeks. Over months, you’ll notice the compounding effect: the practices that survive are ones you barely notice doing anymore.

    Final prompt to run an experiment

    Pick one test above. Set a 7-day window. Pick a single metric—completion yes/no or a one-line daily log. Run the experiment, then review what the data tells you. Design your next experiment from the evidence, not from good intentions.

    Small experiments aren’t glamorous, but they’re how meaningful, lasting change happens. If you want one companion experiment to try this week, pick the notification quarantine: the payoff for a tiny ask is often surprisingly large.

  • Packaging Beats Peak Performance

    Packaging Beats Peak Performance

    Packaging beats peak performance: why open-source models stall at the doorstep

    There’s a pattern I keep seeing: an engineering team furiously trains a model, they throw the weights up on a public registry, and then they wait. Silence. Not because the model is bad, but because the work that actually unlocks usage—distribution, inference packaging, and developer ergonomics—was left for later. Two recent stories brought the pattern into sharp relief: one about a large company putting together an opinionated, enterprise-focused agent platform, and another where a promising open model hit the wall because nobody could easily run it locally. Same problem, different faces: packaging, not pure performance, decides who gets adopted.

    Adoption is a packaging problem
    Adoption is a packaging problem: weights → packaging → runtime → developer UX → adoption.

    Why packaging matters more than you think

    Models are dense technical achievements, but they’re not products on their own. A model is a component. For a developer or product manager to treat it like a component they can actually use, three things need to happen:

    • It must be accessible in the formats and runtimes the ecosystem already uses.
    • It must be easy to evaluate cheaply and quickly.
    • It must compose with toolchains—tokenizers, inference engines, tool calling frameworks—without a day-one deep dive.
    Model selection checklist
    The hidden checklist: why “best model” often loses to “best packaged”.

    If any of those are missing, adoption stalls. People will swap to the slightly worse model that just works out of the box because velocity beats marginal quality improvement every time.

    Two contrasting signals from the field

    Big vendors packaging agents as a product

    When incumbents decide to ship an agent platform, they don’t just release models. They bundle SDKs, observability, security integrations, deployment templates, and partner connectors. That’s not accidental. Enterprises care about operational risk: isolation, audits, rollout control, and predictable infra cost. What looks like a marketing move is often a rational answer to procurement and SRE requirements. The platform sells because it reduces the integration bill—and it can be opinionated about formats and runtimes, which makes it simpler for internal teams to adopt.

    Open-source weights without the lift

    Contrast that with a model release that drops weights in a single format and asks the community to figure out the rest. If common runtimes don’t support the format, if there’s no GGUF conversion, and if chat templates or tool-calling glue are incomplete, developers run straight past it. The outcome is predictable: people pick a slightly older or smaller model that plugs into vLLM, llama.cpp, or whatever their pipeline already uses. The model itself becomes a research artifact rather than a usable building block.

    Why this is a product problem, not just engineering

    Engineers build capabilities. Products remove friction. For open models to succeed, someone has to own the “last mile”: the conversions, the hosted inference endpoint, the reference SDKs, the promotion to popular inference marketplaces. That’s product work—prioritization, docs, SDK releases, and marketing to developer communities. It’s rarely glamorous, and it rarely wins research awards, but it determines whether a model gets embedded into apps or piles up in a downloads folder.

    Three practical moves for teams that want their model adopted

    If you’ve produced a model and want people to actually use it, these are the pragmatic steps that matter more than another benchmark result.

    • Ship the formats people use: provide GGUF, safetensors, ONNX where it makes sense. If you can’t be in every runtime on day one, be in the top three for your target audience.
    • Publish a minimal inference endpoint and a tiny “playground” that runs on cheap infra. Developers will try a hosted demo before spinning up hardware.
    • Bundle a conversion and starter kit: tokenizer, chat template, and a one-click example to hook tool-calling or RAG. Make the first working app under 15 minutes.

    These are small, high-leverage bets. They don’t need perfect engineering—just enough to let people instrument, test, and prototype.

    How incumbents turn packaging into a moat

    When a large vendor builds an opinionated agent platform, a subtle lock-in happens. Not because the models are proprietary (they might not be), but because the platform owns the integration surface: observability, authentication, billing, and deployment patterns. Teams adopt the platform because it removes work and risk. Over time, the “free” cost of switching rises—not from model accuracy but from migration overhead.

    That’s why you’ll often see vendors emphasize partner integrations and enterprise controls early: these are the levers that turn a technical capability into a repeatable operational solution.

    What builders should actually care about

    If you’re building with models or you’re on a team deciding whether to run something in-house, your job is to short-circuit false choices. Don’t treat model X vs model Y as the only axis. Ask:

    • Can I evaluate this with a $20 test bed in an afternoon?
    • Will my existing toolchain accept this format without surgery?
    • What’s the realistic path from prototype to monitored production?

    If the answer to any of those is “no,” the model is less valuable than it looks on paper.

    Retail/PPC analogy (short)

    Think of models like ad creatives. A marginally better creative that takes two weeks to QA and publish will lose to a slightly worse creative you can deploy in an hour and iterate on with A/B tests. Velocity—small, safe bets—beats theoretical win rates in fast-moving systems.

    Two bets I’d place as a PM

    If you’re the product owner for a model or for tooling around models, here’s what I’d prioritize in order:

    • Developer experience: make the first 15 minutes delightful. If someone can’t get a demo running quickly, they’ll move on.
    • Inference options: supported runtimes, small hosted tier, and a conversion pipeline.
    • Operational playbooks: simple monitoring, cost estimates, and a migration checklist for customers who want to move away later.

    Closing: productize the last mile

    There’s an asymmetry in AI adoption: the heavy lifting of training and papers gets attention, but the quiet, mundane work of packaging decides who wins. If you’re a founder or an engineering lead, your healthiest obsession should be “how do we make this trivial to try?” Because the teams that answer that question will win more users than the teams that chase the last 1–2% of benchmark performance.

    Make it trivial to try, and people will. Make it hard, and performance won’t matter.

  • Agents Need Contracts, Not More Brains

    Agents Need Contracts, Not More Brains

    Why the next decade of agents will be decided by their contracts, not their brains

    There’s a familiar pattern I keep seeing whenever a hot new agent platform shows up: breathless demos of planning and autonomy, a bunch of infrastructure scaffolding, and then—inevitably—confusion the first time that agent needs to call a real tool in a real workplace.

    Two things landed in the last 48 hours that make this obvious in a useful way. One is chatter about a big vendor shipping an open agent platform. The other is a clear, practical writeup of the ReAct/tool-calling loop where you explicitly model state, tool schemas, and transitions. Together they highlight a simple truth: agents aren’t just models + compute. They are contracts between a thinking thing and the systems it touches.

    What I mean by “contracts”

    By contract I mean the agreed shape of interaction—the inputs, the outputs, the error modes, and who is responsible for recovery. Contracts sit between three actors: the LLM (the reasoner), the tool set (APIs, DBs, UIs), and the business that owns the outcome. A good contract makes the interaction predictable. A bad one is where subtle failures hide until they become disasters.

    Think of it like a marketplace listing. A great item description tells buyers precisely what they’ll get, what’s excluded, and what happens if something’s damaged in shipping. Tools need the same thing when agents use them: clear schemas, explicit side effects, and well-defined failure semantics.

    Why this matters more than model size right now

    Everyone wants to argue about parameter counts, token limits, or who trained what on which dataset. Those debates matter for capabilities, but not for production reliability. In practice, the majority of outages, hallucinations, and compliance incidents I see happen at the boundary—when an agent takes an action that touches people, money, or private data.

    Here’s the mental model: the LLM is the planner, but the world is deterministic only if you make it so. The agent’s brain can generate a plan, but unless the tool contract guarantees idempotency, transactional boundaries, and clear error codes, the plan will meet chaos. That’s not an ML problem; it’s a systems design problem.

    An everyday example

    Imagine a marketing agent that updates bids in a PPC campaign. The agent decides to raise bids on a promoted SKU because conversion metrics looked good. If the API call is retried without idempotency, your bids could double or worse. If the tool returns vague success messages, the agent may assume the change applied when it didn’t. That’s a measurable revenue leak you’ll notice on Monday morning.

    Three contract-level guarantees you should design for

    When you build agent-enabled systems, prioritize these guarantees before you tune models:

    • Idempotency: Every state-changing call should be safe to retry. If a request can’t be retried, make the contract explicit and force human confirmation.
    • Observability: Tools must emit machine-readable events for every action and every failure. The agent sees events; humans can trace them; alerting works.
    • Authority & scope: Each agent action must be scoped to an account/role and limited in blast radius. Prefer explicit capability tokens over vague “write” permissions.

    Where ReAct-style graphs help

    The recent practical guides on ReAct-style loops show that if you treat the agent’s reasoning loop as explicit state transitions, you get two big wins:

    • You can instrument and replay the loop. When something goes wrong you can reconstruct the exact decisions and tool outputs that led there.
    • You can encode stop conditions and human-handoff points. Instead of a monolithic “do it all” agent, you get a graph that can pause, ask, or escalate based on variables you control.

    That’s operational gold. When a business runs thousands of agent actions per day, being able to replay a single mistaken sequence until you understand the failure is what turns a reactive firefight into a continuous-improvement cycle.

    Design patterns that reduce risk

    I use a few patterns repeatedly when I’m driving product decisions for agent features:

    • Shadow mode first: Let the agent propose actions and write them to a queue or audit log instead of taking them. Let humans confirm or run a verification pass that replays the tools in sandboxed mode.
    • Progressive capability rollout: Start with read-only and scheduled writes, then add real-time write capabilities after you’ve observed behavior in production for a while.
    • Explicit compensation paths: Every destructive action needs a defined undo or compensation workflow. Build the undo API before you let agents touch the live system.

    Why open agent platforms raise the stakes

    Open agent frameworks and vendor platforms both make it easier to stitch together LLMs and tools. That’s great for innovation, but it increases the surface area for misunderstandings. An open platform with lots of connectors makes it easy to accidentally expose a tool without the right contract guarantees.

    Platforms will succeed when they treat connectors as first-class citizens: packaged with schemas, test harnesses, and safety gates. The platform’s job is not just to let you wire a model to an API; it is to help you ship a predictable contract that survives scale.

    Product implications

    For product folks, the practical question is: what do you ship first? My bias: ship the guardrails before the autonomy. Customers will forgive an agent that’s slow or conservative if it doesn’t break things. They will not forgive silent data leakage or thundering financial changes.

    So make autonomy a premium feature, not the default. Build visibility, role-based control, and sandboxing into the product experience. Then sell the autonomy story with a track record: “we ran this in shadow for 30 days and reduced handle time by 24% without any live write errors.” That is believably valuable.

    Short checklist for launch

    • Define the API contract (inputs, outputs, error codes).
    • Implement idempotency and audit events on every write path.
    • Run shadow-mode validation and collect replayable traces.
    • Roll out capabilities progressively with human-in-the-loop gates.
    • Document compensation workflows and test them under load.

    The long game

    In the long run we’ll get better models, and those models will make more credible plans. That’s exciting. But the thing that separates an agent demo from a sustainable product is not how clever the planner is—it’s whether the world it touches behaves in predictable, testable ways.

    If you’re building agent features this year, treat the tool boundary like a core product surface. Ship contracts, not conveniences. Build the undo before you build the action. And if you want a quick win, instrument the loop so you can replay, debug, and iterate without a blame game.

    One retail/personalization analogy

    In retail analytics, a bad data contract is like asking for nightly sales numbers but getting different definitions of “sale” from each store. Decisions computed on that data are brittle. Agents face the same trap: if each connector reports success differently, your agent’s decisions will be brittle—and customers will notice where it hurts their margins.

    What to do tomorrow

    Pick one high-impact agent action in your product and apply the checklist above. Run it in shadow for two weeks, capture traces, and see how often your contract ambiguity shows up. Fix those gaps before you turn the knob to full autonomy.

    That is boring work. It’s also what buys you a future where agents are a feature users trust instead of a liability they tolerate.