There is no shortage of AI products that can summarize a document or answer a question about a PDF. Preparing a tax return is a fundamentally different problem. It requires reading dozens of document types simultaneously, resolving conflicts and ambiguities between them, mapping thousands of individual data points into structured worksheet fields, and producing output where every value is traceable to a specific source. This article explains how Accrual is architected specifically for accounting firms — and why the design decisions behind it preserve the trust, transparency, and judgment the profession depends on.
AI adoption in accounting has accelerated dramatically. The 2025 Wolters Kluwer Future Ready Accountant report found that AI adoption in accounting firms jumped from 9% to 41% in a single year. The question for firms is no longer whether AI can help, but how to deploy it effectively.
While AI solutions are already being deployed across the accounting landscape, tax preparation offers a unique opportunity for efficient AI deployment. The work is highly structured, quality standards are well-defined, and much of the workflow is burdened by manual inputs.
The challenge for new technology adoption is timing. Tax is a seasonal business, which compresses the window for evaluating new tools. Firms often have only a few months to test technology before the next filing season begins.
At the same time, enterprise pilots frequently fail. MIT research suggests that 95% of enterprise AI pilots never translate into operational impact — not because the technology doesn’t work, but because teams lack clear evaluation criteria and execution structure.
Our team has spent years working with large organizations deploying new technology. Across those deployments, one pattern appears repeatedly: successful pilots are structured, measured, and run quickly. This guide summarizes the practices we’ve seen consistently produce clear results.
The most important step in any evaluation is getting practitioners into the product with realistic scenarios.
AI tax preparation is not just a modeling problem. Workflow design matters just as much: how documents are classified, how the system explains decisions, and how review fits existing firm hierarchies. Practitioners need to experience these workflows directly before any pilot begins.
At Accrual, we provide every prospective firm with access to a synthetic "Chris Wolff" client that contains a representative range of documents: K-1s, 1099s, unstructured expense sheets for Schedule C, and other common document types. Practitioners can work through the entire Accrual workflow, from document upload through return generation and review, without uploading any client data that might require legal review.
A common pattern in failed enterprise AI deployments is treating security review and legal onboarding as post-pilot activities. This creates a gap between successful results and the ability to act on them, which is often long enough for organizational momentum to dissipate.
Start these workstreams in parallel with your evaluation. Key items to resolve upfront include SOC 2 Type II certification, data storage jurisdiction (U.S.-only infrastructure), policies on client data and LLM training (at Accrual, no client data is used for model training under any circumstances), data retention and deletion protocols, and Form 7216 consent handling for offshore team access.
Choose 10–20 existing clients with completed prior-year returns. The specific composition matters, the sample should be representative of the complexity distribution your firm handles in practice.
Accrual clients structured their pilots across defined complexity tiers, which proved essential for setting accurate expectations and interpreting results:
Gather all source documents, organizers, and email communications used in the original preparation. The accuracy comparison is only meaningful if the agent has access to the same information your preparers had.
Select ~5 participants across different staff levels: partner, manager, senior, preparer. This cross-functional composition yields feedback from both the reviewers who will evaluate agent-generated work and the preparers whose workflow will change most significantly.
Several patterns have emerged from successful pilot teams:
Document milestones, dates, and responsibilities before the pilot begins. Without this structure, compressed evaluation windows tend to expand as scheduling conflicts and competing priorities erode momentum.
Establish your communication channel (Teams or Slack) early. In a compressed timeline, the most productive exchanges happen asynchronously, outside of scheduled meetings. Many Accrual clients establish direct channels with Accrual's engineering team for real-time support and rapid feedback during the pilot.
Goal: confirm document coverage and extraction quality
Practitioners review agent-generated drafts, validate worksheets, and compare against filed returns. Plan for ~2 hours per return.
Goal: Assess results
Office hours with the partner team to review findings and refine workflows.
Goal: Determine usability
Pilot engagement tends to drop off after 2–3 weeks, which is why extending the process beyond 2 weeks usually doesn’t lead to any meaningful signal.
The most rigorous approach is a true A/B comparison: the agent receives the same client information the preparer used, generates its version of the return, and the two are compared at the worksheet level.
An important methodological note: this comparison is intentionally asymmetric. The agent's draft, before any human intervention, is measured against the preparer's final return after all review cycles. The objective is not autonomous completion; it's producing the most complete, accurate draft possible, along with preparer notes that clearly identify what still requires professional judgment.
Accrual, for example, handles the full accuracy analysis end-to-end, delivering a comprehensive report with no analytical work required from the firm. Our team reviews each return, compares it to the filed version, and produces a spreadsheet summarizing dollar-weighted accuracy across approximately 30 key line items — income, deductions, credits — along with categorized notes on every variance.
Accuracy and time savings can be measured analytically. Workflow fit can only be assessed by practitioners using the product under realistic conditions.
Accuracy metrics should be paired with root-cause analysis. A single missing document can cascade across multiple line items, producing misleading accuracy scores.
Categorizing errors by root cause — missing documents, preparer-specific context, agent limitations, integration issues — provides a far more useful picture than the headline number.
In one Accrual pilot, a return initially showed 14% accuracy. Investigation revealed only a handful of issues: a missing K-1, an undocumented IRA contribution, and a context-specific preparer decision. Once corrected, the underlying system performance was strong.
Performance varies by complexity tier, and this is expected. Across another prospective firm’s 27-client pilot with Accrual:
For simpler returns, the system should achieve high accuracy from the initial draft. For the most complex returns, the primary value is in time savings rather than full autonomy. A return that traditionally requires 120 hours of preparation benefits enormously from an agent-generated starting point, even if 13 items require human attention.
Discrete intervention count maps more directly to workload than accuracy percentage. "I need to address 11 flagged items on this return" is a more actionable assessment than "accuracy is 85%." Track the number of steps required to move from agent draft to finalized return.
Prospective customers compare hours from their billing system for the original preparation against time spent reviewing and finalizing the technologies draft. In a detailed case study from one Accrual pilot:
| Method | Hours |
|---|---|
| Traditional preparation (actual) | 16 |
| SurePrep-assisted preparation | 12 |
| Accrual-assisted preparation | 3 |
This represents a 66% reduction versus the traditional process and 52% versus the incumbent extraction tool — on a single return. Across this firm’s broader deployment, projected savings were 50% reduction in preparation hours and 20% reduction in detailed review time.
At scale, the compounding effect is significant: every 50 complex returns processed through Accrual effectively adds the capacity of one full-time accountant, without increasing headcount.
Accrual collaborates with the firm on an executive readout, scheduled in advance as the formal working endpoint of the pilot. It covers:
The firms that make strong decisions at this stage are those that defined success criteria before the pilot began. The readout becomes a presentation of evidence against pre-established thresholds, not a subjective assessment.
Firms that successfully adopt AI don’t start with a full rollout. They start with a disciplined pilot:
Done well, a pilot can produce clear results within weeks and give leadership the confidence to move quickly before the next tax season.
In our next post, we’ll walk through how firms transition from a successful pilot to a full operational rollout.
To discuss what a pilot would look like at your firm, get in touch.