March 2026

How to run a successful AI tax preparation pilot

A practical guide for enterprise accounting leaders

Milo Spirig and Kyla Jarrar

Top line

There is no shortage of AI products that can summarize a document or answer a question about a PDF. Preparing a tax return is a fundamentally different problem. It requires reading dozens of document types simultaneously, resolving conflicts and ambiguities between them, mapping thousands of individual data points into structured worksheet fields, and producing output where every value is traceable to a specific source. This article explains how Accrual is architected specifically for accounting firms — and why the design decisions behind it preserve the trust, transparency, and judgment the profession depends on.

Context

AI adoption in accounting has accelerated dramatically. The 2025 Wolters Kluwer Future Ready Accountant report found that AI adoption in accounting firms jumped from 9% to 41% in a single year. The question for firms is no longer whether AI can help, but how to deploy it effectively.

While AI solutions are already being deployed across the accounting landscape, tax preparation offers a unique opportunity for efficient AI deployment. The work is highly structured, quality standards are well-defined, and much of the workflow is burdened by manual inputs.

The challenge for new technology adoption is timing. Tax is a seasonal business, which compresses the window for evaluating new tools. Firms often have only a few months to test technology before the next filing season begins.

At the same time, enterprise pilots frequently fail. MIT research suggests that 95% of enterprise AI pilots never translate into operational impact — not because the technology doesn’t work, but because teams lack clear evaluation criteria and execution structure.

Our team has spent years working with large organizations deploying new technology. Across those deployments, one pattern appears repeatedly: successful pilots are structured, measured, and run quickly. This guide summarizes the practices we’ve seen consistently produce clear results.

Before the pilot: level setting

Hands-on evaluation with real workflows

The most important step in any evaluation is getting practitioners into the product with realistic scenarios.

AI tax preparation is not just a modeling problem. Workflow design matters just as much: how documents are classified, how the system explains decisions, and how review fits existing firm hierarchies. Practitioners need to experience these workflows directly before any pilot begins.

At Accrual, we provide every prospective firm with access to a synthetic "Chris Wolff" client that contains a representative range of documents: K-1s, 1099s, unstructured expense sheets for Schedule C, and other common document types. Practitioners can work through the entire Accrual workflow, from document upload through return generation and review, without uploading any client data that might require legal review.

Key evaluation dimensions

  • Tax engine integration. How will the connection between the AI platform and your tax engine be maintained and what technical resources will it require?
  • Document processing breadth. Tax preparation involves highly variable inputs: government forms, brokerage statements, phone photos, client emails, password-protected PDFs, and combined files spanning hundreds of pages. How will the system handle everything a human preparer would review?
  • Transparency and auditability. Every worksheet field should cite its source document. Every agent decision should include an explanation. CPAs will not adopt systems whose reasoning they cannot trace and verify.
  • Pricing structure. Understand how pricing tiers map to return complexity and projected time savings across admin time, return preparation, and review. Avoid complex structures with per-feature charges or upsells, so teams are incentivized to benefit from the full breadth of the platform.

Parallel IT and legal onboarding

A common pattern in failed enterprise AI deployments is treating security review and legal onboarding as post-pilot activities. This creates a gap between successful results and the ability to act on them, which is often long enough for organizational momentum to dissipate.

Start these workstreams in parallel with your evaluation. Key items to resolve upfront include SOC 2 Type II certification, data storage jurisdiction (U.S.-only infrastructure), policies on client data and LLM training (at Accrual, no client data is used for model training under any circumstances), data retention and deletion protocols, and Form 7216 consent handling for offshore team access.

Pilot design

Selecting returns

Choose 10–20 existing clients with completed prior-year returns. The specific composition matters, the sample should be representative of the complexity distribution your firm handles in practice.

Accrual clients structured their pilots across defined complexity tiers, which proved essential for setting accurate expectations and interpreting results:

  • Simple returns (W-2/1099): Establishes baseline capability where automated tax prep requires few to no interventions to finalize a return.
  • Moderate complexity (multiple K-1s, rental properties, Schedule C): Where meaningful time savings begin to emerge. Automated tax prep becomes focused on identifying the correct set of issues that needs to be addressed in the draft return.
  • High complexity (returns with 50–120+ hours of traditional prep time, hedge fund K-1s, international filings): Where the economic case is strongest. The time savings are substantial because agent preparation time is roughly constant (minutes), while human preparation time scales linearly with complexity.

Gather all source documents, organizers, and email communications used in the original preparation. The accuracy comparison is only meaningful if the agent has access to the same information your preparers had.

Assembling the pilot team

Select ~5 participants across different staff levels: partner, manager, senior, preparer. This cross-functional composition yields feedback from both the reviewers who will evaluate agent-generated work and the preparers whose workflow will change most significantly.

Several patterns have emerged from successful pilot teams:

  • Strong collaboration between the firm and partner. Treat the pilot as co-evolution: providing faster, clearer time to results helps take much of the burden off the firm and improve alignment with operational workflows.
  • Local office champions produce the strongest feedback. They understand firm-specific workflows, know which clients are representative, and can contextualize edge cases that a centralized evaluation team might miss.
  • Partner engagement from the outset is essential. The most skeptical Partners often become the strongest advocates, but only after hands-on experience during training. Partners who learn about a new workflow after it's been decided tend to resist it; partners who participate in shaping it tend to champion it.
  • Designate a pilot project manager. Someone needs to own the timeline, coordinate across participants, and maintain pace. We always recommend a designated project leader to shepherd the pilot and provide "steady urgency and focus."

Building the pilot plan

Document milestones, dates, and responsibilities before the pilot begins. Without this structure, compressed evaluation windows tend to expand as scheduling conflicts and competing priorities erode momentum.

Establish your communication channel (Teams or Slack) early. In a compressed timeline, the most productive exchanges happen asynchronously, outside of scheduled meetings. Many Accrual clients establish direct channels with Accrual's engineering team for real-time support and rapid feedback during the pilot.

Execution: a two-week framework

Days 1 – 2

Training and document processing

  • One to three live training sessions with practitioners
  • Upload client documents
  • Generate initial draft returns

Goal: confirm document coverage and extraction quality

Days 3 – 7

Return generation and review

Practitioners review agent-generated drafts, validate worksheets, and compare against filed returns. Plan for ~2 hours per return.

  • One to three live training sessions with practitioners
  • Upload client documents
  • Generate initial draft returns

Goal: Assess results

Days 7 – 14

Feedback and interation

Office hours with the partner team to review findings and refine workflows.

  • One to three live training sessions with practitioners
  • Upload client documents
  • Generate initial draft returns

Goal: Determine usability

Note

Pilot engagement tends to drop off after 2–3 weeks, which is why extending the process beyond 2 weeks usually doesn’t lead to any meaningful signal.

Measuring pilot success

Accuracy: The A/B comparison

The most rigorous approach is a true A/B comparison: the agent receives the same client information the preparer used, generates its version of the return, and the two are compared at the worksheet level.

An important methodological note: this comparison is intentionally asymmetric. The agent's draft, before any human intervention, is measured against the preparer's final return after all review cycles. The objective is not autonomous completion; it's producing the most complete, accurate draft possible, along with preparer notes that clearly identify what still requires professional judgment.

Accrual, for example, handles the full accuracy analysis end-to-end, delivering a comprehensive report with no analytical work required from the firm. Our team reviews each return, compares it to the filed version, and produces a spreadsheet summarizing dollar-weighted accuracy across approximately 30 key line items — income, deductions, credits — along with categorized notes on every variance.

Interpreting the results

Note

Accuracy and time savings can be measured analytically. Workflow fit can only be assessed by practitioners using the product under realistic conditions.

Accuracy metrics should be paired with root-cause analysis. A single missing document can cascade across multiple line items, producing misleading accuracy scores.

Categorizing errors by root cause — missing documents, preparer-specific context, agent limitations, integration issues — provides a far more useful picture than the headline number.

In one Accrual pilot, a return initially showed 14% accuracy. Investigation revealed only a handful of issues: a missing K-1, an undocumented IRA contribution, and a context-specific preparer decision. Once corrected, the underlying system performance was strong.

Performance varies by complexity tier, and this is expected. Across another prospective firm’s 27-client pilot with Accrual:

  • Simple returns: 6 average comparison issues
  • Moderate returns: 9 average comparison issues
  • Complex returns: 13 average comparison issues

For simpler returns, the system should achieve high accuracy from the initial draft. For the most complex returns, the primary value is in time savings rather than full autonomy. A return that traditionally requires 120 hours of preparation benefits enormously from an agent-generated starting point, even if 13 items require human attention.

Discrete intervention count maps more directly to workload than accuracy percentage. "I need to address 11 flagged items on this return" is a more actionable assessment than "accuracy is 85%." Track the number of steps required to move from agent draft to finalized return.

Time savings

Prospective customers compare hours from their billing system for the original preparation against time spent reviewing and finalizing the technologies draft. In a detailed case study from one Accrual pilot:

MethodHours
Traditional preparation (actual)16
SurePrep-assisted preparation12
Accrual-assisted preparation3

This represents a 66% reduction versus the traditional process and 52% versus the incumbent extraction tool — on a single return. Across this firm’s broader deployment, projected savings were 50% reduction in preparation hours and 20% reduction in detailed review time.

At scale, the compounding effect is significant: every 50 complex returns processed through Accrual effectively adds the capacity of one full-time accountant, without increasing headcount.

Exiting the pilot: the decision

The executive readout

Accrual collaborates with the firm on an executive readout, scheduled in advance as the formal working endpoint of the pilot. It covers:

  • Accuracy results by complexity tier, with root cause analysis of key variances
  • Time savings benchmarked against the firm's billing data
  • Practitioner feedback on workflow fit, product experience, and specific enhancement requests
  • Product improvements delivered during the pilot
  • Rollout recommendation with proposed scope, timeline, and support plan

The firms that make strong decisions at this stage are those that defined success criteria before the pilot began. The readout becomes a presentation of evidence against pre-established thresholds, not a subjective assessment.

Bridging to rollout

Firms that successfully adopt AI don’t start with a full rollout. They start with a disciplined pilot:

  • Evaluate with real client workflows
  • Measure against your own work product
  • Build internal confidence through evidence

Done well, a pilot can produce clear results within weeks and give leadership the confidence to move quickly before the next tax season.

In our next post, we’ll walk through how firms transition from a successful pilot to a full operational rollout.

To discuss what a pilot would look like at your firm, get in touch.