From Pilot to Production: The 90-Day Playbook

The AI pilot was a success. The proof-of-concept achieved impressive results in a controlled environment, the team was enthusiastic and the boardroom gave the green light for scaling. And then — nothing happened. Six months later, the pilot is still running on the same data scientist's laptop, with the same manual data entry and the same three users. Sound familiar? You are not alone. According to Gartner research, only 53% of all AI projects reach the production phase. In practice, we observe that this percentage is even lower at Dutch enterprises.

This article offers a practical, step-by-step playbook to bring your AI pilot to production within 90 days. Not a theoretical framework, but a proven approach based on our experience with dozens of AI transformations at Dutch organisations.

Why AI pilots stall

Before we address the 90-day playbook, it is essential to understand why pilots get stuck. In our experience, there are five structural causes that recur consistently, regardless of the sector or size of the organisation.

1. The infrastructure gap. A pilot typically runs in an isolated environment — a Jupyter notebook, a sandbox cloud account or a data scientist's laptop. Production requires enterprise-grade infrastructure: scalability, availability, security and integration with existing systems. The leap from a Python script to a production API with SLAs is fundamentally different from what many organisations expect.

2. The governance vacuum. During a pilot, governance is deliberately kept minimal to maintain speed. But production AI requires clear answers to questions that were not addressed during the pilot: Who is responsible if the model makes incorrect decisions? How do we ensure data quality at scale? How do we comply with the EU AI Act? Without a governance framework, the rollout stalls at legal and compliance departments that have legitimate objections.

3. Skills gaps in the team. The team that built the pilot — often a combination of data scientists and a single ML engineer — rarely possesses all the competencies needed for production. MLOps expertise, platform engineering, security knowledge and change management skills are frequently absent. Organisations structurally underestimate the breadth of expertise that production AI requires.

4. No clear owner. A pilot often has an enthusiastic sponsor and a small team. When scaling, ownership becomes diffuse. IT claims responsibility for infrastructure, the business unit for the use case, data engineering for the data pipelines, and legal for compliance. Without a clear owner with end-to-end responsibility, the project falls between the cracks.

5. Underestimation of data work. In a pilot, datasets are often manually assembled and curated. Production requires automated data pipelines, data quality checks, feature stores and real-time data integration. This data work typically constitutes 60-70% of the total effort in a production implementation, but is frequently dismissed as a detail in the planning.

Day 1–30: Foundation — laying the groundwork

The first thirty days of the playbook focus entirely on laying a solid foundation. The temptation is to immediately start rebuilding the model in a production environment, but that is a recipe for failure. Foundation first, execution second.

Week 1-2: Production readiness assessment. Begin with an honest evaluation of the current state. Document the complete architecture of the pilot, including all dependencies, data sources, model versions and manual steps. Map the gap between the current state and production requirements across five dimensions: infrastructure, data, model, governance and team. Establish a RACI matrix that records for each dimension who is responsible, accountable, consulted and informed.

Week 2-3: Infrastructure and platform. Define the target architecture for production. In the Dutch enterprise context, this typically means a hybrid cloud setup with Azure or AWS as the primary platform, combined with on-premise components for sensitive data. Set up a CI/CD pipeline specifically designed for ML workloads. Make a deliberate choice regarding containerisation strategy — Docker and Kubernetes are the de facto standard, but consider managed services if your team has limited platform expertise. Ensure the infrastructure meets your organisation's security and compliance requirements, including data residency within the EU.

Week 3-4: Data architecture and governance framework. Design the production data pipeline. Identify all data sources, define data quality checks and set up monitoring for data drift. Simultaneously, draft a first version of the AI governance framework. This need not be complete at this stage, but must at minimum include: a risk assessment methodology aligned with the EU AI Act, data governance guidelines, a model validation protocol and an escalation procedure for incidents. Record this framework in documentation that is accessible to all stakeholders.

Deliverables after 30 days: a documented target architecture, a working CI/CD pipeline, an initial governance framework, and a detailed project plan for days 31-90 with clear ownership per workstream.

Day 31–60: Integration — weaving the model into the organisation

With the foundation in place, the focus shifts to integration: connecting the AI model with the existing systems, processes and people of your organisation. This is the phase with the most complexity, as you encounter legacy systems, organisational resistance and technical debt.

Week 5-6: Model refactoring and MLOps. Rewrite the pilot code to production quality. This means not only cleaning up code, but a fundamental restructuring: modularisation of the pipeline, implementation of experiment tracking (MLflow or equivalent), version control for models and datasets, and automated testing. Set up a feature store if your use case justifies it — it prevents the situation where multiple teams calculate the same features in different ways.

Week 6-7: API development and system integration. Build the API layer that exposes the model to consuming applications. Design this API with production requirements in mind: rate limiting, authentication, versioning, error handling and monitoring. Integrate with the existing system landscape through the appropriate integration paths — in practice, this often means collaboration with enterprise architects who manage the current API gateway, ESB or event-driven architecture. Test the integration thoroughly with production-like data and volumes.

Week 7-8: Monitoring, observability and feedback loops. Implement comprehensive monitoring at four levels: infrastructure (compute, memory, latency), application (API availability, throughput, error rates), model (prediction drift, feature drift, accuracy degradation) and business (KPI impact, user adoption, error reports). Set up alerts that reach the right team when anomalies occur. Build feedback loops through which end users can report incorrect predictions, and ensure this feedback flows back to the data science team for model improvement.

Week 8: User acceptance testing (UAT). Conduct a structured UAT with a representative group of end users. This is not only a technical test, but also a validation of the user experience and process integration. Document all findings, categorise them by impact and urgency, and resolve critical issues before proceeding to the next phase.

Deliverables after 60 days: a production-ready model in an MLOps pipeline, working system integrations, comprehensive monitoring and alerting, and a successfully completed UAT with documented results.

Day 61–90: Scale — responsible scaling

The final thirty days focus on controlled scaling to the full production environment. This is also the phase in which you organise the handover from the project team to the operational organisation.

Week 9-10: Phased rollout. Roll out the system in phases using a canary deployment or blue-green deployment strategy. Start with a limited group of users (10-15%) and monitor the impact intensively. Compare production performance with pilot results and business case expectations. Scale stepwise to 50% and then 100%, validating at each step that performance remains stable. Account for seasonal effects and peak loads — many Dutch organisations experience quarterly cycles that influence usage patterns.

Week 10-11: Operational model and team handover. Define the operational model for the production environment. Who is responsible for daily monitoring? Who handles incidents? What is the process for model updates? How are feature requests prioritised? Document this in an operations runbook and train the operations team. The handover from project to operational team is a critical moment that is often underestimated — allocate sufficient time for it.

Week 11-12: Governance validation and compliance check. Conduct a formal governance review of the production system. Validate that all EU AI Act requirements have been addressed, that data privacy measures are effective and that the audit trail is complete. Have this review performed by an independent party — your own legal department, an external auditor or a specialised AI governance consultant. Document the results and any residual risks that have been accepted by the appropriate governance level.

Week 12: Retrospective and next steps. Close the 90-day programme with a comprehensive retrospective. Evaluate what went well, what could have been better and what lessons you take forward to future AI projects. Document the entire programme as a reusable playbook for your organisation — the experience you have built is a valuable asset. Plan the next iteration of the model and identify additional use cases that can benefit from the infrastructure and governance framework you have established.

Deliverables after 90 days: a fully operational AI system in production, an operations runbook, a completed governance review, and an organisation-specific playbook for future AI implementations.

Technical architecture considerations

A robust production architecture for AI systems has a number of essential components that are often overlooked during the pilot phase.

Separation of concerns. Maintain a strict separation between the training pipeline, the serving pipeline and the monitoring pipeline. These three components have fundamentally different requirements in terms of compute, latency and availability. A training pipeline runs in batch mode and is cost-optimised; a serving pipeline is latency-critical and requires high availability; a monitoring pipeline is near-real-time and writes to analytical data stores.

Model serving patterns. Make a deliberate choice between synchronous (real-time) and asynchronous (batch) serving based on use case requirements. Real-time serving via REST or gRPC is suitable for interactive applications, but brings higher infrastructure costs and complexity. Batch serving via scheduled jobs is simpler and cheaper, but limits the timeliness of predictions. Many enterprise use cases are in practice better served by a near-real-time pattern that processes events via a message queue.

Fallback and graceful degradation. Design the system with a clear fallback strategy for situations where the model is unavailable or produces unreliable results. This can be a rule-based system, a previous model version or a manual process. The user experience must remain smooth, even when the model is temporarily non-functional.

The human factor: team structure for production AI

Technology is only half the story. The team structure and organisational embedding largely determine whether your AI implementation is sustainably successful.

The core team. A production AI team for an enterprise use case requires at minimum the following roles: a product owner who bridges business and technology, one or more ML engineers who build and maintain the model, a data engineer who manages the data pipelines, a platform engineer who maintains the infrastructure, and an AI governance specialist who ensures compliance and risk management. In smaller organisations, some roles may be combined, but do not underestimate the breadth of expertise required.

Embedded vs. centralised. Do not organise your AI capability as an isolated department, but as a capability embedded in business units. A central platform team delivers shared infrastructure and tooling, while embedded teams operate close to the business and build domain knowledge. This hybrid model combines economies of scale with business relevance.

Preserving knowledge. Prevent critical knowledge from being locked in the minds of individual team members. Invest in documentation, pair programming, knowledge-sharing sessions and cross-training. The bus factor — how many people can leave before the project stalls — must be at least two for every critical component.

Conclusion: speed through structure

The 90-day playbook may seem paradoxical at first glance: by investing time in foundation, governance and team structure, you accelerate time-to-production rather than slowing it down. The pilots that stall do not stall because of technical problems but because of a lack of structure. The organisations that successfully scale AI are not those with the best models but those with the best processes.

Ninety days is ambitious but achievable, provided you are willing to make the right choices and have the right expertise at the table. The difference between a successful and a failed production implementation is rarely in the model, but almost always in the approach around it.

Does your organisation have AI pilots that are ready for production, but lack the structure and expertise to make the leap? Our AI Steward works as an embedded transformation leader within your team to realise the transition from pilot to production.

View the AI Steward service