Improving High Availability (HA) through FMEA measurement

Improve Trimble Service Level Agreement performance via Sustained Investment into PaaS/SaaS High Availability Design Controls

Summary

There are three fundamental inputs to SaaS/PaaS High Availability (HA). To support competitive Service Level Agreements (SLA), weaknesses in each should be enumerated and remediated via sustained investment.

The Physical Resiliency of the application to be available for all designed use cases.
- This is a continuum of design controls that contribute to uninterrupted SaaS/PaaS uptime.
The Performance and Scalability of the application’s components.
- This is another continuum of design controls that prevent application latency by remediating potential bottlenecks and providing for scalable and efficient peak-demand capacity.
The application’s Mean-Time-To-Recover (MTTR) should an interruption to uptime occur.
- This is a separate continuum of design controls that contribute to the detection, management and restoration from failures that result in downtime or excessive performance latency.

Each of these fundamental continuums are made up of attributes (see Descriptions of HA Continuum Attributes section below) whose quality and quantity can be scaled up or down depending on the level of investment. In other words, they are continuums onto themselves. Although these attributes are all characteristics of good software engineering practices, it’s not uncommon for software teams to realize trade-offs between their best-in-class implementations against budget and schedule delivery constraints (e.g. the cycle technical debt from cycles of Minimum Viable Product). The same legacy is true at Trimble Cloud Core Platform.

As PaaS adoption increases to become the critical path for Connect and Scale, service downtime caused by this technical debt can become acute and severely impact legacy contracted service levels and the productivity of Trimble customers. This scenario is also common for all PaaS and SaaS products.

Initially, the scenario leads to reactive investments. For example, the well-socialized TiD outage at the end of Q3 2022 caused by degradation on one AWS service at a single US region resulted in the definition and funding of a Disaster Recovery (DR) enhancement planto enable NextGen PaaS service failover to secondary regions. By the end of 2023, Core Platform will deliver region failover capability to 80 percent of the NextGen service portfolio. The increase in operational expenses to maintain this infrastructure is now reflected in ongoing budgets.

That single large 2023 investment into DR capability removes risk in only part of the Physical Resiliency Continuum. Other downtime causing failure modes continue to manifest into unpredictable Trimble user downtime. For example, in Q2 2023, the TiD dependency to the external service HCaptcha failed due to a downstream outage of that vendor’s CDN service. This resulted in an effective TiD outage localized to German Transportation customers.

Although reactive improvements do remediate realized risk, they do not address the “known-unknown” risk associated with the technical debt backlog. In August 2023, Cloud Platform will begin maturation to known risk by instituting Failure Mode Effects Analysis (FMEA) as a tool to measure our objective progress and debt to the associated to the DR continuums of Resiliency, Performance & Scalability and Mean-Time-to-Recovery.

FMEA will produce a scored priority for each backlog item and provide the inputs necessary to estimate downtime contributions for each residual failure mode. In the near future, a predictable risk burn down and epic backlog of HA capabilities can be produced to depict the opportunity cost of regular investment.

Such an inventory will also enable real traceability of Cloud Core Platform HA remediated risks and residual failure modes to the components of internal and customer facing SLAs. As a result of continued investment into HA controls, SLA’s could be attached to product roadmaps.

Appendix A: Descriptions of HA Continuum Attributes

The Continuum of Physical Resiliency

Redundancy: This includes deploying multiple instances of the application and it’s topology components to ensure that if one component fails, another can take over. Redundancy is achieved at various distinct levels within our architecture:
1. Cloud Service Provider (CSP) region redundancy - Rollover capability for the entire PaaS topology from one CSP region to another.
  1. This provides faster MTTR in the event of catastrophic loss or degradation of CSP services as well as catastrophic loss of application performance stemming from any unmitigated failures in Trimble PaaS topology components.
  2. Trimble has less dependency on AWS and Azure MTTR spans when one of their regions suffers outage or degradation.
  3. CSP region redundancy also has a sub-continuum of MTTR capability between “Active-Passive” and “Active-Active” failover with trade-offs between cost and speed of recovery.
2. Platform application topology redundancy - Rollover capability for PaaS topology components.
  1. This provides business continuity in the event of degradation or failure of individual compute, database, storage or network nodes within a Trimble PaaS application.
Reliability of Trimble PaaS design.
1. Microservices architecture
  1. By breaking the application into smaller independent services that can update, scale and maintain separately. This reduces downtime as issues in one service may not effect others.
2. Network design
  1. Vertical and horizontal scaling
  2. Isolation
  3. Interoperability
3. Security Controls:
  1. These include encryption, firewalls, intrusion detection systems, and secure coding practices. Good security practices are essential to prevent breaches that could lead to downtime.
Reliability of External Connections. 1. Reliability of connections between Trimble PaaS applications supporting end-to-end integrator or end-user workstreams. 2. Reliability of connections and API’s between Trimble CSP instances (AWS-to-and-from Azure) 3. Reliability of partner integrations.
Reliability of 3rd Party Components.
1. Reliability of licensed services, tools and their own external connections.
  1. CDN dependencies.

The Continuum of Performance and Scalability

Load Balancing effectiveness: Load balancers distribute network traffic across multiple servers to ensure no single server becomes a bottleneck, which helps to maintain optimal performance and availability.
Auto-scaling effectiveness: Auto-scaling allows the system to automatically adjust resources based on demand. This means the system can scale up during peak usage times and scale down during off-peak times, ensuring consistent performance and efficient resource usage.
Database effectiveness: Tuning of data workflows via efficient query structure and database structure such as indexing, partitioning, and caching.
Regular and effective performance and scalability testing.

The Continuum of enabling Mean-Time-To-Recovery

Coverage of Robust Infrastructure as Code (IAAC) to automate infrastructure replacement when required.
Backup and Recovery: Regular backups help ensure that data can be restored in the event of data loss or corruption. Additionally, having a robust disaster recovery plan helps to minimize downtime and data loss during critical incidents.
Monitoring and Alerting: Continuous monitoring and L1-L3 technical support for system health and performance helps detect and address issues before they affect availability. Alerting mechanisms notify the appropriate team members when potential problems are detected.
Replication Effectiveness: Sharing information to ensure consistent and current synchronization between redundant data stores.
Containerization and Orchestration: Tuning of Docker and Kubernetes allows for easier scaling, deployment, and management of applications, contributing to high availability and robust performance.
Quality and availability of vendor technical support.