Skip to main content

Halcyon Days: Finding Faster Solutions for Availability and Performance

How Q2 is using AI in hosting to transform the way we identify and fix availability and performance problems

When it comes to AI, everyone wants to know how to use it internally to get faster and more efficient — to do 10 times the volume and achieve 10 times the impact with current staff. This is nowhere more true than in digital banking, where the complexity of operations and expectations that those operations run perfectly threaten to overwhelm the technicians who need to identify and respond to failures and threats.

There’s a lot of talk about using AI to achieve these goals, but much of it is theoretical. Here’s our story about how we’re actually doing it.

Q2 supports the online financial experience of 23.5 million end users across approximately 500 banks and credit unions in the United States, facilitating more than $3 trillion in transactions annually. This means that 1 in 10 people who are online banking in the nation are logging into a Q2-powered experience.

Despite our massive reach, we maintain unmatched availability and performance. Embedded within our Global Technology Operations (GTO) team is the Integrated Operations Center (IOC), which combines Tier 1 responders, Tier 3 SMEs and millions of dollars in monitoring and alerting tools. The IOC conducts 1.7 billion health checks and synthetic tests annually — achieving nearly four nines (99.99%) uptime across the platform.

Here’s how we do it. As the monitoring tools detect an issue, they alert the technician who acknowledges it, validate that it’s not a false positive and begin to work the appropriate playbook. That entails acknowledging the alert, understanding the context of the issue, gathering and reviewing the meaningful logs, mapping the issue to the proper response playbook and beginning triage and repair. Within 3 minutes of instituting the process, we’ve disrupted over 25,000 logins and nearly 3,000 transactions worth more than $25 million.

The challenge is that, as our solutions and the customer journey through the application(s) have become more complex, there are many more use cases and edges to fail. Additionally, there are more customers to monitor and their usage has exploded. The upshot is that the surface area of what we need to monitor and respond to has grown exponentially, and the volume of noise coming in is growing faster than is humanly possible to evaluate.

Given the scale, we’ve had to decrease the amount of noise coming in and filter events by severity before alerting the IOC and engaging our technicians. This leaves more than 800,000 momentary blips and single test failures muted. The filtering approach has allowed us to focus on the impact of events without overwhelming the IOC. Our industry-leading uptime is a validation of that effectiveness.

Unfortunately, the filtering also removes the ability to identify anomalies where the individual failure may not be meaningful, but where a pattern over time could point to an area of improvement. If we want to raise our uptime, we have to evaluate more of the noise and signals, but no matter how much we increase our staff, it will never be enough. As a result, we’re missing issues because we don’t have the human capacity to chase down every single anomaly. 

Enter Halcyon

This is where AI solutions go from the theoretical to the practical. Using automation, robotics and large language models (LLMs), our Observability Team partnered with the Integrated Operations Center (think of an advanced NOC) to develop Halcyon. Halcyon is an in-house solution that can sort through millions of noisy signals, determine whether there are patterns and/or something more latent in the noise, and surface the problem so it can be addressed.

Halcyon uses an open source LLM, installed in-house and trained on all of our software, data, alerts, runbooks and customers. It knows even more than a human technician because it knows source code, so it can identify a problem, look up the code where it failed, devise recommended solutions and drive those solutions. With its speed and power, Halcyon can investigate every single anomaly — each one of those 800,000 blips we discussed earlier.

If the problem warrants human intervention, Halcyon surfaces it up to a technician and explains why it’s a problem, what’s happening and what should be done about it. As Halcyon recommends a course of action, it includes its train of thought that led to that conclusion and the specific logs to validate its conclusion. It can do this analysis for the several components concurrently, dramatically reducing the time that a technician has to walk through the flow for every component one at a time.

In fact, Halcyon can eliminate the need for human intervention altogether, including in interactions with customers. If it identifies a problem at the customer’s end, Halcyon can put together and send an email with evidence of the problem along with solutions to the customer directly.

There’s a closed feedback loop on Halcyon’s recommendations, allowing the model to improve over time. It can identify software defects and create cases for development or identify issues and create cases for the IOC. And we’re now enabling Halcyon to self-heal the infrastructure, meaning initiate action on its own recommendations.

Halcyon will help us analyze all of the data more than 50% faster — a meaningful improvement — with every minute it saves resulting in nearly 10,000 users accessing their money and moving a total of nearly $10 million. 

Although Halcyon will allow us to investigate another quarter million events, we are targeting an overall reduction in alerts escalated to humans by 25% in Q1/2025, and 50% by end of Q2/2025. It’s important to note that we’re not just using our innovations to increase efficiencies, we’re also improving customer service. 

More important, Halcyon will serve as a platform for future innovation — allowing us to extend its impact further into intelligent automation.

In short, Halcyon is doing all the things we wish we could do if we just had a hundred more people — for less than $25,000 annually. That’s the power of AI.