Cloud Infrastructure Debugging with AI Assistant

8 April 20254 min read

Cloud InfrastructureDevOpsLarge Language Models

One of my responsibilities at GoodAI is deploying backend applications for the AI People game, which involves setting up and maintaining cloud infrastructure on Azure. Over the past few months, we encountered several infrastructure-related issues, and using a large language model (LLM) helped me identify root causes, resolve issues more efficiently, and improve infrastructure stability.

My Debugging Approach

The approach is straightforward.

Describe the Issue and Brainstorm Causes - I start by describing the problem to the LLM, brainstorming possible root causes, and validating my assumptions.
(Optional) Provide Logs or Metrics - When applicable, I provide system logs or application metrics. LLMs are useful at summarizing and identifying patterns in noisy or verbose data.
Ask for Resolution Steps - After identifying the probable cause, I ask the model to recommend specific configuration changes, troubleshooting steps, or best practices to resolve the issue.

Real-World Examples

Container Crashes

We have deployed the backend applications for AI People using Docker containers in Azure Container Apps. A few months ago, we started receiving alerts about intermittent container crashes. The system logs included various infrastructure events and container lifecycle messages, but they didn’t provide a clear explanation for the crashes. It wasn’t obvious which events were relevant, and manually reviewing the logs didn’t help me identify the root cause.

After providing the logs to LLM, it analyzed the sequence of events and identified that the issue was related to health checks with overly aggressive thresholds. The applications occasionally failed to respond quickly enough, triggering restarts despite functioning correctly. The logs and alerts misleadingly suggested that the applications were crashing. Based on the LLM's suggestions, I resolved the issue by adjusting the threshold values in our Terraform configuration.

Optimizing Container Resource Allocation

Another case involved optimizing resource allocation for our containers. I provided CPU and memory usage metrics from Azure to LLM and asked it to recommend updates to our Terraform configuration. The suggestions helped us achieve a more balanced and cost-effective setup without impacting performance.

Connectivity Issues

One of the trickier issues involved occasional HTTP request failures to backend applications deployed in Azure Container Apps behind an Azure Front Door load balancer. Although the failures were rare (about 1% of the time) they had a noticeable impact on gameplay.

At first, it wasn’t clear which part of the system was responsible. Was it Azure Front Door, Container Apps, or DNS? The lack of clear indicators made debugging particularly difficult.

Here’s how I approached it:

I brainstormed possible causes with the LLM.
I then used LLM to generate a Python script to ping our servers and log detailed connection data.
After running the script, I analyzed the results to identify the failure stage (during TCP connection establishment), frequency (1%), and the affected URLs (only some production URLs were affected).

After several rounds of investigation, I confirmed that the issue was not related to Azure setup and eventually traced the root cause to a VPN split tunneling configuration. The problem was tricky because it affected some production URLs, but the tunneling rules were defined only for some staging URLs.

LLM Evaluation

I tested multiple LLMs during this process, including GPT-4o, Claude Sonnet (3.5 and 3.7), and DeepSeek V3. Overall, I found Claude Sonnet to be the most effective for infrastructure- and code-related tasks. Its responses were consistently clear, accurate, and actionable.

Final Thoughts

AI assistants have become essential tools for software engineers, including for tasks like cloud infrastructure debugging. They provide knowledge base and serve as collaborative partners. By integrating LLMs into my workflow, I’ve been able to resolve complex issues more quickly and improve the reliability and performance of our systems.