Our client is a publicly-traded, Fortune 50, general merchandise retailer with more than 1,800 stores spread across every U.S. state and the District of Columbia. It ranks among the 10 largest retailers in the U.S. and employs more than 350,000 people. In 2019, it reported more than $75 billion in revenue. The client also owns 45 brands unique to its stores and operates several subsidiaries that range from a grocery delivery service to a skin care and beauty e-commerce site.
The client had acquired a software company with a proprietary, cloud-based platform to simplify the process of selling and activating connected products such as mobile phones. The platform integrates with the largest cellular network providers and supports ancillary processes like payment programs, insurance, fraud protection, and warranties.
The software company ran its applications on virtual machines in a data center. As part of the acquisition, the client wanted to segregate these servers from the legacy company, moving them to a reliable and scalable platform.
The client’s new division also needed to mature its processes to handle the demands of joining such a large corporation, including stricter security requirements, more redundancies, faster deployments, and the capacity to handle a steady stream of new feature sets.
The product platform had been migrated to the AWS (Amazon Web Services) Cloud Computing environment, but didn’t leverage all cloud capabilities suffering many challenges:
- Site Reliability Engineers (SRE) required a better ability to handle the surging volume of work in a fast-paced, agile environment that supported upwards of 10 development teams, hundreds of AWS instances, and weekly deployments.
- Important but lower-priority items suffered constant delays.
- Expected cost savings also were often delayed or not realized.
- Legacy operations, monolithic applications, and outdated real-time system health dashboards were difficult to maintain, requiring many manual processes and manual intervention.
- When issues occurred, manual escalation practices delayed responses and restoration activities.
- Tracking high volumes of infrastructure changes manually was difficult and time-consuming.
- A lack of modular or customizable building blocks led to inefficient environment builds.
In order to achieve the stable, reliable, and highly available service the client expected from the platform, the new division’s operations needed to mature to a higher standard. More cohesive, heavily automated, efficient processes were required to accommodate the larger pipeline of work and manage the various environments from development to production. The client also needed to ensure the platform could handle peak periods during the holiday shopping season, where the platform realized significant (40X) transaction volume increases.
Auxis was leveraged as a key partner for automation and to mature SRE, DevOps, security, and overall operations.
Auxis provided DevOps support for the division, serving as the central knowledge base for the platform, engineering tasks, deployments, and production support. Auxis was able to gain valuable insight into the platform’s struggles – identifying automation opportunities that improved overall results by streamlining processes, controlling costs, and delivering faster, more consistent deployments.
Key transformations included:
Increasing the size of the system to handle the exponentially higher load during peak holiday seasons. Auxis worked to streamline the scalability of the platform to meet business needs.
Managing cloud infrastructure using a DevOps mindset. In collaboration with the client, Auxis DevOps experts implemented a unified philosophy under Chef/Terraform to automate builds. Not only does this combination allow more agility, but it ensures consistency to changes across environments. Auxis also created new Lambda functions that scanned for frequent infrastructure changes, automatically creating or modifying performance dashboards.
Automation of CI/CD (Continuous Integration/Continuous Delivery) and monitoring/availability for immediate awareness of potential issues. Adopting a modular integrated solution that combined Cloudwatch metrics, alarms, and automated actions ensured all changes across environments would be automatically monitored using statistics like CPU usage, network throughput, memory and disk utilization, and unhealthy instances.
Integrating alerts with the PagerDuty incident notification solution, generating immediate notification and responses. The teams also integrated the communications tool, Slack, with GitHub, heightening awareness of potential issues like build failures and triggering faster resolutions.
Establishing operational efficiencies with automated content pushes and health checks. Auxis wrote Ansible plays to automate the process of running scripts to execute database copies and swaps to the production environment with minimal effort. Other plays captured health checks and verification that code was updated for deployments properly. Other scripts also automatically copied artifacts linked to specific releases from development to production without manual intervention.
Creating modules to simplify deployment of changes. To establish a routine and reusable foundation for building new infrastructure, Auxis created modules for all major resources, including AWS instances, load balancers, base applications, databases, and monitoring. Every module was created with versioning capabilities, making the rollout of changes as simple as increasing the version of the module.
By utilizing automation and other best practices, Auxis was able to streamline delivery management, system maintenance, and application deployments for the client - creating a more scalable model moving forward. Key benefits include:
100%+ faster deployment times with more successful outcomes
The time it took to create new environments plummeted. Deployment times decreased by more than 100%, now taking an average of less than 30 minutes to complete instead of several hours. Automation also eliminated manual errors common with high volumes of work and repetitive processes, delivering more accurate, consistent results.
Quicker realization of cost savings
As just one example, Auxis DevOps experts leveraged serverless computing to automate the weekly cleanup of unused volumes at scale in production and non-production environments. The result: more than $1,200 per month in savings and the elimination of hours of manual work.
Dramatic reduction in expensive manual labor hours
Automating repetitive, time-consuming manual tasks that existed throughout the division was key to maturing operations to a higher standard. As an example, automating health dashboard maintenance alone saved 20+ hours in manual labor in a single month. With new systems constantly spun up into the AWS environment, manually following the steps to gain visibility into the health of each one was a major challenge for the new division. Automation eliminated the need for manual intervention, adding new systems automatically to standardized monitoring protocols and dashboards to achieve real-time insight into their performance.
With these and other remedial, time-draining tasks off their plates, technical teams gained the capacity to focus on more complex, higher-value initiatives that drive strategy and growth.
Greater efficiency, less complexity for managing resources with significantly less errors
Creating modules for major resources forced consistency and eliminated differences between environments, leading to a significant reduction in errors. Previously, managing the environment was incredibly complex, with multiple versions of the same function. With no controls or processes for how they were built, each version was slightly different and changes had to be applied separately to each one.
Auxis streamlined the process for creating and updating resources and security roles/assignments. By rewriting security groups and policies so permissions spanned modules, bottlenecks were reduced that made it more difficult to rollout and launch new environments.
Faster incident resolutions
Automating alerts led to immediate awareness of potential issues and significantly faster resolutions, minimizing downtime, improving customer service, and maximizing the platform’s reliability.