We work with many large enterprise clients in the financial and capital markets sectors. Their information systems are broadly managed by two teams: the “application” and the “infra” teams. The application teams manage various business applications — there is often a sub-group for each major application. The infra team manages the physical servers, network components, network links, and so on. The data in the production system databases too are in the scope of work of the infra guys. Security is usually a separate team, which reports to the CISO (Chief Information Security Officer), not the CTO.
Therefore, for any actual business application to do its thing, part of the technology stack, the people, the headaches, is with the application team, the other part is with the infra team. And we have seen these two teams operate in non-cooperative silos in terrible ways. It is astonishing that almost no CTO seems to break this separation down.
I’ll narrate some stories here, but these are not half-dead organisations with pathetic processes or budgets. These are leaders in their field and handle large percentages of their respective sectors’ transaction loads.
#1: The Zabbix story
One of our clients was facing severe challenges of instability and non-performance with their application servers. They were rebooting some of their servers more than once a day. We were called in by the application team to try to help.
When we asked them how they are monitoring system health parameters, they vaguely murmured something about “Zabbix”. It turns out that this is a very reputed system health monitoring product. The application team, which was fighting to keep their application alive, did not have any clear idea about what this “Zabbix” is — they did not even know how to see the Zabbix dashboard. We asked to see the Zabbix dashboard, and discovered that a young engineer from the infra team must come into the conference room with his laptop and show us the dashboard. We asked various questions about the presence of health parameters, and he tried his best to explain. The application was running in a Java VM, and we asked this infra engineer whether the monitoring agents to pull out data from within Java VMs have been installed. We realised that no one in the infra team had done this.
About a month after we started working with the application team, we suddenly discovered one day that the Zabbix dashboard was inaccessible. We inquired, and discovered that the Zabbix maintenance team, who worked out of an office in another city about 1,100 km away, had started an upgrade project for Zabbix, and had shut down all Zabbix instances in the organisation. How long would the upgrade take? A month to six weeks, we were told.
We saw glaring problems here:
- No one in the application team had demanded greater collaboration from the infra team. This included the application owner, who was a senior manager with 20+ years of experience
- No one from the infra team had felt any urge to work with the application team, ask how they could help.
- The infra team shut down the one and only monitoring system for more than a month because they were “upgrading”, without informing any of the application teams. The assumption here is that health monitoring is of no relevance to application teams.
- No one in the top management team hauled up these two teams, shook them by the collars and shouted at them to “Grow up!”
#2: The random disconnects story
One of our projects teams was deploying a large application at customer site to start system integration testing. All the infrastructure components had been set up, and two sets of users were beginning testing: one set from our team and another from the client.
Within a day, both teams began reporting that front-end screens would report errors when connecting to web services, totally at random. Sometimes, the client’s QA team would report these errors, sometimes our own team. And this was very erratic. One set of screens would go through without errors, and a second run through the same screens would trigger this error. The third run would trigger errors at a different point.
The client was escalating these reports as high as they could, alleging that our software was so poor and untested that even basic stability for simple operations was impossible. Our team had run a copy of the full stack independently on our AWS server cluster, and had not faced any such problem. All requests to the client’s infra team came up with the usual answer: “We have checked and re-checked everything, the problem must be with the application code.”
Our team was in no position to do system debugging, because the problems were being reported on the client’s stack running in their DC, where no vendor teams have any access to even get a shell prompt. So our project manager brought in our DevOps team, who started monitoring the hits on our application server given the minimal instrumentation possible to us. After some painstaking correlation, it became clear that when the front-end was reporting errors, the hits were not even reaching our application server code — they were getting cut off somewhere in between.
We escalated this to the client’s infra team, who denied that there could be any problem with the infrastructure or network. Meanwhile, CSat continued to fall.
Then, after a few days, the infra team quietly told us that if we see any further problems, we should report it to them. We had been reporting problems to them all this while, so this seemed tautological. In the next few days, the problem incidents became literally zero.
Our team used their access to friendly members of our client’s team to extract the story off the record. The client’s infra team had set up two web servers as reverse proxy to act as the end-points for our web services. One of the reverse proxies had been malfunctioning all along — it would fail frequently. The infra team did not know why and had not even checked carefully till the problem reports kept hitting them for a few weeks. They finally just shut down the malfunctioning one, and the remaining one handled the full load of testing. The problem had been with the client’s infra setup all along. The application team of the client, plus our own project team, were set up as fall guys as long as the fallacy could be sustained.
#3: The disk space story
This story has been repeated several times at various organisations.
Our application would be given a server by the client’s infra team. The server’s storage space would not be a single large partition — there would be many fragments on many partitions. For instance /
would be one partition, /var
would be a second, /data
would be a third, /var/log
would be a fourth, etc. And almost all the data areas and log areas were too small. Our early warnings about insufficient data space would be ignored. There were no automated tools for log rotation or archiving. Since the client’s infra team had full freedom to set things up the way they liked, all inputs from our team were ignored.
Our application started failing because Apache logs would fill /var/log
. This would happen every week or two, and our attempts to move data around or delete data using automated scripts were proving insufficient. Every time our application failed, the SLA between our team and the client would be triggered, and since the entire application would stop working (Apache itself would choke), each such incident would be a Severity 1 incident.
It took us several weeks to persuade the client’s management to put pressure on their own infra team to give us larger partitions. And during this period, each Severity 1 incident was contested by us for SLA compliance, leading to sharply increased project overheads.
#4: The load balancer story
Another application at another client had a redundant pair of servers to handle the load of web service calls coming in. There were two load balancer appliances upstream of the application servers, distributing the incoming requests to one or other app server. But the application team could clearly see that one app server was getting exactly 10% of the load. On drilling down, we realised that the two app servers were not running on identical hardware — one was much lower-end than the other.
Just to explain things: load balancers are entirely in the ambit of the infra team. The application server hardware and OS, ditto. The Java VM and application code on the app servers were in the scope of the application team.
We asked who had decided to deploy an asymmetric pair of app servers. The infra and application teams pointed at each other and launched into story-telling. We moved on.
We asked the infra team how had they configured their load balancers to distribute the requests in this asymmetric ratio. They said something about distributing them this way because the app servers were unequal. We asked them how did they arrive at the precise 9:1 ratio — we got no answer. We asked them whether the load balancers had any agent running on each app server to report back the CPU/RAM load on each app server, so that the load balancers could decide their request distribution in real time based on actual load conditions. We got hand-waving answer, broad generalities about what sort of capabilities modern load balancers have, etc. We pinned them down to specifics — “Are you using those features? If yes, can you show us what agent you’re running on each app server?” We got the classic corporate response: “We will have to get back to you on this one.” This is the sort of answer we were getting from the technical head of the infra team, who is reputed to be a very competent and very hands-on engineer with 20+ years of experience.
We saw the following:
- The application was under-performing so severely that the Managing Director of the organisation was personally taking status updates, yet the infra team was not being held accountable for some commonsense questions.
- There was no one owning up to a highly peculiar decision of deploying an unequal pair of servers to distribute the load of one application at one end-point. This of course made load distribution a much more hairy thing to configure correctly.
- The infra team had no clue how their own load balancers were configured and why they were distributing the requests in a very specific 9:1 ratio.
#5: The disk upgrade story
This story is from 1999, just to show how things don’t change with time or scale.
We had built a small Internet-facing application for a very large financial institution — it was their first mission-critical application on Linux. The application had great strategic importance, and was being accessed by a community of about 700-800 institutional users in real time, but did not require heavy hardware, therefore was set up on a small server.
Over the next few months, data accumulated on that server, and its small local disk began to fill up. The server was using a small local SATA drive of 20 GB or 40 GB — I don’t remember the figure now. We asked the client for a disk upgrade. We kept asking them, and reminding them, for almost three months. We saw the familiar story — the infra team, responsible for all hardware, was living in a different planet, and the application team had no method to make them respond.
The server began malfunctioning occasionally, data loss began to get logged, the application team started tightening their archival data purging calendars to free up disk space, and finally, after three months, a replacement disk was made available. To put things into perspective, the cost of one of these disks was small compared to the monthly salary of any of the senior managers in the organisation — this was not an expensive procurement decision. The item was an ordinary SATA disk — it was not a specialised item either. One could literally take a cab to the city’s electronic hardware market and buy this disk over the counter and be back in an hour.
We discovered once again the astonishing separation between the application and infra silos, and how indifference by the infra team hurts overall service delivery and is ignored by the organisation’s management.
#6: The server provisioning story
Recently, we were building a large business application for another very large client. Once the size and scale of the system was clear to our team, they submitted a detailed infrastructure sizing document, specifying how many servers we would need and for what purpose. This information was discussed thoroughly with our client’s team, and everyone was on board. Or so we thought.
Several months elapsed, and it was time to set up the production servers to deploy the application. When we asked the client to give us access to the new server cluster, they said that their infra team “had some questions”. Over the last 25 years, we had learned not ask “What were the infra team doing these last six months?” We just nodded and met the infra team.
We discovered that some of the servers had been given to us with 250GB disk space on their SAN when we had asked for 2 TB volumes. Some of the servers were showing 32GB RAM when we had asked for 128GB RAM. We asked for the correct sizing, and the infra team came back and asked us to explain “why we needed those sizes”.
We had clarified all sizing issues in our original document. And as any solution architect knows, sizing is an art and a science, so the question of “why” gets answered only through heuristics.
The back-and-forth which ensued between us and their infra team delayed our software deployment by about two months.
The sordid picture
What we see over and over, in otherwise large and successful organisations:
- There is astonishing ignorance in the application teams we have worked with, about their need to get a grip on the infrastructure on which their application runs. They do not need root passwords for each server and firewall, but they certainly need logs.
- There is astonishing indifference in the infra teams about the trauma suffered by application teams. They are not involved, not curious, let alone being forthcoming or cooperative.
- There is no awareness at the CTO level about how this is hurting overall service delivery and the business revenues. I have not seen any CTO step in and fix these systemic problems.
Multi-functional teams are mandatory
FIRST: the infra teams must be split into many units:
- one for core common infra
- one for the infra of each application
One unit will be in charge of the core shared infra like hardware, Ethernet switches, Internet gateways, which are shared by all applications. This core team will also operate overall observability frameworks, which will collect extensive data from all components. And then there must be small application-specific infra teams, one for each application.
The application-specific infra team must be empowered to reboot their own servers, upgrade their application’s server OS, administer their application’s databases, install and remove their own logging and monitoring tools, etc. And they must have full read-only access to all company-wide observability dashboards.
The core infra team must serve the application-specific infra teams. They must treat the application-specific teams as clients, and their performance evaluation must depend on the CSat the application owners of all the applications give them.
SECOND: Each application must be owned by a product owner, and the application management team (those who mess with the source code) and the infra team (those who mess with the servers and networks) for a specific application must report to this product owner.
The performance evaluation of both teams must be tied to application availability and stability.
The application-specific infra teams:
- must have full access to all logs from all infra components, including detailed logs of load balancers, firewalls, slow-query logs from production databases, and all other logs needed for diagnosing application health
- must dictate what monitoring agents must be installed on production infrastructure, what data the application teams must receive
- must have access to the overall system-wide infra monitoring dashboards, for instance to see if the overall data centre Internet links are reporting packet loss. If this is happening, it hurts all applications, so all application-specific teams must monitor these health parameters like hawks.
- must get time-bound responses to all their queries submitted to the core infra team.
- must monitor health parameters of the tech stacks of their own applications, and not depend on the core infra team to do any such monitoring. If this requires that automated alerts be sent to them, they must instruct the core infra team to configure such alerts to be sent to the application-specific cellphones.
And to hold this structure in place, the CTO must hold the overall infra function and application team accountable whenever there are application performance issues. This is completely missing in most of the organisations we have worked with. The application teams are penalised, and the infra teams are a law unto themselves.
We feel that our recommendations are commonsense measures. But they do not seem to be common at all.
Leave a Reply