Software Architecture Refactoring: Case Studies

Written with Hari Kiran.

What is software architecture? An interesting and useful definition of architecture is from Grady Booch:

“Architecture represents the significant design decisions that shape a system, where significant is measured by cost of change”.

The process of creating architecture (and later evolving it) involves making a series of design decisions. All design decisions are important, but some design decisions are more significant than other, and are known as “Architectural Decisions” (ADs).

In the last couple of years, myself and Hari Kiran have conducted numerous workshops on software architecture. During our discussions with the architects in our workshops, we came across many interesting real-world problems related to architecture decisions and refactoring for addressing them. In this article, we present three of those case studies. These examples show that it is often very hard to perform architecture refactoring because of wrong architecture decisions. Let us illustrate this with a real-world analogy from civil engineering.

We dread crossing the notorious KR Puram flyover in Bangalore – it has sometime taken us hours just to cross this flyover! We were surprised and shocked when we came to know that its design had won awards in the past. But from our experience (that is shared by everyone who dreads crossing this flyover), it is a living example of poor functional design. What is the meaning of aesthetic design when the flyover creates a traffic mess because it is unusable?

We believe that the design of this bridge is architecturally flawed when it comes to functionality: its hard to make a U-turn, it connects only one road and leaves another one altogether, no scope for extension, etc. But “refactoring” it is too costly because of its location (it is just outside of the gate of a KR Puram railway station) and the extensive discomfort it would cause to people.

In many ways, architecting software is analogous to architecting cities. We need to continuous adaptation of the cities and the key elements in the infrastructure; otherwise, it will cause great discomfort to its users. However, refactoring cities (and large software) is very hard in practice.

With that we discuss 3 case studies on architecture decisions and architecture refactoring; they are not so difficult from the KR Puram flyover analogy!

Case Study 1: Dealing with Security Vulnerabilities

A large company had a successful networking product that was a cash cow for the company. Over a decade, the company had sold the product to a large number of customers world-wide. The team responsible for the product within the organization was working on adding many new features for the upcoming release. During that period, a customer reported a security attack on the product. The management had to pacify that customer and also had to assure their other customers that fixing security holes and enhancing the security of the product will be the highest priority in the next release. The management team got back to the engineering team to address the security concerns along with the promised critical features without changing the release timelines.

An in-depth analysis showed that the only support for security at architectural level in the product was the “security layer” that authenticated the users. Code-level security vulnerabilities were spread throughout the code base. If the users could bypass the security layer, the software would become vulnerable to attacks. Though security was identified as one of the key requirements from the early days of the software, the software’s architecture provided poor support for security.

In other words, the software’s support for security was sub-optimal at the architectural level, making it vulnerable to attacks. Since security was the highest priority, the management asked the engineering team to address all the security related concerns discovered so far found in the product. Since engineering team had spent time only on addressing security concerns and testing the changes, they missed the release deadline and could not realize the promised critical features. Though code-level security concerns were mostly addressed, since the architecture did not support security, the software continued to be vulnerable.

Extensive architectural refactoring had to be taken-up in later releases of the software to make it “secure-by-design”.

Key takeaways:

  • Inadequately addressing critical concerns such as security at the architectural level can cause considerable business losses.
  • Quality attributes like security must be “designed-in” and cannot be “added later at code level”.

Case Study 2: Using Structured Exception Handling

A large network monitoring software evolved over a period of more than two decades. The software was initially meant for use only in Windows platform and was written in C++. Error or Exception Handling (EH) was a challenge since it was a large application spanning millions of lines of code across numerous components. It was early days of C++ – the language did not have good support for exception handling. The architects decided to create an exception handling strategy by making use of Structured Exception Handling (SEH) feature supported in Microsoft Visual C++ for Windows platform. In a few years, the whole of the application used SEH.

As the software evolved, it had to be extended to support other platforms, mainly Unix and later Linux variants. By this time, C++ added support for exception handling in the language. The architects of the application found the need to move to using standard C++ exception handling features for two reasons: a) portability of the source code across different C++ compilers and operating systems b) to avoid relying on compiler/language extensions and use standard language features which most C++ developers were now familiar with. However, it was considerably effort intensive (and risky) to rewrite the existing code to use standard C++ EH features. Hence, the architects decided to create and use a new EH strategy based on using standard C++ EH language features only for the newly written components. With this move, the earlier AD of having an EH strategy based on SEH became obsolete though the vast majority of the code still using SEH.

With this architectural decision to use standard C++ features for new components, how to handle throwing or catch exceptions between the old components and new components became a thorny issue. Also, over a period of time, developers found it difficult to maintain the code that used SEH in older components because they were not familiar with the SEH approach. These factors caused major maintenance problems for the software.

Key takeaways:

  • Real-world software often survives to live for many decades. In such software, it is inevitable for some of the architectural decisions made earlier to get obsolete.
  • Architects have to carefully consider the ramifications of making ADs obsolete. It is impossible to foresee all potential problems that can arise in the future, but it is important to find a transition path and make the transition to realizing the new ADs as smooth as possible.

Case Study 3: Illicit Dependencies

This is the case of a startup that had the focus on getting a working product to their customers ahead of their competition get the “first-mover advantage” in the market. The founder of the start-up was a domain expert and an experienced architect. He created the design and left the implementation to his engineering team and focused on management aspects such as getting funding from investors. To reduce time-to-market, the development team (mostly consisting of fresh or inexperienced engineers) reused commercial or open source libraries. The bet payed-off and the product was a roaring success in the market and first of its kind in that domain. This success attracted the attention of major players in the domain and a large company acquired the start-up.

The large company planned to add new features and start selling the product to its huge customer base worldwide within next two quarters. When the engineering team tried to add features, they discovered that the software had improper dependencies on many commercial and open source libraries. The start-up had not procured the licenses for most of the commercial libraries. Open source libraries that were internally used by the software violated the licensing terms. As a large company, it cannot afford to violate legal and regulatory aspects. Hence the focus of the engineering team now turned towards removing all the “red-flagged” libraries and get “clean”.

The engineering team found that the use of libraries were tightly entangled within the code base. It would take a huge effort if all such “illicit dependencies” were to be removed. As a remedial measure, the company had to procure relevant licenses for these libraries. The cost of the libraries in turn increased the product cost.

When it came to open source libraries, most of them did not have equivalents. They had to be replaced with either existing in-house libraries or rewritten from scratch. With these changes, the testing effort drastically increased.

Instead of the product hitting the market in two quarters, it took two years for the company to finally get rid of all the “illicit dependencies” and to add new features. When the product finally hit the market, it was too late: competitors had already garnered the market share.

Key takeaways:

  • While acquisition itself, a company should assign experienced architects to evaluate technical and legal aspects of acquiring the software products. Legal implications can have considerable business impact – hence it is better to be “safe than sorry” and do the due diligence before acquisition.
  • “Illicit dependencies” is a serious architectural problem. In general, an engineering team (especially the architect) must carefully analyze the legality of using a library before starting to use it in the project. Once a library is used as an API, the code gets “entangled” with the use of the API. Removing the dependency later can turn out to be costly in terms of effort required and time.

Summary

It is practical and useful to view architecture as representation of Architectural Decisions (ADs) that shape a system. If we get some of those ADs wrong, we have significant cost to pay. The cost could be in terms of increased effort or resources, or it could be negative impact on quality or business losses. Hence, it is important for architects to focus on getting the ADs right.

As Philippe Kruchten notes, “key architectural choices cannot be easily retrofitted on an existing system, by means of simple refactorings.”

Leave a Reply

Your email address will not be published. Required fields are marked *