Modernize the Legacy — Software Archaeology

Thilo Hermann
4 min readApr 29, 2021

--

Modernize the Legacy — Software Archaeology

When I was young, I was fascinated by Archaeology and tried to find some part time jobs to get a closer look into this topic. Finally, I started to work as an attendant in a museum of local history during my school time. Later on, during my studies of computer science, I had a part time job on Osteology Department (Osteology is the scientific study of bones). I developed a software to analyze human & animal bone finds from archaeological excavations.

Once I started to work at #Capgemini my first job was to analyze an “ancient” C-Program that was performing a duplicate matching on addresses. Everything was handled in one huge nested “if-statement” and I was wondering why there were so many blank lines. During the analysis of the source code I realized that this was caused by the indent level which was extremely high. It took me quite some time to understand the business logic behind that. The final documentation was more than ten pages in text.

Later I learned that it could have been worse, because for some of the applications even the source code was gone. For those cases you need to reverse engineer the machine code (or assembly language) and this can be extremely tricky.

All this reminded me to my work experience in archaeology. Archaeologists typically study the material remains of the past to understand how people lived. To achieve this Archaeologists, ask questions and develop hypotheses. They choose a dig site and observe, record, categorize, and interpret what they find.

This happened more than 20 years ago … and I was wondering if it’s still relevant? The answer is yes due to the fact that we still have to “Modernize the legacy application landscapes” for huge organizations.

After some research I found out, that I’m not the first one to come up with this analogy. There is a nice definition of the first law of software archaeology (see Mitch Rosenberg see https://en.wikipedia.org/wiki/Software_archaeology):

Everything that is there is there for a reason, and there are 3 possible reasons:

  • It used to need to be there but no longer does
  • It never needed to be there and the person that wrote the code had no clue
  • It STILL needs to be there and YOU have no clue

The corollary to this “law” is that, until you know which was the reason, you should NOT modify the code.

What can we do to survive this? A Software Engineer needs to master at least the following analysis techniques/methods to be a good “Software Archaeologist” for source code:

Static

  • Analyze the source code repository to identify the author and the latest changes.
  • Creation of a heat map that shows “hot” and “cold” code according to the number of changes per time.
  • Use the embedded features of your IDE (e.g. Full-Text-Search, Re-factoring tools, …) to navigate through the code
  • Use Reverse-Engineering tools on Source Code (Sonargraph, Imagix …) to generate metrics & diagrams for better understanding (e.g. dependencies)
  • Use Disassembler if source code is missing (e.g. IDA Pro, Ghidra, …) to start the analysis on machine code or assembly language

Dynamic

  • Use a Debugger to analyze execution and data flow. Inspect variables and understand the flow.
  • Monitor the Interfaces to understand external dependencies and the external data flow
  • Analyze the flow of Messages (incoming and outgoing)
  • Analyze Logs & Traces to understand the flow

Please note that it’s important that you document your findings during analysis, sooner or later your implementation will be the legacy and someone else needs to analyze it. Otherwise you are as bad as your predecessors!

After you understood what is going on you are ready to move on and start to change the code that it fits to the new requirements.

The challenge is that your changes might have unwanted side effects. Even the best analysis can’t give a guarantee for zero side effects. You can’t overestimate the impact of automated tests in this context. Typically, there are no automated tests at all for the legacy code, thus a good “Software Archaeologist” starts to write automated tests to verify the taken assumptions during the analysis. In addition, it’s very handy to have automated test once you start to modify or refactor the code. Besides the tests it’s also a best practice to use logs & traces within the changed code. This can help you to identify side effects during tests and if enabled even in production.

Once you’re done with the changes you should cross fingers and watch out for unwanted effects due to broken assumptions on production!

Source code is only one dimension. I also observed some similarities for Data, Devices, Protocols, Interfaces, Operations, Processes in the legacy world … so watch out for the upcoming series of blogs around Software Archaeology!

--

--

Thilo Hermann

Thilo has more than 25 years of experience in IT Architecture and worked for several clients in Germany. He’s located in Stuttgart and works at Capgemini.