Data Dependency Analysis in Backend Applications

Hidden Patterns In Your Backend Code

Every application somehow deals with data. After all, this is why applications are necessary in the first place.

In case of backend applications, there are well known stable patterns in how code deals with data. Everyone knows about CRUD, for example.

But there is another set of patterns which present in wast a majority of backend applications. Unlike traditional software design patterns, which more revolve around code and much less about data, these patterns are purely data-driven. To be precise, they are defined by dependencies between different parts of data circulating inside the application.

Very high level look at backend application will show almost the same picture regardless from the application itself and technologies used for implementation: a set of entry points (Inputs), some kind of persistence layer and/or external services (Outputs) and a blob of code which transforms data between these two sets (Application Logic):

The large blob in the middle is the place where all magic happens. And exactly this part is in the focus of discussion below.

Each entry point takes some data (parameters, perhaps optional) as input and produces some data (response) as output. Let’s forget about parameters for the moment — we’ll return to them soon — and take a look at how entry point generates the response: in the vast majority of cases entry point requests one or more pieces of data from external sources (DB, internal or external services, DB’s, etc.) and then transforms them into response. Once we abstract out the details of the data retrieval and focus only on the data itself we immediately notice that the response depends on the data received from external sources. This dependency can be one of two kinds:

  • Response can be created only when all data from external sources are successfully obtained. This is a dependency of type All.
  • Response can be created when any external source successfully obtains data. This is a dependency of type Any.

Note that the dependencies described above do not require that data obtained from an external source should be actually used to generate a response. In fact, for example, CRUD’s Update or Remove entry points usually need only operation status — success or failure. Nevertheless, data dependency still exists — we can’t generate response until we receive data from the external source.

What is even more interesting is that these patterns are repeated inside external sources we’re calling. In most cases they retrieve data from other external source, transform and return to us. This happens again and again until we reach an external source, which actually holds necessary data. If we take a look at the whole picture from application entry point down to every source of data, we will see a tree of data dependencies or Data Dependency Graph (or DDG). The external sources which actually holding data and don’t depend on other external sources are leafs at this tree. For convenience, I’ll be calling them terminal sources.

So, the whole tree has root (application entry point) and through intermediate nodes (services’ methods) goes down to terminal sources.

The diagram below shows an example of such a data dependencies:

Each rounded rectangle on the diagram contains the name of the service and name of the data structure (or table name) it returns to the requester. The direction of the arrows represent data transfer direction.

The diagram shown above is good for illustration and something similar might be convenient to drawing on whiteboard. In some cases it might be more convenient to use pure text representation. We can write patterns All and Any as functions which have dependencies as parameters. Returned value is our response. For example, DDG nodes from the diagram shown above can be written as follows:

UserProfileResponse = All(UserProfile, LastUserLogin)

UserProfile = All(UserData)

LastUserLogin = All(UserLogin)

UserLogin = All(user_logins)

Sometimes for brevity it might be convenient to substitute intermediate dependencies and write the whole DDG as a single function:

UserProfileResponse = All(All(UserData), All(All(user_logins)))

While this representation omits some useful information (names of intermediate data, for example), it’s more explicit as it shows DDG depth and contains only terminal sources as dependencies. The DDG depth is an important property for architecture analysis, as we’ll see below.

Some Terminology

  • Data Dependency Forest (DDF) — set of DDG’s where each DDG represents one application entry point
  • Data Dependency Analysis (DDA) — a process of building DDG or DDF

It’s easy to notice that DDG is rather abstract. It does not depend on implementation language, framework or packaging (monolith or microservice). Legacy applications often have no documentation nor tests and very often are written … let’s say in outdated style. It’s somewhat tedious work to build DDF for such an application, but once it’s done we obtain a valuable source of information:

  • We get information about applications’ internal structure
  • We get information about how different parts of information interact with each other

With this information in hands it is much simpler to start reworking an application into something more modern and maintainable or even rewrite the whole application from scratch.

As DDG is abstract it does mean that we can build it for an application which does not yet exists. Once the detailing of the architecture reaches the level when we can discover dependencies between data, we can build DDF for the application. From DDF we can find:

  • How to split the whole application into services
  • How dependencies interact with each other and where can be potential bottlenecks
  • Estimates for how long will take processing of each request (discussed below)

Needless to say how useful this information during design as it allows to avoid many pitfalls and costly mistakes. By rearranging data it is possible to optimize the architecture before the first line of code will be written.

Obviously obtaining dependencies requires some time. Some time is necessary to retrieve data from the terminal source. Then as data propagates up in the DDG at each step some communication overhead is added. By collecting these times through the whole DDG we can estimate how long processing will take. Note that there is a difference on how time for synchronous and asynchronous handling of All and Any dependencies.

Synchronous Handling

Estimate for obtaining dependency of type All is a sum of estimates for each data dependency. Since processing is synchronous, we obtain all dependencies one by one hence time is the sum of all times. If processing is paralleled somehow, then time is reduced to the same estimate as for asynchronous processing.

The estimate for obtaining dependency of type Any is a sum of time to retrieve first dependency and any subsequent dependency if previous dependency failed. This happens we ought to retrieve dependencies one by one until one of them will be successful.

Asynchronous Handling

Usually various asynchronous processing mechanisms provide dedicated methods to obtain All and Any results in parallel. This property results in significant reduction of the processing time.

Estimate for obtaining dependency of type All is the longest estimate among data dependencies. All dependencies retrieved in parallel so all of them will be available once one most long-lasting will be obtained.

The estimate for obtaining dependency of type Any is equal to the estimate of first successful data dependency. Again — all of them are processed in parallel and once one of them returns the success, there is no need to wait for remaining ones.

Simple Example of Request Processing Time Estimation

Let’s try to estimate request processing time for DDG provided above and repeated here for convenience:

UserProfileResponse = All(UserProfile, LastUserLogin)

UserProfile = All(UserData)

LastUserLogin = All(UserLogin)

UserLogin = All(user_logins)

Let’s assume the following average access times for terminal sources and network delays:

- Access time for Auth SaaS = 5ms

- Access time for Activity Repository = 2ms

- Average communication delay between services = 3 ms

Synchronous Version

UserLogin time = 2ms (access) + 3ms (network) = 5ms

LastUserLogin time = UserLogin time + 3ms (network) = 8ms

UserProfile time = 5ms (access) + 3ms (network) = 8ms

UserProfileResponse time = LastUserLogin time + UserProfile time = 16ms

Asynchronous Version

UserLogin time = 2ms (access) + 3ms (network) = 5ms

LastUserLogin time = UserLogin time + 3ms (network) = 8ms

UserProfile time = 5ms (access) + 3ms (network) = 8ms

UserProfileResponse time = max(LastUserLogin time, UserProfile time) = 8ms

Even in this simple example the asynchronous version is twice as fast as synchronous one.

Data Dependency Analysis is a new approach to architecture design and analysis. While it already looks very interesting and useful, the research in this direction is still incomplete. In particular, the part which allows systematic synthesis of code for DDG entry points is still in progress. DDA might be also useful for refactoring and performing code reviews, but these areas also need deeper research.

Writing code for 30+ years and still enjoy it…