Is scary data pipeline technical debt haunting your business?

Here are some horror stories (fictional, of course). Cautionary Halloween tales.

The hacky spaghetti monster- the lone junior analyst who had to build the company’s first data pipeline

A junior analyst with no data engineering training outside of a few Data Camp courses and Stack Overflow Googling put together some data pipelines in the form of DAGs on Airflow. This data analyst was the lone keeper of the data and put these data pipelines together in the midst of many slacks and angry stakeholders wondering why the data was taking so long. When the DAGs were done, the data analyst was happy that they did not have to query the product database directly and showed the boss how there is now a new schema in the data warehouse. All seemed well.

Months went by. The junior data analyst found another job and left some comments in the code. Nothing was documented.

A new team was formed. This team queried the data warehouse and made some dashboards. Soon angry stakeholders started slacking them. The dashboards were failing. The numbers were unbelievable. The team of analysts tried their best to build fixes on top of the original DAGs, but they were not sure how the DAGs worked.

Eventually some data engineers where hired. After 8 months and 6-figure salaries the DAGs were re-engineered to function properly.

The lone wizard of the data pipeline – the quietly brilliant data engineer who disappeared without proper documentation.

There once was a data engineer who was responsible for the data pipelines in their team. Whenever there was a need, the analysts asked this person to make the data available. This person would work in mysterious ways, naming tables mixtures of characters and numbers, using a different data model for each table, not documenting. The pipeline would run through the different tables unbeknownst to the rest of the team. This person left the team, due to an offer that paid 20% more at another company.

One day the Head of Data decided to clean up the tables that had not been used in the last two years.

All of the sudden, the pipelines started to break.

External consultant had to be called in to clean up the puzzle. The project cost 6-figures, but now the average Google Cloud Platform (GCP) costs are 2K less.

Infinite loop of debugging and fixing – the repeating nightmare when analysts run the pipelines

There was a team of junior data analysts who made their own pipelines on Airflow. They put all of the analytical scripts into one pipeline that would run daily to ensure that the dashboards were working every morning.

The small team of data engineers were occupied building out what the Head of Data called the “core infrastructure,” which was ingestion. That meant while the trained data engineers were focusing on the mechanisms that took data from APIs, building event streams, etc., the data analysts were responsible for the data warehouse and the business logic.

Because the pipelines were all different due to analysts being happy whenever a script actually worked and going on to the next card in their endless backlog, the pipelines all had different definitions and logics. Thus the results were often not aligned.

The analysts would Google for how to write the scripts in Airflow. The next morning when the scripts would break, they would scramble to debug and fix the problem. In the afternoon, the analysts would again Google for the next script.

The infinite loop of producing something and then fixing it the next morning continued until one day the analysts were fired.

It turned out that they spent so much time figuring out how to write scripts and debug, they did not have enough time for analysis and delivery of impactful insight. The company did not see a return on investment sufficient to keep an analyst team onboard.

The thousand headed data warehouse – when a new table is made for every task

A team was once filled with junior analysts who were using Big Query. For every new customer and every new assignment they would make a new table. As time went on, the data warehouse in Big Query was huge. Each query was horrendously expensive and took too long.

The analysts all had their own naming convention for the tables and made comments in the script as their form of documentation. The company was ad-hoc and did not pay the analysts particularly well, so there was a lot of churn. The analysts did not document, so new generations of analysts could not figure out what their predecessors had done.

One day auditors came and asked the firm many difficult questions which the firm could not answer.

Moral of the stories:

Even for a startup, technical debt can run into the 8-digit and above to overcome.

Control your technical debt before it controls you- by ruining your financing round or audit

You might even cling to your technical debt, as improving the quality of your data will deflate the faulty KPIs you sold to your investors. Bringing data quality to your organization could mean admitting that your CLV/CAC is low or that your retention figures are way off. Scary scary conversation wil follow.

Yes technical debt can haunt you.

Blogs

What can stop the cycle of chaos, under investment, attrition and over hiring in data teams? – an interview with Stevan Lazic

Operationtal KPIs that will let you know when you data team is adding value

Data strategy is part of corporate strategy

Podcasts

Data Operations on Analytics Anonymous with Valentin Umbach

More background about Convey’s Law.