Is the love of artisinal SQL the downfall of data teams? - The Data-Driven Decision Making Blog

The artisanal approach to SQL querying prevalent in the data industry is a not-so silent killer of many data teams. The self-view of many analysts and even data leaders is that writing long queries is an intellectual and technical pure form of analysis. However, if crafting a 200-something line SQL query takes hours or even days (a scenario all too familiar), it’s a sign that your data team is operating like artistic masters. Let’s face it, data teams don’t have royal sponsors to create masterpieces, but are often running on cash-burn budgets. In a tech landscape that preaches agility and growth, indulging in artisanal coding becomes a costly impediment rather than a driver of data-driven progress.

Over reliance on SQL can indeed lead to inefficiencies in certain scenarios. Here are some reasons why SQL might be overused and how it can result in inefficiency:

Limited Toolset Awareness: SQL has been the default lingua franca of data teams. Technical intellectual snobbery sometimes keeps data teams from defaulting to more user-friendly means of conducting analysis, such as Tableau, PowerBI, numerous other tools and even – Excel! Sometimes if you need an analysis, yes importing the data to Excel and manipulating it is actually the more efficient and thus better option.
Complex Data Manipulation: SQL can become cumbersome for complex data manipulation tasks, especially when dealing with multiple data sources, transformations, and calculations. Other tools like Python with Pandas offer more flexibility and expressiveness for such tasks.
Performance Issues: Writing inefficient SQL queries can lead to performance bottlenecks, especially with large datasets or complex queries. Without proper indexing, query optimization, or understanding of database internals, SQL queries may not perform optimally.
Data Integration Challenges: SQL is designed for querying structured data within relational databases. When dealing with semi-structured or unstructured data, or when integrating data from disparate sources, SQL may not be the most suitable tool for the job.
Lack of Scalability: Scaling SQL-based solutions to handle growing data volumes and increasing complexity can be challenging. While SQL databases offer scalability options, they may not always be sufficient for rapidly evolving data requirements.

I will take a simple example from my world as an agile-minded data leader. I want a burn-down chart for my team

SQL Query for Burn Down Chart:

This is a simple query that shows a stumbling block for so many teams: The nested subquery.

This query aggregates story points by date and calculates the cumulative sum of story points over time, which can be used to create a burn down chart. While this simple query is quite easy to follow, complex queries often come with multiple layers of subqueries.

SELECT

date_column AS Date,

SUM(story_points) AS Total_Points,

(

SELECT SUM(story_points)

FROM your_table AS t2

WHERE t2.date_column <= t1.date_column

) AS Cumulative_Points

FROM

your_table AS t1

WHERE

date_column BETWEEN start_date AND end_date

GROUP BY

date_column

ORDER BY

date_column;

This query:

We’re using a subquery in the select clause to calculate the cumulative sum of story points up to each date.
For each row in the outer query (t1), the subquery calculates the sum of story points from all rows in the same table (your_table) where the date is less than or equal to the date of the current row in the outer query.
This subquery calculates the cumulative sum of story points for each date.
The rest of the query, aggregates story points by date and filtering for the relevant date range.

Such a query can achieve the desired results. With complex datasets, junior analysts underpressure can write complex and large queries that result in correlated subqueries, which will slow down the processing time.

Too many data teams: Rely on multilevel nested queries, long and complex CASE WHEN statements, Left Join upon Left Join.

For this query, a better query could have a window function. CTEs are also good tools to use. ROLLUP and CUBE functions are great tools to use for Tables, Views, Materialized Views when there is a definition or otherwise repeated query.

What about using Pandas?

# Execute the query and load the results into a Pandas DataFrame

df = pd.read_sql_query(query, conn)

# Close the database connection

conn.close()

# Calculate the cumulative sum of story points

df[‚Cumulative_Points‘] = df[‚Total_Points‘].cumsum()

# Plot the burn down chart

plt.plot(df[‚date_column‘], df[‚Cumulative_Points‘], marker=’o‘)

plt.xlabel(‚Date‘)

plt.ylabel(‚Cumulative Story Points‘)

plt.title(‚Burn Down Chart‘)

plt.xticks(rotation=45)

plt.grid(True)

plt.tight_layout()

plt.show()

After executing the SQL query and loading the results into a Pandas DataFrame (df), calculate the cumulative sum of story points using the cumsum() function.
Plot the burn down chart using Matplotlib with dates on the x-axis and the cumulative sum of story points on the y-axis.

Which is more efficient?

It depends. The size of the dataset, complexity of the query, and the capabilities of the underlying database system are all factors that determine the efficiency.

Efficiency of SQL:

SQL queries are typically executed directly on the database server. Window functions enable parallel processing and are efficient for performing complex analytical tasks, such as ranking, aggregation, and moving averages, efficiently over partitioned subsets of data. Nested queries, if they are poorly written, can be overly complex or poorly optimized and might lead to performance issues, especially when dealing with large datasets or multiple levels of nesting.

Efficiency of Pandas:

Pandas processes data in memory, which can be advantageous for smaller to medium-sized datasets that can fit into memory. In this case of the burn-down chart, Pandas.
Pandas provides an interface for data manipulation and analysis, making it easy to write and understand code for complex data transformations.
For certain tasks, Pandas may offer more flexibility and expressiveness compared to SQL, allowing for custom logic and calculations that might be challenging to implement in SQL queries.
Pandas allows you to visualize the data in the tool itself.

In summary, the efficiency of Pandas versus SQL depends on the specific requirements of the task, the characteristics of the dataset, and the performance considerations of the underlying database system. It’s essential to evaluate the trade-offs and choose the appropriate tool based on the context of the data analysis project.

Sometimes bringing data into Excel or Tableau with a SELECT * FROM a table, view, etc (subset of your dataset) for example to explore and analyze the data is also a good option.

In anycase, choices of which method to use should be made based on efficiency, effectiveness and the needs of the organization, not based on artisanal craftsmanship technocracy or technological affinity.

Schreibe einen Kommentar Antwort abbrechen