By Elizabeth Press and Irina Brudaru
Photo by Irina Brudaru. Each greenfield project is unique. As no field on the side of a hill is the same, no two greenfield data projects will be either. They all come with their mixture burned earth, islands of trees, fenced in areas and more. This picture is in Iași, Romainia.
This blog amalgamates the experiences of others with our own to guide business leaders, founders and managers, who want to build a data competence in their company. We compiled experiences and advice from greenfield veterans with different backgrounds from a wide variety of companies, ranging from startups to industrial mid-market companies. The common factor was that each of these people built a new data function in a given company.
The data field is a jumble of overlapping, trendy and obfuscating terms. In this article “data” means the broad spectrum of tools, professional roles and processes that are associated with managing and analyzing data. “Business Intelligence” (BI) is a spectrum of cross-functional disciplines that are associated with managing the journey of data from creation to a business decision, and the impact thereof. Technical debt is a debt in effort and money you create for your future business operations as a result of sub-optimal technical decisions in the short-run.
At what stage and age was the company when they started a data team?
The answers ranged from close to day 1 to after 100 years. Generally, there was always data somewhere in the functions, be it limited to founders, or an employee turned ad-hoc data analyst. A central team with dedicated professionals often came after a significant investment round.
Who is the first hire?
Data Engineer was the dominating answer. All of the first hires had technical backgrounds, formal engineering, mathematical or computer science education and hands-on experience in tech. The caveat with the sole data engineer is that they need to be good at self-managing and be able to produce the first analytics, which is not always the case. One trap of the first engineering hire may fall in,depending on higher management support, is lack of focus on change management.
When do you ramp up and with whom?
Ramp ups generally took place within the first two months. The people who waited for more than six months were generally data engineers who worked extremely long hours.
We found two main tactics regarding the second hire:
- Data engineer: The pro of this route was that infrastructure was built quickly, and former technical debt was often able to be managed to differing extents, depending on seniority of hires and headcount. The cons were that the focus of the team was on the technology of the pipeline, which seemed good at the time, but resulted in a technocratic focus. As we will discuss later on, overly technocratic teams can focus on technological experimentation rather than business value, which results in wasted resources and breeds technical debt. Another caveat can be lack of focus on political undertones, change management and building relationships with the functional leaders.
- Data analyst: The pro of this route was complimentary skills and quicker time to value through quicker analytical output. The value to the organization is realized through the ability to create good use cases and KPIs, combined with stakeholder usage of analytical output in decision making. The cons were greater strain on a sole data engineer and slower infrastructural scale-up, unless easier (low code / no code) tools are used. Even when using easier tools, there will be rapidly paced questions placed on thin infrastructure, which might lead to ad-hoc, messy infrastructural solutions. The organization will see quicker analytical results at the expense of infrastructure architecture.
Every further role that is filled by either an analyst or data engineer or somewhere in between, such as an analytics engineer brings the above mentioned trade off. Generally people said to stay away from data science the first year, mostly because both the infrastructure, use case development and data maturity of companies in the initial phases of a greenfield project are still too nascent.
The respondents were split between hiring contributors or managers first in a specific data competency, but the tendency leaned towards senior contributors. Many companies who are budget limited make the mistake of hiring people who are good coders, but too junior. Hiring good coders who are too junior often results in a solution that works initially, but is not scalable – or even understandable – for future situations and team members.
What did the team look like at the end of the first year?
The average headcount was 6 people, although the answers ranged from 2 to 10.
How long did it take to get results?
First KPIs: 3 months, although the timeline varied from a few weeks to never based on team size, existing infrastructure and relationships with stakeholders.
Conceptualize infrastructure architecture: 3 months, although the answers varied based on team size, data and business model complexity.
Implement infrastructure architecture: 3-6 months to 1 year, depending on scope of architecture, team ramp up and existing assets, as well as technical debt inherited.
What were successful tactics to get results with limited resources?
Work horizontally and iterate
Working horizontally creates quicker time to first results. Horizontal means creating data pipelines and analytics for a specific problem/domain with high business impact, rather than expanding rather than the entire company at once.
Design pipelines units that can be refined at a later time, in order to speed up delivery of the first milestones. Depending on the data and business model complexity, the Transformation part of the ELT is left for later. Also, challenge the adoption of trendy tools in favor of considering open source, or making the decision at a later time. Too many new tools can be too many in the beginning. Start lean.
Move data logic to DBT
The tool is flexible and SQL-based, so easier to use for analysts. This improved query and dashboard performance. DBT also enables visualization of data lineage, which is very important and is a gap in many BI stacks.
Low-code / No-code
The focus should be on the analytics. Teams often spend too much time on code and technical issues with the pipeline when the business value lies in the analytics. Low-code and no-code tools enable workable pipelines, minimize complexity and enable focus on analytics. You can also hire a broader spectrum of people if knowledge of code is no longer a hiring bottle neck. This choice allows for a faster delivery of results/ KPIs.
What were some first wins?
Getting the stakeholders to embrace you
Convincing stakeholders and getting their buy-in for data projects is an important first win, as data projects can be seen as a back-end nice to have. Performant and impactful dashboards were also cited as first wins, as well as being able to show important metrics for specific stakeholders.
Auditing the data
In the process of building the data stack, one also may perform a data quality audit at the same time. Are all values of a certain variable type documented? Do they exist at all? This is also helpful in defining the KPI coverage of the data and may lead to the discovery of bugs, QA, data dictionary creation and feature requests in case the desired KPIs are based on untracked data.
The major learnings
Stakeholders! Return on investment depends on this!
Stakeholder buy-in is important for buy-in, continued success and investment. A top-down directive to become more data-driven is not enough. Department heads and middle management need to embrace the data team. The initial person and team members need to communicate progress, results and positive impact for the business. Moreover, they need to understand the decision making culture and engage stakeholders enough to be able to capture their enthusiasm and feedback. Building a data function is a cross functional project that depends on the change management created at the top to bottom by the company leadership. A successful data team does not work in silos, isolated.
Focus on usability instead of technically interesting
Managers of greenfield teams should not be afraid to be directive of technology decisions. Often managers try to accommodate team members’ technical curiosities to retain talent. Accommodating technical curiosity can lead to overly complex data pipelines, lack of scalability and breeds technical debt.
Don’t spend too much time on the pipeline. Build an MVP and iterate
Engineers like using complex tools to learn and that makes scaling, hiring, etc. hard in the future. Plug and play tools are important for moving quickly and remaining competitive in the dynamism of the data world.
Sometimes you don’t get the tools you would like due to external factors.
Spend more time on documentation, governance and data quality issues
Avoid custom code. Custom code gets hidden in places such as dashboards and is often not documented. Tools such as DBT are useful for unwinding and understanding the business logic.
Garbage in means garbage out
Everything a data team does is built on trust in data. There is a point when data is of such poor quality, that gut feeling is actually better than trusting data. Sad but true.
Did our veterans create technical debt?
Yes they did….
A couple of the respondents said that the engineers found Spark appealing, but Spark skillsets are rare, which made hiring and upskilling difficult. They regretted starting with Spark instead of an easier alternative.
The budgets were tight, so some of the tool choices were suboptimal. For example, one respondent made a choice to use Redshift due to budget constraints and the ability to convince stakeholders.
Other respondents chose Postgres and BigQuery and did not govern.
Lack of governance, overly focusing on speed rather than quality
Custom code, bugs, hacks. inconsistency. All of these issues were mentioned multiple times and had to be cleaned up expensively later on, often by another data leader. It is important to take care of metadata, data quality, governance and documentation.
How much will the first year cost?
The average was 800K Euros. 700K for personnel costs, 100K for tools. The lowest cost was 300K with huge amounts of overtime. The largest cost was 1.1 Million Euros for a company that scaled up its team after receiving a large investment.
Data functions are an artifact of their organization, thus there is no cookbook for how to successfully execute a greenfield project. With a limited budget and mediocre political clout, decisions will focus on trade offs rather than building a world-class organization. Building a data function that is strategic and represents a competitive advantage is a long-term investment whose return on investment will likely still remain negative in the first year, despite some early wins. How long it will take to eventually break even or even realize a positive ROI depends on the ability of the team to produce impactful use cases and KPIs while maintaining quality data and engaging stakeholders.