. Part of me began this journey as a result of information engineering is among the hottest and highest-paying careers proper now. I’m not going to fake that wasn’t an element.
However there’s extra to it than that.
I’ve been studying information analytics for some time now. SQL, Energy BI, Python (Pandas, NumPy, a bit Polars), information cleansing, EDA. You identify it, I’ve been within the weeds with it. And I genuinely get pleasure from it. However someplace alongside the way in which, I began getting interested in what occurs earlier than the information lands on my desk. How does it transfer? Who builds these pipelines? What does the infrastructure behind all of this truly appear to be?
That curiosity planted a seed.
Then AI began making numerous what I do quicker and simpler. Which is nice. But it surely additionally made me suppose: if AI can deal with the evaluation, what’s my edge? What can I construct and perceive that goes deeper? I work as an IT System Analyst at a startup, and whereas I benefit from the work, I noticed I wasn’t difficult myself the way in which I wished to. I used to be prepared for extra.
The ultimate push got here from a video by Information With Baraa, the place he laid out a whole information engineering roadmap. One thing about seeing it structured and damaged down made it really feel actual and doable. So right here I’m.
I’m studying information engineering in public. And this text is the start of that journey.
Additionally, simply leaving a disclaimer that I’m not affiliated with Information with Baraa. I’m simply sharing my private journey. Hope it helps.
Why Information Engineering Particularly
I wish to spend a second right here as a result of I believe this query deserves an actual reply.
Information analytics taught me learn how to work with information after it arrives. Clear it, discover it, visualize it, draw insights from it. That skillset is genuinely invaluable. However the extra I discovered, the extra I stored bumping into the identical wall. The info I used to be working with had already been formed and moved by another person. Somebody had constructed the pipeline that introduced it to me. Somebody had determined the way it was saved, the way it was structured, how usually it refreshed.
I wished to be that particular person.
Information engineering sits upstream from analytics. It’s about constructing the techniques that make evaluation potential within the first place. Information pipelines, storage structure, workflow orchestration, large-scale information processing. These are the foundations every thing else is constructed on. And truthfully, that form of infrastructure work appeals to me in a approach that pure evaluation now not does.
There’s additionally a sensible argument. Information engineering roles persistently rank among the many highest paying within the information trade. As AI instruments get higher at automating the analytical layer, the demand for individuals who can construct and preserve dependable information infrastructure is barely going to develop. I’d relatively be constructing the pipes than simply utilizing them.
And yet one more factor. The startup I work at doesn’t use any of the instruments I’m about to study. Which implies each hour I put into that is completely self-directed. No workforce to study from, no work tasks to use it on. Simply me, the web, and no matter I can construct by myself. That’s a problem I’m selecting on function.
Why I’m Doing This in Public
Writing about what I study is one thing I already consider in deeply. It forces you to truly perceive one thing earlier than you clarify it. It retains you accountable. And over time, it builds one thing {that a} resume alone by no means might.
However I’ll be trustworthy about my fears too, as a result of I believe that’s the purpose of doing this publicly.
I’ve shiny object syndrome. There, I stated it. I’ve explored graphic design, animation, writing, advertising, and IT earlier than touchdown in information. There’s at all times one thing new and thrilling pulling my consideration. Information engineering might simply get changed by the subsequent flashy factor in my feed if I’m not intentional about it.
Consistency is one other one. I work a 9-5 the place I barely contact the instruments I’ll be studying. There’s no pure reinforcement at work, no colleague I can bounce Airflow questions off of. I’m constructing this completely by myself time, outdoors of my job duties.
And stability. Three to 4 hours a day is the purpose. Some days that may really feel simple. Different days it can really feel unattainable.
Publishing this journey is my accountability system. If I am going quiet, you’ll know I slipped. And I’d relatively not slip.
What I’m Beginning With
I’m not ranging from zero, which helps. I have already got newbie to intermediate SQL data from my information analytics work, fundamental Python fundamentals, and a few hands-on expertise with Pandas. That offers me a basis to construct on relatively than rebuild from scratch.
Right here’s the total studying stack, roughly within the order I’ll be tackling it.
1. SQL: Going Deeper Than Analytics
I do know SQL. However analytics SQL and engineering SQL are completely different animals. I’ll be going deeper into question optimization, indexing, working with very massive datasets, and writing SQL that’s constructed for efficiency relatively than simply exploration. Should you’ve solely ever used SQL to drag and filter information, there’s an entire different layer beneath price understanding.
Why it’s first: Every part in information engineering ultimately touches SQL. Getting sharp right here earlier than layering in additional complicated instruments makes the remainder of the journey simpler.
2. Python: From Exploratory to Manufacturing-Prepared
I’ve the fundamentals. Pandas, NumPy, some Polars. However the Python I’ve been writing lives principally in notebooks. Exploratory, messy, not constructed to final. The purpose now’s to write down cleaner, extra structured, reusable code. Capabilities, modules, error dealing with, scripting. The form of Python you’d truly put in a pipeline.
Why it issues: Python is the glue that holds most fashionable information engineering stacks collectively. Airflow makes use of it. PySpark is constructed on it. Getting comfy right here is non-negotiable.
3. Git and GitHub: Model Management Performed Correctly
I’ll be trustworthy. My Git data is at the moment “copy the command, hope it really works.” That has to vary. Model management is key to working like an engineer relatively than simply an analyst. I’ll be studying branching, pull requests, and learn how to handle code correctly throughout tasks.
Why it issues: Each undertaking I construct from right here on goes on GitHub. It’s portfolio, it’s self-discipline, and it’s how actual groups work.
4. Apache Spark and PySpark: Huge Information Processing
That is the place issues get genuinely thrilling. Apache Spark is among the most generally used engines for processing large-scale information. PySpark is the Python API for it, which suggests I can use a language I’m already considerably aware of to work with distributed information at scale.
The bounce from Pandas to Spark is a mindset shift. Pandas works on a single machine. Spark is constructed to run throughout clusters. Studying to suppose in that distributed approach is among the abilities that separates information engineers from analysts.
Why it issues: If you wish to work with large information in a manufacturing surroundings, Spark is sort of unavoidable. It reveals up in job descriptions consistently and is core to the Databricks ecosystem I’ll be constructing towards.
5. Apache Airflow: Orchestrating Information Pipelines
Information pipelines don’t run themselves. You want one thing to schedule them, monitor them, and deal with failures gracefully. That’s the place workflow orchestration instruments are available, and Airflow is my decide.
I thought of a couple of choices right here. Databricks Workflows is nice when you’re already deep within the Databricks ecosystem. Azure Information Manufacturing unit is sensible for Azure-heavy environments. However Airflow is free, open-source, cloud-agnostic, and extensively used throughout the trade. It additionally teaches you the core ideas of orchestration in a approach that transfers to different instruments. Beginning with Airflow felt like the fitting name, particularly since I’m attempting to maintain prices low.
Why it issues: Orchestration is what turns a group of scripts into an precise pipeline. Understanding Airflow is knowing how manufacturing information workflows are managed.
6. Databricks: The Information Platform
In some unspecified time in the future you must decide a knowledge platform and go deep on it. I’m going with Databricks. It’s constructed on prime of Spark, it’s in excessive demand, and it has a free Group Version that allows you to follow with out paying for cloud credit.
The alternate options are stable too. Snowflake is a clear, quick SQL warehouse that numerous firms love. BigQuery is Google’s absolutely managed, serverless choice and genuinely wonderful when you’re leaning towards Google Cloud. However Databricks sits on the intersection of huge information, machine studying, and information engineering in a approach that matches the place I wish to go. It made probably the most sense for my targets.
Why it issues: Employers need you to have platform expertise. Going deep on one is extra invaluable than figuring out a bit about all of them.
How I’m Structuring the 12 Months
The trustworthy reply is that this would possibly take longer than 12 months. And I’m okay with that. I’d relatively take 15 months and really perceive what I’m doing than rush via in 12 and are available out shaky on the basics.
The overall method is to maneuver via every ability so as and never advance till I’ve constructed one thing with what I simply discovered. Tutorials are nice for orientation however tasks are the place actual studying occurs. My plan is to doc every section right here on In direction of Information Science: the ideas, the tasks, the frustrations, and the wins.
For monitoring progress, I’m utilizing the Notion roadmap from Information With Baraa as my spine. It breaks down every ability into core subjects and lets me monitor the place I’m with out getting overwhelmed by the total image unexpectedly.
As for time dedication, three to 4 hours a day is the goal. A few of that can be structured studying. Some can be constructing. Some can be writing about what I simply discovered, which is its personal type of learning.
What Success Appears Like
Touchdown a high-paying information engineering position is the purpose. That’s actual and I’m not going to decorate it up.
However alongside that, I wish to grow to be a reputable voice on this area. Somebody who builds issues price speaking about, paperwork the journey with out filtering out the exhausting elements, and perhaps makes the trail a bit clearer for somebody developing behind me.
The writing and the educational feed one another. The portfolio turns into the proof. The proof builds the model. That’s the imaginative and prescient.
Beginning As we speak
This text is my official begin date. I’m not ready till I really feel prepared or till every thing is completely deliberate. I’m beginning now, writing as I am going, and letting the method be public and a bit messy.
Should you’re someplace on an identical path. Whether or not you’re in analytics eager about engineering, in IT questioning what’s subsequent, or simply somebody attempting to construct abilities that maintain their worth in an AI-accelerated world. Comply with alongside.
I believe we’ll have so much to speak about. I’ll even be sharing my learnings on my YouTube channel. So be at liberty to subscribe under and observe alongside.
That is the primary article in an ongoing collection documenting my information engineering journey. I’ll be publishing usually on my progress, the tasks I’m constructing, and every thing I study alongside the way in which.
And if you wish to get entry to the Notion template, in case you’re on the identical journey as I’m, you possibly can entry it right here.
Comply with alongside on my journey under.
