I built a data warehouse in Postgres and I want to convince my boss that we should use it. Looking for a reality check.
Get your bingo cards ready, r/dataengineering. I'm about to confess to every data engineering sin and maybe invent a couple new ones. I'm a complete noob with no formal training, but I have enough dev knowledge to be a threat to myself and others around me. Let's jump into it.
I rolled my own data warehouse in a Postgres database. Why?
I was tasked with migrating our business to a new CRM and Accounting software. For privacy, I'll avoid naming them, but they are well-known and cloud-based. Long story short, I successfully migrated us from the old system that peaked in the late 90's and was on its last leg. Not because it was inherently bad. It just had to endure 3 generations of ad-hoc management and accrued major technical debt. So 3 years ago, this is where I came in. I learned how to hit the SQL back-end raw and quickly became the go-to guy for the whole company for anything data related.
Now these new systems don't have an endpoint for raw SQL. They have "reports". But they are awful. Any time you need to report on a complex relationship, you have to go through point-and-click hell. So I'm sitting here like wow. One of the biggest CRMs in the world can't even design a reporting system that lets you do what a handful of lines of sql can do. Meanwhile management is like "you're the data guy & there's no way this expensive software can't do this!" And I'm like "YEAH I THOUGHT THE SAME THING" I am baffled at the arbitrary limitations of the reporting in these systems and the rediculous learning curve.
To recap: We need complex joins, pivots and aggregations, but the cloud systems can't transform the data like that. I needed a real solution. Something that can make me efficient again. I need my SQL back.
So I built a Linux server and spun up Postgres. The plan was to find an automated way to load our data onto it. Luckily, working with APIs is not a tall order, so I wrote a small python script for each system that effectively mirrors all of the objects & fields in their raw form, then upserts the data to the database. It was working, but needed some refinement.
After some experimenting, I settled on a dumbed-down lake+warehouse model. I refined my code to only fetch newly created and modified data from the systems to respect API limits, and all of the raw data goes into the "data lake" db. The lake has a schema for each system to keep the raw data siloed. This alone is able to power some groundbreaking reports... or at least reports comparable to the good old days.
The data warehouse is structured to accommodate the various different reporting requirements from each department in our business. So I made each department their own schema. I then began to write a little library of python scripts that transforms and normalizes the data so that it is primed for quick and efficient reports to meet each department's needs. (I'm not done with them all, but I have good momentum, and it's proving to be really pleasant to work with. Especially with the PostgreSQL data connector from Excel PowerQuery.)
Now the trick is adoption. Reactions to this system were first met rather indifferently by my boss. But it seemed to have finally dawned on him (and he is 100% correct) that a homebrew database on the network LAN just feels kind of sketchy. But our LAN is secure. We're an IT company after all. And my PSQL DB has all the basic opsec locked down. I also store virtually nothing locally on my machine.
Another contention he raised was that just because I think it's a good solution, that doesn't mean my future replacement is going to think the same thing (early retirement?? 😁 (Anyone hiring??)). He's not telling me to tear it down per-se, but he wants me to move away from this "middleware".
His argument to me is that my "single source of truth" is a vulnerability and a major time sink that I have not convinced him of any future value. He suggested that for any custom or complex reports, I write a script that queries within the scope of that specific request. No database. Just a file that, idk, I guess I run it as needed or something.
I know this post is trailing off a bit. It's getting late.
My question to you all are as follows.
Is my approach worth continuing? My boss isn't the type to "forbid" things if it works for the human, but he will eventually choke out the initiative if I can't strongly justify what I'm doing.
What is your opinion of my implementation. What could I do to make it better?
There's a concern about company adoption. I've been trying to boil my system's architecture and process design down to a simple README so that anybody with a basic knowledge in data analytics and intermediate programming skills could pick this system right up and maintain it with no problems. -> Are there any "gold standard" templates for writing this kind of documentation?
I am of the opinion that we need a Warehouse because the reporting on the cloud systems are not built for intense data manipulation. Why the hell shouldn't I be able to use this tool? It saves me time and is easier to build automations on. If I'm not rocking in SQL, I'm gonna be rocking in PowerQuery so all this sensitive data ends up on a 2nd party system regardless!
What do you think?
Any advice is greatly appreciated! (Especially ideas on how to prove that a data warehouse system can absolutely be a sustainable option for the comoany.)