spot_img

Date:

Share:

PBT Group: Why businesses must rethink their data lakes

While businesses are racing to implement Artificial Intelligence (AI), many overlook a critical factor in success: the quality and structure of the data feeding these models. The reality is that a model is only as good as the data it is trained on. This is according to Julian Thomas, Principal Consultant at PBT Group.

“If your data lake is unmanaged or full of unstructured, incomplete, insufficient, or unreliable data, even the most sophisticated AI will not deliver value,” he emphasises.

Thomas explains that too many organisations treat their data lakes as passive repositories, a place to store everything, rather than a curated resource. This approach undermines governance, hinders usability, and creates downstream issues for data teams tasked with developing AI and machine learning solutions.

“To get AI right, we need to shift the mindset around data lakes. They should be active environments governed by frameworks like the Medallion architecture, which helps teams clean, refine, and enrich data in a structured, layered way.”

PBT Group often uses the Medallion architecture to bring structure to a data lake. It separates data into three layers. Bronze for raw, unfiltered data; Silver for data that has been cleaned and enriched, that is more analytics-friendly; and Gold for the curated, trusted datasets that are fully governed and ready for use in Business Intelligence or machine learning. This progression helps teams work from a consistent base, trace where data comes from and ensure that what is delivered matches the needs of the people using it.

But a layered structure is only part of the solution. The real differentiator, according to Thomas, is data wrangling.

“Data wrangling is not just a technical clean-up. It is a deliberate, skilled process of transforming messy, inconsistent data into something reliable and fit for purpose. That includes everything from deduplication to validation and enrichment.”

This approach is particularly important in industries like financial services, where it is essential to know exactly where your data comes from and how it has been handled. It is also crucial when training AI models, which depend on accurate historical data to perform reliably and fairly.

As part of the wider data wrangling process, Thomas emphasises that it is important to understand the main difference between data wrangling and the process of Extract, Transform and Load (ETL). “Data wrangling can be considered as ‘informal ETL’, done in the context of machine learning for a given initiative. ETL is effectively the same activity, however it is automated for long term use. Once data wrangling has been completed with the resulting training model approved for production implementation, the data wrangling solution must be handed over to a formal engineering team where it can be converted into formal ETL.”

Thomas also cautions against viewing data quality as a once-off project.

“Data governance must be embedded into daily operations. From ingestion to output, quality controls, validation steps, and metadata tracking need to be built into every phase.”

The payoff? A structured data lake combined with rigorous wrangling makes data more accessible and AI-ready. It enables teams to experiment with confidence, deliver faster iterations, and avoid the costly rework that comes from poor input data.

“As AI becomes more integrated into business decisions, the pressure on data teams will only increase. Getting the fundamentals right now, especially how we wrangle and structure our data, will determine who actually succeeds in turning AI into value.”

spot_img
spot_img

━ More like this

iCAUR to debut visionary concept car in Beijing, showcasing the brand’s future design ethos

iCAUR will unveil a futuristic concept car for the first time at the 2026 Beijing Auto Show. The V23 and V27 lineup will...

NCIC 2026 mobilises innovators nationwide to solve South Africa’s most pressing challenges.

Applications for the National Cleantech Innovation Challenge (NCIC) 2026 have closed with over 2000 innovators, entrepreneurs and researchers who showed interest across the country....

SAS AI Navigator to bring order to AI chaos

New SaaS platform will accept any model or agent, govern every use case and accelerate AI innovation. The rush to implement AI leaves organisations struggling to track its use, with...

Kaspersky reveals a 37% increase in malicious packages compromising software supply chains worldwide

According to Kaspersky telemetry, almost 19,500 malicious packages were found in open-source projects by the end of 2025, representing a 37% increase compared to...

Why Monitors are Becoming the Unsung Heroes of the AI-Driven Workplace in South Africa

When discussing the AI-driven workplace, the focus often falls on powerful processors, intelligent software, and the transformative potential of large language models. However, one...
spot_img