The enterprise data movement has come a long way in the past few decades. The first relational database system was built in 1970 and organizations built themselves around relational/transactional databases with SQL becoming the de facto standard for working with data. Over the last few years, we have seen a foundational shift in the movement of data to the cloud and Snowflake (SNOW) and Databricks (DBX) have cracked a powerful cloud/data flywheel – as more and more data moves to the cloud, more and more tools are written for processing that data in the cloud. And as more and more tools are written for the cloud, more and more data move to the cloud.
Today we are witnessing another foundational shift with Generative AI (Gen AI) which is yet again revolutionizing our relationship with data. While Gen AI’s primary impact has been around unstructured data, in this article we attempt to speculate a) the role of Gen AI for structured and relational data and b) how the modern enterprise data landscape will evolve with Gen AI.
Gen AI use cases for structured data
With data becoming the new oil, enterprises have significantly invested in transactional as well as operational database systems. These often are the system of record in an organization. Organizations also have a pile of unstructured data sources like emails, chats, documents etc., which are often the system of insights in an organization. Gen AI is going to play a vital role in bringing together the systems of record and systems of insights to augment the knowledge worker. Listed below are three use cases where we speculate Gen AI’s maximum impact in the world of structured enterprise data.
Humanizing interaction with structured data:
Thanks to ChatGPT, a bot no longer feels like a bot. LLMs have paved the way for powerful conversational interfaces. While SQL continues to be the de facto language among data practitioners, Gen AI can help provide a chat-based interface converting natural language to complex database queries. This will not displace business intelligence professionals, but it will make data more accessible and open up more stakeholders across the organization. This also makes the building and adoption of data products accessible and efficient.
For example, cube.dev recently launched a GPT powered interface that lets you interact with your semantic layer through slack. Glean chat is another example of an enterprise intelligent assistant powered by Gen AI. Unlike conventional enterprise search methods, Glean chat understands the context and sequence of queries, facilitating more intuitive and efficient interactions.
LLMs and Data engineering:
Just like Github copilot we expect to see data engineering co-pilots emerge which increase the efficiency of data engineers. Generative AI is already being used to automate data cleaning and preparation, generate code for data pipelines, and create visualizations of data. For example, Gen AI is used to generate code for ETL tasks such as extracting data from sources, transforming the data and loading it into destination. It can also be used to test ETL pipelines, improve its reliability and generate documentation to improve understanding and compliance of data pipelines. From a Data Ops perspective, it can also be used to reduce misconfiguration errors in data warehouses by automatically generating schemas and identifying and correcting data errors.
Prophecy, for example, recently launched a data copilot that converts natural language queries to suggested data pipelines that brings data together for desired reports. Another example is SnapLogic which recently launched SnapGPT to simplify data transformation. As we increase access to data, the criticality and complexity of data makes it harder from a governance perspective. Existing governance and data security solutions like Immuta will see a heightened demand for their offering.
As new sources of unstructured data become relevant for an organization, this data needs to be prepared in a way that LLMs can understand and leverage. They will need to be stored as a system of record for governance and compliance purpose thus creating the need for data pipelines for unstructured data. Unstructured.io is a good example of a company addressing this problem.
Augmenting data sets:
Data is AI and AI is data. Data augmentation is a critical step in enhancing the quality and diversity of data sets which is essential for improving the performance of machine learning models. Missing values in data sets are a common issue that data practitioners have to grapple with. While there are conventional methods to fill in these gaps, like using mean values or forward-fill methods, they often fall short in capturing the true complexity or relationships within the data. Gen AI has the ability to conduct more meaningful imputation, filling in missing data in a way that is statistically coherent with the rest of the dataset. This increases the reliability of the data for both analytics and further machine learning applications.
Solutions like Ikigai, Hacarus, and Nomad data are using sparse data modelling techniques to build solutions for use cases where data is sparse. In addition, we also have synthetic data startups that are using General Adversarial Network (GAN) technology to augment existing data sets. You can read our deep dive on synthetic data and its impact here.
In conclusion, as Gen AI continues to integrate with structured data, it stands poised to transform enterprise data landscapes by making data more accessible, streamlining data operations, and enhancing the quality of datasets. These developments will undoubtedly play a pivotal role in shaping the future of data-driven decision-making in the corporate world.