Data and AI leaders have been working feverishly on generative AI (gen AI) use cases for more than a year. Their experience has provided promising glimpses of the considerable value at stake in gen AI but has also exposed a variety of challenges in getting to scale. Managing data remains one of the main barriers to value creation from gen AI. In fact, 70 percent of top performers in a recent McKinsey survey said they have experienced difficulties integrating data into AI models, ranging from issues with data quality, defining processes for data governance, and having sufficient training data.
In our experience, organizations have been held back by a still maturing understanding of both how to evolve data capabilities to support gen AI cases at scale and how to use gen AI to improve data practices. This article will cover three actions that data and AI leaders can consider to help them move from gen AI pilots to scaling data solutions. The first focuses on how organizations can strengthen the quality and readiness of their data for gen AI use cases. The second looks at how organizations can use gen AI to build better data products with their modernized data platforms. The third explores key data-management considerations that enable reuse and accelerate the development of data solutions.
It starts at the source: Improve your data
While data quality has long been an important concern for data and AI leaders, the risks and costs of feeding poor data into gen AI models cannot be overstated, ranging from poor outcomes, costly fixes, and cyber breaches to a loss of user trust in the outputs. The 2024 McKinsey survey cited above, in fact, found that 63 percent of respondents—seven percentage points more than in the 2023 survey—said that output inaccuracy was the greatest risk they saw in their organizations’ use of gen AI.
Traditional methods of ensuring data quality aren’t enough; leaders should consider the following ways of improving and expanding their source data.
Obtain better and more-accurate source data from complex data types
Organizations are struggling to handle the increased complexity of unstructured data sets. For example, banks might want to look at both structured financial information, such as transaction history, as well as financial statements and market analyses to determine the creditworthiness of a corporate client. But processing combinations of structured and unstructured data often increases the chance of errors because, while internal teams and subject-matter experts have the relevant knowledge, they generally struggle to codify that knowledge so that data pipeline processes can be easily replicated.
Tools have evolved to handle the relationship between different types and sources of data. For example, knowledge graphs can help capture complex relationships between entities, providing meaningful context for large language models (LLMs) and their downstream data sets. These kinds of capabilities make it easier to accurately map data points from unstructured to structured data.
Even when data engineers understand the relationship between data sets, they still need to assign different methods to interpret that data based on attributes, such as the data format (PDF, PowerPoint, Word, or image files, for example). This is a challenge as companies integrate formats into their systems that are becoming increasingly complex. Multimodal models are sophisticated enough now to parse more complex types of documents that feature disparate data formats, such as extracting tabular data from unstructured documents.
While these models are becoming easier to use, they can still make mistakes (and, in some cases, are expensive). Accuracy issues require constant review, which is often still manual. Some data engineers, for example, spend a lot of time checking two screens of an integrated development environment to observe the differences between outputs. As concurrent use cases increase, this manual approach quickly hits its limits. Data leaders need to focus resources on implementing automated evaluation methods, mechanisms to manage versioning, and data-relevancy scoring to enhance multimodal model output accuracy and consistency.
An investment firm knew it needed to improve its data access and usage to implement a virtual assistant. In order to use product information from structured and unstructured data sources, it had to build data pipelines for parsing and processing unstructured data, identify which version of each document was most recent, and adapt the length of articles for mobile users. The firm’s data engineers used multimodal model capabilities to parse tabular data from documents into structured data and build a medallion architecture (a popular design pattern for organizing data that supports modular pipeline development). Additionally, they introduced versioning and relevancy scores to improve output accuracy. As a result, the company was able to quickly start work on use cases, such as due-diligence activities, with a production-grade gen AI environment within two weeks.
Create data when they aren’t available
Some gen AI use cases are difficult to pursue because the required data are difficult to obtain and process, which is often an issue in healthcare, life sciences, or other sectors that have stringent data security regulations. To overcome these challenges, in some cases, a data engineer can manually generate a file to test the efficacy of a use case. But the process can be time-consuming and inefficient.
Instead, data and AI leaders are investing in gen AI tools to generate synthetic data as test data or to produce new values based completely on the column descriptions and context of the table, allowing them to either create a new data set or make revisions to an existing one. Some companies have already used synthetic data generators to create statistically similar data sets.
Use gen AI to accelerate the building of reusable data products
Data products, such as a 360-degree view of individual customers, are the cornerstone of how companies use data to generate value at scale for the business. But such data products can be difficult and time-consuming to develop. With better data and new gen AI tools, however, companies are finding they can accelerate development and improve outputs. For example, one hospitality company expedited the creation of customer domain data models by up to 60 percent while increasing productivity in feature engineering by 50 percent. It was able to hit those marks by focusing on automatically generating both end-to-end data transformation pipelines in PySpark and robust documentation of all the complex transformations that occurred.
Shift to end-to-end creation of data products
Until recently, available technology has limited the creation of data pipelines (such as a medallion architecture) to a laborious step-by-step approach. While using gen AI to perform tasks, such as generating an individual table from natural language, may make data engineers more efficient, engineers still must complete a series of other upstream and downstream steps, such as combining all the tables.
Data and AI leaders instead are starting to take an end-to-end approach to building data pipelines by automating all the steps, achieving, in some cases, time savings of 80 to 90 percent and enhanced scalability for specific use cases.
Writing the data pipeline code to generate data products traditionally has been one of the most time-consuming tasks for data engineers. We now are seeing the automated creation of data pipelines, written in languages such as SQL or Python, to create entire models that can solve for multiple use cases at once. Rather than looking at a modest scope of work, such as generating an individual table from a natural-language prompt, the capabilities exist to generate dozens of tables as a cohesive target data model capable of providing solutions to multiple use cases.
Before an organization can begin generating these types of capabilities, however, it needs to ensure it has trustworthy, easily understandable, and available data. For companies that have been building their data estate for many years, an important element of this process is understanding their legacy code bases and existing data. Many companies struggle, however, because of poor data lineage or cataloging, leading to a limited understanding of how their data are generated. In response, some companies are employing a variety of agents (gen AI applications) across multiple LLMs to analyze legacy code bases and generate natural-language text descriptions. This approach not only improves the organization’s understanding of its code base but also facilitates the creation of data catalog features, streamlining the identification and removal of redundant code segments.
Enhance consistency with better orchestration and data management
Developing gen AI applications requires a level of orchestration and modularization that enables easy reuse of specific capabilities. Traditional continuous integration/continuous delivery (CI/CD) methods are often not up to the task, because they cannot maintain the necessary consistency between gen AI programs due to the introduction of gen AI–specific activities, such as prompt engineering.
In response, some data and AI leaders are using agent-based frameworks, a structure that facilitates collaboration and coordination among multiple gen AI agents. These frameworks orchestrate gen AI agents and the complexities involved with scaling their use (and reuse). Agent-based frameworks are equipped with reasoning, code execution, tool usage, and planning abilities as well as enhanced workflow management. They can help address limitations associated with LLMs, such as process-management challenges, cross-verification errors, and end-to-end workflow design constraints. By incorporating these agents into a gen AI architecture, organizations can better manage complex tasks and improve overall performance, reliability, value, and user satisfaction. Some companies are employing agent-based frameworks in consumer-facing chatbots or enterprise knowledge retrieval systems.
To better manage their data products, many companies are turning to a range of tools. Some are working with off-the-shelf tools, though these often have issues with complex scenarios, such as automatically generating insights from unstructured data. Organizations that use gen AI–augmented data catalogs can facilitate real-time metadata tagging, including automatically generating metadata from structured and unstructured content and creating smart tags. This has the effect of improving data discovery and assisting in the selection of appropriate structured and unstructured data for gen AI models.
Migrate and modernize data products
Before beginning the process of using gen AI capabilities, such as code translation, to migrate data products and their underlying pipelines from one platform to another, companies need to first determine the right LLM for the job. While many organizations use LLMs supplied by their cloud service provider, certain LLMs may be trained more proficiently on one set of coding languages than on others. For example, one LLM may be better suited to write PySpark code for pipelines, while another is more efficient at Terraform for developing infrastructure as code. Organizations can use these LLMs to facilitate smoother migration to platforms that use PySpark or SQL, though, in some cases, depending on the coding language or framework, fine-tuning a model may still be necessary.
By understanding which LLMs to use for given coding languages—and how to automate code translation across languages—companies can better migrate pipelines from mainframes and legacy-managed services already in the cloud to more-modern cloud resources. Identifying the appropriate LLM, however, may require additional testing time, which data and AI leaders should account for in their project road maps.
Scale gen AI with security and coding standards
Data and AI leaders face big challenges in managing and governing the rapidly expanding use of unstructured data. The proliferation of gen AI models and applications not only introduces risks but also hampers getting to scale because teams often end up using different—and sometimes conflicting—tools and approaches.
By protecting data at every stage of the development process and automating the integration of coding best practices, companies can mitigate risk as well as enforce standards to scale their gen AI solutions.
Protect data at each step
Unstructured data such as PDFs, video, and audio files hold a wealth of information for gen AI models, but they create significant security issues and require strong data-protection controls. Traditional access controls, however, may not suffice. Unstructured data, for example, must be converted into a format that a gen AI application can analyze to understand the context and to then generate metadata that help determine access rights to the data.
To mitigate security risks, some data and AI leaders are designing modularized pipelines capable of automatically securing data. For example, extracting a revenue table with notes that span multiple pages in a PDF will require implementing traditional role-based access control, including hiding related sentences in the text. Because gen AI outputs are still often inconsistent, data and AI leaders should carefully build consistent, secure access controls and guardrails at each checkpoint in the data pipeline, from ingestion to vectorization to retrieval-augmented generation (RAG) to consumption by gen AI models.
Integrate coding best practices into gen AI outputs
A key feature of scale is ensuring the consistent adherence to approved standards and best practices when engineering data. This can be an issue when using code sourced directly from LLMs, where the quality may not meet expectations, because, for example, the code lacks organizational context or does not fit the standard frameworks an organization uses. To help overcome these issues and improve data quality, some organizations are integrating coding best practices into all their gen AI–generated code.
Another approach is to use gen AI to analyze column values, determine appropriate rules for data quality based on existing rules, and then seamlessly integrate them into the pipeline generation process. Companies generally have a common set of data quality rules for data products, often with only slight changes across use cases. Organizations that define what those rules are—with the correct parameters for adjustments to different situations—can develop gen AI solutions that allow them to automatically add the rules to their pipelines.
Gen AI tools are available to accelerate the development of data products and data platforms and improve their performance. But to use them effectively, companies will have to address a broad range of technical challenges. Focusing on orchestration capabilities, automating data-development programs, and improving usability will allow data and AI leaders to help their organizations move from gen AI pilots to scaling solutions that drive real value.