• Fri. Jun 21st, 2024

Data is choking AI. Here’s how to break free.


May 5, 2023
Data is choking AI. Here's how to break free.


This article is part of a VB Lab Insights series on AI sponsored by Microsoft and Nvidia. Don’t miss additional articles in this series providing new industry insights, trends and analysis on how AI is transforming organizations. Find them all here.

AI is a voracious, data-hungry beast. Unfortunately, problems with that data — quality, quantity, velocity, availability and integration with production systems — continue to persist as a major obstacle to successful enterprise implementation of the technology.

The requirements are easy to understand, notoriously hard to execute: Deliver usable, high-quality inputs for AI applications and capabilities to the right place in a dependable, secure and timely (often real-time) way. Nearly a decade after the challenge became apparent, many enterprises continue to struggle with AI data: Too much, too little, too dirty, too slow and siloed from production systems. The result is a landscape of widespread bottlenecks in training, inference and wider deployment, and most seriously, poor ROI.

According to the latest industry studies, data-related issues underlie the low and stagnant rate of success (around 54%, according to Gartner) in moving enterprise AI proof of concepts (POCs) and pilots into production. Data issues are often behind related problems with regulatory compliance, privacy, scalability and cost overruns. These can have a chilling effect on AI initiatives — just as many organizations are counting on technology and business groups to quickly deliver meaningful business and competitive benefits from AI.

The key: Data availability and AI infrastructure

Given the rising expectations of CEOs and boards for double-digit gains in efficiencies and revenue from these initiatives, freeing data’s chokehold on AI expansion and industrialization must become a strategic priority for enterprises.

But how? The success of all types of AI depends heavily on availability, the ability to access usable and timely data. That in turn, depends on an AI infrastructure that can supply data and easily enable integration with production IT. Emphasizing data availability and fast, smooth meshing with enterprise systems will help organizations deliver more dependable, more useful AI applications and capabilities.

To see why this approach makes sense, before turning to solutions let’s look briefly at the data problems strangling AI, and the negative consequences that result.

Data is central to AI success — and failure

Many factors can torpedo or stall the success of AI development and expansion: lack of executive support and funding, poorly chosen projects, security and regulatory risks and staffing challenges, especially with data scientists. Yet in numerous reports over the last seven years, data-related problems remain at or near the top of AI challenges in every industry and geography. Unfortunately, the struggles continue.

A major new study by Deloitte, for example, found that 44% of global firms surveyed faced major challenges both in obtaining data and inputs for model training and in integrating AI with organizational IT systems (see chart below).

50% Managing AI-related risks 50% Executive commitment 46% Integrating AI into daily operations and workflows
42% Implementing AI technologies 50% Maintaining or ongoing support after initial launch 44% Integrating with other organizational/business systems
40% Proving business value 44% Training to support adoption 44% AI solutions were too complex or difficult for end users to adopt
44% Obtain needed data or input to train model 42% Alignment between AI developers and the business need/problem/need/ mission 42% Identifying the use cases with the greatest business value
41% Technical skills 38% Choosing the right AI technologies
38% Funding for AI technology and solutions
Credit: Deloitte

The seriousness and centrality of the problem is obvious. Data is both the raw fuel (input) and refined product (output) of AI. To be successful and useful, AI needs a reliable, available, high-quality source of data. Unfortunately, an array of obstacles plagues many enterprises.

Lack of data quality and observability. GIGO (garbage in/ garbage out) has been identified as a problem since the dawn of computing. The impact of this truism gets further amplified in AI, which is only as good as the inputs used to train algorithms and run it. One measure of the current impact: Gartner estimated in 2021 that poor data quality costs the typical organization an average $12.9 million a year, a loss that’s almost certainly higher today.

Data observability refers to the ability to understand the health of data and related systems across data, storage, compute and processing pipelines. It’s crucial for ensuring data quality and reliable flow for AI data that’s ingested, transformed or pushed downstream. Specialized tools can provide an end-to-end view needed to identify, fix and otherwise optimize problems with quality, infrastructure and processing. The task, however, becomes much more challenging with today’s larger and more complex AI models, which can be fed by hundreds of multi-layered data sources, both internal and external, and interconnected data pipelines.

Nearly 90% of respondents in the Gartner study say they have or plan to invest in data observability and other quality solutions. At the moment, both remain a big part of AI’s data problem.

Poor data governance. The ability to effectively manage the availability, usability, integrity and security of data used throughout the AI lifecycle is an important but under-recognized aspect of success. Failure to adhere to policies, procedures and guidelines that help ensure proper data management — crucial for safeguarding the integrity and authenticity of data sets — makes it much more difficult to align AI with corporate goals. It also opens the door to compliance, regulatory and security problems such as data corruption and poisoning, which can produce false or harmful AI outputs.

Lack of data availability. Accessing data for building and testing AI models is emerging as perhaps the most important data challenge to AI success. Recent studies by the McKinsey Global Institute and U.S. Government Accountability Office (GAO) both highlight the issue as a top obstacle for broader expansion and adoption of AI.

A study of enterprise AI published in the MIT Sloan Management Journal entitled “The Data Problem Stalling AI” concludes: “Although many people focus on the accuracy and completeness of data, … the degree to which it is accessible by machines — one of the dimensions of data quality — appears to be a bigger challenge in taking AI out of the lab and into the business.” 

Strategies for data success in AI

To help avoid these and other data-based showstoppers, enterprise business and technology leaders should consider two strategies:

Think about big-picture data availability from the start. Many accessibility problems result from how AI is typically developed in organizations today. Specifically, end-to-end availability and data delivery are seldom built into the process. Instead, at each step, different groups have varying requirements for data. Rarely does anyone look at the big picture of how data will be delivered and used in production systems. In most organizations, that means the problem gets kicked down the road to the IT department, where late-in-the-process fixes can be more costly and slow.

Focus on AI infrastructure that integrates data and models with production IT systems. The second crucial part of the accessibility/availability challenge involves delivering quality data in a timely fashion to the models and systems where it will be processed and used. An article in the Harvard Business Review, “The Dumb Reason Your AI Project Will Fail”, puts it this way:

“It’s very hard to integrate AI models into a company’s overall technology architecture. Doing so requires properly embedding the new technology into the larger IT systems and infrastructure — a top-notch AI won’t do you any good if you can’t connect it to your existing systems.

The authors go on to conclude: “You want a setting in which software and hardware can work seamlessly together, so a business can rely on it to run its daily real-time commercial operations… Putting well-considered processing and storage architectures in place can overcome throughput and latency issues.”

A cloud-based infrastructure optimized for AI provides a foundation for unifying development and deployment across the enterprise. Whether deployed on-premises or in a cloud-based data center, a “purpose-built” environment also helps with a crucial related function: enabling faster data access with less data movement.

As a key first step, McKinsey recommends shifting part of spend on R&D and pilots towards building infrastructure that will allow you to mass produce and scale your AI projects. The consultancy also advises adoption of MLOps and ongoing monitoring of data models being used.

Balanced, accelerated infrastructure feeds the AI data beast

As enterprises deepen their embrace of AI and other data-driven, high-performance computing, it’s critical to ensure that performance and value are not starved by underperforming processing, storage and networking. Here are key considerations to keep in mind.

Compute. When developing and deploying AI, it’s crucial to look at computational requirements for the entire data lifecycle: starting with data prep and processing (getting the data ready for AI training), then during AI model building, training, and inference. Selecting the right compute infrastructure (or platform) for the end-to-end lifecycle and optimizing for performance has a direct impact on the TCO and hence ROI for AI projects.

End-to-end data science workflows on GPUs can be up to 50x faster than on CPUs. To keep GPUs busy, data must be moved into processor memory as quickly as possible. Depending on the workload, optimizing an application to run on a GPU, with I/O accelerated in and out of memory, helps achieve top speeds and maximize processor utilization.

Since data loading and analytics account for a huge part of AI inference and training processing time, optimization here can yield 90% reductions in data movement time. For example, because many data processing tasks are parallel, it’s wise to use GPU acceleration for Apache Spark data processing queries. Just as a GPU can accelerate deep learning workloads in AI, speeding up extract, transform and load pipelines can produce dramatic improvements here.

Storage. Storage I/O (Input/Output) performance is crucial for AI workflows, especially in the data acquisition, preprocessing and model training phases. How quickly data can be read from varied sources and transferred to storage mediums further enables differentiated performance.Storage throughput is critical to keep GPUs from waiting on I/O. Be aware that AI training (time-consuming) and inference (I/O heavy and latency-sensitive) have different requirements for processing and storage access behavior with I/O. For most enterprises, local NVMe +BLOB is the best, most cost- effective choice here. Consider Azure Managed Lustre and Azure NetApp Files if there’s not enough local NVMe SSD capacity or if the AI needs a high-performance shared filesystem. Choose Azure NetApp Files over Azure Managed Lustre if the I/O pattern requires a very low-latency shared file system.

Networking. Another high-impact area for optimizing data accessibility and movement is the critical link and transit path between storage and compute. Traffic clogs here are disastrous. High-bandwidth and low-latency networking like InfiniBand is crucial to enabling training at scale. It’s especially important for large language models (LLM) deep learning, where performance is often limited by network communication.

When harnessing multiple GPU-accelerated servers to cooperate on large AI workloads, communications patterns between GPUs can be categorized as point-to-point or collective communications. Many point-to-point communications may happen simultaneously in an entire system between sender and receiver and it helps if data can travel fast on a “superhighway” and avoid congestion. Collective communications, generally speaking,are patterns where a group of processes participate, such as in a broadcast or a reduction operation.  High-volume collective operations are found in AI algorithms, which means that intelligent communication software must get data to many GPUs and repeatedly during a collective operation by taking the fastest, shortest path and minimizing bandwidth. That’s the job of communication acceleration libraries like NCCL (NVIDIA Collective Communications Library) and it is found extensively in deep learning frameworks for efficient neural network training.  

High-bandwidth networking optimizes the network infrastructure to allow multi-node communications in one hop or less. And since many data analysis algorithms use collective operations, using in-network computing can double the network bandwidth efficiency. Having a high-speed network adapter per GPU for your network infrastructure allows AI workloads (think large, data-dependent models like recommender engines) to scale efficiently and allow GPUs to work cooperatively.  

Adjacent technologies. Beyond setting up a strong foundational infrastructure to support the end-to-end lifecycle of putting data to use with AI, regulated industries like healthcare and finance face another barrier to accelerating adoption. The data they require to train AI/ML models are often sensitive and subject to a rapidly evolving set of protection and privacy laws (GDPR, HIPAA, CCPA, etc.). Confidential computing secures in-use data and AI/ML models during computations. This ability to protect against unauthorized access helps ensure regulatory compliance and unlocks a host of cloud-based AI use cases previously deemed too risky.

To address the challenge of data volume and quality, synthetic data, generated by simulations or algorithms, can save time and reduce the costs of creating and training accurate AI models requiring carefully labeled and diverse datasets.

Bottom line

Data-related problems remain a dangerous AI killer. By focusing on data accessibility and integration through AI-optimized cloud infrastructure and accelerated, full-stack hardware and software, enterprises can increase their success rate in developing and deploying applications and capabilities that deliver business value faster and more surely. To this end, investing in research and development to define and test scalable infrastructure is a crucial key to scaling a data-dependent AI project into profitable production.

Learn more about AI-first infrastructure at Make AI Your Reality.

VB Lab Insights content is created in collaboration with a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Leave a Reply

Your email address will not be published. Required fields are marked *