Data is the lifeblood of the machine. Without it, you can’t build anything related to AI. Many organizations are still struggling to get good, clean data to sustain their AI and machine learning initiatives, according to Appen's State of AI and Machine Learning report released this week.
According to Appen's survey on artificial intelligence, among the four stages of artificial intelligence - data procurement, data preparation, model training and deployment, and human-guided model evaluation, data procurement consumes the most resources and costs the most. The longest and most challenging. 504 business leaders and technology experts.
On average, data procurement consumes 34% of an organization’s AI budget, while data preparation and model testing and deployment each account for 24%, and model evaluation 15%, according to Appen’s survey, which was conducted by Harris The poll was conducted and included IT decision-makers, business leaders and managers, and technology practitioners from the United States, United Kingdom, Ireland and Germany.
In terms of time, data procurement consumes approximately 26% of an organization’s time, while data preparation and model testing, deployment, and model evaluation account for 24% and 23% respectively. Finally, 42% of technologists consider data sourcing to be the most challenging stage of the AI lifecycle, compared to model evaluation (41%), model testing and deployment (38%), and data preparation (34%).
According to technology experts, data sourcing is the biggest challenge facing artificial intelligence. But business leaders see things differently...
Despite the challenges, organizations are making it work. According to Appen, four-fifths (81%) of respondents said they are confident they have enough data to support their AI initiatives. Perhaps the key to this success: The vast majority (88%) are augmenting their data by using external AI training data providers such as Appen.
However, the accuracy of the data is questionable. Appen found that only 20% of respondents reported data accuracy of more than 80%. Only 6% (about 1 in 10) said their data was 90% accurate or better. In other words, one in five data contains errors for more than 80% of organizations.
With that in mind, it’s perhaps not surprising that nearly half (46%) of respondents agree that data accuracy is important “but we can fix it,” according to Appen’s survey. Only 2% said data accuracy is not a big need, while 51% agreed it is a critical need.
It appears that Appen CTO Wilson Pang’s view on the importance of data quality matches the 48% of customers who believe data quality is not important.
“Data accuracy is critical to the success of AI and ML models, as quality-rich data results in better model output and consistent processing and decision-making,” Pang said in the report. “To achieve good results, data sets must be accurate, comprehensive, and scalable.”
Over 90% of Appen respondents said they use pre-labeled data
The rise of deep learning and data-centric AI has shifted the motivation for AI success from good data science and machine learning modeling to good data collection, management and mark. This is especially true for today’s transfer learning techniques, where AI practitioners step out on top of a large pre-trained language or computer vision model and retrain a small set of layers with their own data.
Better data can also help prevent unnecessary bias from creeping into AI models and often prevent undesirable AI outcomes. This is especially true for large language models, said Ilia Shifrin, senior director of AI at Appen.
“Companies face another challenge with the rise of large language models (LLMs) trained on multilingual web crawler data,” Shifrin said in the report. "These models often exhibit bad behavior due to the abundance of toxic language, as well as racial, gender, and religious biases in the training corpora."
Bias in Web data raises some thorny issues, although there are some workarounds methods (changing training regimens, filtering training data and model output, and learning from human feedback and testing), but more research is needed to establish a good standard for "human-centered" LLM benchmarks and model evaluation methods, Shifrin said. .
According to Appen, data management remains the biggest obstacle facing AI. The survey found that 41% of people in the AI cycle believe data management is the biggest bottleneck. Lack of data ranked fourth, with 30% citing it as the biggest obstacle to AI success.
But there’s some good news: The time organizations spend managing and preparing data is trending downward. That's just over 47% this year, compared with 53% in last year's report, Appen said.
Data accuracy levels may not be as high as some organizations would like
“The majority of respondents use external data providers and it can be inferred that by outsourcing data sourcing and preparation, data scientists are saving money on proper management , the time required to clean and label the data,” the data labeling company said.
However, judging by the relatively high error rates in the data, perhaps organizations should not scale back their data procurement and preparation processes (whether internal or external). There are many competing needs when it comes to building and maintaining AI processes—hiring qualified data professionals was another top need identified by Appen. However, until significant progress is made in data management, organizations should continue to put pressure on their teams to continue driving the importance of data quality.
The survey also found that 93% of organizations strongly or somewhat agree that ethical AI should be the “foundation” of AI projects. Appen CEO Mark Brayan said it was a good start, but there was more work to be done. "The problem is that many people face the challenge of trying to build great AI with poor data sets, which creates a significant obstacle to achieving their goals," Brayan said in a press release.
In-house, custom-collected data remains the majority of organizations’ data sets used for AI, accounting for 38% to 42% of data, according to Appen’s report. Synthetic data performed surprisingly strongly, accounting for 24% to 38% of an organization's data, while pre-labeled data (typically sourced from data service providers) accounted for 23% to 31% of the data.
Synthetic data in particular has the potential to reduce the incidence of bias in sensitive AI projects, with 97% of Appen respondents saying they use synthetic data “when developing inclusive training datasets.”
Other interesting findings from the report include:
The above is the detailed content of Research shows: Data sources remain the main bottleneck for AI. For more information, please follow other related articles on the PHP Chinese website!