


Data sources are still the main bottleneck of artificial intelligence
According to Appen’s “State of Artificial Intelligence and Machine Learning” report released this week, agencies are still struggling to obtain good, clean data to sustain their artificial intelligence and machine learning programs.
According to Appen’s survey of 504 business leaders and technology experts, among the four stages of artificial intelligence, data sources; data preparation; models Training and deployment; the human-led model evaluation phase—the data source consumes the most resources, takes the longest, and is the most challenging.
According to Appen’s survey, data sources consume an average of 34% of an organization’s AI budget, with data preparation, model testing and deployment each accounting for 24%, and model evaluation accounting for 15%. The survey was conducted by Harris Poll and included IT decision-makers, business leaders and managers, and technology practitioners from the United States, United Kingdom, Ireland and Germany.
In terms of time, data sources consume approximately 26% of the time, data preparation time is 24%, model testing, deployment and model evaluation time are each 23% . Finally, 42% of technicians believe that data sourcing is the most challenging stage in the AI life cycle. The other stages are: model evaluation (41%), model testing and deployment (38%), and data preparation (34%) .
Despite the challenges, organizations are working hard to make it work. According to Appen, four-fifths (81%) of respondents said they have enough data to support their AI initiatives. The key to success may be this: The vast majority (88%) of companies augment their data by using external AI training data providers such as Appen.
However, the accuracy of the data is still open to question. Appen found that only 20% of respondents reported data accuracy of more than 80%. Only 6% (roughly one in 20 people) said their data was 90% accurate or better.
With this in mind, nearly half (46%) of respondents believe data accuracy is important, according to Appen’s survey. Only 2% believe data accuracy is not a big need, while 51% believe it is a critical need.
Appen’s Chief Technology Officer Wilson Pang has a different view on the importance of data quality, with 48% of his customers not believing data quality is important.
“Data accuracy is critical to the success of AI and ML models, as quality-rich data yields better model output and consistent processing and decision-making,” the report said. “In order to obtain For good results, data sets must be accurate, comprehensive, and scalable.”
The rise of deep learning and data-centric artificial intelligence has shifted the motivation for AI success from good data science and machine learning Model shift to good data collection, management and labeling. This is especially true in today's transfer learning techniques. Practitioners of artificial intelligence will abandon a large pre-trained language or computer vision model and retrain a small part of it on their own data.
Better data can also help prevent unnecessary bias from seeping into AI models and preventing bad outcomes that AI can lead to. This is especially true for large language models.
The report says: “With the rise of large language models (LLMs) trained on multilingual web scraping data, enterprises are facing another challenge. As training corpora are filled with toxic languages, and Racial, gender, and religious biases, these models often exhibit undesirable behavior."
Bias in network data raises thorny issues, although there are some workarounds (changing training regimens, filtering training data, and model output, and learn from human feedback and testing), but more research is needed to create a “human-centered LLM” benchmark and good standard for model evaluation methods.
Appen said data management remains the biggest obstacle facing artificial intelligence. The survey found that 41% of people believe that data management is the biggest bottleneck in the artificial intelligence cycle. In fourth place is a lack of data, with 30% of respondents citing this as the biggest obstacle to AI success.
But there’s some good news: The time enterprises spend managing and preparing data is falling. This year's rate was just over 47%, compared with 53% in last year's report, Appen said.
“Since the majority of respondents use external data providers, it can be inferred that by outsourcing data sourcing and preparation, data scientists are saving time required to properly manage, clean, and label their data.” Data Labeling the company said.
However, judging by the relatively high error rates in the data, perhaps organizations should not scale back their data sources and preparation processes (whether internal or external). There are many competing needs when it comes to building and maintaining AI processes—the need to hire qualified data professionals was another top need identified by Appen. However, until significant progress is made in data management, organizations should continue to pressure their teams to continue driving the importance of data quality.
The survey also found that 93% of organizations strongly or to some extent agree that AI ethics should be the "foundation" of AI projects. Appen CEO Mark Brayan said it was a good start but there was still much work to be done. "The problem is that many people are facing the challenge of trying to build great AI with poor data sets, which creates huge obstacles to achieving their goals," Brayan said in a press release.
According to Appen’s report, custom-collected data within enterprises remains the primary data set used for AI, accounting for 38% to 42% of data. Synthetic data showed surprisingly strong performance, accounting for 24% to 38% of an organization's data, while pre-labeled data (usually from data service providers) accounted for 23% to 31% of the data.
In particular, synthetic data has the potential to reduce the incidence of bias in sensitive AI projects, with 97% of Appen’s survey participants saying they used synthetic data in “developing inclusive training datasets.”
Other interesting findings from the report include:
- 77% of organizations retrain their models monthly or quarterly;( Interpretation of the forefront of the AI era: Artificial intelligence is not a one-time solution. It is constantly improving according to application needs and needs to be constantly updated.)
- 55% of American companies claim that they are ahead of their competitors, while in Europe this proportion is 44%; (Interpretation from the forefront of the AI era: Europeans are slightly more low-key than Americans.)
- 42% of organizations reported that artificial intelligence was “widely” rolled out, and in the “2021 State of Artificial Intelligence Report”, this proportion is 51%; (Interpretation from the forefront of the AI era: Artificial intelligence applications are becoming more and more widespread.)
- 7% of organizations reported that their artificial intelligence budget exceeded US$5 million, compared with 9% last year. (Interpretation from the forefront of the AI era: On the one hand, it may be due to the gradual maturity of artificial intelligence that reduces costs, but also shows that artificial intelligence is no longer a "luxury product" and is gradually becoming a "must-have" for enterprises.)
The above is the detailed content of Data sources are still the main bottleneck of artificial intelligence. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

Editor | Radish Skin Since the release of the powerful AlphaFold2 in 2021, scientists have been using protein structure prediction models to map various protein structures within cells, discover drugs, and draw a "cosmic map" of every known protein interaction. . Just now, Google DeepMind released the AlphaFold3 model, which can perform joint structure predictions for complexes including proteins, nucleic acids, small molecules, ions and modified residues. The accuracy of AlphaFold3 has been significantly improved compared to many dedicated tools in the past (protein-ligand interaction, protein-nucleic acid interaction, antibody-antigen prediction). This shows that within a single unified deep learning framework, it is possible to achieve
