The discipline of artificial intelligence originated in 1956, and then made almost no progress in the next half century. The development of computing power and data lagged far behind algorithms. However, with the advent of the Internet era in 2000, the limitations of computing power were broken, artificial intelligence gradually penetrated into all walks of life, and ushered in the era of large-scale models. However, high-quality data seems to have become the final "bottleneck" in the development of artificial intelligence
Huawei OceanStor Pacific won the "Best Innovation Award for AI Storage Base" at the recently held National High-Performance Computing Academic Annual Conference (CCF HPC China 2 needs to be rewritten as: 023)
The emergence of the concept of AI retention actually reflects the continuous improvement of the value of data for AI
The content that needs to be rewritten is: 01
Data determines the intelligence level of artificial intelligence
The development of artificial intelligence is a process of continuous data collection and analysis. Data, as the carrier of information, is the basis for artificial intelligence to learn and understand the world. General intelligence is the ultimate goal of the development of artificial intelligence. It can learn, understand, reason and solve problems autonomously, and data is the biggest driving force for its development
So, the more data, the smarter the AI becomes? As long as there is a large amount of data, can AI surpass the role of experts?
Taking artificial intelligence systems in the medical field as an example, many diagnostic cases actually do not have a single correct answer. In medical diagnosis, each set of symptoms has a range of possible causes with varying probabilities, so AI-assisted decision-making can help clinicians narrow down the possible causes until a solution is found. In this case, medical artificial intelligence does not rely on large amounts of data, but on accurate and high-quality data. Only in this way can it ensure that the real possible causes are not missed during "screening"
The importance of data quality for AI intelligence is reflected in this typical demonstration
In the artificial intelligence industry, there has always been a consensus that "garbage in, garbage out". This means that without high-quality data input, no matter how advanced the algorithm is or how powerful the computing power is, it will not be able to produce high-quality results
In this day and age, we are on the cusp of big models. Big models of artificial intelligence are springing up like mushrooms after rain. A number of China's large models, such as Huawei's Pangu, iFlytek's Spark, and Zidong's Taichu, are developing rapidly and are committed to building a cross-industry universal artificial intelligence capability platform to provide power for the digital transformation of all walks of life.
According to the "China Artificial Intelligence Large Model Map Research Report" released by the New Generation Artificial Intelligence Development Research Center of the Ministry of Science and Technology of China at the end of May, 79 large models with more than one billion parameters have been released in China. Although the pattern of "Battle of 100 Models" has been formed, it has also triggered in-depth thinking on the development of large models
The expressive ability of a model based on small-scale data is limited by the data scale, and it can only perform coarse-grained simulation and prediction, which is no longer applicable in situations where accuracy requirements are relatively high. If you want to further improve the accuracy of the model, you need to use massive data to generate relevant models
The rewritten content is: This means that the amount of data determines the degree of AI intelligence. Regardless of the quality of data, the quantity of data is an area of focus that needs to be focused on building "AI storage capacity"
What needs to be rewritten is: 02
In the era of big data, challenges faced by data
As artificial intelligence develops towards large models and multi-modality, enterprises face many challenges when developing or implementing large model applications
First of all, the data preprocessing cycle is very long. Since the data is distributed in different data centers, different applications and different systems, there are problems such as slow collection speed. As a result, it takes about 10 days to preprocess 100 TB of data. The system utilization needs to be improved from the beginning.
Secondly, the problem of low training set loading efficiency needs to be solved. Nowadays, the scale of large-scale models is getting larger and larger, with parameter levels reaching hundreds of billions or even trillions. The training process requires a large amount of computing resources and storage space. For example, multi-modal large-scale models use massive texts and images as training sets, but the current loading speed of massive small files is slow, resulting in inefficient loading of training sets
In addition, it also faces the challenges of frequent tuning of large model parameters and unstable training platforms, with training interruptions occurring on average every two days. In order to resume training, a checkpoint mechanism needs to be used, and the failure recovery time exceeds one day, which brings many challenges to business continuity
To succeed in the era of AI large models, we need to pay attention to both the quality and quantity of data and build a large-capacity, high-performance storage infrastructure. This has become a key element to victory
The content that needs to be rewritten is: 03
The key to the AI era lies in the power base
With the combination of big data, artificial intelligence and other technologies with high-performance computing, high-performance data analysis (HPDA) has become a new form of realizing data value. By utilizing more historical data, multiple heterogeneous computing power and analysis methods, HPDA can improve analysis accuracy. This marks a new stage of intelligent research in scientific research, and artificial intelligence technology will accelerate the application of cutting-edge results
Today, a new paradigm based on "data-intensive science" is emerging in the field of scientific research. This paradigm focuses more on combining big data knowledge mining and artificial intelligence training and reasoning technology to obtain new knowledge and discoveries through calculation and analysis. This also means that the requirements for the underlying data infrastructure will fundamentally change. Whether it is high-performance computing or the future development of artificial intelligence, it is necessary to establish advanced storage infrastructure to meet the challenges of data
To solve data challenges, we need to start with data storage innovation. As the proverb goes, the person who untied the bell must tie the bell
The AI storage base is developed based on OceanStor Pacific distributed storage and adheres to the AI Native design concept to meet the storage needs of all aspects of AI. AI systems pose comprehensive challenges to storage, including data computing acceleration, data storage management, and efficient circulation between data storage and computing. By using a combination of "large-capacity storage and high-performance storage", we can ensure the consistent scheduling and coordination of storage resources, so that every link can operate efficiently, thereby fully releasing the value of the AI system
How does OceanStor Pacific distributed storage demonstrate its core capabilities?
First of all, the technical architecture is unique in the industry. This storage system supports unlimited horizontal expansion and can handle mixed loads. It can efficiently handle the IOPS of small files and the bandwidth of high-speed reading and writing of large files. It has intelligent hierarchical data flow functions at the performance layer and capacity layer, and can realize full-process AI data management such as collection, preprocessing, training and inference of massive data. In addition, it has the same data analysis capabilities as HPC and big data
The rewritten content is: Secondly, the best way to improve efficiency in the industry is through storage innovation. The first is data weaving, which means accessing raw data scattered in different regions through the GFS global file system to achieve global unified data views and scheduling across systems, regions, and multiple clouds, simplifying the data collection process. The second is near-memory computing, which realizes preprocessing of near-data by storing embedded computing power, reduces invalid data transmission, and reduces the waiting time of the preprocessing server, thus significantly improving preprocessing efficiency
In fact, the "Battle of Hundreds of Models" is not a "sign" of the development of large AI models. In the future, all walks of life will use the capabilities of AI large models to promote the in-depth development of digital transformation, and the construction of data infrastructure will also be accelerated. OceanStor Pacific distributed storage's innovative technical architecture and high efficiency have proven itself to be the industry's first choice
We understand that data has become a new factor of production alongside land, labor, capital, and technology. Many traditional definitions and operating models in the past digital market will be rewritten. Only with pre-existing capabilities can we ensure the steady progress of the era of data-driven artificial intelligence large models
The above is the detailed content of The development of the AI large model era requires advanced storage technology to achieve stable progress. For more information, please follow other related articles on the PHP Chinese website!