


Detailed explanation of Qingyun Technology's launch of AI computing power products and services to address computing power challenges
At the Qingyun Technology AI Computing Power Conference, Miao Hui, product manager, introduced in detail the Qingyun AI computing power scheduling platform and Qingyun AI computing power cloud services. The following is the full text of the speech:
Artificial intelligence users face computing power challenges
With the explosion of the artificial intelligence industry, AIGC, large models, scientific research computing, enterprise-level big data and artificial intelligence have put forward higher demands on computing power centers. Especially in the face of data centers with a single computing power, it is no longer able to meet the growing demand for computing power in all walks of life. Therefore, more intelligent computing centers, supercomputing centers and general cloud computing services are needed to provide computing power services to the whole society.
However, the AI industry, AI infrastructure and users of AI computing power also face a series of challenges:
Bottleneck of unified management of multiple resources. Faced with users' requirements for multiple computing power, multiple storage, the entire computing network, and nearby services, Qingyun provides a multiple resource management unified service scheduling platform to solve the chaotic situation of multiple resource management.
High-speed network bottleneck. In terms of AI high-speed network construction, Qingyun uses high-speed networks to interconnect computing and storage devices, and uses general-purpose networks to publish application services, that is, Qingyun solves multi-region high-speed networking problems through Qingyun's platform.
Environment construction is a tedious bottleneck. Algorithm engineers and R&D engineers may waste a lot of time on setting up basic environments such as hardware servers and storage servers. Through Qingyun AI intelligent computing services, training platforms and inference model platforms, the environment construction is simplified and one-click deployment can be achieved.
Multi-service integration bottleneck. Qingyun integrates multiple businesses and combines traditional cloud computing, super computing and intelligent computing to provide panoramic computing services for more businesses and more customers.
Lack of operational services. Qingyun also provides comprehensive operations and operation and maintenance management services to the computing power operation center and computing power management department.
Qingyun AI computing power scheduling platform
The full-stack product architecture of Qingyun AI scheduling products is multi-AZ and multi-zone, that is, products in multiple regions can be unified and integrated to provide computing power services to the whole society with a global service. Specifically, it will manage the underlying infrastructure, make the infrastructure logical and business-oriented through the data logic layer, and form an AI computing power cluster through specific products or services, including GPU hosts, bare metal, virtualization, sharing forms, etc. , container inference services, model markets and other related businesses, providing computing power scheduling and application scenario implementation capabilities for customers across the industry.
Covering all aspects, new model of computing power construction center
Overall, the AI computing power scheduling platform capabilities provided by Qingyun Technology are mainly based on the following four aspects:
First, the entire platform is compatible with all computing chips on the market (including newly produced Xinchuang chips), as well as GPU-related graphics cards and network cards.
Second, carry out unified management, distribution, monitoring and scheduling of the above adaptation resources, and provide full life cycle online management functions from user application to release after use.
Third, for the management side and the user side, Qingyun unified management platform allows users and administrators to fully operate AI infrastructure and AI computing power cloud services.
Facing the field of intelligent computing, Qingyun will commercialize and scenario-based more services, such as large language model training and reasoning, and load balancing services based on text generation. Qingyun can also use the AI computing power scheduling platform to provide customers with Provides convenient operations such as one-click deployment, one-click expansion, and one-click load balancing. In terms of load balancing, especially in network, public network and computing infrastructure, it can achieve second-level delivery and second-level capacity expansion.
Finally, based on the above three capabilities, Qingyun can support computing in various industries including high-performance computing, artificial intelligence computing and general computing models, and create a unified user management, distribution and distribution system with independent innovation and complete functions for customers. Operation platform.
Nine Abilities Unlock AI Computing Power Freedom
Through years of industry accumulation, Qingyun AI computing power scheduling platform has formed nine key capabilities:
1. Multi-region and multi-business resource integration capabilities
Especially for the diversification of computing power services in western Sichuan or northwest regions, when providing computing power services to the eastern region, scientific research institutions, and universities, Qingyun can manage multi-region resources in a unified manner and build effective high-speed networks through cooperation with telecom operators. network.
2. Distributed scheduling and management capabilities
According to the principle of nearby use, Qingyun manages and distributes all infrastructure (including computing resources and storage resources) in different regions, computing centers, and data centers, and configures scheduling priorities, including affinity and non-affinity. . In VMs, hosts and bare metal servers (including containers based on Container and Pod), affinity and non-affinity data configuration can be performed on the management side of Qingyun AI computing power scheduling platform to ensure the priority of data scheduling. , the purpose is to ensure that users get a consistent experience in the final use of data, application for computing resources, business training, and business reasoning.
3. Resource scheduling capability
In terms of resource scheduling capabilities, Qingyun has the following six major advantages:
1) Immediately schedule and expand resources of tens of thousands of cards
Mainly oriented to AI computing scenarios, especially large model inference. Some model scenarios require inference several times a year, which requires the instant construction of a training platform with dozens of cards or even tens of thousands of cards. According to this requirement, the built-in, adaptation and resource management can be carried out on the Qingyun AI computing power scheduling platform to ensure that the computing power cluster can immediately support the resources of tens of thousands of cards and can be released immediately after use. In terms of resource environment and configuration, Qingyun AI computing power scheduling platform has done a lot of automation to ensure that Wanka resources can be scheduled uniformly.
2) Communication link shortest priority scheduling
Keeping data from detours is also the main purpose of Qingyun AI computing power scheduling platform. In the scenarios of AI training and AI inference, there will be a large amount of data interaction between nodes and between nodes and storage. In this case, Qingyun performs some configurations on the switch at the same time to ensure that computing and storage resources can be on one switch. Prioritize scheduling within a computer room or cabinet to prevent data from being detoured and reduce the constraints of difficult network transmission during AI training.
3) Support heterogeneous platforms
When building a cluster, users can choose different services to run on different cards. Qingyun Technology also carried out domestic adaptation and domestic replacement of the chips. 4) Improve the granularity of the scheduling system
The first is a scheduling system based on Slurm, and the second is a scheduling system based on K8s. In terms of the granularity of the scheduling system, users can perceive true job-level accuracy. When every training task is run on every process on every card, it can be implemented through large-scale data monitoring, business scheduling, etc. Monitoring of job anomalies ensures that users can handle abnormal situations of training tasks in a timely manner to maximize resource scheduling and reduce waste at this level. If something is wrong, modify it immediately and run it immediately.
5) The management side implements scheduling priority configuration
Because different computing power centers will operate different computing power services, especially in the case of multiple data centers, users can prioritize the scheduling through the Qingyun AI computing power scheduling platform. All are built-in in the early stage, and users can also configure it in the later stage. Make reservations, pauses, resumes, priority settings, queuing and other settings to increase priority. At the management level, Qingyun can prioritize resource allocation for users who apply for special applications or users with high priority.
6) Flexible scheduling and allocation of resources for the intelligent computing industry
Qingyun enables dynamic and flexible resource scheduling and configurability to address challenging priorities in AI systems. This is why Qingyun continues to discover new problems in AI scheduling computing power or AI scenarios, constantly uses the platform to solve new problems, and uses new products to solve some major problems in the industry.
4. High-speed parallel storage capability
Qingyun’s computing and storage products are diverse and diverse, providing the following three types of storage:
1) Qingyun U10000 Object Storage
Storage models, codes and commonly used data calls are mainly used for large-scale data backup and data reading operations.
2) Parallel file storage EPFS
In terms of large-scale parallel writing of data, Qingyun provides parallel file storage EPFS, which mainly provides all-flash parallel file storage for MPI-level data writing operations.
3) File storage NAS
You can store some common documents, texts, etc. All Qingyun's storage products can be internally interconnected with its own computing products to perform data transmission, distribution, backup, etc. on the internal high-speed network.
5. Hybrid networking capability
Different high-speed networks can be provided for different computing scenarios, such as computing IB network and storage IB network. How to optimally configure them?
Qingyun interconnects high-configuration computing products and high-configuration storage products, and interconnects medium- and low-configuration products for training scenarios, inference scenarios, and general application service scenarios.
6. Algorithm development support capabilities
For algorithm developers, Qingyun provides more comprehensive cloud service products. Especially in the algorithm development stage, a large amount of parameter adjustment and large-scale code writing are required. During training and deployment, due to the operations on and off the cloud, it may cause problems. For large-scale data upload, download or code copy, it is not suitable for online editing and immediate operation.
Therefore, Qingyun provides an algorithm development platform in terms of algorithm development. It can launch an online development environment based on cloud services, completely build Python projects and VC projects, and use project files and engineering environments online to conduct code research and development.
During the development process, if there is any need for debugging, it can be expanded immediately; if training is needed, the job task can be immediately assigned to the training cluster; if inference is needed, it can be placed on the inference cluster.
At the same time, during the algorithm development process, there may be some forms of joint development or mixed development. Qingyun also provides code warehouses and mirror warehouses for model management. Different personnel use different permissions to carry out unified algorithm development and service integration. .
In a nutshell, Qingyun mainly provides computing products and scheduling products for all development scenarios for algorithm developers, ensuring that the entire algorithm development business can be effectively operated on the cloud and reducing large-scale upload and download operations.
7. AI training platform
If the algorithm development is nearing completion or needs debugging, a large amount of computing power infrastructure needs to be activated for development and training. Based on the infrastructure, Qingyun provides an AI training platform to empower users.
After the GPU resources, storage resources and network resources are constructed, users can build independently through the cloud platform and achieve one-click operation. Qingyun AI training platform mainly builds clusters online based on its own GPU resources. After the construction is completed, a certain storage will be mounted by default, and users can choose by themselves.
The Qingyun AI training platform will also have a built-in online development environment. Some commonly used training frameworks will also be built into the development environment. It will provide users with full scenarios and full application environments through clusters, allowing users to distribute online across multiple machines. style training.
8. Container Inference Service Platform After the large model training is almost completed, the Qingyun container inference service platform can play a role when providing inference services to the public.
Through the Qingyun container inference service platform, after users deploy the inference service, they can then use the configured load balancing and automatic scaling to ensure that user visits can be called immediately. At the same time, Qingyun provides online monitoring services to customers. If there is a problem with the inference service, users can immediately monitor what went wrong with the container inference, and Qingyun can solve it online. For concurrent operations and large-scale call operations, Qingyun can also perform load balancing and automatic scaling, greatly reducing manual configuration operations.
9. Model warehouse (MaaS)
Qingyun Model Warehouse (MaaS) is mainly aimed at AI computing power service customers and general computing customers. Model service providers can put products on the application market and model market according to their own model needs, making it convenient for customers of various enterprises to call, One-click fine-tuning and one-click deployment.
3: Stimulate diverse values and accelerate the implementation of scenarios
In general, the purpose of Qingyun AI computing power scheduling platform is to manage AI infrastructure like local resources, which is mainly reflected in five major aspects:
1. Provide unified scheduling of multiple computing power
Faced with GPU resources, CPU resources, domestic chips, application frameworks, applications and user business scenarios, Qingyun uses a unified platform for scheduling and management, including storage facilities and network facilities.
2. Realize intelligent computing power scheduling based on infrastructure
In terms of computing power scheduling priority and affinity, based on VM, host and container, users can realize intelligent computing power scheduling and configuration, as well as management services through Qingyun's platform.
3. Quick and effective adaptation to domestic chips. Qingyun can effectively and quickly adapt to domestic chips, ensuring that localized algorithm services and localized codes can run immediately on domestic chips.
4. Visualization service
In terms of intelligent operation and maintenance for the management side, Qingyun's monitoring and alarm services provide customers and administrators with visual operations through a large operation and maintenance platform.
5. Rich application market
Qingyun Technology actively builds an ecosystem and creates a rich application market, so that applications and customers from all walks of life can get the computing resources and business resources they want on the Qingyun AI computing platform.
At present, Qingyun AI computing power scheduling platform has been implemented in Jinan supercomputing applications, and Sunward Cloud has been online to provide operational services. Based on Jinan's tens of thousands of supercomputing hardware infrastructure, various computing networks, servers, etc., Qingyun provides listing, management, and scheduling services. It provides services for different computer rooms, supercomputing businesses, intelligent computing businesses, GPUs, and various storage and network-based services. information, conduct unified management, integration, management and distribution, and provide computing power scheduling products and computing power cloud service products to customers from all walks of life.
Qingyun AI Computing Cloud Service
Qingyun AI computing power cloud service products are also launched on Qingyun public cloud to provide services, mainly for large model training scenarios.
For cards with relatively high priority and high configuration, Qingyun provides public cloud computing service products. In the AI scenario, Qingyun builds distributed GPU computing clusters with underlying resources and binds them to the public network environment, allowing user access.
Users can upload data to parallel file storage based on this, or they can integrate parallel file storage and GPU computing clusters into the same network to ensure data security and cloud service security through private networks. . You can also run your business through online training and remote SSH access to distributed computing clusters and parallel file storage.
In terms of business, users can use AI computing clusters and container inference services, and their infrastructure is A800 resources, bare metal servers, and virtualized servers. All Qingyun AI computing power cloud service products use high-speed interconnected networks and adopt the online environment, development environment, training and reasoning environment required by the AI computing power industry. Everyone is welcome to apply for registration and trial.
The above is the detailed content of Detailed explanation of Qingyun Technology's launch of AI computing power products and services to address computing power challenges. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



As the demand for artificial intelligence surges, optical module products will continue to be updated to adapt to the development trend of high speed and new technologies. Zhongji InnoLight (300308.SZ), as a leader in the optical module industry, recently announced its important AI customers The demand for 1.6T optical modules has been proposed to meet future GPU needs with higher bandwidth and computing power. The company is expected to achieve mass production of 1.6T optical modules in 2025. The demand for AI computing power has exploded, promoting the development of the optical module industry. With the rise of industries such as artificial intelligence, big data, cloud computing, 5G and the Internet of Things, optical modules have become a One of the core components in optical fiber communication systems, its functional requirements in realizing photoelectric signal conversion are also growing rapidly. With the continuous expansion of the telecommunications market and data centers, optical communications

At the Qingyun Technology AI Computing Power Conference, Miao Hui, product manager, introduced in detail the Qingyun AI computing power scheduling platform and Qingyun AI computing power cloud services. The following is the full text of the speech: Artificial Intelligence users face computing power challenges. With the explosion of the artificial intelligence industry, AIGC, large models, scientific research computing, enterprise-level big data and artificial intelligence have put forward higher demands on computing power centers. Especially in the face of data centers with a single computing power, it is no longer able to meet the growing demand for computing power in all walks of life. Therefore, more intelligent computing centers, supercomputing centers and general cloud computing services are needed to provide computing power services to the whole society. However, users of the AI industry, AI infrastructure, and AI computing power also face a series of challenges: the bottleneck of unified management of multiple resources. Facing users with multiple computing power, multiple storage, and the entire computing network

(Editor/Lu Dong) The integration and innovation of large models and large computing power is causing a major shift in the production paradigm, pushing scientific research and industrial applications towards the era of general artificial intelligence (AGI) driven by intelligent computing. In the early stages of rapid technology iteration, how to build a new generation of infrastructure, lower the application threshold, shorten the research and development cycle, and improve innovation efficiency has become a new challenge that the government and upstream and downstream industries need to jointly solve. "Currently, when artificial intelligence companies come to Lingang, they do not simply focus on support policies such as subsidies, but ask whether they can solve the computing power needs." On June 2, at the Intelligent Computing Conference of Shanghai Lingang New Area, Lingang New Film Lu Yu, director of the high-tech department of the district management committee, said. On the same day, the "Action Plan to Accelerate the Construction of Computing Industry Ecosystem in Lingang New Area" (hereinafter referred to as: Plan

There is no doubt that artificial intelligence is not a simple technological revolution for human society. It symbolizes the arrival of an era, just like the industrial era to the agricultural era. It will bring about earth-shaking changes and affect human society for hundreds of years, even for centuries. The process of thousands of years. There is no doubt about the importance of AI computing power in promoting artificial intelligence applications, especially with the explosion of the entire generative AI market this year, which has brought huge demand for AI computing power. In order to ensure the further rapid development of artificial intelligence applications, AI computing power can be said to have become a crucial factor. Against this background, at the 2023 China Computing Power (Infrastructure) Conference held on August 18, Lenovo responded to the current demand for AI computing power and officially released the Lenovo AI computing power strategy.

At the recent Wind3C conference, Hou Bin, the head of the Great Wall Securities Industrial Finance Research Institute and TMT Research Center, shared a speech titled "AI Computing Power and Satellite Internet". Hou Bin believed that computing power is the productivity of the development of artificial intelligence. With the rapid development of large artificial intelligence models and the ever-expanding demand for computing power, there has been a mismatch between computing power demand and chip capabilities, which will promote the development of the artificial intelligence computing power industry. As a core product with the highest degree of industrialization and the most cutting-edge technology reserves in China, optical modules will see rapid growth due to the continuous upgrade of computing power driven by the development of large-scale artificial intelligence models. The satellite Internet industry is still in its infancy, and the medium- and long-term development potential in the future It is huge and has broad development prospects. The following is the core point shared by Hou Bin: People

After officially investing in OpenAI, Microsoft's related software announced one by one that it would be equipped with ChatGPT. With the increase in users, OpenAI and many users have discovered that if they want better feedback and learning effects of the robot system, they need to add more machines. The ability to learn means that the overall computing power needs to be improved. But just today, during an interview with OpenAI CEO in Europe, he discussed OpenAI's API and product plans, which attracted the attention of many netizens. What attracted the attention of many friends was his discussion on the progress of AI. According to him, in addition to normal AI iterations, GPT-3 is also in OpenAI’s open source plan, but currently GPUs can no longer keep up. And according to previous introduction, GPT-

Coming to the 2023 Amazon Cloud Technology China Summit on the second day, there were still many surprises. The activities and theme discussions on this day cannot escape one theme: AI. Focusing on the continued hot trend of AI, Amazon Cloud Technology has made a lot of deployments and will come up with more products and technologies. Among them, Chen Xiaojian, general manager of Amazon Cloud Technology Greater China, mentioned that Amazon Cloud Technology focuses on helping customers get rid of the constraints of infrastructure and focus on technological innovation. In order to cope with the high computing power requirements of the AI era, Amazon Cloud Technology will develop products and technologies such as self-developed chips, elastic computing storage combinations, and Serverless architecture to provide more cost-effective services. Chen Xiaojian said that Amazon Cloud Technology has comprehensive and in-depth infrastructure construction capabilities, and cooperates with Intel,
