At a developer meeting at Facebook headquarters, engineers from the social networking giant revealed that they are using a new self-developed query engine, Presto, to conduct interactive analysis on the existing massive 250PB data warehouse.
According to engineer Martin Traverso, more than 850 Facebook engineers use it to scan more than 320TB of data every day. In the past, our scientists and analysts have relied on Hive for data analysis. But Hive is designed for batch processing. But with more and more data, Hive can no longer meet our needs. While we have other tools that are faster than Hive, they are either limited in functionality or too simple to operate our massive data warehouse. And over the past few months, we've been using Presto to fill this gap.
Hive is a data warehouse tool created by Facebook specifically for Hadoop a few years ago. Because it mainly relies on MapReduce for operation, as it ages, its speed can no longer meet the growing data requirements. Browsing through a complete data set could take anywhere from minutes to hours, which is simply impractical.
Traverso also said that simple queries with Presto only take a few hundred milliseconds, and even very complex queries only take minutes to complete. It runs in memory and does not write to disk.
While it may look like Presto is Facebook's version of the Cloudera Impala SQL query engine, or similar to what Hortonworks is doing with Project Stinger, this is a version customized for faster operations at Facebook's scale. Presto won't compete with other commercial products, but it will soon shake up the big data industry. And Facebook plans to release Presto as open source this fall.
Ravi Murthy, engineering manager at Facebook, said that as the number of users continues to grow, the data warehouse is also growing rapidly. It is 4,000 times larger than four years ago. Murthy also said that in the next few years, data will reach exabytes. So in order to accommodate this kind of data scale, we had to rethink a lot of things.
Presto is one of them. In addition to improving query speed, this engine is 7 times more efficient than Hive in terms of CPU usage efficiency. Another ongoing project is shrinking the analytics space in Facebook's data centers.
What do the experts on Weibo think of Presto, the latest query engine launched by Facebook?
Big Data Pi Dong, former head of the Big Data Laboratory of EMC China Research Institute : Facebook’s latest interactive big data query system Presto, similar to Cloudera’s Impala and Hortonworks’ Stinger, solves Facebook’s rapidly expanding massive data warehouse Quickly check requirements. Facebook is developing a new generation of big data system for Exabyte scale data. Presto is one of the data warehouse interactive query systems and should also have a mass storage system. At this level, there's a lot of design to consider!
Sina CTO and Co-President Jack Xu Liangjie: Social networks and social media have given birth to a real big data (Big Data) platform. Sina Weibo is no exception...
vinW, a computer and linguistics researcher at the University of Leeds, UK, and a postdoctoral researcher on the search project: 1. Presto will be open source in the autumn; 2. Seven times faster than hive; 3. Based on memory
Launch_Bruce: FaceBook is not a search engine and has higher requirements for real-time performance. Even if Hive was initially launched, it could only be a temporary measure. This is the gene of Hadoop. Hadoop will definitely make many projects that are launched blindly without in-depth thinking difficult in the end. But obviously Hadoop's successful ecosystem will also harm many people.
TeslaElon: Come on! Big Data will generate many business opportunities. In particular, potential cooperation with Alibaba, the largest e-commerce platform, and YOKU, the largest video platform, are worth looking forward to. In addition, Sina has invested in many popular applications on Weibo and has many opportunities. We will see how Sina does well in R&D, management and sales later.
Henry, who carries big data: We were doing big data analysis about five years ago, and our MPP product already had these strategies. At that time, the biggest problem was big data in the Internet, but these star companies did not like to spend money to buy but only loved to build wheels. It's better for telecom customers, who are willing to spend money to purchase rather than reinvent the wheel.
English from: gigaom.com