Apache Spark is an open source cluster computing framework originally developed by AMPLab at the University of California, Berkeley. Compared with Hadoop's MapReduce, which stores intermediary data on disk after running the work, Spark uses in-memory computing technology to analyze and perform operations in memory before the data is written to the hard disk.
Spark can run programs in memory 100 times faster than Hadoop MapReduce. Even when running programs on hard disk, Spark can run 10 times faster. Spark allows users to load data into cluster storage and query it multiple times, making it ideal for machine learning algorithms.
Using Spark requires a cluster administrator and distributed storage system. Spark supports standalone mode (local Spark cluster), Hadoop YARN or Apache Mesos cluster management.
In terms of distributed storage, Spark can be equipped with interfaces such as HDFS, Cassandra, OpenStack Swift and Amazon S3. Spark also supports pseudo-distributed local mode, but it is usually only used for development or testing to replace the distributed storage system with the local file system. In such cases, Spark only uses each CPU core on one machine to run the program.
In 2014, more than 465 contributors invested in Spark development, making it the most active project among the Apache Software Foundation and many open source projects of big data.
For more Apache related knowledge, please visit the Apache usage tutorial column!
The above is the detailed content of what is apache spark. For more information, please follow other related articles on the PHP Chinese website!