Using Spark to Ignite Data Analytics-mysql教程-PHP中文網

What is Apache Spark?

Apache Spark at eBay

The Future of Spark at eBay

首頁

資料庫

mysql教程

Using Spark to Ignite Data Analytics

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:34 PM

data ignite spark using

At eBay we want our customers to have the best experience possible. We use data analytics to improve user experiences, provide relevant offers, optimize performance, and create many, many other kinds of value. One way eBay supports this value creation is by utilizing data processing frameworks that enable, accelerate, or simplify data analytics. One such framework is Apache Spark. This post describes how Apache Spark fits into eBay’s Analytic Data Infrastructure.

spark_logo

What is Apache Spark?

The Apache Spark web site?describes Spark as “a fast and general engine for large-scale data processing.” Spark is a framework that enables parallel, distributed data processing. It offers a simple programming abstraction that provides powerful cache and persistence capabilities. The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager. Developers can use the Spark framework via several programming languages including Java, Scala, and Python. Spark also serves as a foundation for additional data processing frameworks such as Shark, which provides SQL functionality for Hadoop.

Spark is an excellent tool for iterative processing of large datasets. One way Spark is suited for this type of processing is through its Resilient Distributed Dataset (RDD). In the paper titled Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, RDDs are described as “…fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.” By using RDDs, ?programmers can pin their large data sets to memory, thereby supporting high-performance, iterative processing. Compared to reading a large data set from disk for every processing iteration, the in-memory solution is obviously much faster.

The diagram below shows a simple example of using Spark to read input data from HDFS, perform a series of iterative operations against that data using RDDs, and write the subsequent output back to HDFS.

spark_example_diagram

In the case of the first map operation into RDD(1), not all of the data could fit within the memory space allowed for RDDs. In such a case, the programmer is able to specify what should happen to the data that doesn’t fit. The options include spilling the computed data to disk and recreating it upon read. We can see in this example how each processing iteration is able to leverage memory for the reading and writing of its data. This method of leveraging memory is likely to be 100X faster than other methods that rely purely on disk storage for intermittent results.

Apache Spark at eBay

Today Spark is most commonly leveraged at eBay through Hadoop via Yarn. Yarn manages the Hadoop cluster’s resources and allows Hadoop to extend beyond traditional map and reduce jobs by employing Yarn containers to run generic tasks. Through the Hadoop Yarn framework, eBay’s Spark users are able to leverage clusters approaching the range of 2000 nodes, 100TB of RAM, and 20,000 cores.

The following example illustrates Spark on Hadoop via Yarn.

spark_hadoop_diagram

The user submits the Spark job to Hadoop. The Spark application master starts within a single Yarn container, then begins working with the Yarn resource manager to spawn Spark executors – as many as the user requested. These Spark executors will run the Spark application using the specified amount of memory and number of CPU cores. In this case, the Spark application is able to read and write to the cluster’s data residing in HDFS. This model of running Spark on Hadoop illustrates Hadoop’s growing ability to provide a singular, foundational platform for data processing over shared data.

The eBay analyst community includes a strong contingent of Scala users. Accordingly, many of eBay’s Spark users are writing their jobs in Scala. These jobs are supporting discovery through interrogation of complex data, data modelling, and data scoring, among other use cases. Below is a code snippet from a Spark Scala application. This application uses Spark’s machine learning library, MLlib, to cluster eBay’s sellers via KMeans. The seller attribute data is stored in HDFS.

/**
 * read input files and turn into usable records
 */
 var table = new SellerMetric()
 val model_data = sc.sequenceFile[Text,Text](
   input_path
  ,classOf[Text]
  ,classOf[Text]
  ,num_tasks.toInt
 ).map(
   v => parseRecord(v._2,table)
 ).filter(
   v => v != null
 ).cache
....
/**
 * build training data set from sample and summary data
 */
 val train_data = sample_data.map( v =>
   Array.tabulate[Double](field_cnt)(
     i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
   )
 ).cache
/**
 * train the model
 */ 
 val model = KMeans.train(train_data,CLUSTERS,ITERATIONS)
/**
 * score the data
 */
 val results = grouped_model_data.map( 
   v => (
     v._1
    ,model.predict(
       Array.tabulate[Double](field_cnt)(
         i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
       )
     )
   )
 ) 
 results.saveAsTextFile(output_path)

登入後複製

In addition to ?Spark Scala users, several folks at eBay have begun using Spark with Shark to accelerate their Hadoop SQL performance. Many of these Shark queries are easily running 5X faster than their Hive counterparts. While Spark at eBay is still in its early stages, usage is in the midst of expanding from experimental to everyday as the number of Spark users at eBay continues to accelerate.

The Future of Spark at eBay

Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. Our Hadoop platform team has started gearing up to formally support Spark on Hadoop. Additionally, we’re keeping our eyes on how Hadoop continues to evolve in its support for frameworks like Spark, how the community is able to use Spark to create value from data, and how companies like Hortonworks and Cloudera are incorporating Spark into their portfolios. Some groups within eBay are looking at spinning up their own Spark clusters outside of Hadoop. These clusters would either leverage more specialized hardware or be application-specific. Other folks are working on incorporating eBay’s already strong data platform language extensions into the Spark model to make it even easier to leverage eBay’s data within Spark. In the meantime, we will continue to see adoption of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product announcements, industry chatter, and Spark’s own strengths and capabilities.

原文地址：Using Spark to Ignite Data Analytics, 感谢原作者分享。

本網站聲明

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

熱AI工具

Undresser.AI Undress

人工智慧驅動的應用程序，用於創建逼真的裸體照片

AI Clothes Remover

用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool

免費脫衣圖片

Clothoff.io

AI脫衣器

Video Face Swap

使用我們完全免費的人工智慧換臉工具，輕鬆在任何影片中換臉！

熱工具

記事本++7.3.1

好用且免費的程式碼編輯器

SublimeText3漢化版

中文版，非常好用

禪工作室 13.0.1

強大的PHP整合開發環境

Dreamweaver CS6

視覺化網頁開發工具

SublimeText3 Mac版

神級程式碼編輯軟體(SublimeText3)

熱門話題

Java教學

1665

CakePHP 教程

1424

Laravel 教程

1322

PHP教程

1270

C# 教程

1249

Related knowledge

十個AI演算法常用函式庫Java版 Jun 13, 2023 pm 04:33 PM

今年ChatGPT火了半年多，熱度絲毫沒有降下來。深度學習和NLP也重新回到了大家的視線中。公司裡有一些小夥伴都在問我，身為Java開發人員，如何入門人工智慧，是時候拿出壓箱底的私藏的學習AI的Java庫來介紹給大家。這些函式庫和框架為機器學習、深度學習、自然語言處理等提供了廣泛的工具和演算法。根據AI專案的具體需求，可以選擇最合適的函式庫或框架，並開始嘗試使用不同的演算法來建立AI解決方案。 1.Deeplearning4j它是一個用於Java和Scala的開源分散式深度學習函式庫。 Deeplearning

探索Java在大數據領域的應用：Hadoop、Spark、Kafka等技術堆疊的了解 Dec 26, 2023 pm 02:57 PM

Java大數據技術堆疊：了解Java在大數據領域的應用，如Hadoop、Spark、Kafka等隨著資料量不斷增加，大數據技術成為了當今網路時代的熱門話題。在大數據領域，我們常聽到Hadoop、Spark、Kafka等技術的名字。這些技術起到了至關重要的作用，而Java作為一門廣泛應用的程式語言，也在大數據領域發揮著巨大的作用。本文將重點放在Java在大

在Go語言中使用Spark實現高效率的資料處理 Jun 16, 2023 am 08:30 AM

隨著大數據時代的到來，資料處理變得越來越重要。對於各種不同的資料處理任務，不同的技術也應運而生。其中，Spark作為一種適用於大規模資料處理的技術，已被廣泛地應用於各個領域。此外，Go語言作為一種高效的程式語言，在近年來也得到了越來越多的關注。在本文中，我們將探討如何在Go語言中使用Spark實現高效率的資料處理。我們將首先介紹Spark的一些基本概念和原理

PHP入門指南：PHP和Spark May 20, 2023 am 08:41 AM

PHP是一種非常受歡迎的伺服器端程式語言，因為它簡單易學、開放原始碼和跨平台。目前，許多大企業都採用PHP語言來建立應用程序，例如Facebook和WordPress等。 Spark是一種快速且輕量級的開發框架，可用於建立Web應用程式。它是基於Java虛擬機器（JVM）並且與PHP結合使用。本文將介紹如何使用PHP和Spark建立Web應用程式。什麼是PHP？ PH

利用PHP實現大規模資料處理：Hadoop、Spark、Flink等 May 11, 2023 pm 04:13 PM

隨著資料量的不斷增加，大規模資料處理已經成為了企業必須面對和解決的問題。傳統的關聯式資料庫已經無法滿足這種需求，而對於大規模資料的儲存與分析，Hadoop、Spark、Flink等分散式運算平台成為了最佳選擇。在資料處理工具的選擇過程中，PHP作為一種易於開發和維護的語言，越來越受到開發者的歡迎。在本文中，我們將探討如何利用PHP來實現大規模資料處理，以及如

PHP中的資料處理引擎(Spark, Hadoop等) Jun 23, 2023 am 09:43 AM

在目前的網路時代，海量資料的處理是各個企業和機構都需要面對的問題。作為一種廣泛應用的程式語言，PHP同樣需要在資料處理方面跟上時代的腳步。為了更有效率地處理大量數據，PHP開發引入了一些大數據處理工具，如Spark和Hadoop等。 Spark是一款開源的資料處理引擎，可用於大型資料集的分散式處理。 Spark的最大特點是具有快速的資料處理速度和高效的資料存

data資料夾裡面是什麼數據 May 05, 2023 pm 04:30 PM

data資料夾裡面是系統及程式的數據，例如軟體的設定和安裝包等，Data資料夾中各個資料夾則代表的是不同類型的資料存放資料夾，無論Data資料指的是檔案名稱Data還是擴充名data，都是系統或程式自訂的資料文件，Data是資料保存的備份類別文件，一般可以用meidaplayer、記事本或word開啟。

mysql load data亂碼怎麼辦 Feb 16, 2023 am 10:37 AM

mysql load data亂碼的解決方法：1、找到出現亂碼的SQL語句；2、修改語句為「LOAD DATA LOCAL INFILE "employee.txt" INTO TABLE EMPLOYEE character set utf8;」即可。

See all articles

Using Spark to Ignite Data Analytics

What is Apache Spark?

Apache Spark at eBay

The Future of Spark at eBay

熱AI工具

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

熱門文章

熱工具

記事本++7.3.1

SublimeText3漢化版

禪工作室 13.0.1

Dreamweaver CS6

SublimeText3 Mac版

熱門話題