“Spark 学习笔记”的版本间的差异

2016年7月28日 (四) 07:03的版本

Spark 抽象成两部分：

RDD : resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
Shared variables: Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

以 python 为例子

核心模块：

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf)

appName: 你的应用名
master: master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.

shell 部分：

@@ 第16行： / 第16行： @@
 核心模块：
-```
+<code>
 from pyspark import SparkContext, SparkConf
-```
+conf = SparkConf().setAppName(appName).setMaster(master)
+sc = SparkContext(conf=conf)
+</code>
+* appName: 你的应用名
+* master: master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.
+shell 部分：
+* ./bin/pyspark --master local[4] 本地启动 4 核的 shell
+* 加载依赖代码  ./bin/pyspark --master local[4] --py-files code.py
+== RDD ==