查看Spark 学习笔记的源代码


== 概览 ==

Spark 抽象成两部分：

* RDD :  resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. 
* Shared variables:  Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

== 入门 ==

以 python 为例子

* bin/spark-submit 提交任务
* bin/pyspark 启动一个 shell

核心模块：

<code>
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
</code>

* appName: 你的应用名
* master: master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. 

shell 部分：

* ./bin/pyspark --master local[4] 本地启动 4 核的 shell
* 加载依赖代码  ./bin/pyspark --master local[4] --py-files code.py

== RDD ==