查看Spark 学习笔记的源代码

=== RDD operations ===

分为两类：

* Transformations:  RDD 之间的转换，各种高阶函数 map, reduceByKey ,join, union etc，这一步的操作都是 lazy，要得到最终结果要经过 action，这里的设计基本跟 clojure reducer 库类似。
* Action: 获得结果, collect, take , reduce,  count etc.

例子：

<pre>
lines = sc.textFile("data.txt")

pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
</pre>

* 闭包的问题，类似下面代码是无法在 cluster 模型下正常运行的：
<pre>
counter = 0
rdd = sc.parallelize(data)

# Wrong: Don't do this!!
def increment_counter(x):
    global counter
    counter += x
rdd.foreach(increment_counter)

print("Counter value: ", counter)
</pre>

因为 counter 会被拷贝到各个 executor 节点，task 操作的也将是 executor 里的 counter ,driver 里的 counter 不会有任何更新。如果凑巧对了，只是刚好 driver 和  executor 在同一个 JVM。


==== shuffle 阶段 ====

跟  Hadoop MapReduce 一样， spark 也有一个 shuffle 过程，在 xxxByKey 、 join 、 cogroup 操作的时候，涉及到怎么将 map 结果和 reduce 对接，需要在节点之间传输数据，有分区、网络、磁盘、序列化的开销，因此是性能关键的地方。

一个介绍 ppt：

http://www.slideshare.net/colorant/spark-shuffle-introduction