查看Spark 学习笔记的源代码

== Shared Variables ==

Spark operation 一般都运行在一个一个远程节点上，函数里用到的变量也都会拷贝过去，他们的更新都是基于拷贝，不会去修改 drviver 上的变量。为了解决变量共享问题，提供了两种受限模型：

*  broadcast variables，只读型，在每台机器上缓存，避免 tasks 之间拷贝的开销。creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

<pre>
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>

>>> broadcastVar.value
[1, 2, 3]
</pre>

* accumulators，累计型，比如计数或者求和。

<pre>
>>> accum = sc.accumulator(0)
Accumulator<id=0, value=0>

>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
10
</pre>

只有 driver 可以读取它的值。

遵循 AccumulatorParam 协议，自定义 accumulator，两个方法：zero 和 addInPlace，对应初始值和『加法』。

<pre>
class VectorAccumulatorParam(AccumulatorParam):
    def zero(self, initialValue):
        return Vector.zeros(initialValue.size)

    def addInPlace(self, v1, v2):
        v1 += v2
        return v1

# Then, create an Accumulator of this type:
vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam())
</pre>

Spark 保证 task 每次对  accumulator 的更新只会执行一次，哪怕 task 重启。但是如果重复执行 task 或者 stage，那是可能被多次更新的。

accumulators 同样是lazy的， map 等 transform 操作不会『真正』去加。