Spark 学习笔记
来自Dennis的知识库
2016年7月28日 (四) 07:00Dennis zhuang(讨论 | 贡献)的版本
概览
Spark 抽象成两部分:
- RDD : resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
- Shared variables: Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
入门
以 python 为例子
- bin/spark-submit 提交任务
- bin/pyspark 启动一个 shell
核心模块:
``` from pyspark import SparkContext, SparkConf ```