“Spark 学习笔记”的版本间的差异
来自Dennis的知识库
Dennis zhuang(讨论 | 贡献) (以“ == 概览 == Spark 抽象成两部分: * RDD : resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the clust...”为内容创建页面) |
Dennis zhuang(讨论 | 贡献) (→入门) |
||
第16行: | 第16行: | ||
核心模块: | 核心模块: | ||
− | + | <code> | |
from pyspark import SparkContext, SparkConf | from pyspark import SparkContext, SparkConf | ||
− | + | ||
+ | conf = SparkConf().setAppName(appName).setMaster(master) | ||
+ | sc = SparkContext(conf=conf) | ||
+ | </code> | ||
+ | |||
+ | * appName: 你的应用名 | ||
+ | * master: master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. | ||
+ | |||
+ | shell 部分: | ||
+ | |||
+ | * ./bin/pyspark --master local[4] 本地启动 4 核的 shell | ||
+ | * 加载依赖代码 ./bin/pyspark --master local[4] --py-files code.py | ||
+ | |||
+ | == RDD == |
2016年7月28日 (四) 07:03的版本
概览
Spark 抽象成两部分:
- RDD : resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
- Shared variables: Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
入门
以 python 为例子
- bin/spark-submit 提交任务
- bin/pyspark 启动一个 shell
核心模块:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf)
- appName: 你的应用名
- master: master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.
shell 部分:
- ./bin/pyspark --master local[4] 本地启动 4 核的 shell
- 加载依赖代码 ./bin/pyspark --master local[4] --py-files code.py