本文共 2516 字,大约阅读时间需要 8 分钟。
Resilient Distributed Dataset?RDD??Spark??????????????????????????????RDD????????
???????RDD?????????Partitions???????????????????????????????????????????????CPU????
?????RDD??compute???????????
?????RDD??????????????????????????????????????????????????????????????????????????????
??????Key-Value???RDD??????????????????????HashPartitioner?RangePartitioner???
?????????????????????Spark?????????????????????????????
// ?????????def compute(split: Partition, context: TaskContext): Iterator[T]// ??????protected def getPartitions: Array[Partition]// ??????protected def getDependencies: Seq[Dependency[_]] = deps// ????????protected def getPreferredLocations(split: Partition): Seq[String] = Nil// ??????????@transient val partitioner: Option[Partitioner] = None
RDD???????????
???????
??Spark shell??????????spark-shell --master local[4]
??Scala????
val conf = new SparkConf().setAppName("Spark shell").setMaster("local[4]")val sc = new SparkContext(conf)??RDD????????
val data = Array(1, 2, 3, 4, 5)val dataRDD = sc.parallelize(data) // ?????val dataRDD = sc.parallelize(data, 2) // ?????
????????
??????????val fileRDD = sc.textFile("/usr/file/emp.txt")fileRDD.take(1)?????
textFile?wholeTextFiles
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {...}def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = {...}RDD???????transformations?????actions??????????????????action???????
val list = List(1, 2, 3)sc.parallelize(list).map(_ * 10).foreach(println)// ???10 20 30
Spark??????????????????
?????
persist(StorageLevel)??????cache()????persist(MEMORY_ONLY)??????
RDD.unpersist()?Shuffle??
Shuffle???????????????I/O?????????????RDD?????????????spark.local.dir?????????Shuffle???
repartition?coalesce?groupByKey?reduceByKey??countByKey???cogroup?join????????
RDD????????DAG?Spark???????????
??????????????RDD????????Spark?????
转载地址:http://ogmk.baihongyu.com/