博客
关于我
Spark 系列3—— 弹性式数据集RDDs
阅读量:117 次
发布时间:2019-02-26

本文共 2516 字,大约阅读时间需要 8 分钟。

RDD??

Resilient Distributed Dataset?RDD??Spark??????????????????????????????RDD????????

  • ???????RDD?????????Partitions???????????????????????????????????????????????CPU????

  • ?????RDD??compute???????????

  • ?????RDD??????????????????????????????????????????????????????????????????????????????

  • ??????Key-Value???RDD??????????????????????HashPartitioner?RangePartitioner???

  • ?????????????????????Spark?????????????????????????????

  • RDD[T]???????

    // ?????????def compute(split: Partition, context: TaskContext): Iterator[T]// ??????protected def getPartitions: Array[Partition]// ??????protected def getDependencies: Seq[Dependency[_]] = deps// ????????protected def getPreferredLocations(split: Partition): Seq[String] = Nil// ??????????@transient val partitioner: Option[Partitioner] = None

    ??RDD

    RDD???????????

  • ???????

    ??Spark shell??????????

    spark-shell --master local[4]

    ??Scala????

    val conf = new SparkConf().setAppName("Spark shell").setMaster("local[4]")val sc = new SparkContext(conf)

    ??RDD????????

    val data = Array(1, 2, 3, 4, 5)val dataRDD = sc.parallelize(data) // ?????val dataRDD = sc.parallelize(data, 2) // ?????
  • ????????

    ??????????

    val fileRDD = sc.textFile("/usr/file/emp.txt")fileRDD.take(1)

    ?????

    • ???????????????????????
    • ??????????????
  • textFile?wholeTextFiles

    • textFile???RDD[String]???????????
    • wholeTextFiles???RDD[(String, String)]???????????
      ???
    def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {...}def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = {...}

  • RDD??

    RDD???????transformations?????actions??????????????????action???????

    val list = List(1, 2, 3)sc.parallelize(list).map(_ * 10).foreach(println)// ???10 20 30

    ??RDD

    Spark??????????????????

  • MEMORY_ONLY??????????JVM??????????????
  • MEMORY_AND_DISK????????????????????
  • MEMORY_ONLY_SER????????????Java?Scala????????
  • MEMORY_AND_DISK_SER???MEMORY_ONLY_SER????????????
  • DISK_ONLY????????
  • ????????MEMORY_ONLY_2?MEMORY_AND_DISK_2???????????????
  • ?????

    • persist(StorageLevel)??????
    • cache()????persist(MEMORY_ONLY)?

    ?????

    • Spark????????????LRU?????
    • ?????RDD.unpersist()?

    Shuffle???

  • Shuffle??

    Shuffle???????????????I/O?????????????RDD?????????????spark.local.dir???????

  • ??Shuffle???

    • ?????repartition?coalesce?
    • ByKey???groupByKey?reduceByKey??countByKey???
    • ?????cogroup?join?
  • ???????

    • ??????????????????????????
    • ????????????????????Shuffle?????????

  • DAG?????

    RDD????????DAG?Spark???????????

    • ??????????????????
    • ???????Shuffle????????

    ??????????????RDD????????Spark?????

    转载地址:http://ogmk.baihongyu.com/

    你可能感兴趣的文章
    Nginx 学习总结(17)—— 8 个免费开源 Nginx 管理系统,轻松管理 Nginx 站点配置
    查看>>
    nginx 常用配置记录
    查看>>
    Nginx 我们必须知道的那些事
    查看>>
    nginx 配置~~~本身就是一个静态资源的服务器
    查看>>
    Nginx的是什么?干什么用的?
    查看>>
    Nio ByteBuffer组件读写指针切换原理与常用方法
    查看>>
    NI笔试——大数加法
    查看>>
    NLP 基于kashgari和BERT实现中文命名实体识别(NER)
    查看>>
    No 'Access-Control-Allow-Origin' header is present on the requested resource.
    查看>>
    Node.js安装与配置指南:轻松启航您的JavaScript服务器之旅
    查看>>
    NSSet集合 无序的 不能重复的
    查看>>
    nullnullHuge Pages
    查看>>
    Numpy如何使用np.umprod重写range函数中i的python
    查看>>
    oauth2-shiro 添加 redis 实现版本
    查看>>
    OAuth2.0_JWT令牌-生成令牌和校验令牌_Spring Security OAuth2.0认证授权---springcloud工作笔记148
    查看>>
    OAuth2.0_JWT令牌介绍_Spring Security OAuth2.0认证授权---springcloud工作笔记147
    查看>>
    OAuth2.0_介绍_Spring Security OAuth2.0认证授权---springcloud工作笔记137
    查看>>
    OAuth2.0_完善环境配置_把资源微服务客户端信息_授权码存入到数据库_Spring Security OAuth2.0认证授权---springcloud工作笔记149
    查看>>
    OAuth2.0_授权服务配置_Spring Security OAuth2.0认证授权---springcloud工作笔记140
    查看>>
    OAuth2.0_授权服务配置_客户端详情配置_Spring Security OAuth2.0认证授权---springcloud工作笔记142
    查看>>