博客
关于我
Spark 系列3—— 弹性式数据集RDDs
阅读量:117 次
发布时间:2019-02-26

本文共 2516 字,大约阅读时间需要 8 分钟。

RDD??

Resilient Distributed Dataset?RDD??Spark??????????????????????????????RDD????????

  • ???????RDD?????????Partitions???????????????????????????????????????????????CPU????

  • ?????RDD??compute???????????

  • ?????RDD??????????????????????????????????????????????????????????????????????????????

  • ??????Key-Value???RDD??????????????????????HashPartitioner?RangePartitioner???

  • ?????????????????????Spark?????????????????????????????

  • RDD[T]???????

    // ?????????def compute(split: Partition, context: TaskContext): Iterator[T]// ??????protected def getPartitions: Array[Partition]// ??????protected def getDependencies: Seq[Dependency[_]] = deps// ????????protected def getPreferredLocations(split: Partition): Seq[String] = Nil// ??????????@transient val partitioner: Option[Partitioner] = None

    ??RDD

    RDD???????????

  • ???????

    ??Spark shell??????????

    spark-shell --master local[4]

    ??Scala????

    val conf = new SparkConf().setAppName("Spark shell").setMaster("local[4]")val sc = new SparkContext(conf)

    ??RDD????????

    val data = Array(1, 2, 3, 4, 5)val dataRDD = sc.parallelize(data) // ?????val dataRDD = sc.parallelize(data, 2) // ?????
  • ????????

    ??????????

    val fileRDD = sc.textFile("/usr/file/emp.txt")fileRDD.take(1)

    ?????

    • ???????????????????????
    • ??????????????
  • textFile?wholeTextFiles

    • textFile???RDD[String]???????????
    • wholeTextFiles???RDD[(String, String)]???????????
      ???
    def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {...}def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = {...}

  • RDD??

    RDD???????transformations?????actions??????????????????action???????

    val list = List(1, 2, 3)sc.parallelize(list).map(_ * 10).foreach(println)// ???10 20 30

    ??RDD

    Spark??????????????????

  • MEMORY_ONLY??????????JVM??????????????
  • MEMORY_AND_DISK????????????????????
  • MEMORY_ONLY_SER????????????Java?Scala????????
  • MEMORY_AND_DISK_SER???MEMORY_ONLY_SER????????????
  • DISK_ONLY????????
  • ????????MEMORY_ONLY_2?MEMORY_AND_DISK_2???????????????
  • ?????

    • persist(StorageLevel)??????
    • cache()????persist(MEMORY_ONLY)?

    ?????

    • Spark????????????LRU?????
    • ?????RDD.unpersist()?

    Shuffle???

  • Shuffle??

    Shuffle???????????????I/O?????????????RDD?????????????spark.local.dir???????

  • ??Shuffle???

    • ?????repartition?coalesce?
    • ByKey???groupByKey?reduceByKey??countByKey???
    • ?????cogroup?join?
  • ???????

    • ??????????????????????????
    • ????????????????????Shuffle?????????

  • DAG?????

    RDD????????DAG?Spark???????????

    • ??????????????????
    • ???????Shuffle????????

    ??????????????RDD????????Spark?????

    转载地址:http://ogmk.baihongyu.com/

    你可能感兴趣的文章
    oracle删除重复数据保留第一条记录
    查看>>
    oracle判断空值的函数nvl2,【PL/SQL】 NVL,NVL2,COALESCE 三种空值判断函数
    查看>>
    Oracle发布VirtualBox 7.1稳定版!支持ARM、优化了UI、支持Wayland等
    查看>>
    oracle启动三步
    查看>>
    oracle启动关闭服务,启动关闭oracle服务.bat
    查看>>
    Oracle命令行创建数据库
    查看>>
    Oracle和SQL server的数据类型比较
    查看>>
    oracle和sybase的一些区别
    查看>>
    oracle在日本遇到的技术问题
    查看>>
    Oracle在线重定义
    查看>>
    oracle基础 管理索引
    查看>>
    ORACLE多表关联UPDATE 语句
    查看>>
    Oracle多表查询与数据更新
    查看>>
    oracle如何修改单个用户密码永不过期
    查看>>
    oracle字符集
    查看>>
    oracle存储参数(storage子句)含义及设置技巧
    查看>>
    Oracle学习
    查看>>
    Oracle学习第五课
    查看>>
    Oracle安装、Navicat for Oracle、JDBCl连接、获取表结构
    查看>>
    ORACLE客户端连接
    查看>>