Spark RDD distinct

- 1 min

distinct

返回一个该RDD中每一个item仅仅包含一次的新RDD

注:这是一个action操作,会触发实际计算

函数原型:

def distinct(): RDD[T] def distinct(numPartitions: Int): RDD[T]

例子:


val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2

a.distinct(3).partitions.length
res17: Int = 3
comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora