Skip to content

Latest commit

 

History

History
78 lines (54 loc) · 2.25 KB

spark-mllib-vector.adoc

File metadata and controls

78 lines (54 loc) · 2.25 KB

Vector

Vector sealed trait represents a numeric vector of values (of Double type) and their indices (of Int type).

It belongs to org.apache.spark.mllib.linalg package.

Note

To Scala and Java developers:

Vector class in Spark MLlib belongs to org.apache.spark.mllib.linalg package.

It is not the Vector type in Scala or Java. Train your eyes to see two types of the same name. You’ve been warned.

A Vector object knows its size.

A Vector object can be converted to:

  • Array[Double] using toArray.

  • a dense vector as DenseVector using toDense.

  • a sparse vector as SparseVector using toSparse.

  • (1.6.0) a JSON string using toJson.

  • (internal) a breeze vector as BV[Double] using toBreeze.

There are exactly two available implementations of Vector sealed trait (that also belong to org.apache.spark.mllib.linalg package):

  • DenseVector

  • SparseVector

Tip
Use Vectors factory object to create vectors, be it DenseVector or SparseVector.
import org.apache.spark.mllib.linalg.Vectors

// You can create dense vectors explicitly by giving values per index
val denseVec = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5))
val almostAllZeros = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))

// You can however create a sparse vector by the size and non-zero elements
val sparse = Vectors.sparse(10, Seq((1, 0.4), (2, 0.3), (3, 1.5)))

// Convert a dense vector to a sparse one
val fromSparse = sparse.toDense

scala> almostAllZeros == fromSparse
res0: Boolean = true
Note
The factory object is called Vectors (plural).
import org.apache.spark.mllib.linalg._

// prepare elements for a sparse vector
// NOTE: It is more Scala rather than Spark
val indices = 0 to 4
val elements = indices.zip(Stream.continually(1.0))
val sv = Vectors.sparse(elements.size, elements)

// Notice how Vector is printed out
scala> sv
res4: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,1.0,1.0,1.0,1.0])

scala> sv.size
res0: Int = 5

scala> sv.toArray
res1: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0)

scala> sv == sv.copy
res2: Boolean = true

scala> sv.toJson
res3: String = {"type":0,"size":5,"indices":[0,1,2,3,4],"values":[1.0,1.0,1.0,1.0,1.0]}