-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xxhash64 supports nested types [databricks] #11859
Xxhash64 supports nested types [databricks] #11859
Conversation
After compile/install the JNI with patch: NVIDIA/spark-rapids-jni#2575, all the cases for nested types passed. |
Signed-off-by: Chong Gao <[email protected]>
725c25a
to
5bf3c78
Compare
build |
build |
Map is list of struct in cuDF. /**
* Convert map to list of struct.
* Note: Do not support UserDefinedType and other unregular types.
* Do not retain the nullable and other info
*/
private def flatMap(inputType: DataType): DataType = {
inputType match {
case mapType: MapType =>
ArrayType(StructType(Array(
StructField("key", flatMap(mapType.keyType)),
StructField("value", flatMap(mapType.valueType))
)))
case arrayType: ArrayType => ArrayType(flatMap(arrayType.elementType))
case structType: StructType =>
StructType(structType.map(f => StructField(f.name, flatMap(f.dataType))).toArray)
case nullType: NullType => nullType
case atomicType: AtomicType => atomicType
case other => throw new RuntimeException(s"Unsupported type: $other")
}
} Yan Feng will file a follow-up PR to do the improvement. |
case ArrayType(c: DataType, _) => computeMaxStackSizeForFlatten(c) | ||
case st: StructType => | ||
1 + st.map(f => computeMaxStackSizeForFlatten(f.dataType)).max | ||
case _ => 1 // primitive types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we don't need the flatMap
function and can directly compute the depth of the map type using
case mt: MapType =>
2 + math.max(computeMaxStackSizeForFlatten(mt.keyType), computeMaxStackSizeForFlatten(mt.valueType))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not rewriting the data type to check the depth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let's post a follow-up PR to do the improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@res-life why? It takes 5 mins and I want the fallback message updated too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. commit
case ArrayType(c: DataType, _) => computeMaxStackSizeForFlatten(c) | ||
case st: StructType => | ||
1 + st.map(f => computeMaxStackSizeForFlatten(f.dataType)).max | ||
case _ => 1 // primitive types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not rewriting the data type to check the depth.
val maxDepth = a.children.map( | ||
c => XxHash64Utils.computeMaxStackSize(c.dataType)).max | ||
if (maxDepth > Hash.MAX_STACK_DEPTH) { | ||
willNotWorkOnGpu(s"The data type requires a stack size of $maxDepth, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very clear because Maps count as a depth of 2. Users who see this will be confused, because they will try to add things up manually and it will not work out. At a minimum we need to mention that Maps count as a depth of 2 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. commit
build |
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | ||
case ArrayType(c: DataType, _) => computeMaxStackSize(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be combined like this?
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | |
case ArrayType(c: DataType, _) => computeMaxStackSize(c) | |
case ArrayType(c: _, _) => computeMaxStackSize(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in JNI, list(other type) is optmized to use less stack depth, only use depthOf(other).
But for list(struct), should be 1 + depthOf(struct).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | |
case ArrayType(c: DataType, _) => computeMaxStackSize(c) | |
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | |
case ArrayType(c, _) => computeMaxStackSize(c) |
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | ||
case ArrayType(c: DataType, _) => computeMaxStackSize(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | |
case ArrayType(c: DataType, _) => computeMaxStackSize(c) | |
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c) | |
case ArrayType(c, _) => computeMaxStackSize(c) |
build |
The JNI PR: Add xxhash64 support for nested types is initially required by Hyper Log Log Plus Plus(HLLPP).
This PR is used to
This PR contains:
Signed-off-by: Chong Gao [email protected]