Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xxhash64 supports nested types [databricks] #11859

Merged
merged 11 commits into from
Dec 19, 2024

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Dec 11, 2024

The JNI PR: Add xxhash64 support for nested types is initially required by Hyper Log Log Plus Plus(HLLPP).

This PR is used to

  • verify the correctness of xxhash64 result for nested types for HLLPP
  • BTW: Add support for nested type.

This PR contains:

  • Supports nested types for xxhash64
  • Add fallback when the stack depth required by nested type is bigger than 8, refer to how to calculate stack depth

Signed-off-by: Chong Gao [email protected]

@res-life res-life requested review from ttnghia and ustcfy December 11, 2024 07:55
@res-life res-life changed the title Xxhash64 supports nested types [Do not review]Xxhash64 supports nested types Dec 11, 2024
@res-life
Copy link
Collaborator Author

After compile/install the JNI with patch: NVIDIA/spark-rapids-jni#2575, all the cases for nested types passed.
@ustcfy @ttnghia We can merge JNI PR, because this PR verified the correctness.

@res-life res-life force-pushed the xxhash64-support-nested-type branch from 725c25a to 5bf3c78 Compare December 11, 2024 08:39
ttnghia
ttnghia previously approved these changes Dec 11, 2024
@res-life res-life changed the title [Do not review]Xxhash64 supports nested types Xxhash64 supports nested types Dec 18, 2024
@res-life res-life changed the title Xxhash64 supports nested types Xxhash64 supports nested types [databricks] Dec 18, 2024
@res-life res-life marked this pull request as ready for review December 18, 2024 07:42
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Map is list of struct in cuDF.
Yan Feng mentioned a better approach to check the max stack depth without flatMap:

  /**
   * Convert map to list of struct.
   * Note: Do not support UserDefinedType and other unregular types.
   * Do not retain the nullable and other info
   */
  private def flatMap(inputType: DataType): DataType = {
    inputType match {
      case mapType: MapType =>
        ArrayType(StructType(Array(
          StructField("key", flatMap(mapType.keyType)),
          StructField("value", flatMap(mapType.valueType))
        )))
      case arrayType: ArrayType => ArrayType(flatMap(arrayType.elementType))
      case structType: StructType =>
        StructType(structType.map(f => StructField(f.name, flatMap(f.dataType))).toArray)
      case nullType: NullType => nullType
      case atomicType: AtomicType => atomicType
      case other => throw new RuntimeException(s"Unsupported type: $other")
    }
  }

Yan Feng will file a follow-up PR to do the improvement.

case ArrayType(c: DataType, _) => computeMaxStackSizeForFlatten(c)
case st: StructType =>
1 + st.map(f => computeMaxStackSizeForFlatten(f.dataType)).max
case _ => 1 // primitive types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need the flatMap function and can directly compute the depth of the map type using

case mt: MapType => 
    2 + math.max(computeMaxStackSizeForFlatten(mt.keyType), computeMaxStackSizeForFlatten(mt.valueType))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not rewriting the data type to check the depth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's post a follow-up PR to do the improvement

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@res-life why? It takes 5 mins and I want the fallback message updated too?

Copy link
Collaborator Author

@res-life res-life Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. commit

@res-life res-life requested a review from ttnghia December 18, 2024 13:54
case ArrayType(c: DataType, _) => computeMaxStackSizeForFlatten(c)
case st: StructType =>
1 + st.map(f => computeMaxStackSizeForFlatten(f.dataType)).max
case _ => 1 // primitive types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not rewriting the data type to check the depth.

val maxDepth = a.children.map(
c => XxHash64Utils.computeMaxStackSize(c.dataType)).max
if (maxDepth > Hash.MAX_STACK_DEPTH) {
willNotWorkOnGpu(s"The data type requires a stack size of $maxDepth, " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear because Maps count as a depth of 2. Users who see this will be confused, because they will try to add things up manually and it will not work out. At a minimum we need to mention that Maps count as a depth of 2 here.

Copy link
Collaborator Author

@res-life res-life Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. commit

@res-life
Copy link
Collaborator Author

build

Comment on lines 125 to 126
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c: DataType, _) => computeMaxStackSize(c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be combined like this?

Suggested change
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c: DataType, _) => computeMaxStackSize(c)
case ArrayType(c: _, _) => computeMaxStackSize(c)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in JNI, list(other type) is optmized to use less stack depth, only use depthOf(other).
But for list(struct), should be 1 + depthOf(struct).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c: DataType, _) => computeMaxStackSize(c)
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c, _) => computeMaxStackSize(c)

ustcfy
ustcfy previously approved these changes Dec 19, 2024
integration_tests/src/main/python/hashing_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/hashing_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/hashing_test.py Outdated Show resolved Hide resolved
Comment on lines 125 to 126
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c: DataType, _) => computeMaxStackSize(c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c: DataType, _) => computeMaxStackSize(c)
case ArrayType(c: StructType, _) => 1 + computeMaxStackSize(c)
case ArrayType(c, _) => computeMaxStackSize(c)

@res-life
Copy link
Collaborator Author

build

@res-life res-life merged commit adb89aa into NVIDIA:branch-25.02 Dec 19, 2024
49 of 50 checks passed
@res-life res-life deleted the xxhash64-support-nested-type branch December 19, 2024 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants