Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add perf test for non-UTC operators #10078

Merged
merged 5 commits into from
Dec 27, 2023

Conversation

res-life
Copy link
Collaborator

closes #10076

  • Add perf test framework for non-UTC operators
  • Add an example perf test for from_utc_timestamp

Signed-off-by: Chong Gao [email protected]

@res-life res-life marked this pull request as draft December 19, 2023 11:42
Copy link
Collaborator

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for micro-benchmark only, right?

@res-life
Copy link
Collaborator Author

This is for micro-benchmark only, right?

No, it's a end to end benchmark testing:
For example, test from_utc_timestamp

createDF(spark).select(f.from_utc_timestamp(f.col("c_timestamp"), zone))

@sameerz sameerz added performance A performance related task/issue test Only impacts tests labels Dec 23, 2023
@res-life res-life marked this pull request as ready for review December 25, 2023 08:31
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share the results of this perf test?

Thread.sleep(5L)

def perfTest(spark: SparkSession, zone: String): DataFrame = {
spark.read.parquet(path).select(functions.max( // use max to reduce the result data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.read.parquet(path).cache().select(...) Perhaps add cache here to eliminate the overhead of the Scan ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the test file generated is small 2.5M, so the second time running the OS will cache the file.
I tried to generate a file that is about 500M and use df.cache.
The previous result (2.5M) is:

CPU
from_utc_timestamp,Cpu,Iran,733
from_utc_timestamp,Cpu,Iran,781

GPU
from_utc_timestamp,Gpu,Iran,131
from_utc_timestamp,Gpu,Iran,114

The new result(500M + df.cache) is:

CPU:
from_utc_timestamp,Cpu,Iran,670
from_utc_timestamp,Cpu,Iran,665

GPU:
from_utc_timestamp,Gpu,Iran,334
from_utc_timestamp,Gpu,Iran,326

The event log shows GPU decode time is 115 ms.

So using highly compressed file can reduce the scan/decode time.

val year2000 = Instant.parse("2000-01-01T00:00:00Z").getEpochSecond * 1000L * 1000L
val year2030 = Instant.parse("2030-01-01T00:00:00Z").getEpochSecond * 1000L * 1000L

private val stringsForCast = Array(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put every zone IDs into this pool by loading them from TimeZoneDB ? Either, appends more zones into this pool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can specify the time zones:

-DTZs=Iran,Asia/Shanghai -DenableTimeZonePerf=true

@res-life
Copy link
Collaborator Author

build

1 similar comment
@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits.


private val zones = timeZoneStrings.split(",")

private val path = "/tmp/tmp_TimeZonePerfSuite"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add a UUID or something similar here. Generally it is not a good idea to have any totally hard coded paths, just to avoid issues with multiple tests running at the same time.

class TimeZonePerfSuite extends SparkQueryCompareTestSuite with BeforeAndAfterAll {
// perf test is disabled by default since it's a long running time in UT.
private val enablePerfTest = java.lang.Boolean.getBoolean("enableTimeZonePerf")
private val timeZoneStrings = System.getProperty("TZs", "Iran")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I feel about hidden properties to enable/disable tests. At a minimum these need to be documented somewhere, otherwise this is just lost.

@res-life res-life merged commit 6abe148 into NVIDIA:branch-24.02 Dec 27, 2023
38 of 39 checks passed
@res-life res-life deleted the perf-for-time-zone branch December 27, 2023 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add performance test framework for non-UTC time zone features.
6 participants