-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add perf test for non-UTC operators #10078
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for micro-benchmark only, right?
No, it's a end to end benchmark testing:
|
Signed-off-by: Chong Gao <[email protected]>
60c61ca
to
ffbd87e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share the results of this perf test?
tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZonePerfSuite.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZonePerfSuite.scala
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZonePerfSuite.scala
Outdated
Show resolved
Hide resolved
Thread.sleep(5L) | ||
|
||
def perfTest(spark: SparkSession, zone: String): DataFrame = { | ||
spark.read.parquet(path).select(functions.max( // use max to reduce the result data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.read.parquet(path).cache().select(...)
Perhaps add cache here to eliminate the overhead of the Scan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the test file generated is small 2.5M, so the second time running the OS will cache the file.
I tried to generate a file that is about 500M and use df.cache.
The previous result (2.5M) is:
CPU
from_utc_timestamp,Cpu,Iran,733
from_utc_timestamp,Cpu,Iran,781
GPU
from_utc_timestamp,Gpu,Iran,131
from_utc_timestamp,Gpu,Iran,114
The new result(500M + df.cache) is:
CPU:
from_utc_timestamp,Cpu,Iran,670
from_utc_timestamp,Cpu,Iran,665
GPU:
from_utc_timestamp,Gpu,Iran,334
from_utc_timestamp,Gpu,Iran,326
The event log shows GPU decode time is 115 ms.
So using highly compressed file can reduce the scan/decode time.
val year2000 = Instant.parse("2000-01-01T00:00:00Z").getEpochSecond * 1000L * 1000L | ||
val year2030 = Instant.parse("2030-01-01T00:00:00Z").getEpochSecond * 1000L * 1000L | ||
|
||
private val stringsForCast = Array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we put every zone IDs into this pool by loading them from TimeZoneDB
? Either, appends more zones into this pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can specify the time zones:
-DTZs=Iran,Asia/Shanghai -DenableTimeZonePerf=true
build |
1 similar comment
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits.
|
||
private val zones = timeZoneStrings.split(",") | ||
|
||
private val path = "/tmp/tmp_TimeZonePerfSuite" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we add a UUID or something similar here. Generally it is not a good idea to have any totally hard coded paths, just to avoid issues with multiple tests running at the same time.
class TimeZonePerfSuite extends SparkQueryCompareTestSuite with BeforeAndAfterAll { | ||
// perf test is disabled by default since it's a long running time in UT. | ||
private val enablePerfTest = java.lang.Boolean.getBoolean("enableTimeZonePerf") | ||
private val timeZoneStrings = System.getProperty("TZs", "Iran") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how I feel about hidden properties to enable/disable tests. At a minimum these need to be documented somewhere, otherwise this is just lost.
closes #10076
Signed-off-by: Chong Gao [email protected]