-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable null values #76
base: master
Are you sure you want to change the base?
Conversation
To avoid replacement MONTANA -> MONTA
+1. This will be very helpful. |
@petro-rudenko Since this can easily be done with a transformation, I prefer leaving it as it is rather than adding yet another option to spark-csv. |
Spark-csv accepts sqlContext and path to files. So transformation is only possible by saving to file, which is not efficient for big files. Also replacement is done on token basis (when csv parser parsed the data). If doing csv parsing on the client side - there would no need to use spark-csv. |
case _: DateType => Date.valueOf(datum) | ||
case _: StringType => datum | ||
case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}") | ||
if (datum.isEmpty && castType != StringType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice if this was another option. IE: In my application we have decided to standardize on parsing empty string fields as nulls rather than empty strings.
i was working on the same and several other options. see https://github.com/databricks/spark-csv/pull/94/files |
please look at pull request #113 |
One reason to want this over client-side processing is that user-provided schemata have to initially state the all nullable columns are Seeing as a user-provided schema can tell us whether a column is nullable or not, it might be nice if we could also say what the null values will actually look like in the data. |
+1. I have some data that was generated using R and in this case nulls are encoded as "NA". Currently I am running another job that converts "NA" to "" but it will be nice if there is an option to specify how null values are encoded. All CSV parser I know off have such an option. |
There's datasets where each column has it's own marker for missing values. spark-csv assumes only empty string for missing values. To avoid additional data transformation and saving on user's side would be great to specify a set of null markers and replace them to empty string on a library side.