parsing options and serializing arrays #113

mohitjaggi · 2015-07-30T23:14:50Z

several parsing options are added. they are organized in classes because there are many of them. a "text" based API to configure options is provided.

another feature is the ability to serialize a column of arrays. the array is "unnested" and column names to use are supplied by user. this is useful for writing out csv after doing transforms on the data that "expand" the number of columns e.g. one hot encode a category. this can be improved later. e.g. sparse vector from mllib can replace the array.

falaki · 2015-07-30T23:23:02Z

build.sbt

@@ -1,20 +1,20 @@
 name := "spark-csv"

-version := "1.1.0"
+version := "1.1.0-SNAPSHOT"


Why do you need this?

falaki · 2015-07-30T23:37:49Z

Please update it to rebase against master for tests to run. Also please revert changes to build.sbt.

this works everywhere but travis! the partitions are re-ordered somehow methinks. leaving this for another time

codecov-io · 2015-08-01T07:17:13Z

Current coverage is `84.79%`

Merging #113 into master will decrease coverage by -0.22% as of 516c8a0

@@            master    #113   diff @@
======================================
  Files           10      11     +1
  Stmts          407     526   +119
  Branches       125     150    +25
  Methods          0       0       
======================================
+ Hit            346     446   +100
  Partial          0       0       
- Missed          61      80    +19

Review entire Coverage Diff

Powered by Codecov

mohitjaggi · 2015-08-05T00:02:54Z

@falaki should be ready to merge now.

falaki · 2015-08-05T02:11:09Z

@mohitjaggi This is fairly large. I am about to publish a release with schema inference and all the recent improvements, and then I will review this.

falaki · 2015-08-13T17:14:45Z

@mohitjaggi this is packing too much into a single PR. Would you please split it. Please first submit one for number parsing improvements and another for arrays. On arrays within CSV it would be good to post an issue and gather some feedback from community first.

mohitjaggi · 2015-08-13T17:48:38Z

will do

On Thu, Aug 13, 2015 at 10:14 AM, Hossein Falaki [email protected]
wrote:

@mohitjaggi https://github.com/mohitjaggi this is packing too much into
a single PR. Would you please split it. Please first submit one for number
parsing improvements and another for arrays. On arrays within CSV it would
be good to post an issue and gather some feedback from community first.

—
Reply to this email directly or view it on GitHub
#113 (comment).

mohitjaggi · 2015-08-13T18:40:26Z

see #124

HyukjinKwon · 2016-04-28T13:37:53Z

@mohitjaggi Wouldn't this make sense to close this if #124 is subset from this and you are willing to make some more?

Mohit Jaggi added 10 commits July 2, 2015 11:11

single header option to write csv

c09c36f

print error line contents

de59f04

incorporate into bigdf submodule

8079e8b

more parsing options for univocity-based parser

213aa35

"provided" dependencies

faef4bc

unnest array

e424401

fix npe when no array/sparse column exists

c9a6592

parse exception handling, unit tests

d553da5

scala style fixes

5d3c906

easier interface for options

beab9c2

falaki reviewed Jul 30, 2015
View reviewed changes

build.sbt

@@ -1,20 +1,20 @@

name := "spark-csv"

version := "1.1.0"

version := "1.1.0-SNAPSHOT"

Copy link

Member

falaki Jul 30, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Mohit Jaggi added 4 commits July 31, 2015 00:16

Merge branch 'master' of https://github.com/databricks/spark-csv

bdb8a8b

fix unit test breakage

df6914e

add options to ddl

72a91fc

rebase again

d62dbc9

mohitjaggi mentioned this pull request Aug 1, 2015

Configurable null values #76

Open

Mohit Jaggi added 3 commits July 31, 2015 23:19

trying to debug travis test

be8ab4d

suspect partitions out of order

e414e07

revert headerPerPart option

dcb1df1

this works everywhere but travis! the partitions are re-ordered somehow methinks. leaving this for another time

workaround for scoverage as specified in their readme

df3f7b9

JoshRosen added the stale / awaiting update label Sep 12, 2015

moving up to spark 1.4.1

1438a41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing options and serializing arrays #113

parsing options and serializing arrays #113

mohitjaggi commented Jul 30, 2015

falaki Jul 30, 2015

falaki commented Jul 30, 2015

codecov-io commented Aug 1, 2015

mohitjaggi commented Aug 5, 2015

falaki commented Aug 5, 2015

falaki commented Aug 13, 2015

mohitjaggi commented Aug 13, 2015

mohitjaggi commented Aug 13, 2015

HyukjinKwon commented Apr 28, 2016 •

edited

Loading

parsing options and serializing arrays #113

Are you sure you want to change the base?

parsing options and serializing arrays #113

Conversation

mohitjaggi commented Jul 30, 2015

falaki Jul 30, 2015

Choose a reason for hiding this comment

falaki commented Jul 30, 2015

codecov-io commented Aug 1, 2015

Current coverage is 84.79%

mohitjaggi commented Aug 5, 2015

falaki commented Aug 5, 2015

falaki commented Aug 13, 2015

mohitjaggi commented Aug 13, 2015

mohitjaggi commented Aug 13, 2015

HyukjinKwon commented Apr 28, 2016 • edited Loading

Current coverage is `84.79%`

HyukjinKwon commented Apr 28, 2016 •

edited

Loading