Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing options and serializing arrays #113

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

mohitjaggi
Copy link
Contributor

several parsing options are added. they are organized in classes because there are many of them. a "text" based API to configure options is provided.

another feature is the ability to serialize a column of arrays. the array is "unnested" and column names to use are supplied by user. this is useful for writing out csv after doing transforms on the data that "expand" the number of columns e.g. one hot encode a category. this can be improved later. e.g. sparse vector from mllib can replace the array.

@@ -1,20 +1,20 @@
name := "spark-csv"

version := "1.1.0"
version := "1.1.0-SNAPSHOT"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

@falaki
Copy link
Member

falaki commented Jul 30, 2015

Please update it to rebase against master for tests to run. Also please revert changes to build.sbt.

Mohit Jaggi added 3 commits July 31, 2015 23:19
this works everywhere but travis! the partitions are re-ordered somehow
methinks. leaving this for another time
@codecov-io
Copy link

Current coverage is 84.79%

Merging #113 into master will decrease coverage by -0.22% as of 516c8a0

@@            master    #113   diff @@
======================================
  Files           10      11     +1
  Stmts          407     526   +119
  Branches       125     150    +25
  Methods          0       0       
======================================
+ Hit            346     446   +100
  Partial          0       0       
- Missed          61      80    +19

Review entire Coverage Diff

Powered by Codecov

@mohitjaggi
Copy link
Contributor Author

@falaki should be ready to merge now.

@falaki
Copy link
Member

falaki commented Aug 5, 2015

@mohitjaggi This is fairly large. I am about to publish a release with schema inference and all the recent improvements, and then I will review this.

@falaki
Copy link
Member

falaki commented Aug 13, 2015

@mohitjaggi this is packing too much into a single PR. Would you please split it. Please first submit one for number parsing improvements and another for arrays. On arrays within CSV it would be good to post an issue and gather some feedback from community first.

@mohitjaggi
Copy link
Contributor Author

will do

On Thu, Aug 13, 2015 at 10:14 AM, Hossein Falaki [email protected]
wrote:

@mohitjaggi https://github.com/mohitjaggi this is packing too much into
a single PR. Would you please split it. Please first submit one for number
parsing improvements and another for arrays. On arrays within CSV it would
be good to post an issue and gather some feedback from community first.


Reply to this email directly or view it on GitHub
#113 (comment).

@mohitjaggi
Copy link
Contributor Author

see #124

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 28, 2016

@mohitjaggi Wouldn't this make sense to close this if #124 is subset from this and you are willing to make some more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants