-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] cottoncandy for Journal of Open Source software #63
Comments
Thanks for the comments! I'll address all your other comments in detail this weekend. |
Response to ReviewThanks for the detailed review! I've made all the requested changes and provided answers to the questions raised (see branch I'd like to highlight one point. Our main contribution with cottoncandy is to provide convenience functions to use cloud-storage from within a pythonic environment (e.g. Jupyter) without saving or reading data to disk. This functionality is not provided by libraries recommended (e.g. rcloud, joblib) as far as we are aware. cottoncandy allows the user to seamlessly download and upload to a cloud-based storage solution from within a Jupyter session -- without accessing the user disk. While cottoncandy currently implements only two back-ends additional back-ends can be added. The AWS S3 API is by far the most widely used cloud-storage solution and can be used outside of AWS (i.e. CEPH). We hope to incorporate support for a Google Cloud Storage back-end in the near future. In all, we hope to have addressed all the issues raised in the review. Best, Responses by pointRespository
Fixed. I made a release candidate (0.1.0rc; https://github.com/gallantlab/cottoncandy/releases/tag/0.1.0rc). Once the review is approved, I will push the 0.1.0 release. I could also make a 0.1.0 release now and then a 0.1.1 if there are further changes needed. Let me know what you think.
Clarification. The main developers are myself (AN) and JSG. The following list details the specific significant contributions made by the authors.
It is generally difficult to draw sharp authorship lines with software. With respect to the publication of cottoncandy, I believe the list of co-authors reflects all the users who have made significant contributions. The rest of the contributions have been very beneficial(!) but more limited in scope. Manuscript
The Google Drive back-end support is not experimental. Apologies for the confusion. It was merged to master over a year ago (Aug 2, 2017). The note on the README was put there when it was first merged. It is now removed.
This is a good point. The first sentence is a high-level description of the software. Details about the specific cloud-based object stores that are currently supported are listed in the "Long description" section. We wish to include support for Azure and Google Cloud Storage in the near future. I'd like to maintain the generality of the first sentence to reflect the fact that cottoncandy is not specific to any one storage solution. Ideally, once we implement the Google Cloud Storage (or Azure) back-end this point will not be contentious (#64).
To clarify, CEPH is not AWS. CEPH is an open source storage solution with many capabilities (https://ceph.com/). CEPH can be configured to provide access to stored data using an S3 API (http://docs.ceph.com/docs/mimic/radosgw/). This is what is meant by "S3 API-enabled CEPH". User interaction
Fixed. Sure thing! I've pushed cottoncandy to pip (https://pypi.org/project/cottoncandy/). 0.1.0rc is currently there and I can submit a 0.1.0 once all the changes pertaining to the review are approved. Installation instructions now contain pip as well as the previous method.
Clarification. There are situations where it is appropriate to install packages system wide (with sudo or root). When managing shared computing resources, it can be impractical to allow multiple virtual environments per user since this will require a lot of disk space. There are also cases where it can be useful to use a bleeding edge package (e.g. master) in production across multiple machine (with shared storage), which can be achieved with However, I agree that these use-cases are not widely adopted. I have therefore also provided the more conventional
Clarification. The configuration file is much more flexible than a MANIFEST.ni. The configuration file allows the user to specify the default values for some variables in the software. The configuration file is loaded every time the software is imported, not only during installation.
Fixed. Yes, good point! I have added a more visible warning statement on the README. I can also set this to False by default if the risk/convenience trade-off of this default setting is deemed to high.
Details. Is the software being installed using MAC OS? If so, the configuration file will be stored under
Clarification. Good point! I haven't encountered this problem under Ubuntu (i.e. even under sudo installation, the configuration is for the user not root). I'm not sure how other *NIX systems handle this. I imagine this can be an issue for some OSs. At any rate, using
Fixed. I've added a note on the README to inform the user that the keys will be stored in a configuration file. The keys can also be written into the configuration file by the user. Furthermore, the keys and bucket can be specified directly in the function
Clarification. The configuration file is created upon installation. The default configuration file (https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/defaults.cfg) is copied to the user directory (see the "Configuration file" subsection on the README). If the user keys are found upon installation, the keys will also be included in the configuration file. The default configuration file contains the default values for a number of options (https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/defaults.cfg). The configuration file is read every time cottoncandy is used (upon
This is tricky because our system is closed-off. However, the tests can also be run using the Google Drive back-end. Setting up Google Drive can be a pain, but we'be included instructions to do so (https://github.com/gallantlab/cottoncandy/blob/master/google_drive_setup_instructions.md). Note that AFAIR Google Drive throttles upload/download speeds to 1Mb. If the Google Drive or Travis options do not suffice, please let me know and we can think of other solutions.
Fixed. Apologies for that. I wanted to communicate to users that they may incur costs if they ran the tests themselves (in the order of a couple of cents). I changed this to a less alarming warning (https://github.com/gallantlab/cottoncandy/blob/joss_review/cottoncandy/tests/test_roundtrip.py#L37). Documentation
Fixed. Sorry about that! This is now reflected in the
API Documentation
Ah, yes! Apologies. I haven't gotten around to making them prettier and uploading the logo! Hopefully soon =) Statement of Need
cottoncandy has been used in production in our lab for more than 3 years (on Github for 2). We have stored approximately 181TB of data using cottoncandy. Similarly, the Google Drive has been used in production for the last year for conducting data backups.
Details. Just to clarify: cottoncandy doesn't dump/load numpy array data in the disk storage sense. It streams to and from the object store into memory directly. This also applies to other data formats (e.g. pickle, JSON).
Response. I agree with the problem. There is a lot of overhead in using these libraries that are not immediately apparent. cottoncandy actually provides a solution to these problems.
I didn't know about this. Looks like a great package! Alas, from what I can see, rclone depends on disk storage: "Rclone is a command line program to sync files and directories to and from". Community GuidelinesFixed. I added a "Contributing" section to the README. |
Reviewer ResponseI will start the second part of my review - for reference I have cloned this branch and installed cottoncandy as follows:
Review Metadata
Reviewer Response
I am going to push you further on this - to either provide additional "main contribution" points, evidence that cottoncandy does this better than existing options, or just mentioning that they exist (and then writing about the other options in your paper and how they compare). I can offer to help, because I've used some of the other Python software, and notably they allow for writing of objects (as blobs) to storage without needing to interact with the local filesystem. For example, if you look in the Google Cloud Storage getting started tutorial, you see use of the library cloudstorage. This library of course has:
from cloudstorage.drivers.<driver> import *
This logically isn't needed, because the operation before uploading a "blob" to storage is already reading in the file. E.g. you read in as binary data, and then stream from the file, directly as a stream, or a string object. If you look at the various library implementations, they are very general (any binary / blob), so maybe the cottoncandy "spin" would be to better develop the functions to be for data science objects? I don't know if this is a good idea (it might be terrible) but if the other libraries do a good job of loading / unloading general blobs from Python (could be dict or numpy array, we don't care) I would ask what they don't do well. They don't do well to capture metadata about the data frame, such as the environment it was generated in, versions of things, or maybe even metadata about the table / headers (if pandas). Cottoncandy would add a layer of richness (and novely, utility) if it offered these kinds of things beyond just moving around blobs. There are notably functionality that (possibly) is provided via the above (or the underlying libraries) that cotton candy doesn't provide. For example, exponential backoff is normally done with the retrying library, and if the other solutions don't offer it, cottoncandy might. I am not sure about the current state of this, so I posted an issue so we both can find out. I'm curious - if cottoncandy doesn't have it, how often was it that you had issues with uploading due to network, etc.? Other than the individual clients, another notable wrapper akin to cotton candy is Apache libcloud. If you do a Google search a few similar pop up too. Anyway - I don't mean to say you need to invent a completely new thing, but to properly review the alternatives, the features that are notably different, and (the goal) is to give the rationale that I as a user would choose cottoncandy over the other options. From my general review, I don't see that cottoncandy offers me anything more than what is available, and maybe is limited in not having exponential backoff and a broader arrange of client options.
If adding a backend comes down to supporting the underlying modules for it, isn't this what the cloudstorage module is already doing? Could you use this as a base and add the "syntactical sugar" functions to it to provide metrics about the dataframes, etc. for the user? Responses By Point
In research computing we use a module system where it's (still) not a system python, but one installed custom. It does require root given the location, but arguably the documentation should be written with the user in mind that would be installing either to python on a shared resources (where he/she doesn't have root) or encouraged to be working reproducibly with something like a virtual environment, in which case root still wouldn't be used. I can see arguments for both, but I see stronger case for encouraging the user to not use it.
Ah, ok sounds good :) InstallationOkay, here is some more digging into installation and configuration issues. First an edge case of having installed with python setup.py install, and then trying with pip: $ pip install cottoncandy
Requirement already satisfied: cottoncandy in /home/vanessa/anaconda3/lib/python3.6/site-packages (0.1) Note that the install above is with the
So I removed manually. And then did a clean install with pip (note that this is pip from anaconda). The configuration file still wasn't there! pip install --user cottoncandy It's still not there. What am I doing wrong? After I can find this file I can look over this point again:
I could test out Drive once we have more discussion on the above. I'd like to hold off final testing until we are sure it's good to go! Contributing looks good, and my other issues are addressed! Looking forward to hearing more! |
Thanks for the comments! |
Oh goodness, no need to apologize! We are working on this as a team :) |
Tada! Update - it is correct that the config is created on first import! I just did a fresh install (no config) and then imported, exited, and then... $ ls /home/vanessa/.config/cottoncandy/options.cfg
/home/vanessa/.config/cottoncandy/options.cfg This probably makes sense, because if an admin is installing globally for users, this would show up (for the individual user) on import. So maybe let's make this stated clearly in the instructions, and also a question - is there a need for an "admin config" sort of deal, and then a user config to customize further? For example, the admin cares about upload / download / encryption stuffs, but the user (mostly) cares about credentials I would guess. The next question would be priority - should the admin's defaults take preference? This is (sort of) an approach that Singularity takes - the user has total decision to set binds, and similar, but the admin gets to control how that works. |
See scottwernervt/cloudstorage#15 (comment) about exponential backoff - it looks like boto3 might implement it, so let's take note that whatever libraries are used in cottoncandy, the effort is taken to have exponential backoff (and maybe if possible, multiprocessing for downloads?) Also- what about atomicity? I should have mentioned these things earlier, just remembering now :) |
Excellent! Glad the pip install issue was resolved. re configuration options for admins: Yes, I can make that happen! We've had this kinda setup before. A system-wide install can have a re download issues: Yup, downloads can fail when the network is saturated (or if using AWS, whenever AWS is acting up).
re multipart download: Yes! Sadly, this has been on my to-do list for 2 years =( I just haven't had the bandwidth to implement it (#11). re atomicity: Not sure I know what this means exactly. Are you referring to checks on when objects are uploaded? At the moment, this is all left to the back-end API. For S3, the object is not updated unless the upload is successful. For Google Drive, I imagine the same but I haven't tested this. |
Awesome! Let's make sure this is nicely documented so an admin and user know:
An admin can specify system-wide values by modifying the defaults.cfg file from the cloned repo and then performing a sudo python setup.py install kind of thing. This is along the lines of how numpy handles custom builds (https://github.com/numpy/numpy/blob/master/site.cfg.example). The difference is that in this setup the user's values get priority, not the admin. In this case, the user can still overwrite defaults =S. How do you implement that with Singularity? I think in the admin file it's easiest to have booleans that determine to allow (or disallow) user control of a variable or resource. E.g., see here. You can also have filters that limit what they are able to select from, and then when you load up the libraries just parse the admin config first, determine the logic, then the user config to finish up figuring out settings.
okay, I'm satisfied with this example! Let's make sure it's easy for a user to find if he/she wants to do the same.
Maybe there's an open source developer that could contribute to the library after this publication? wink
This is related to downloading - making sure two processes don't try and write to the same file object at once. It would go with the previous point, so don't need to address (yet). |
OK these points have been addressed (see 0.1.0rc3 also on pip) config file defaults multipart download atomicity cloudstorage The primary goal of cottoncandy is to provide convenience functions to stream data to and from memory in a typical data science workflow (e.g. easy MPU uploads, compression, search). The The current text in the manuscript and the README examples reflect the fact that cottoncandy is useful for the scientific data analysis side. The goal is to simplify the use of cloud-storage solutions in data science workflows. We have provided capabilities for using different back-ends but that is not the emphasis. Is the last sentence the one you have an issue with? I've added a mention of |
These changes look great! I’m at the doctor now but I would like to give one more look over and test when I get home, and likely sign off after that. Overall really great work on the fixes! |
configomg I love this!
defaultsI think this is good for now, because it's more likely that a single user is going to use it vs. having it installed at the system level (by an admin) and then used across users. If/when the time comes for this, it will be natural for someone to open an issue and you can address then. Given this, the atomicity issue is also not likely an issue, so I'm good :) I'm going to look over the compiled pdf again, and will continue in the main thread! |
The review is finished, and the paper accepted! Congrats! Closing. |
Here I will write my review for openjournals/joss-reviews#890. While the suggestion is for multiple issues, I will try addressing my issues here.
Repository
As a user, my first interaction with the repository is at https://github.com/gallantlab/cottoncandy
change requested 🌠
For your versioning, please use semvar convention. Instead of 0.01 for example, it would be 0.0.1 or 0.1.0 (format is
MAJOR.MINOR.PATCH
). Note that this should correspond with the release here (and no v, alas!)check requested ✔️
For authors, I only see 4 on the paper, but I see 11 contributors here. The authors list likely corresponds to the list here, but could you please clarify who is who, and the contribution (we want to make sure someone with significant contribution is not forgotten!)
Manuscript
Based on the many choices available, I would have expected more than AWS and then
EXPERIMENTAL
Google Drive. I think this should be made more true to that?S3 API-enabled CEPH
is included in the list, but it's still AWS correct? There is just then AWS and Google Drive (experimental?)This line is fantastic :)
User Interaction
Installation
As a user, I first go to the repository and want to know how to install the software. Even before reading the instructions I would expect
(given python) to do:
and the library doesn't seem to be provided. I see that the instructions are to install from Github. A few suggestions / questions!
question / suggestion ❓
Do the authors intended to provide the package on pypi, and if not, why? Either way, could the authors also offer an install example
using pip from Github?
I proceeded to (start to) install following instructions to clone:
change requested
It's generally not good practice to be installing to system python with sudo, or at least it's something I did when new to Python (and eventually never again!) Could the authors please explain the need for system python over:
For my review, I have an anaconda environment used as default, so I am able to do
python setup.py install
detail requested :magnify:
Could you explain the logic of your setup.py, using the defaults.cfg and configparser? I am mostly not familiar with this
approach - it looks different than including data with a MANIFEST.in or other? Note the warning:
$ python setup.py install /home/vanessa/anaconda3/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'include_package_data' warnings.warn(msg)
When you mention:
I'm guessing that the install tries to sniff these from the environment. It might be good to add this detail so the user isn't alarmed about scripts sneaking around in places they should not be. :)
bug with file generation?
Note that I would expect from this statement that after install, to find a file at
~/.config/cottoncandy/options.cfg
. I don't see a file (TAB does not complete!) Should there be one?cat ~/.config/cottoncandy/options.cfg
Permissions
Depending on how the user installed it (sudo vs. not) if this file is to be generated, it could be problematic for the effective user home. For example:
It's also generally nice to know the extent to which a secret is being saved. For example, it's important to me to know if I don't have them exported (but define them later) if they are written to that config file. I may not want this to be the case. The next line then suggests that
I have another option (instead):
change request :delta:
Could you please clarify when the file is generated, where it is generated, where the configuration values come from, and when for the user?
Usage
AWS
Let's store some stuffs! I'd like to ask if someone from your team could share a testing AWS IAM credential for me to use? I don't
have or use AWS much, and I would need to personally pay for it. I can assure you it will only be used for this test, and if
you want, I think that you can limit the usage and scope.
When I am able to test the storage, then I can finish the review! Are there other backends (not requiring payment) that I
might test?
Others?
I was expecting based on the title to have storage clients beyond s3. I'm also worried about the big
EXPERIMENTAL
(hence why I'm not testing this) in caps, could you give detail as to why these are experimental? It would be good to have them as production ready on the publish of this paper. It's also slightly alarming to see things like:Documentatione
Questions for you!
Api Documentation
This isn't essential, but I do with the documentation pages were as beautifully done as the README, in terms of design, color, etc. A user that is introduced to the software through the portal should get a professional feel. Otherwise, the functions are here and readthedocs is good!
Statement of Need
I definitely there is need for a unified interface to interact with storage, and from python, but I would like to see a little more development
done to provide (minimally) production clients for AWS, Google Drive, and perhaps one more largely used storage backend (my first thoughts
were Google Storage and Azure as well). The statement of need in the manuscript is compelling, but the software itself isn't quite ready
to match it. But it's okay! This can be developed to get cottoncandy in tip top shape.
I think this is a really cool tool for a lab (and the name is fantastic!) but in that it requires wrapping other APIs to interact with services, but the current implementation doesn't provide additional functionality other than dumping and loading a numpy array. For a user that is using AWS or Google storage endpoints, in that these Python libraries (boto, google-storage, google-drive-python etc.) change remarkably quickly, I would advise them to use the libraries directly to save the files as they are generated and avoid the additional layer. The "competitor" for this would be something like rclone. Did the authors consider providing a python wrapper to that?
Community Guidelines
Missing guidelines
Could the authors please add:
The text was updated successfully, but these errors were encountered: