Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] cottoncandy for Journal of Open Source software #63

Closed
vsoch opened this issue Aug 15, 2018 · 13 comments
Closed

[REVIEW] cottoncandy for Journal of Open Source software #63

vsoch opened this issue Aug 15, 2018 · 13 comments

Comments

@vsoch
Copy link

vsoch commented Aug 15, 2018

Here I will write my review for openjournals/joss-reviews#890. While the suggestion is for multiple issues, I will try addressing my issues here.

Repository

As a user, my first interaction with the repository is at https://github.com/gallantlab/cottoncandy

  • I am happy to see the source code, and the instructions for installation front and center. The logo is quite fantastic :)
  • The LICENSE is also present
  • The version string is found in the setup.py.

change requested 🌠

For your versioning, please use semvar convention. Instead of 0.01 for example, it would be 0.0.1 or 0.1.0 (format is MAJOR.MINOR.PATCH). Note that this should correspond with the release here (and no v, alas!)

check requested ✔️

For authors, I only see 4 on the paper, but I see 11 contributors here. The authors list likely corresponds to the list here, but could you please clarify who is who, and the contribution (we want to make sure someone with significant contribution is not forgotten!)

Manuscript

is a python scientific library for storing and accessing numpy array data from cloud-based object stores.

Based on the many choices available, I would have expected more than AWS and then EXPERIMENTAL Google Drive. I think this should be made more true to that?

cottoncandy is a bucket full of syntactic sugar that facilitates the use of cloud storage in typical data science workflows.

  • S3 API-enabled CEPH is included in the list, but it's still AWS correct? There is just then AWS and Google Drive (experimental?)

This line is fantastic :)

User Interaction

Installation

As a user, I first go to the repository and want to know how to install the software. Even before reading the instructions I would expect
(given python) to do:

pip install cottoncandy

and the library doesn't seem to be provided. I see that the instructions are to install from Github. A few suggestions / questions!

question / suggestion

Do the authors intended to provide the package on pypi, and if not, why? Either way, could the authors also offer an install example
using pip from Github?

I proceeded to (start to) install following instructions to clone:

change requested

$ git clone https://github.com/gallantlab/cottoncandy.git
$ cd cottoncandy
$ sudo python setup.py install

It's generally not good practice to be installing to system python with sudo, or at least it's something I did when new to Python (and eventually never again!) Could the authors please explain the need for system python over:

# assuming a local or virtual environment with python
python setup.py install

# or assuming a system python and install as a user (common on shared resources)
python setup.py install --user

For my review, I have an anaconda environment used as default, so I am able to do python setup.py install

detail requested :magnify:

Could you explain the logic of your setup.py, using the defaults.cfg and configparser? I am mostly not familiar with this
approach - it looks different than including data with a MANIFEST.in or other? Note the warning:

$ python setup.py install
/home/vanessa/anaconda3/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'include_package_data'
  warnings.warn(msg)

When you mention:

Upon installation cottoncandy will try to find your AWS keys and store them in this file

I'm guessing that the install tries to sniff these from the environment. It might be good to add this detail so the user isn't alarmed about scripts sneaking around in places they should not be. :)

bug with file generation?

Note that I would expect from this statement that after install, to find a file at ~/.config/cottoncandy/options.cfg. I don't see a file (TAB does not complete!) Should there be one?

cat ~/.config/cottoncandy/options.cfg

Permissions

Depending on how the user installed it (sudo vs. not) if this file is to be generated, it could be problematic for the effective user home. For example:

# /root is sudo's home
$ sudo python -c "import pwd; import os; userhome = pwd.getpwuid(os.getuid())[5]; print(userhome)"
/root

# but not my home!
$ python -c "import pwd; import os; userhome = pwd.getpwuid(os.getuid())[5]; print(userhome)"
/home/vanessa

Upon installation cottoncandy will try to find your AWS keys from the environment and store them in this file. If you do not have them exported, you will be able to define them at a later step.

It's also generally nice to know the extent to which a secret is being saved. For example, it's important to me to know if I don't have them exported (but define them later) if they are written to that config file. I may not want this to be the case. The next line then suggests that
I have another option (instead):

Setup the connection (endpoint, access and secret keys can be specified in the configuration file instead)::

change request :delta:

Could you please clarify when the file is generated, where it is generated, where the configuration values come from, and when for the user?

Usage

AWS

Let's store some stuffs! I'd like to ask if someone from your team could share a testing AWS IAM credential for me to use? I don't
have or use AWS much, and I would need to personally pay for it. I can assure you it will only be used for this test, and if
you want, I think that you can limit the usage and scope.

When I am able to test the storage, then I can finish the review! Are there other backends (not requiring payment) that I
might test?

Others?

I was expecting based on the title to have storage clients beyond s3. I'm also worried about the big EXPERIMENTAL (hence why I'm not testing this) in caps, could you give detail as to why these are experimental? It would be good to have them as production ready on the publish of this paper. It's also slightly alarming to see things like:

    ##############################
    # USE THIS AT YOUR OWN RISK.
    ##############################

Documentatione

Questions for you!

  • The version here is also specified differently than the tagged release (there is a v) and not using semvar. Would the authors be able to be consistent between the places?
  • Based on the upload functions it looks like you are converting to bytes (with String.io) and then uploading the blog. Then to download, you are reversing the process.
    • How is this going to work between versions of numpy?
    • What is the benefit of doing this for the user, versus allowing them to save (with joblib, pickle, other) their own object to upload?
    • what about different versions and formats of numpy arrays (e.g., pandas or other?)
    • What about versions of Python? (Python 2 vs. Python 3?)

Api Documentation

This isn't essential, but I do with the documentation pages were as beautifully done as the README, in terms of design, color, etc. A user that is introduced to the software through the portal should get a professional feel. Otherwise, the functions are here and readthedocs is good!

Statement of Need

I definitely there is need for a unified interface to interact with storage, and from python, but I would like to see a little more development
done to provide (minimally) production clients for AWS, Google Drive, and perhaps one more largely used storage backend (my first thoughts
were Google Storage and Azure as well). The statement of need in the manuscript is compelling, but the software itself isn't quite ready
to match it. But it's okay! This can be developed to get cottoncandy in tip top shape.

I think this is a really cool tool for a lab (and the name is fantastic!) but in that it requires wrapping other APIs to interact with services, but the current implementation doesn't provide additional functionality other than dumping and loading a numpy array. For a user that is using AWS or Google storage endpoints, in that these Python libraries (boto, google-storage, google-drive-python etc.) change remarkably quickly, I would advise them to use the libraries directly to save the files as they are generated and avoid the additional layer. The "competitor" for this would be something like rclone. Did the authors consider providing a python wrapper to that?

Community Guidelines

Missing guidelines

Could the authors please add:

  • a CONTRIBUTING.md and CONTRIBUTORS.md, if appropriate
  • a reference / some discussion of how to contribute in the README
  • instructions for how to report an issue in the README
  • (optional) issue and PR templates, if the above is complicated.
@anwarnunez
Copy link
Contributor

anwarnunez commented Aug 15, 2018

Thanks for the comments!
I created a new branch: https://github.com/gallantlab/cottoncandy/tree/joss_review
New release candidate tag: https://zenodo.org/record/1345777
The release candidate is now on pip, too: https://pypi.org/project/cottoncandy/

I'll address all your other comments in detail this weekend.
Thanks!
Anwar

@anwarnunez
Copy link
Contributor

anwarnunez commented Aug 20, 2018

Response to Review

openjournals/joss-reviews#890

Thanks for the detailed review! I've made all the requested changes and provided answers to the questions raised (see branch joss_review; https://github.com/gallantlab/cottoncandy/tree/joss_review). The requested changes have definitely made the software better!

I'd like to highlight one point. Our main contribution with cottoncandy is to provide convenience functions to use cloud-storage from within a pythonic environment (e.g. Jupyter) without saving or reading data to disk. This functionality is not provided by libraries recommended (e.g. rcloud, joblib) as far as we are aware. cottoncandy allows the user to seamlessly download and upload to a cloud-based storage solution from within a Jupyter session -- without accessing the user disk.

While cottoncandy currently implements only two back-ends additional back-ends can be added. The AWS S3 API is by far the most widely used cloud-storage solution and can be used outside of AWS (i.e. CEPH). We hope to incorporate support for a Google Cloud Storage back-end in the near future.

In all, we hope to have addressed all the issues raised in the review.

Best,
Anwar

Responses by point

Respository

For your versioning, please use semvar convention. Instead of 0.01 for example, it would be 0.0.1 or 0.1.0 (format is MAJOR.MINOR.PATCH). Note that this should correspond with the release here (and no v, alas!)

Fixed. I made a release candidate (0.1.0rc; https://github.com/gallantlab/cottoncandy/releases/tag/0.1.0rc). Once the review is approved, I will push the 0.1.0 release. I could also make a 0.1.0 release now and then a 0.1.1 if there are further changes needed. Let me know what you think.

For authors, I only see 4 on the paper, but I see 11 contributors here. The authors list likely corresponds to the list here, but could you please clarify who is who, and the contribution (we want to make sure someone with significant contribution is not forgotten!)

Clarification. The main developers are myself (AN) and JSG. The following list details the specific significant contributions made by the authors.

  • AN: Wrote and designed cottoncandy.
  • JSG: Helped design cottoncandy.
  • TZ: Implemented Google Drive back-end.
  • JLG: Funding.

It is generally difficult to draw sharp authorship lines with software. With respect to the publication of cottoncandy, I believe the list of co-authors reflects all the users who have made significant contributions. The rest of the contributions have been very beneficial(!) but more limited in scope.

Manuscript

Based on the many choices available, I would have expected more than AWS and then EXPERIMENTAL Google Drive.

The Google Drive back-end support is not experimental. Apologies for the confusion. It was merged to master over a year ago (Aug 2, 2017). The note on the README was put there when it was first merged. It is now removed.

I think this should be made more true to that?

This is a good point. The first sentence is a high-level description of the software. Details about the specific cloud-based object stores that are currently supported are listed in the "Long description" section. We wish to include support for Azure and Google Cloud Storage in the near future. I'd like to maintain the generality of the first sentence to reflect the fact that cottoncandy is not specific to any one storage solution. Ideally, once we implement the Google Cloud Storage (or Azure) back-end this point will not be contentious (#64).

S3 API-enabled CEPH is included in the list, but it's still AWS correct? There is just then AWS and Google Drive (experimental?)

To clarify, CEPH is not AWS. CEPH is an open source storage solution with many capabilities (https://ceph.com/). CEPH can be configured to provide access to stored data using an S3 API (http://docs.ceph.com/docs/mimic/radosgw/). This is what is meant by "S3 API-enabled CEPH".
However, the point raised is well-taken: Only S3 and Google Drive back-ends are provided.

User interaction

Do the authors intended to provide the package on pypi, and if not, why? Either way, could the authors also offer an install example

Fixed. Sure thing! I've pushed cottoncandy to pip (https://pypi.org/project/cottoncandy/). 0.1.0rc is currently there and I can submit a 0.1.0 once all the changes pertaining to the review are approved. Installation instructions now contain pip as well as the previous method.

It's generally not good practice to be installing to system python with sudo, or at least it's something I did when new to Python (and eventually never again!) Could the authors please explain the need for system python over:

Clarification. There are situations where it is appropriate to install packages system wide (with sudo or root). When managing shared computing resources, it can be impractical to allow multiple virtual environments per user since this will require a lot of disk space. There are also cases where it can be useful to use a bleeding edge package (e.g. master) in production across multiple machine (with shared storage), which can be achieved with sudo python setup.py develop.

However, I agree that these use-cases are not widely adopted. I have therefore also provided the more conventional pip instructions as suggested.

Could you explain the logic of your setup.py, using the defaults.cfg and configparser? I am mostly not familiar with this

Clarification. The configuration file is much more flexible than a MANIFEST.ni. The configuration file allows the user to specify the default values for some variables in the software. The configuration file is loaded every time the software is imported, not only during installation.
For example, the user can specify the threshold at which multi-part uploads are used (100MB by default; https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/defaults.cfg#L18), or the default permissions to use for uploaded objects (https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/defaults.cfg#L9).

I'm guessing that the install tries to sniff these from the environment. It might be good to add this detail so the user isn't alarmed about scripts sneaking around in places they should not be. :)

Fixed. Yes, good point! I have added a more visible warning statement on the README. I can also set this to False by default if the risk/convenience trade-off of this default setting is deemed to high.

I don't see a file (TAB does not complete!) Should there be one?

Details. Is the software being installed using MAC OS? If so, the configuration file will be stored under ~/Library/Application Support/cottoncandy/options.cfg. I've added more details to the README (see "Configuration file" subsection). We also took this opportunity to make sure that cottoncandy works on an anaconda environment under MAC OS.

Depending on how the user installed it (sudo vs. not) if this file is to be generated, it could be problematic for the effective user home.

Clarification. Good point! I haven't encountered this problem under Ubuntu (i.e. even under sudo installation, the configuration is for the user not root). I'm not sure how other *NIX systems handle this. I imagine this can be an issue for some OSs. At any rate, using pip to install cottoncandy fixes this potential issue as you suggested.

It's also generally nice to know the extent to which a secret is being saved. For example, it's important to me to know if I don't have them exported (but define them later) if they are written to that config file. I may not want this to be the case.

Fixed. I've added a note on the README to inform the user that the keys will be stored in a configuration file. The keys can also be written into the configuration file by the user. Furthermore, the keys and bucket can be specified directly in the function cottoncandy.get_interface(). Again, I can change this default to False if the risk/convenience trade-off of this default setting is deemed too high.

Could you please clarify when the file is generated, where it is generated, where the configuration values come from, and when for the user?

Clarification. The configuration file is created upon installation. The default configuration file (https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/defaults.cfg) is copied to the user directory (see the "Configuration file" subsection on the README). If the user keys are found upon installation, the keys will also be included in the configuration file. The default configuration file contains the default values for a number of options (https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/defaults.cfg). The configuration file is read every time cottoncandy is used (upon import cottoncandy). This assures that the latest user-defined options are used.

I'd like to ask if someone from your team could share a testing AWS IAM credential for me to use?

This is tricky because our system is closed-off. However, the tests can also be run using the Google Drive back-end. Setting up Google Drive can be a pain, but we'be included instructions to do so (https://github.com/gallantlab/cottoncandy/blob/master/google_drive_setup_instructions.md). Note that AFAIR Google Drive throttles upload/download speeds to 1Mb.
Currently, automatic testing is set up with Travis CI (https://travis-ci.org/gallantlab/cottoncandy/branches).

If the Google Drive or Travis options do not suffice, please let me know and we can think of other solutions.

I was expecting based on the title to have storage clients beyond s3. I'm also worried about the big EXPERIMENTAL (hence why I'm not testing this) in caps, could you give detail as to why these are experimental? It would be good to have them as production ready on the publish of this paper. It's also slightly alarming to see things like:

Fixed. Apologies for that. I wanted to communicate to users that they may incur costs if they ran the tests themselves (in the order of a couple of cents). I changed this to a less alarming warning (https://github.com/gallantlab/cottoncandy/blob/joss_review/cottoncandy/tests/test_roundtrip.py#L37).

Documentation

The version here is also specified differently than the tagged release (there is a v) and not using semvar. Would the authors be able to be consistent between the places?

Fixed. Sorry about that! This is now reflected in the joss_review branch: fd2e6e0

Based on the upload functions it looks like you are converting to bytes (with String.io) and then uploading the blog. Then to download, you are reversing the process.
[i] How is this going to work between versions of numpy?
[ii] What is the benefit of doing this for the user, versus allowing them to save (with joblib, pickle, other) their own object to upload?
[iii] what about different versions and formats of numpy arrays (e.g., pandas or other?)
[iv] What about versions of Python? (Python 2 vs. Python 3?)

    1. cottoncandy works across multiple versions of numpy. This is because the numpy array format is stable and does not change across numpy versions (https://github.com/numpy/numpy/blob/master/numpy/lib/format.py). Basically, the array data is stored as bytes with some header information. We are currently separating the header information and storing it in the object metadata with the function cottoncandy.upload_raw_array(). The function cottoncandy.upload_npy_array() uploads the array data to an object with the header in the byte stream.
    1. The benefit is huge and this is the main reason to use cottoncandy. cottoncandy eliminates the cost of constantly downloading cloud data to disk and then reading from disk into a Python session. This is not a big issue when using cloud storage as a backup or intermittently. In situations where the cloud storage is the only solution available, the overhead of constantly saving and reading from disk to and from Python sessions becomes prohibitively expensive. (re pickle: cottoncandy provides functionality to stream pickle objects (https://github.com/gallantlab/cottoncandy/blob/joss_review/cottoncandy/interfaces.py#L446)).
    1. There are only two versions of numpy array formats. Version 2.0 is fully compatible with 1.0 and vice versa except for very large structured arrays (https://github.com/numpy/numpy/blob/master/numpy/lib/format.py#L136). Structured arrays are not supported by cottoncandy, so this is not an issue. However, we do hope to incorporate them soon (handle recarrays #24). When that occurs, a simple flag can easily handle this case.
      re pandas: We do not currently support pandas.
    1. cottoncandy is fully python2 and python3 compatible (Travis build: https://travis-ci.org/gallantlab/cottoncandy)

API Documentation

This isn't essential, but I do with the documentation pages were as beautifully done as the README, in terms of design, color, etc. A user that is introduced to the software through the portal should get a professional feel. Otherwise, the functions are here and readthedocs is good!

Ah, yes! Apologies. I haven't gotten around to making them prettier and uploading the logo! Hopefully soon =)

Statement of Need

I definitely there is need for a unified interface to interact with storage, and from python, but I would like to see a little more development done to provide (minimally) production clients for AWS, Google Drive, and perhaps one more largely used storage backend (my first thoughts were Google Storage and Azure as well). The statement of need in the manuscript is compelling, but the software itself isn't quite ready to match it. But it's okay! This can be developed to get cottoncandy in tip top shape.

cottoncandy has been used in production in our lab for more than 3 years (on Github for 2). We have stored approximately 181TB of data using cottoncandy. Similarly, the Google Drive has been used in production for the last year for conducting data backups.
But the point is well taken. We do wish to include support for more back-ends. Google Storage actually has very similar semantics to S3 (https://cloud.google.com/storage/docs/boto-plugin#uploading-objects). Supporting that back-end should be straightforward. I have created an issue for this (#64).

I think this is a really cool tool for a lab (and the name is fantastic!) but in that it requires wrapping other APIs to interact with services, but the current implementation doesn't provide additional functionality other than dumping and loading a numpy array.

Details. Just to clarify: cottoncandy doesn't dump/load numpy array data in the disk storage sense. It streams to and from the object store into memory directly. This also applies to other data formats (e.g. pickle, JSON).

For a user that is using AWS or Google storage endpoints, in that these Python libraries (boto, google-storage, google-drive-python etc.) change remarkably quickly, I would advise them to use the libraries directly to save the files as they are generated and avoid the additional layer.

Response. I agree with the problem. There is a lot of overhead in using these libraries that are not immediately apparent. cottoncandy actually provides a solution to these problems.
For example, there is a lot of boiler plate code that needs to be written when uploading large data with boto to AWS. This is handled using a multi-part uploads (MPU). cottoncandy automatically handles the MPU for the user (https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/s3client.py#L383). cottoncandy also handles client-side streaming compression and decompression, which decreases storage and bandwidth requirements. This is implemented to require no more memory than the size required for the buffer (https://github.com/gallantlab/cottoncandy/blob/joss_review/cottoncandy/utils.py#L429).

The "competitor" for this would be something like rclone. Did the authors consider providing a python wrapper to that?

I didn't know about this. Looks like a great package! Alas, from what I can see, rclone depends on disk storage: "Rclone is a command line program to sync files and directories to and from".
This would also involve reading/saving from disk and then to the object store.

Community Guidelines

Fixed. I added a "Contributing" section to the README.

@vsoch
Copy link
Author

vsoch commented Aug 20, 2018

Reviewer Response

I will start the second part of my review - for reference I have cloned this branch and installed cottoncandy as follows:

$ git clone -b joss_review https://github.com/gallantlab/cottoncandy
$ cd cottoncandy
$ python setup.py install

Review Metadata

Reviewer Response

  • I too agree that the software is much better, solid work @anwarnunez!

I'd like to highlight one point. Our main contribution with cottoncandy is to provide convenience functions to use cloud-storage from within a pythonic environment (e.g. Jupyter) without saving or reading data to disk. This functionality is not provided by libraries recommended (e.g. rcloud, joblib) as far as we are aware. cottoncandy allows the user to seamlessly download and upload to a cloud-based storage solution from within a Jupyter session -- without accessing the user disk.

I am going to push you further on this - to either provide additional "main contribution" points, evidence that cottoncandy does this better than existing options, or just mentioning that they exist (and then writing about the other options in your paper and how they compare). I can offer to help, because I've used some of the other Python software, and notably they allow for writing of objects (as blobs) to storage without needing to interact with the local filesystem. For example, if you look in the Google Cloud Storage getting started tutorial, you see use of the library cloudstorage. This library of course has:

convenience functions to use cloud storage

from cloudstorage.drivers.<driver> import *

without reading data to disk

This logically isn't needed, because the operation before uploading a "blob" to storage is already reading in the file. E.g. you read in as binary data, and then stream from the file, directly as a stream, or a string object. If you look at the various library implementations, they are very general (any binary / blob), so maybe the cottoncandy "spin" would be to better develop the functions to be for data science objects? I don't know if this is a good idea (it might be terrible) but if the other libraries do a good job of loading / unloading general blobs from Python (could be dict or numpy array, we don't care) I would ask what they don't do well. They don't do well to capture metadata about the data frame, such as the environment it was generated in, versions of things, or maybe even metadata about the table / headers (if pandas). Cottoncandy would add a layer of richness (and novely, utility) if it offered these kinds of things beyond just moving around blobs.

There are notably functionality that (possibly) is provided via the above (or the underlying libraries) that cotton candy doesn't provide. For example, exponential backoff is normally done with the retrying library, and if the other solutions don't offer it, cottoncandy might. I am not sure about the current state of this, so I posted an issue so we both can find out. I'm curious - if cottoncandy doesn't have it, how often was it that you had issues with uploading due to network, etc.?

Other than the individual clients, another notable wrapper akin to cotton candy is Apache libcloud. If you do a Google search a few similar pop up too. Anyway - I don't mean to say you need to invent a completely new thing, but to properly review the alternatives, the features that are notably different, and (the goal) is to give the rationale that I as a user would choose cottoncandy over the other options. From my general review, I don't see that cottoncandy offers me anything more than what is available, and maybe is limited in not having exponential backoff and a broader arrange of client options.

While cottoncandy currently implements only two back-ends additional back-ends can be added. The AWS S3 API is by far the most widely used cloud-storage solution and can be used outside of AWS (i.e. CEPH). We hope to incorporate support for a Google Cloud Storage back-end in the near future.

If adding a backend comes down to supporting the underlying modules for it, isn't this what the cloudstorage module is already doing? Could you use this as a base and add the "syntactical sugar" functions to it to provide metrics about the dataframes, etc. for the user?

Responses By Point

  • Versioning Looks fantastic! Thank you for fixing this up. I am in agreement about having the 0.1.0rc and then 0.1.0 when the first release is published (or whichever marginal increment from that winds up being the chosen one :P).
  • Authors Thank you for the clarification, and I definitely understand this is a challenging point! I don't know if @whedon and @arfon have a good place for this in mind, but lots of journals have some kind of "who did what" summary, maybe JoSS could have a suggested way to do this too? I had originally been biased to say "Well, the record of contribution is in Github commits!" but on the other hand, the humanly generated list is meaningful to say "These are the contributors we assert to have made significant contribution to warrant being on the paper, and here is why."
  • Thank you for the clarification on CEPH and the configuration files!
  • https://pypi.org/project/cottoncandy/ great!

Clarification. There are situations where it is appropriate to install packages system wide (with sudo or root). When managing shared computing resources

In research computing we use a module system where it's (still) not a system python, but one installed custom. It does require root given the location, but arguably the documentation should be written with the user in mind that would be installing either to python on a shared resources (where he/she doesn't have root) or encouraged to be working reproducibly with something like a virtual environment, in which case root still wouldn't be used. I can see arguments for both, but I see stronger case for encouraging the user to not use it.

However, I agree that these use-cases are not widely adopted. I have therefore also provided the more conventional pip instructions as suggested.

Ah, ok sounds good :)

Installation

Okay, here is some more digging into installation and configuration issues. First an edge case of having installed with python setup.py install, and then trying with pip:

$ pip install cottoncandy
Requirement already satisfied: cottoncandy in /home/vanessa/anaconda3/lib/python3.6/site-packages (0.1)

Note that the install above is with the joss_review branch and I cannot find the configuration file in the expected location. and Clarification I'm not on MacOS - I'm running from Ubuntu 16.04. I'm using Anaconda, however. Should I look elsewhere? Next I tried
uninstalling (with pip)

vanessa@vanessa-ThinkPad-T460s:~/Documents/Dropbox/Research/Reviews/2018/joss/cottoncandy-gallantlab$ pip uninstall cottoncandy
Cannot uninstall 'cottoncandy'. It is a distutils installed project and thus we cannot accurately determine...

So I removed manually. And then did a clean install with pip (note that this is pip from anaconda). The configuration file still wasn't there!
So I tried pip with the --user directive:

pip install --user cottoncandy

It's still not there. What am I doing wrong? After I can find this file I can look over this point again:

It's also generally nice to know the extent to which a secret is being saved. For example, it's important to me to know if I don't have them exported (but define them later) if they are written to that config file. I may not want this to be the case.
Fixed. I've added a note on the README to inform the user that the keys will be stored in a configuration file. The keys can also be written into the configuration file by the user. Furthermore, the keys and bucket can be specified directly in the function cottoncandy.get_interface(). Again, I can change this default to False if the risk/convenience trade-off of this default setting is deemed too high.

I could test out Drive once we have more discussion on the above. I'd like to hold off final testing until we are sure it's good to go!

Contributing looks good, and my other issues are addressed! Looking forward to hearing more!

@anwarnunez
Copy link
Contributor

Thanks for the comments!
I'll reply in more detail later in the week.
For now, I just wanted to apologize for the conda install problems.
I've made an issue (#65) and will get this sorted ASAP.

@vsoch
Copy link
Author

vsoch commented Aug 20, 2018

Oh goodness, no need to apologize! We are working on this as a team :)

@vsoch
Copy link
Author

vsoch commented Aug 21, 2018

Tada! Update - it is correct that the config is created on first import! I just did a fresh install (no config) and then imported, exited, and then...

$ ls /home/vanessa/.config/cottoncandy/options.cfg
  /home/vanessa/.config/cottoncandy/options.cfg

This probably makes sense, because if an admin is installing globally for users, this would show up (for the individual user) on import. So maybe let's make this stated clearly in the instructions, and also a question - is there a need for an "admin config" sort of deal, and then a user config to customize further? For example, the admin cares about upload / download / encryption stuffs, but the user (mostly) cares about credentials I would guess. The next question would be priority - should the admin's defaults take preference? This is (sort of) an approach that Singularity takes - the user has total decision to set binds, and similar, but the admin gets to control how that works.

@vsoch
Copy link
Author

vsoch commented Aug 21, 2018

See scottwernervt/cloudstorage#15 (comment) about exponential backoff - it looks like boto3 might implement it, so let's take note that whatever libraries are used in cottoncandy, the effort is taken to have exponential backoff (and maybe if possible, multiprocessing for downloads?) Also- what about atomicity? I should have mentioned these things earlier, just remembering now :)

@anwarnunez
Copy link
Contributor

anwarnunez commented Aug 21, 2018

Excellent! Glad the pip install issue was resolved.

re configuration options for admins: Yes, I can make that happen! We've had this kinda setup before. A system-wide install can have a options.cfg file in the site/dist-packages directory. These values are copied to the user options.cfg upon first import. The user can thereafter modify these values from their ~/.config/cottoncandy/options.cfg. If a value in the user options does not exist, it is read from the system-wide options.cfg (e.g. encryption keys).
An admin can specify system-wide values by modifying the defaults.cfg file from the cloned repo and then performing a sudo python setup.py install kind of thing. This is along the lines of how numpy handles custom builds (https://github.com/numpy/numpy/blob/master/site.cfg.example). The difference is that in this setup the user's values get priority, not the admin. In this case, the user can still overwrite defaults =S. How do you implement that with Singularity?

re download issues: Yup, downloads can fail when the network is saturated (or if using AWS, whenever AWS is acting up). boto allows the user to specify specify timeouts and retries (It appears this is implemented with exponential delays: https://github.com/boto/botocore/blob/develop/botocore/data/_retry.json#L87 ).
One can already combine these options with cottoncandy:

import cottoncandy as cc                                                                   
from botocore.client import Config
config = Config(connect_timeout=60, read_timeout=60, retries=dict(max_attempts=10))
cci = cc.get_interface('mybucket', config=config)                                             

re multipart download: Yes! Sadly, this has been on my to-do list for 2 years =( I just haven't had the bandwidth to implement it (#11).

re atomicity: Not sure I know what this means exactly. Are you referring to checks on when objects are uploaded? At the moment, this is all left to the back-end API. For S3, the object is not updated unless the upload is successful. For Google Drive, I imagine the same but I haven't tested this.

@vsoch
Copy link
Author

vsoch commented Aug 21, 2018

re configuration options for admins: Yes, I can make that happen! We've had this kinda setup before. A system-wide install can have a options.cfg file in the site/dist-packages directory. These values are copied to the user options.cfg upon first import. The user can thereafter modify these values from their ~/.config/cottoncandy/options.cfg. If a value in the user options does not exist, it is read from the system-wide options.cfg (e.g. encryption keys).

Awesome! Let's make sure this is nicely documented so an admin and user know:

  1. what happens with respect to writing files
  2. when it happens
  3. where they are! :)

An admin can specify system-wide values by modifying the defaults.cfg file from the cloned repo and then performing a sudo python setup.py install kind of thing. This is along the lines of how numpy handles custom builds (https://github.com/numpy/numpy/blob/master/site.cfg.example). The difference is that in this setup the user's values get priority, not the admin. In this case, the user can still overwrite defaults =S. How do you implement that with Singularity?

I think in the admin file it's easiest to have booleans that determine to allow (or disallow) user control of a variable or resource. E.g., see here. You can also have filters that limit what they are able to select from, and then when you load up the libraries just parse the admin config first, determine the logic, then the user config to finish up figuring out settings.

re download issues: Yup, downloads can fail when the network is saturated (or if using AWS, whenever AWS is acting up). boto allows the user to specify specify timeouts and retries (It appears this is implemented with exponential delays: https://github.com/boto/botocore/blob/develop/botocore/data/_retry.json#L87 ). One can already combine these options with cottoncandy:

import cottoncandy as cc
from botocore.client import Config
config = Config(connect_timeout=60, read_timeout=60, retries=dict(max_attempts=10))

cci = cc.get_interface('mybucket', config=config)

okay, I'm satisfied with this example! Let's make sure it's easy for a user to find if he/she wants to do the same.

re multipart download: Yes! Sadly, this has been on my to-do list for 2 years =( I just haven't had the bandwidth to implement it (#11).

Maybe there's an open source developer that could contribute to the library after this publication? wink

re atomicity: Not sure I know what this means exactly. Are you referring to checks on when objects are uploaded? At the moment, this is all left to the back-end API. For S3, the object is not updated unless the upload is successful. For Google Drive, I imagine the same but I haven't tested this.

This is related to downloading - making sure two processes don't try and write to the same file object at once. It would go with the previous point, so don't need to address (yet).

@anwarnunez
Copy link
Contributor

OK these points have been addressed (see 0.1.0rc3 also on pip)

config file
The README has info on when configuration file is written and where it is.
I added a message to inform the user about the config file upon first install (https://github.com/gallantlab/cottoncandy/blob/joss_review/cottoncandy/options.py#L85)

defaults
I implemented the changes we talked about regarding admin options.
However, w.r.t to the preference of system vs. user defaults, I haven't made those changes. Unfortunately, implementing something like that will require more modifications than I have bandwidth for at the moment (#66).

multipart download
That'd be great! =)

atomicity
There are no checks to ensure the same file is not overwritten by multiple processes. This is left to the user.

cloudstorage
I checked out the package. It looks very promising. I don't think it existed when we started working on cottoncandy. We definitely can implement support for this to access cloud stores. Or more generally, we could offload all the streaming to it. I'll keep an eye on. However, I'd have to do a lot more testing to make sure refactoring cottoncandy to rely on cloudstorage is worth the cost.

The primary goal of cottoncandy is to provide convenience functions to stream data to and from memory in a typical data science workflow (e.g. easy MPU uploads, compression, search). The cloudstorage package is nice in that it provides a unified interface to the streaming part, which is very useful. However, as far as I can tell it does not provide browsing or convenience functions for typical objects encountered in data science. cloudstorage appears more like a more general unifying API for cloudstorage that relies on boto3, google.cloud, etc. This is great! In the future we can think of incorporating cloudstorage to solve the API problem. Currently, cottoncandy only provides those capabilities for two back-ends whereas cloudstorage provides four.

The current text in the manuscript and the README examples reflect the fact that cottoncandy is useful for the scientific data analysis side. The goal is to simplify the use of cloud-storage solutions in data science workflows. We have provided capabilities for using different back-ends but that is not the emphasis. Is the last sentence the one you have an issue with? I've added a mention of cloudstorage to the last paragraph.

@vsoch
Copy link
Author

vsoch commented Aug 23, 2018

These changes look great! I’m at the doctor now but I would like to give one more look over and test when I get home, and likely sign off after that. Overall really great work on the fixes!

@vsoch
Copy link
Author

vsoch commented Aug 23, 2018

config

omg I love this!

In [1]: import cottoncandy

############################################################
Hi! Looks like this is your first time using cottoncandy.

Your cottoncandy configuration file will be stored in:
/home/vanessa/.config/cottoncandy/options.cfg

You can store your S3 or Google Drive credentials there,
and many other options.

Thanks for using cottoncandy!

defaults

I think this is good for now, because it's more likely that a single user is going to use it vs. having it installed at the system level (by an admin) and then used across users. If/when the time comes for this, it will be natural for someone to open an issue and you can address then. Given this, the atomicity issue is also not likely an issue, so I'm good :)

I'm going to look over the compiled pdf again, and will continue in the main thread!

@vsoch
Copy link
Author

vsoch commented Aug 24, 2018

The review is finished, and the paper accepted! Congrats! Closing.

@vsoch vsoch closed this as completed Aug 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants