Skip to content

Using Storage

William Silversmith edited this page May 12, 2017 · 9 revisions

Storage is an interface that we use to abstract away various filesystems and cloud providers. You give it a provider layer path, and then you can download or upload files relative to that path.

Storage provides (python) multithreading capability to accelerate uploads and downloads on http1 connections. You can set the number of threads to use. 0 threads means run everything on the main program thread. If you use too many (between 64 to 128 on the test machine below) it will crash.

Safety

By default, a storage instance spawns with a number of threads. Unfortunately, the del method of the storage object is called inconsistently (probably because of the timing of garbage collection). This means that you should clean up after you are finished with a storage object. There are three methods for cleaning up, one of which is preferred:

  1. with statement
with Storage(...) as stor:
   stor.put_file(...)
  1. storage.kill_threads()
storage = Storage(...)
files = storage.get_files(...)
storage.kill_threads()
  1. No threads (no cleanup necessary)
storage = Storage(..., n_threads=0)

get_files Download Performance

We tested get_files on a dual core (NB: python threads only use a single core) 2014 Macbook Pro, 2.4 GHz on a decent wireless connection with SSD.

The version tested was commit 26b3606240ca66d7dbe6def33aab4dba7bb316be

Service Threads Time (sec)
file 0 0.0036
file 2 0.0039
file 4 0.0037
file 8 0.0053
file 16 0.0045
file 32 0.0058
file 64 0.0070
gs 0 27.8455
gs 1 10.5758
gs 2 4.9513
gs 4 2.5868
gs 8 1.4941
gs 16 0.9418
gs 32 0.7500
gs 64 0.6997
S3 0 10.0914
S3 1 1.6661
S3 2 0.9482
S3 4 0.6604
S3 8 0.5300
S3 16 0.2337
S3 32 0.2419
S3 64 0.4772

The code used to generate the tests is listed below. The command to run the test is:

py.test -s -v python/test/test_storage.py

def test_performance():

    def run(url, num_threads):
        s = Storage(url, n_threads=num_threads)
        content = 'some_string'
        s.put_file('info', content, compress=False)
        s.wait_until_queue_empty()

        start = time.time()
        s.get_files([ 'info' for i in xrange(50) ])
        end = time.time()

        s._kill_threads()

        return end - start


    urls = [
        "file:///tmp/removeme/read_write",
        "gs://neuroglancer/removeme/read_write",
        "s3://neuroglancer/removeme/read_write"
    ]


    for url in urls:
        n_threads = [ 0 ] + [ 2 ** i for i in xrange(0,7) ]
        for num in n_threads:
            delta = run(url, num)
            print url, num, delta
Clone this wiki locally