Using Storage

Storage is an interface that we use to abstract away various filesystems and cloud providers. You give it a provider layer path, and then you can download or upload files relative to that path.

Storage provides (python) multithreading capability to accelerate uploads and downloads on http1 connections. You can set the number of threads to use. 0 threads means run everything on the main program thread. If you use too many (between 64 to 128 on the test machine below) it will crash.

Thread Disposal

By default, a storage instance spawns with a number of threads. Unfortunately, the del method of the storage object is called inconsistently (probably because of the timing of garbage collection). This means that you should clean up after you are finished with a storage object. There are three methods for cleaning up:

with statement (preferred)

with Storage(...) as stor:
   stor.put_file(...)

storage.kill_threads()

storage = Storage(...)
files = storage.get_files(...)
storage.kill_threads()

No threads (no cleanup necessary)

storage = Storage(..., n_threads=0)

`get_files` Download Performance

We tested get_files on a dual core 2014 Macbook Pro, 2.4 GHz on a decent wireless connection with SSD.

The version tested was commit 26b3606240ca66d7dbe6def33aab4dba7bb316be

Service	Threads	Time (sec)
file	0	0.0036
file	2	0.0039
file	4	0.0037
file	8	0.0053
file	16	0.0045
file	32	0.0058
file	64	0.0070
gs	0	27.8455
gs	1	10.5758
gs	2	4.9513
gs	4	2.5868
gs	8	1.4941
gs	16	0.9418
gs	32	0.7500
gs	64	0.6997
S3	0	10.0914
S3	1	1.6661
S3	2	0.9482
S3	4	0.6604
S3	8	0.5300
S3	16	0.2337
S3	32	0.2419
S3	64	0.4772

Chart: https://drive.google.com/file/d/0B1ZqbPNMA3DaTlNxTExIeFJxdVk/view?usp=sharing

The code used to generate the tests is listed below. The command to run the test is:

py.test -s -v python/test/test_storage.py

def test_performance():

    def run(url, num_threads):
        s = Storage(url, n_threads=num_threads)
        content = 'some_string'
        s.put_file('info', content, compress=False)
        s.wait_until_queue_empty()

        start = time.time()
        s.get_files([ 'info' for i in xrange(50) ])
        end = time.time()

        s._kill_threads()

        return end - start


    urls = [
        "file:///tmp/removeme/read_write",
        "gs://neuroglancer/removeme/read_write",
        "s3://neuroglancer/removeme/read_write"
    ]


    for url in urls:
        n_threads = [ 0 ] + [ 2 ** i for i in xrange(0,7) ]
        for num in n_threads:
            delta = run(url, num)
            print url, num, delta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Storage

Thread Disposal

`get_files` Download Performance

Clone this wiki locally

Using Storage

Thread Disposal

get_files Download Performance

Clone this wiki locally

`get_files` Download Performance