Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 24] Too many open files when RandomForestRegressor has 140 estimators #22

Open
ollieglass opened this issue Nov 7, 2016 · 4 comments

Comments

@ollieglass
Copy link
Contributor

Here's a loop that fits and compiles trees, stepping up the number of estimators each time:

from sklearn import datasets, ensemble
import compiledtrees

data = datasets.load_boston()
X, y = data.data, data.target

for i in range(20, 250, 20):
    print(i)

    model = ensemble.RandomForestRegressor(n_jobs=4, n_estimators=i)
    model.fit(X, y)

    model = compiledtrees.CompiledRegressionPredictor(model)

    h = model.predict(X)

It crashes on 140:

$ python test_script.py 
20
40
60
80
100
120
140
Traceback (most recent call last):
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/_parallel_backends.py", line 344, in __call__
    return self.func(*args, **kwargs)
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/compiledtrees/code_gen.py", line 173, in _compile
    _call([CXX_COMPILER, cpp_f, "-c", "-fPIC", "-o", o_f.name, "-O3", "-pipe"])
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/compiledtrees/code_gen.py", line 179, in _call
    shell=True, stdout=DEVNULL, stderr=DEVNULL)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 576, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 557, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 1454, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

This is on mac OS.

I haven't looked into workarounds - perhaps I can increase the number of files that can be open at once. But if there's a way to limit the open files in the library, that would probably be better.

@ollieglass
Copy link
Contributor Author

I had a look at code_gen.py. Perhaps the CodeGenerator class could build a string instead of opening and writing to a file. When the .file method is called, it could write to a file, close it and return the name.

@mwojcikowski
Copy link
Collaborator

mwojcikowski commented Nov 8, 2016

On Linux and macOS you have to do simply issue ulimit -n 2048. By design compiling trees consumes 2 * n_trees + 2 open files.

On Windows there is no way to raise the limit globally, but there is an internal solution, which you have to include in your script:

import platform

if platform.system() == 'Windows':
    import win32file
    win32file._setmaxstdio(2048)

I used to write one cpp file, but it didn't work for large forests - especially if you have lots of data and allow for full growth. For my example this translate to 500 .cpp files over 100MB (50GB+ of RAM). Keeping all those files in StringIO's would probably work, although .o files would also still be there, so we would go down to ntrees + 2 open files (assuming we successfully close/delete files after compiling them to .o).

To sum up - I regard it as not an issue, and overcoming it would probably cost a lot of RAM in return, which ultimately is a deal-breaker (at least for me).

@ollieglass
Copy link
Contributor Author

ollieglass commented Nov 8, 2016

I see what you mean. I've fixed the problem for myself, like you say, it isn't hard.

I am concerned that users could be put off by this. How about an informative error for them, like this?

class CodeGenerator(object):
    def __init__(self):
        try:
            self._file = tempfile.NamedTemporaryFile(prefix='compiledtrees_', suffix='.cpp', delete=True)
        except OSError as e:
            if e.errno == 24:
                print("Too many open files. Increase limit to 2 * n_trees + 2" \
                    + "(unix / mac: ulimit -n [limit], windows: http://bit.ly/2fAKnz0)", file=sys.stderr)
            raise e

        self._indent = 0

edit: added if

@mwojcikowski
Copy link
Collaborator

That might be good solution if e.errno == 24 across platforms. As I remember correctly, on Windows I've got some kind of "Permission Denied" errors, which were terrible to debug...

Although I fear we will catch some false positives.

Also an unittest for that would be usefull (see hints on changing limits on all platforms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants