Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make caching location optional. #546

Open
PengWenChen opened this issue Dec 22, 2023 · 9 comments
Open

Make caching location optional. #546

PengWenChen opened this issue Dec 22, 2023 · 9 comments

Comments

@PengWenChen
Copy link

Thanks for your great work!
Could this saving cache path be optional instead of always writes into /tmp/streaming ?

self._filelock_root = os.path.join(os.path.sep, 'tmp', 'streaming')

@knighton
Copy link
Contributor

Ah, you have a point there.

Would it be enough to use the official temp root of your operating system (say, os.path.join(tempfile.gettempdir(), 'streaming') IIRC?)

If not, what's your use case so we may better understand?

@PengWenChen
Copy link
Author

PengWenChen commented Dec 25, 2023

Hi @knighton. Thanks for your reply!
My working environment cannot access as root account.
If one of my partner runs your package first and somehow generates some cache files, such as /tmp/streaming/000000_barrier_filelock or /000000_locals, he or she cannot modified the read and write permissions of these generated files.
Thus, others who also want to run the same scripts could not access those files written under root directory.
Then PermissionError happens here.
Also, writing files under root directory is not allowed in my working space :(.

I have tried to modified self._filelock_root to my local directory under my account.
However, the process will stuck at the start of training.
So, I would like to ask if I can modify self._filelock_root. If so, what else should I modify? Thank you.

@PengWenChen
Copy link
Author

PengWenChen commented Dec 25, 2023

Hi @knighton.
I found another root path here:
https://github.com/mosaicml/streaming/blob/main/streaming/base/stream.py#L166

After modifying both self._filelock_root in dataset.py and root in stream.py, the scripts can successfully executed!
But I still want to confirm the correctness with you. Is there anything else I need to change? Thank you.

@PengWenChen
Copy link
Author

Hi there!
The sharedmemory seems not be cleaned up successfully by the first or other users and then the error occurs: Permission denied: '/00000_locals'.

I also found another closed issue: #429 is the same issue I met here. But it seems not being solved.

@knighton
Copy link
Contributor

After modifying both self._filelock_root in dataset.py and root in stream.py, the scripts can successfully executed!
But I still want to confirm the correctness with you. Is there anything else I need to change? Thank you.

I think you've fully gotten it (which is further supported by it running successfully), and also you are perfectly safe in making those changes you mentioned. Probably wise to have asked, as StreamingDataset has grown a bit complicated...

As you probably noticed, the temp root path in stream.py only comes into play when you do not provide local argument, letting it randomly generate a local for you.

The sharedmemory seems not be cleaned up successfully by the first or other users and then the error occurs: Permission denied: '/00000_locals'.

Leftover shared memory between runs that you have to clean up yourself is unfortunately something that happens when Python processes using shared memory die badly. We have gradually improved on this front over time, but it's far from solved completely. In the meantime, you can manually clear any stale shared memory objects by calling this method: https://github.com/mosaicml/streaming/blob/main/streaming/base/util.py#L169

If you meant you are already calling that method and encountering that permissions problem, that is all relating to the fact that Streaming was originally built to be run on ephemeral training jobs as root. On startup, StreamingDataset replicas do some registration and safety checks to prevent different runs clobbering the same dirs. Unfortunately, this was built for a world where everyone can write files to shared memory and these files can be read back by anyone.

As the original author of that difficult piece of nonsense, let me tell you that combined with patching filelock root, I think if you simply disable the checks and are just very careful about concurrent training jobs (and zombie processes thereof), you will be able to run this as non-root just fine.

@knighton
Copy link
Contributor

Specifically, you could remove this bit of code:

streams_local = [os.path.abspath(os.path.join(x.local, x.split)) for x in streams]
streams_remote = [
os.path.join(x.remote, x.split) if x.remote is not None else None for x in streams
]
self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
world)

And replace it with something like

self._shm_prefix_int = 8  # I'm feeling lucky

If on Mac OSX, shmem paths need to be quite short. I believe Linux and the like are not so limited.

@PengWenChen
Copy link
Author

Hi @Skylion007. Thanks for your reply.
I haven't tried this to force destroy the leftover shared memory. If I encounter the same problem in the future, I will give it a try.
https://github.com/mosaicml/streaming/blob/main/streaming/base/util.py#L169

Now I can execute without shared memory permission denied issue that mentioned above by changing the constant names directly.. (for example: Add some ID to all the constants, like localsPW.)
Though its a very aggressive approach but it works...
If there are any risks with this approach, please let me know!
https://github.com/mosaicml/streaming/blob/main/streaming/base/constant.py

And thanks for the advice to remove code. I will try it.

@knighton
Copy link
Contributor

Is your /tmp or equivalent directory world-readable and writeable? I'm thinking if we switched to files for registration and ensured 777 perms, it should be fine cross-user?

If there are any risks with this approach, please let me know!

You are clear.

@PengWenChen
Copy link
Author

For these path modification: Yes. My path is readable and writeable, and I can change the permission by chmod either.
#546 (comment)

As for /000000_locals issue, I failed to change the saving directory of shm (build from Python Multiprocessing), so I changed the saving name instead -> Change all the constant name by adding some ID that refers to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants