Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple users on same system encounter permissions errors #429

Open
acutkosky opened this issue Sep 11, 2023 · 8 comments · May be fixed by #430
Open

multiple users on same system encounter permissions errors #429

acutkosky opened this issue Sep 11, 2023 · 8 comments · May be fixed by #430
Labels
bug Something isn't working

Comments

@acutkosky
Copy link
Contributor

Environment

  • OS: [Ubuntu 22.04]
  • Hardware (GPU, or instance type): no gpu

To reproduce

Steps to reproduce the behavior:

  1. Create a StreamingDataset
  2. Log in as a different user
  3. Create a StreamingDataset
  4. Encounter a permissions error on trying to write to tmp/streaming

A similar issue related to shared memory that only happens if your process crashes:

  1. Create a StreamingDataset
  2. Kill the process in an unclean way so that python doesn't have a chance to clean up shared memory (e.g. kill -9)
  3. Delete /tmp/streaming to prevent the first issue.
  4. Log in as a different user
  5. Create a StreamingDataset
  6. Encounter a permissions error on a shared memory resource (e.g. PermissionError: [Errno 13] Permission denied: '/000000_locals')

You can verify this with the following script:

from streaming import MDSWriter
from streaming import StreamingDataset
import tempfile
### generate a sample dataset...
# directory for a sample dataset
out_root = tempfile.TemporaryDirectory().name
local_root = tempfile.TemporaryDirectory().name
print("out_root: ",out_root)
# A dictionary of input fields to an Encoder/Decoder type
columns = {
    'anumber': 'int'
}

# some sample data
samples = [
    {
        'anumber': 123123
    }
    for _ in range(2)
]

# write out the sample data
print("writing data...")
with MDSWriter(out=out_root, columns=columns) as out:
    for sample in samples:
        out.write(sample)
### finished generating the sample dataset

print("creating dataset object...")
remote_dir = out_root
local_dir = local_root

dataset = StreamingDataset(remote=remote_dir, local=local_dir, split=None)
for x in enumerate(dataset):
    print(x)

input("wait for input (so you have a chance to kill the process in an unclean way)")

Expected behavior

No errors.

Additional context

StreamingDataset initialization will create a directory tmp/streaming if it does not exist yet, and so the first user will own that directory.
Subsequent users on the same system are now locked out unless the first user manually chmods the directory or system cleans up tmp/streaming.

A similar issue can happen with the SharedMemory objects in dev/shm.
Using the clean_stale_shared_memory function doesn't fix this because it encounters the same permissions error.

@acutkosky acutkosky added the bug Something isn't working label Sep 11, 2023
@acutkosky acutkosky linked a pull request Sep 11, 2023 that will close this issue
8 tasks
@snarayan21
Copy link
Collaborator

Hey! So we looked into this and weren't able to reproduce the first behavior, but we were able to reproduce the second (PermissionError: [Errno 13] Permission denied: '/000000_locals'). The reason this is happening is because we need to access each existing SharedMemory file to check for potential collisions between local directory names for different StreamingDatasets here -- without this, multiple StreamingDatasets could point to the same local directories, messing up the samples.

For this case, we would recommend making sure each user creating StreamingDatasets has the same permissions, or updating user permissions to make sure that /tmp/streaming and the SharedMemory files are accessible. Thanks for identifying the issue and submitting this PR!

@acutkosky
Copy link
Contributor Author

Thanks for looking into it!

I'm confused how the first issue didn't replicate - my understanding of the issue was that this os.makedirs call made it so that /tmp/streaming is only writeable by the creating user. Was it perhaps the case for you that /tmp/streaming/ was already present and globally writeable rather than being created for the first time? This would explain the difference since the shared memory is cleaned up when the process exits normally and so would actually have been created by the first process. In my scenario /tmp/streaming did not exist yet.

For my system (an academic cluster), it's not easy to ensure that /tmp/streaming is accessible since I don't have any special privileges. Right now I've told my students to just chmod /tmp/streaming as part of every job they launch in case they are the first to be scheduled on some node, but if any other group starts using streaming then we'll have problems again.

@snarayan21
Copy link
Collaborator

I just doublechecked again and the /tmp/streaming directory was not present when creating the first streaming dataset. Even when creating equally permissioned users, and not killing the process in an unclean way (it's still running when the second streaming dataset is created), I can only reproduce the Permission denied: '/00000_locals' error. And that's due to the local directory checking that's needed before initializing every streaming dataset. If possible, could you send over the stack trace with the error for /tmp/streaming? Would love to take a look.

@karan6181
Copy link
Collaborator

closing the issue due to inactivity. Please feel free to re-open if you think this is still an issue.

@Skylion007
Copy link
Contributor

Skylion007 commented Jan 22, 2024

@karan6181 Reopening this issue because I ran into the same issue on my university cluster. There is also an easy solution. Instead of hardcoding /tmp/streaming can we have it respect the TMPDIR env var instead so each user can set a custom location instead?

@Skylion007 Skylion007 reopened this Jan 22, 2024
@knighton
Copy link
Contributor

#570 is merged.

@Oktai15
Copy link

Oktai15 commented Mar 22, 2024

@knighton is it already in release? I still have this problem (PermissionError: [Errno 13] Permission denied: '/000000_locals')

@snarayan21
Copy link
Collaborator

@Oktai15 are you still seeing this with the latest version of streaming?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants