Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to upload file with custom name containing slashes #336

Open
maxnoe opened this issue Mar 27, 2024 · 10 comments
Open

Error when trying to upload file with custom name containing slashes #336

maxnoe opened this issue Mar 27, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@maxnoe
Copy link
Contributor

maxnoe commented Mar 27, 2024

Description

Searching around a bit, there are a couple of issues related to lfns containing slashes, but it these seem to be resolved, so I was expecting that this is supported now:

However, in a simple test setup, like the integration tests, this does not work.

Steps to reproduce

In the docker based test setup, run:

[root@f5fb292ddad1 rucio]# head -c5M /dev/urandom > test1.dat
[root@f5fb292ddad1 rucio]# rucio upload test1.dat --scope test --rse XRD1 --name 'foo/bar/baz.dat' 
2024-03-27 13:02:25,608	INFO	Preparing upload for file test1.dat
2024-03-27 13:02:25,641	ERROR	Provided object does not match schema.
Details: Problem validating dids: 'foo/bar/baz.dat' does not match '^[A-Za-z0-9][A-Za-z0-9\\.\\-\\_]{1,250}$'

Failed validating 'pattern' in schema['items']['properties']['name']:
    {'description': 'Data Identifier name',
     'pattern': '^[A-Za-z0-9][A-Za-z0-9\\.\\-\\_]{1,250}$',
     'type': 'string'}

On instance[0]['name']:
    'foo/bar/baz.dat'

Rucio Version

Current master, also tested with 33.6 and 34.0

Additional Information

No response

@maxnoe maxnoe added the bug Something isn't working label Mar 27, 2024
@panta-123
Copy link

panta-123 commented Apr 1, 2024

It looks like you are using generic rucio schema: https://github.com/rucio/rucio/blob/master/lib/rucio/common/schema/generic.py#L59

The pattern : '^[A-Za-z0-9][A-Za-z0-9\\.\\-\\_]{1,250}$' doesn't allow forward slashes/.
You need to use schema that allows /.
For example : cms and belle II schema allows forward slashes. Each can have its own schema and other policy defined.

cms schema: https://github.com/rucio/rucio/blob/master/lib/rucio/common/schema/cms.py#L55-L63
belle ii schema: https://github.com/rucio/rucio/blob/master/lib/rucio/common/schema/belleii.py

And you can enable schema as:

[policy]
schema = cms

You can add your schema using policy package as described in: https://rucio.cern.ch/documentation/operator/policy_packages/

@maxnoe
Copy link
Contributor Author

maxnoe commented Apr 2, 2024

What is the reason the default schema is so restrictive?

I think for a service like Rucio, it would make more sense to have the defaults be permissive and allow specific restrictions to be applied via policies.

It seems a bit weird to me to have a custom policy to allow more filenames.

One reason would be unit tests, as the issues show, there were issues with filenames that Rucio allows in principle but not in its default configuration.

@maxnoe
Copy link
Contributor Author

maxnoe commented Apr 2, 2024

That two large experiments need to allow this also clearly shows that this is a common use case

@maxnoe
Copy link
Contributor Author

maxnoe commented Jun 17, 2024

In the meantime, we created our policy plugin allowing slashes in the NAME regex.

However, uploading fails with the error:

>       assert upload_client.upload([upload_spec]) == 0

tests/bdms/test_basic_rucio_functionality.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/site-packages/rucio/client/uploadclient.py:211: in upload
    self._register_file(file, registered_dataset_dids, ignore_availability=ignore_availability, activity=activity)
/usr/local/lib/python3.9/site-packages/rucio/client/uploadclient.py:425: in _register_file
    meta = self.client.get_metadata(file_scope, file_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <rucio.client.client.Client object at 0x7fe1be8f6bb0>, scope = 't_20240617_142502_a76c', name = '/ctao.dpps.test/t_20240617_142502_a76c/file_aiv_basic', plugin = 'DID_COLUMN'

    def get_metadata(self, scope, name, plugin='DID_COLUMN'):
        """
        Get data identifier metadata
    
        :param scope: The scope name.
        :param name: The data identifier name.
        :param plugin: Backend Metadata plugin the Rucio server should use to query data.
        """
        path = '/'.join([self.DIDS_BASEURL, quote_plus(scope), quote_plus(name), 'meta'])
        url = build_url(choice(self.list_hosts), path=path)
        payload = {}
        payload['plugin'] = plugin
        r = self._send_request(url, type_='GET', params=payload)
        if r.status_code == codes.ok:
            meta = self._load_json_data(r)
            return next(meta)
        else:
            exc_cls, exc_msg = self._get_exception(headers=r.headers, status_code=r.status_code, data=r.content)
>           raise exc_cls(exc_msg)
E           rucio.common.exception.RucioException: An unknown exception occurred.
E           Details: no error information passed (http status code: 404)

/usr/local/lib/python3.9/site-packages/rucio/client/didclient.py:427: RucioException
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /usr/local/lib/python3.9/site-packages/rucio/client/didclient.py(427)get_metadata()
-> raise exc_cls(exc_msg)
(Pdb) url
'https://rucio-server/dids/t_20240617_142502_a76c/%2Fctao.dpps.test%2Ft_20240617_142502_a76c%2Ffile_aiv_basic/meta'
(Pdb) 

Seeing that the rucio server container has the option "RUCIO_HTTPD_ENCODED_SLASHES", I tried enabling that and now it works.

@maxnoe
Copy link
Contributor Author

maxnoe commented Jun 17, 2024

While uploading files now works, uploading datasets fails with a strange error:

test_scope = 't_20240617_154325_bed7', tmp_path = PosixPath('/tmp/pytest-of-user/pytest-7/test_replication0')

    @pytest.mark.usefixtures("_voms_proxy")
    def test_replication(test_scope, tmp_path):
        name = "transfer_test"
        # dataset_lfn = name
        dataset_lfn = f"/ctao.dpps.test/{test_scope}/{name}"
        file_lfn = f"/ctao.dpps.test/{test_scope}/{name}.dat"
    
        path = tmp_path / f"{name}.dat"
        path.write_text("I am a test for replication rules.")
    
    
        main_rse = "STORAGE-1"
        replica_rse = "STORAGE-2"
    
        client = Client()
        upload_client = UploadClient()
        did_client = DIDClient()
        rule_client = RuleClient()
        replica_client = ReplicaClient()
    
        upload_spec = {
            "path": path,
            "rse": main_rse,
            "did_scope": test_scope,
            "did_name": file_lfn,
        }
        # 0 = success
        assert upload_client.upload([upload_spec]) == 0
>       assert did_client.add_dataset(scope=test_scope, name=dataset_lfn)

tests/bdms/test_basic_rucio_functionality.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/site-packages/rucio/client/didclient.py:144: in add_dataset
    return self.add_did(scope=scope, name=name, did_type='DATASET',
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <rucio.client.didclient.DIDClient object at 0x7f4701e05310>, scope = 't_20240617_154325_bed7', name = '/ctao.dpps.test/t_20240617_154325_bed7/transfer_test', did_type = 'DATASET', statuses = None
meta = None, rules = None, lifetime = None, dids = None, rse = None

    def add_did(self, scope, name, did_type, statuses=None, meta=None, rules=None, lifetime=None, dids=None, rse=None):
        """
        Add data identifier for a dataset or container.
    
        :param scope: The scope name.
        :param name: The data identifier name.
        :param did_type: The data identifier type (file|dataset|container).
        :param statuses: Dictionary with statuses, e.g.g {'monotonic':True}.
        :param meta: Meta-data associated with the data identifier is represented using key/value pairs in a dictionary.
        :param rules: Replication rules associated with the data identifier. A list of dictionaries, e.g., [{'copies': 2, 'rse_expression': 'TIERS1'}, ].
        :param lifetime: DID's lifetime (in seconds).
        :param dids: The content.
        :param rse: The RSE name when registering replicas.
        """
        path = '/'.join([self.DIDS_BASEURL, quote_plus(scope), quote_plus(name)])
        url = build_url(choice(self.list_hosts), path=path)
        # Build json
        data = {'type': did_type}
        if statuses:
            data['statuses'] = statuses
        if meta:
            data['meta'] = meta
        if rules:
            data['rules'] = rules
        if lifetime:
            data['lifetime'] = lifetime
        if dids:
            data['dids'] = dids
        if rse:
            data['rse'] = rse
        r = self._send_request(url, type_='POST', data=render_json(**data))
        if r.status_code == codes.created:
            return True
        else:
            exc_cls, exc_msg = self._get_exception(headers=r.headers, status_code=r.status_code, data=r.content)
>           raise exc_cls(exc_msg)
E           rucio.common.exception.InvalidObject: Provided object does not match schema.
E           Details: Problem validating did: 't_20240617_154325_bed7/ctao.dpps.test/t_20240617_154325_bed7' does not match '^[a-zA-Z_\\-.0-9]{1,25}$'
E           
E           Failed validating 'pattern' in schema['properties']['scope']:
E               {'description': 'Scope name',
E                'pattern': '^[a-zA-Z_\\-.0-9]{1,25}$',
E                'type': 'string'}
E           
E           On instance['scope']:
E               't_20240617_154325_bed7/ctao.dpps.test/t_20240617_154325_bed7'

/usr/local/lib/python3.9/site-packages/rucio/client/didclient.py:116: InvalidObject

Although extract_scope is configured on both client and server to be the one of the policy package, it seems for the dataset, there is something going wrong

@maxnoe
Copy link
Contributor Author

maxnoe commented Jun 17, 2024

Here is a simpler test using the CLI:

[user@5646f0a48745 dpps-docker-compose]$ rucio add-dataset /ctao.dpps.test/test/foo
2024-06-17 15:45:17,215	ERROR	Provided object does not match schema.
Details: Problem validating did: 'test/ctao.dpps.test/test' does not match '^[a-zA-Z_\\-.0-9]{1,25}$'

Failed validating 'pattern' in schema['properties']['scope']:
    {'description': 'Scope name',
     'pattern': '^[a-zA-Z_\\-.0-9]{1,25}$',
     'type': 'string'}

On instance['scope']:
    'test/ctao.dpps.test/test'

@maxnoe
Copy link
Contributor Author

maxnoe commented Jun 17, 2024

It seems the reason is that the server uses SCOPE_NAME_REGEX from the schema instead of the extract_scope function...

Why are there two implementations of this?

@maxnoe
Copy link
Contributor Author

maxnoe commented Jun 20, 2024

With the help of @cserf, I figured out that SCOPE_NAME_REGEX is needed because the did client submits requests to the url {scope}/{lfn} which in our case now is something like:

foo_scope/ctao.dpps.test/foo_scope/path/to/data

Taking the SCOPE_NAME_REGEXP from the belleii schema also worked for us:

SCOPE_NAME_REGEXP = "/([^/]*)(?=/)(.*)"

@maxnoe
Copy link
Contributor Author

maxnoe commented Jun 20, 2024

So I guess this is about documentation mainly now.

Since this is mostly the necessary configuration for getting rucio and dirac to collaborate, it's not really obvious if such a documentation should live in the dirac or the rucio docs. The required rucio configuration could go here and link to the required dirac configuration there?

@bari12
Copy link
Member

bari12 commented Jun 26, 2024

I would add it to the Rucio doc and then link to the important parts of the DIRAC doc. @maxnoe if you could add a PR for this, this would be very helpful. I will move the issue to the documentation repo

@bari12 bari12 transferred this issue from rucio/rucio Jun 26, 2024
@voetberg voetberg self-assigned this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants