Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prefix to data file path #7

Open
Redmar-van-den-Berg opened this issue Nov 30, 2021 · 3 comments
Open

Add prefix to data file path #7

Redmar-van-den-Berg opened this issue Nov 30, 2021 · 3 comments

Comments

@Redmar-van-den-Berg
Copy link

I have a project where input filenames cannot be derived from sample attributes, but are determined by the outside provider. I would still like to use PEP to simplify my sample list.

Is there a way to simply add a prefix to the path for a data file?
For example, I would like to have project-id/external_id_L001.fastq.gz in my sample configuration, and then extend this into /current/path/for/data/files/project-id/external_id_L001.fastq.gz. This seems like a simple enough use case, is this possible using PEP?

@nsheff
Copy link
Contributor

nsheff commented Nov 30, 2021

Yes, this is very easy to do, and a perfect use case for PEP. The best way to do this would be:

  1. Set a column in your sample table named external_id.
  2. Set up a derived attribute to populate your path, with source: /current/path/for/data/files/project-id/{external_id}_{lane}.fastq.gz

For an example you could look at the example_derive example in this repository. Your config file could look something like this:

sample_modifiers:
  derive:
    attributes: [file_path]
    sources:
      external_data: /current/path/for/data/files/project-id/{external_id}_L001.fastq.gz

and your sample table would be:


sample_name,protocol,organism,time,file_path,external_id
pig_0h,RRBS,pig,0,external_data,xowyn3
pig_1h,RRBS,pig,1,external_data,vosish2
frog_0h,RRBS,frog,0,external_data,vi2on3
frog_1h,RRBS,frog,1,external_data,290n34

If you want more of the info in the sample table, you can also do something like this:

sample_modifiers:
  derive:
    attributes: [file_path]
    sources:
      external_data: /current/path/for/data/files/project-id/{external_local_path}

and your sample table would be:

sample_name,protocol,organism,time,file_path,external_local_path
pig_0h,RRBS,pig,0,external_data,project-id/ext1_L001.fastq.gz
pig_1h,RRBS,pig,1,external_data,project-id/ext2_L001.fastq.gz
frog_0h,RRBS,frog,0,external_data,project-id/ext3_L001.fastq.gz
frog_1h,RRBS,frog,1,external_data,project-id/ext4_L001.fastq.gz

@Redmar-van-den-Berg
Copy link
Author

Thanks, I was hoping to be able to do this in one go, without adding any new 'dummy' columns to the sample table. But I guess having a file_path and local_path that together make the full path also works.

Would it be possible to add the functionality to derive values from themselves? I don't think there is anything in the docs that disallows something like file_path: /path/to/stuff/{file_path}. And it would make the PEP's a bit more clean.

@nsheff
Copy link
Contributor

nsheff commented Nov 30, 2021

Thanks, I was hoping to be able to do this in one go, without adding any new 'dummy' columns to the sample table.

Actually, sure, you can also do that by just using the append modifier, and then just derive that same attribute. Like this:

sample_modifiers:
  append:
    - file_path: external_data
  derive:
    attributes: [file_path]
    sources:
      external_data: /current/path/for/data/files/project-id/{external_local_path}

This works if your samples all have the same data path. You'd have to use an imply modifier if you want it to vary across samples.

Would it be possible to add the functionality to derive values from themselves?

Interesting. I don't even know if this would work -- it may work. I have never tried it. Did you try it? It does seem convenient but I'm not sure I like the overloading of the attribute like that... it's nice to have access to the original file_path variable on your sample object if you need it for something.

I think the above method of combining an append with derive satisfies the same use case... it's admittedly not quite as simple as what you propose, but it has the advantage of keeping the original value intact, so I think that might be preferable...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants