Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Y24-300 - Addition of anonymous ids for public data sharing - existing data #1896

Open
2 tasks
KatyTaylor opened this issue Sep 5, 2024 · 10 comments · May be fixed by sanger/sequencescape#4594
Open
2 tasks
Assignees
Labels
Enhancement New feature or request RVI RVI project Size: M Medium - medium effort & risk Value: 4 Value to the insitute is high

Comments

@KatyTaylor
Copy link
Contributor

KatyTaylor commented Sep 5, 2024

User story
As a member of GSU, I would like RVI samples already imported into Sequencescape to have an anonymous id added to them, so that when they are shared in public databases, the id does not give any unnecessary information.

Who are the primary contacts for this story
Anna G, Adrianne L

Who is the nominated tester for UAT
Anna G, Ya-Lin H

Acceptance criteria
To be considered successful the solution must allow:

  • Samples previously imported into Sequencescape for RVI should have an id created and inserted into the public name (?) field
  • Id is created using Baracoda (RVI prefix)

References
This story has a non-blocking relationship with:

Additional context
The data for these already imported samples has not yet been released.

Original request from Anna: "GSU would like to apply additional level of security in relation to sample IDs. This will need to be applied to future and all retrospective samples within RVI." (was split into 3 stories).

N.B. Is it worth checking before doing this story if the Sanger Sample Id would be sufficient? Is this used for existing samples in other studies? We discussed using the accession number but think this is not appropriate because it is specific to EBI and this will potentially be released to other databases as well.

@psd-issuer psd-issuer bot changed the title Addition of anonymous ids for public data sharing - existing data Y24-300 - Addition of anonymous ids for public data sharing - existing data Sep 5, 2024
@KatyTaylor KatyTaylor added On Hold On hold RVI RVI project Enhancement New feature or request labels Sep 5, 2024
@TWJW-SANGER
Copy link

I know we discussed that the WSI prefix would be used for the Anon ID could it be the RVI one instead. It’s still anon enough not to distinguish location.

@TWJW-SANGER
Copy link

TWJW-SANGER commented Oct 1, 2024

GSU agree that RVI prefixed Sanger Sample ID can to be used.

@SujitDey2022
Copy link

Team Discussion 13Nov2024

Assumption:

  1. Run a script to update the IDs
  2. Referencing to only RVI samples and not Heron Samples
  3. Use of public name field only for RVI and not for Heron

How do we identify the samples/Study that need to be updated?

@SujitDey2022 SujitDey2022 added the Size: M Medium - medium effort & risk label Nov 13, 2024
@dasunpubudumal dasunpubudumal self-assigned this Nov 22, 2024
@dasunpubudumal
Copy link
Contributor

dasunpubudumal commented Nov 22, 2024

@KatyTaylor There's an attribute in sample_metadata table called sample_public_name which the manifest's PUBLIC NAME value is stored into. Is this the target for the new ID?

@dasunpubudumal dasunpubudumal removed their assignment Dec 9, 2024
@BenTopping
Copy link
Contributor

BenTopping commented Dec 9, 2024

Outstanding questions:

  • How do we determine which samples are RVI samples? Are they all under particular studies?
  • If we update sanger_sample_id's that will break all RVI manifests (they will have old ids) does this matter?
  • Should this be done after the manifest story so that we don't get intermediate data created after this data patch and before the manifest patch?

@TWJW-SANGER
Copy link

TWJW-SANGER commented Dec 10, 2024

Quick clarifications.

  • How do we determine which samples are RVI samples? Are they all under particular studies?
    Yes they will be identified by belonging to the "RVI Program - Bait Capture" study

  • If we update sanger_sample_id's that will break all RVI manifests (they will have old ids) does this matter?
    We are not changing the sanger_sample_id values.
    We would like to set the sample's "Public Name" field to be identical to the sample's "Sanger Sample ID" field for these samples.

  • Should this be done after the manifest story so that we don't get intermediate data created after this data patch and before the manifest patch?
    Good point. Yes.

@neilsycamore
Copy link

If the sample has been accessioned whatever identifier is added to the 'name' and then added as the public name then potentially (if data is pushed to the ENA) this would expose the 'new id' as public name if is displayed as the title of the accessioned sample

@TWJW-SANGER
Copy link

@neilsycamore The issue they are trying to address is that they don't have an anonymous ID they can use for publishing data. Their sample supplier uses ids that are identifiable and they automatically transferred these over to SciOps / SequenceScape.

The accessioning logic, I believe, uses public name if present and then fails back to supplier id ?

So, by setting an anonymous id in the Public Name field they can safely publish. We thought about allocating a new id with baracoda but it seems that in this case the SequenceScape name is sufficiently anonymous.

Does that help explain things? Are there any concerns with this approach?

@sabrine33 sabrine33 self-assigned this Dec 16, 2024
@sabrine33 sabrine33 added the Value: 4 Value to the insitute is high label Dec 16, 2024
@sabrine33
Copy link
Contributor

sabrine33 commented Dec 16, 2024

@TWJW-SANGER , @KatyTaylor I’ve been looking into this, and I noticed that some samples belonging to the RVI Program - Bait Capture study have their sanger_sample_id prefixed with "RandD_" (e.g., RandD_RVIxxxxxx).

I’m unsure how to handle the public names for these samples. Should we remove the "RandD_" prefix to align with the acceptance criteria, or should we leave them as they are for now?

Out of 5055 records, only 23 have their sanger_sample_id prefixed with "RandD_," which I found a bit confusing. Additionally, there doesn’t seem to be much consistency in the use of the RVI prefix—some samples are prefixed with "RVI," while others use "RVI_."
I’m not sure if this is relevant, but I thought it was worth mentioning. I also looked at the date column to try and make sense of it, but that didn’t provide any clear insights.

@TWJW-SANGER
Copy link

Good spot.

My thoughts below:

  • I don't think it would be a good idea to change the sanger_sample_id, as this is likely linked to in various ways in different systems.
  • I suspect that the R&D samples will not be part of the data release. So that it won't matter what their public name is as it will never be exported.
  • "RVI" vs "RVI_" - is irritating, but still satisfies the requirement to be anonymous and that inconsistency will be present in other systems that feed into SequenceScape already.

I suggest that we continue as planned but I will send an email to Adrianne and Ya-Lin to cover these points and CC you both in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request RVI RVI project Size: M Medium - medium effort & risk Value: 4 Value to the insitute is high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants