Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not Understanding Upload Code #151

Open
gbdubs opened this issue Jan 20, 2024 · 7 comments
Open

Not Understanding Upload Code #151

gbdubs opened this issue Jan 20, 2024 · 7 comments
Assignees
Labels
documentation Discussion or improvements to documentation

Comments

@gbdubs
Copy link
Contributor

gbdubs commented Jan 20, 2024

app/cmd/runner/main.go

Lines 468 to 484 in b743bc9

dirEntries, err := os.ReadDir(reportDir)
if err != nil {
return fmt.Errorf("failed to read report directory: %w", err)
}
var artifacts []*task.AnalysisArtifact
for _, dirEntry := range dirEntries {
if !dirEntry.IsDir() {
continue
}
dirPath := filepath.Join(reportDir, dirEntry.Name())
tmp, err := h.uploadDirectory(ctx, dirPath, h.reportContainer)
if err != nil {
return fmt.Errorf("failed to upload report directory: %w", err)
}
artifacts = tmp
}

Three questions about this code - there are some fundamental things I'm not understanding:
1 - Won't it miss files that directly live in the reportDir, if any exist?
2 - Why do we list dirEntries when we already do a walk of the dirs in uploadDirectory? I.e. couldn't this section of code just be replaced with a single call to uploadDirectory (which would fix 1 if that's correct)
3 - Doesn't this code only return the artifacts from the last dirEntry, overwriting anything prior to the last?

@bcspragu Would you mind helping me understand this?

@bcspragu
Copy link
Collaborator

1 - Won't it miss files that directly live in the reportDir, if any exist?

That would indeed be the case, this is exploiting the fact that we know how the RMI scripts work (see here), and that they put all the report artifacts in a portfolio_id directory:

https://github.com/AlexAxthelm/stub_PACTA_containers_for_app/blob/190763cba67bafdf666f56eb34693da728fea21b/report_runner/create_report.R#L52-L66

One could argue this is part of the API contract.

2 - Why do we list dirEntries when we already do a walk of the dirs in uploadDirectory? I.e. couldn't this section of code just be replaced with a single call to uploadDirectory (which would fix 1 if that's correct)

Similar answer to above - We expect a certain structure, so it's two steps:

  1. List directories (e.g. look for created report directories)
  2. For each directory that we found, upload the assets

We intentionally don't upload things like /mnt/reports/<random file>, because that's not something we expect to be there.

3 - Doesn't this code only return the artifacts from the last dirEntry, overwriting anything prior to the last?

Indeed, we currently assume there's only one report (e.g. artifacts = tmp), but I don't know if that's an invariant we plan on holding (probably?). I'm not sure if I thought about this all that much when I wrote it, should probably be made more explicit/opinionated (e.g. only one report dir, or N and we preserve them all correctly)

@gbdubs
Copy link
Contributor Author

gbdubs commented Jan 20, 2024

OK this makes sense - I think we need to talk with the RMI team about this - because we already have situations where we're running processes for multiple logical portfolios (PortfolioGroups, Initiatives), but I don't know whether the outputs there are 1:1 with inputs or Many:1 (which was my assumption).

@hodie
Copy link

hodie commented Jan 22, 2024

@AlexAxthelm @cjyetman to chime in!

@cjyetman
Copy link
Member

Not sure I totally understand what we're talking about here, but... if we're talking about the directory where all the output files are for the interactive report, I would think it would be ideal for any file that ends up in there be transferred to wherever it needs to go to be accessible to the user when viewing their report, because it's not necessarily a fixed set of files, e.g. some initiatives require certain files for their reports that are not required for "GENERAL" reports, and I can foresee wanting to add a new plot or something that might require new files and it would be ideal that doing so would not require changing the server code. Am I talking about the same thing?

@AlexAxthelm
Copy link

AlexAxthelm commented Jan 24, 2024

they put all the report artifacts in a portfolio_id directory

This is an artifact of how we've been doing things on our previous system. I don't think we're married to this, if it would make life easier on your side. On the previous system, the files were organized in a user-level directory that get's mounted to the docker container, so the portfolio_id directory in reports is necessary to distinguish reports for different portfolios.

Simplified current schematic

User123
├── Portfolio
│   ├── 1.csv
│   └── 2.csv
├── Results
|   ├── 1
|   |   └── results.rds
|   └── 2
|       └── results.rds
└── Report
    ├── 1
    |   ├── foo.json
    |   └── index.html
    └── 2
        ├── foo.json
        └── index.html

If it would be better to have something different, I'm totally open to that. In fact somethin like this would align well with what we're thinking on our side for how we want to change the directory structure (cc @cjyetman @jdhoffa @hodie ):

Possible schematic

portfolio_1
├── 1.csv
├── results.rds
└── Report
    ├── foo.json
    └── index.html
portfolio_2
├── 2.csv
├── results.rds
└── Report
    ├── foo.json
    └── index.html

The only issue I'm seeing is in how this relates to #106, and if we need ot have a unique ID of some kind on the report directory, but if that's not a problem from your side, then it probably isn't from ours either.

@AlexAxthelm
Copy link

AlexAxthelm commented Jan 24, 2024

but I don't know whether the outputs there are 1:1 with inputs or Many:1 (which was my assumption).

The outputs right now are 1:1 (1 portfolio to 1 report) but that's because the current system does the grouping before it hits the PACTA code, and gives it to us as a single file. I think we're comfortable with the idea of doing a many:one mapping, along the lines of:

portfolio_group_xyz
├── 1.csv
├── 2.csv
├── group_definitions.json
├── results.rds
└── Report
    ├── foo.json
    └── index.html

where group_definitions.json is as described in #82

@gbdubs
Copy link
Contributor Author

gbdubs commented Feb 16, 2024

@bcspragu given the above, I think we probably want to change this code inline with Alex's suggestion - confirming that matches your expectations, since you've been doing the coupling w/ the docker image?

@gbdubs gbdubs added the documentation Discussion or improvements to documentation label Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Discussion or improvements to documentation
Projects
None yet
Development

No branches or pull requests

5 participants