Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate portfolio processing into runner #45

Merged
merged 1 commit into from
Oct 30, 2023
Merged

Conversation

bcspragu
Copy link
Collaborator

This PR integrates the runner with the PACTA portfolio processing code, integrating with Azure blob storage to read + write portfolios and reports.

It creates a minimal set of pactasrv APIs and frontend pages to test out the process of 1) uploading portfolios and 2) running the PACTA processing on them.

I've tested the process_portfolio piece against local Docker (i.e. dockertask), but I haven't tested the create_report flow yet in it's current incarnation. This PR is already large enough that I'll add the relevant async task stuff in a follow-up PR.

This PR integrates the runner with the PACTA portfolio processing code, integrating with Azure blob storage to read + write portfolios and reports.

I've tested the `process_portfolio` piece against local Docker (i.e. `dockertask`), but I haven't tested the `create_report` flow yet in it's current incarnation. This PR is already large enough that I'll add the relevant async task stuff in a follow-up PR.
@bcspragu bcspragu requested a review from gbdubs October 30, 2023 15:33
@bcspragu bcspragu merged commit 0aa5fcb into main Oct 30, 2023
2 checks passed
@bcspragu bcspragu deleted the brandon/runner-blobs branch October 30, 2023 15:50
Copy link
Contributor

@gbdubs gbdubs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly excellent PR. Thank you for writing this, it feels like the PACTA work is really coming together nicely.

cmd/runner/main.go Show resolved Hide resolved
cmd/runner/main.go Show resolved Hide resolved
cmd/runner/main.go Show resolved Hide resolved
return nil
}

func (h *handler) processPortfolio(ctx context.Context, req *task.ProcessPortfolioRequest) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clear - thank you!

}

func (h *handler) processPortfolio(ctx context.Context, req *task.ProcessPortfolioRequest) error {
// Load the portfolio from blob storage, place it in /mnt/raw_portfolios, where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/portfolio/portfolios/(many places)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a question I had for you about how we want to architect async task running. I was originally thinking that the async runner would get a "portfolio group ID" or whatever, but that means it needs to have database access, to translate from "portfolio group X" to "list of assets in portfolios"

I'd like it to be as dumb as possible because it's a large Docker image with code from N different teams in N different languages, so ideally we could sculpt the interface to be "here are some files in blob store, do stuff to them and write them to this other blob store", which minimizes the permissions that we need to give the runner.

My question here is what the end-user experience/data model looks like and if that is a reasonable/achievable goal. Are there reasons we can't process portfolios within a portfolio group individually at this stage? Let's chat if it's not a simple answer.

paths = append(paths, strings.TrimSpace(line[idx+17:]))
}

// NOTE: This code could benefit from some concurrency, but I'm opting not to prematurely optimize.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought here as above, maybe less important, since the cardinality is O(report outputs) not O(portfolio inputs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, my knowledge of what the actual data steps look like is fuzzy, and I don't know what the discrete units of compute are


// Load the processed portfolio from blob storage, place it in /mnt/
// processed_portfolios, where the `create_report.R` script expects it
// to be.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably just missing context that you have - is a portfolio report always generated off of exactly one processed portfolio? I thought it (like above) could be multiple?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! I actually don't know, and I expect this code will require a bunch of tweaks once Alex's stub interface gets reconciled with the actual code interface

cmd/runner/main.go Show resolved Hide resolved
bcspragu added a commit that referenced this pull request Nov 7, 2023
Specifically, add the `task_id` to log lines in the runner so we can figure out how long tasks are taking from the log output.
bcspragu added a commit that referenced this pull request Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants