Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing General Purpose Dependency Analyzer #29

Open
pykenny opened this issue Oct 19, 2021 · 4 comments
Open

Providing General Purpose Dependency Analyzer #29

pykenny opened this issue Oct 19, 2021 · 4 comments

Comments

@pykenny
Copy link

pykenny commented Oct 19, 2021

Hi team, I'm trying to analyze table dependencies in a set of BigQuery queries, but I haven't found any lightweight and robust solution written in Python yet:

  • Google released their parser framework but seems that it's almost not runnable on common laptop
  • Didn't find any decent parser that works on all my queries (all verified as valid syntax in BigQuery console) 😿

Currently I just only need to pick out all the tables that appears in the query with no BigQuery view involved, which I think the approach proposed in this package (regex-matching potential candidates) looks promising for solving this type of question. Right now this package requires BigQuery access to retrieve query defined in the dataset with no offline support. Any possibility to provide module that extracts potential source dependency (table and view names), given a string that represents some BigQuery SELECT statement?

@pykenny pykenny changed the title Providing General Purpose Dependency Map Providing General Purpose Dependency Analyzer Oct 19, 2021
@christippett
Copy link
Contributor

I actually looked into Google’s zetasql library for this project and came to a similar conclusion you did, it’s very heavyweight if all you want is to extract entities from some SQL.

Obviously the regex we’re using is coupled to the inner workings of this project, but there’s no reason it couldn’t work as a standalone module. If I get some time over the next few days I’ll see if I can extract it into a separate package that’ll work how you want it. I’ve been wanting to revisit the regex anyway since I don’t think it covers all possible permutations for formatting a table/view.

@pykenny
Copy link
Author

pykenny commented Oct 29, 2021

Sounds great! Right at this moment I'm simply applying bqva.analyzer.find_query_objects on query strings to gather table information, then review dependencies between these queries manually.

Would be nice if it's possible to generate dependency graph automatically given a group of BQ queries, while users can opt in to verify views/tables under their project, or simply running the analysis offline.

@pykenny
Copy link
Author

pykenny commented Oct 29, 2021

As for my case tables were specified in format `[dataset].[table]` in queries (though `[dataset]`.`[table]` seems to be more preferred), so the current regex pattern is supposed to catch all the occurrences correctly.

@christippett
Copy link
Contributor

@pykenny, FYI I've recently had to divulge myself of my involvement with Servian's open source projects, but @TWinsnes or @polleyg will happily take your request on board and prioritise accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants