-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add meta-parser for json output #26
base: dev
Are you sure you want to change the base?
Conversation
Thank you for this PR. Your work is very interesting, and I am glad you found this project a little help. This code does not change the core parsing, so I suggest it stand as its own module. Would you happen to have a test suite that goes along with it? It would be interesting to see if it breaks with later versions of this parser. Going through this PR, I get the sense it is pulling features that others have expressed desire for; although, I am not certain of what all the functions do yet. |
Reformatted example for easier reading:
|
hi prime: |
@LenaJava I am not certain this code will be included in this parser.
I am keeping this PR alive because it might be very useful: @RemilYoucef obviously needed it. You requested something similar, and I noticed my other projects performing this same type of work. I want to come back here from time-to-time and think about how to proceed. |
Here is a project that appears to pull metadata from the parse results: https://github.com/macbre/sql-metadata |
This is a useful PR at least as a reference, as one thing we are quite interested in is taking a SQL and determining the tables that are referenced within it as a part of a larger project to build a data catalog (think someone could click on a table and then see all types of queries that touch that table). We'd also need a way to fingerprint similar queries but that is another matter (if there's something you know of that can fingerprint in python easily that would be useful - right now we use the percona golang fingerprinting tool). |
@avaitla Thank you for mentioning "... way to fingerprint ...". I had some vague sense that even a partial PR can provide value, and you articulated it. |
Overview
Dear Lahnakoski,
As a part of my PhD thesis, I was very pleased to use your availaible and open-source SQL parser in order to preprocess our SQL queries so that they can be mined and used properly by our data mining algorithms. Our work has been accepted in a pioneering conference in Software Engineering (International Conference on Automated Software Engineering - ASE) (Core A*). However, we used and referenced in our article your archived repository.
Since your parser provide an SQL syntactic tree in XML, our extensions consisted of adding a meta-parser for the provided json output. The syntactic tree is mined using depth first stratergy to identify for each query clause its associated attributes. Moreover, we handle nested queries and we replace alias (temporary table names) in each attribute to differentiate between attributes having the same name but belong to different tables or clauses. In this way, our tabular representation can be used for machine learning or data mining techniques. For more details please refer to our repository.
Extensions
The new file
./mo-sql-parsing/mo_sql_parsing/json_parser.py
which contains the extended functions of the meta-parser is added. The meta-parser is called via the functionparse_json(sql)
in the./mo-sql-parsing/mo_sql_parsing/__init__.py
Example
In what follows, we show through an example the difference between what the
parse(sql)
andparse_json(sql)
functions provide as an ouptut. Let's consider the following query recorded from one of our runing database servers of our company Infologic:The resulting output of parse is :
However the output of parse_json is :