Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add meta-parser for json output #26

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

RemilYoucef
Copy link

@RemilYoucef RemilYoucef commented Aug 24, 2021

Overview

Dear Lahnakoski,

As a part of my PhD thesis, I was very pleased to use your availaible and open-source SQL parser in order to preprocess our SQL queries so that they can be mined and used properly by our data mining algorithms. Our work has been accepted in a pioneering conference in Software Engineering (International Conference on Automated Software Engineering - ASE) (Core A*). However, we used and referenced in our article your archived repository.

Since your parser provide an SQL syntactic tree in XML, our extensions consisted of adding a meta-parser for the provided json output. The syntactic tree is mined using depth first stratergy to identify for each query clause its associated attributes. Moreover, we handle nested queries and we replace alias (temporary table names) in each attribute to differentiate between attributes having the same name but belong to different tables or clauses. In this way, our tabular representation can be used for machine learning or data mining techniques. For more details please refer to our repository.

Extensions

The new file ./mo-sql-parsing/mo_sql_parsing/json_parser.py which contains the extended functions of the meta-parser is added. The meta-parser is called via the function parse_json(sql) in the ./mo-sql-parsing/mo_sql_parsing/__init__.py

Example

In what follows, we show through an example the difference between what the parse(sql) and parse_json(sql) functions provide as an ouptut. Let's consider the following query recorded from one of our runing database servers of our company Infologic:

from mo_sql_parsing import parse, parse_json

query = 'select a.uex.ik from fr.infologic.stocks.cumuls.modele.lotulcumul as a join a.prod as p where a.uex.flagfictif = p1 and a.uex.ik in (select temp.ik from fr.infologic.global.outils.modele.tabletemp as temp) and a.dossierinfo.dosres = p3 group by a.uex.ik having ( count(distinct a.prod.ik ) = p2) and ( sum ( a.qteuelemappro ) = max(p.nbusurembal))'
parse(query)

The resulting output of parse is :

{'select': {'value': 'a.uex.ik'},
 'from': [{'value': 'fr.infologic.stocks.cumuls.modele.lotulcumul',
   'name': 'a'},
  {'join': {'name': 'p', 'value': 'a.prod'}}],
 'where': {'and': [{'eq': ['a.uex.flagfictif', 'p1']},
   {'in': ['a.uex.ik',
     {'select': {'value': 'temp.ik'},
      'from': {'value': 'fr.infologic.global.outils.modele.tabletemp',
       'name': 'temp'}}]},
   {'eq': ['a.dossierinfo.dosres', 'p3']}]},
 'groupby': {'value': 'a.uex.ik'},
 'having': {'and': [{'eq': [{'count': {'distinct': 'a.prod.ik'}}, 'p2']},
   {'eq': [{'sum': 'a.qteuelemappro'}, {'max': 'p.nbusurembal'}]}]}}

However the output of parse_json is :

{'tables_from': ['fr.infologic.stocks.cumuls.modele.lotulcumul', 'fr.infologic.global.outils.modele.tabletemp'],
 'tables_join': ['fr.infologic.stocks.cumuls.modele.lotulcumul.prod'],
 'projections': ['fr.infologic.stocks.cumuls.modele.lotulcumul.uex.ik', 'fr.infologic.global.outils.modele.tabletemp.ik'],
 'attributes_where': ['fr.infologic.stocks.cumuls.modele.lotulcumul.uex.flagfictif', 
  'fr.infologic.stocks.cumuls.modele.lotulcumul.uex.ik', 'fr.infologic.global.outils.modele.tabletemp.ik', 
   'fr.infologic.stocks.cumuls.modele.lotulcumul.dossierinfo.dosres'],
 'attributes_groupby': ['fr.infologic.stocks.cumuls.modele.lotulcumul.uex.ik'],
 'attributes_orderby': [],
 'attributes_having': ['fr.infologic.stocks.cumuls.modele.lotulcumul.prod.ik',
  'fr.infologic.stocks.cumuls.modele.lotulcumul.qteuelemappro',
  'fr.infologic.stocks.cumuls.modele.lotulcumul.prod.nbusurembal'],
 'functions': ['count', 'sum', 'max']}

@RemilYoucef RemilYoucef changed the title add meta-parser for json outputs add meta-parser for json output Aug 24, 2021
@klahnakoski
Copy link
Owner

Thank you for this PR. Your work is very interesting, and I am glad you found this project a little help.

This code does not change the core parsing, so I suggest it stand as its own module. Would you happen to have a test suite that goes along with it? It would be interesting to see if it breaks with later versions of this parser.

Going through this PR, I get the sense it is pulling features that others have expressed desire for; although, I am not certain of what all the functions do yet.

@klahnakoski
Copy link
Owner

Reformatted example for easier reading:

{
    "select": {"value": "a.uex.ik"},
    "from": [
        {"value": "fr.infologic.stocks.cumuls.modele.lotulcumul", "name": "a"},
        {"join": {"name": "p", "value": "a.prod"}},
    ],
    "where": {"and": [
        {"eq": ["a.uex.flagfictif", "p1"]},
        {"in": [
            "a.uex.ik",
            {
                "select": {"value": "temp.ik"},
                "from": {
                    "value": "fr.infologic.global.outils.modele.tabletemp",
                    "name": "temp",
                },
            },
        ]},
        {"eq": ["a.dossierinfo.dosres", "p3"]},
    ]},
    "groupby": {"value": "a.uex.ik"},
    "having": {"and": [
        {"eq": [{"count": {"distinct": "a.prod.ik"}}, "p2"]},
        {"eq": [{"sum": "a.qteuelemappro"}, {"max": "p.nbusurembal"}]},
    ]},
}

{
    "tables_from": [
        "fr.infologic.stocks.cumuls.modele.lotulcumul",
        "fr.infologic.global.outils.modele.tabletemp",
    ],
    "tables_join": ["fr.infologic.stocks.cumuls.modele.lotulcumul.prod"],
    "projections": [
        "fr.infologic.stocks.cumuls.modele.lotulcumul.uex.ik",
        "fr.infologic.global.outils.modele.tabletemp.ik",
    ],
    "attributes_where": [
        "fr.infologic.stocks.cumuls.modele.lotulcumul.uex.flagfictif",
        "fr.infologic.stocks.cumuls.modele.lotulcumul.uex.ik",
        "fr.infologic.global.outils.modele.tabletemp.ik",
        "fr.infologic.stocks.cumuls.modele.lotulcumul.dossierinfo.dosres",
    ],
    "attributes_groupby": ["fr.infologic.stocks.cumuls.modele.lotulcumul.uex.ik"],
    "attributes_orderby": [],
    "attributes_having": [
        "fr.infologic.stocks.cumuls.modele.lotulcumul.prod.ik",
        "fr.infologic.stocks.cumuls.modele.lotulcumul.qteuelemappro",
        "fr.infologic.stocks.cumuls.modele.lotulcumul.prod.nbusurembal",
    ],
    "functions": ["count", "sum", "max"],
}

@lenahi
Copy link

lenahi commented Dec 11, 2021

hi prime:
Just want to know when the pasre_json function planned to delivered?
Thanks.

@klahnakoski
Copy link
Owner

@LenaJava I am not certain this code will be included in this parser.

  1. It requires no changes to the parser; It could be in its own project with a dependency on this
  2. It has no tests - I need tests to ensure I do not break it when I make changes

I am keeping this PR alive because it might be very useful: @RemilYoucef obviously needed it. You requested something similar, and I noticed my other projects performing this same type of work. I want to come back here from time-to-time and think about how to proceed.

@klahnakoski
Copy link
Owner

Here is a project that appears to pull metadata from the parse results: https://github.com/macbre/sql-metadata

@avaitla
Copy link

avaitla commented Mar 26, 2022

This is a useful PR at least as a reference, as one thing we are quite interested in is taking a SQL and determining the tables that are referenced within it as a part of a larger project to build a data catalog (think someone could click on a table and then see all types of queries that touch that table). We'd also need a way to fingerprint similar queries but that is another matter (if there's something you know of that can fingerprint in python easily that would be useful - right now we use the percona golang fingerprinting tool).

@klahnakoski
Copy link
Owner

@avaitla Thank you for mentioning "... way to fingerprint ...". I had some vague sense that even a partial PR can provide value, and you articulated it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants