Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError occurred when running python run.py --run Balsa_TPCH --local #2

Open
Blondig opened this issue Jun 9, 2022 · 0 comments

Comments

@Blondig
Copy link

Blondig commented Jun 9, 2022

I am interested in your code and try to run it with TPC-H . I write a subclass of Balsa_JOBRandSplit and change p as follows.

p.db = 'tpchload'
p.sim_checkpoint = None
p.query_dir = 'queries/myTpchTest'
p.query_glob = ['*.sql']
p.test_query_glob = TPCH_TEST_QUERIES

The PostgreSQL version and conda environment are the same as recommended in README.md. When I run it as python run.py --run Balsa_TPCH --local, an error occurred with the following traceback.

Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2150, in Main
    agent = BalsaAgent(p)
  File "run.py", line 754, in __init__
    self.exp, self.exp_val = self._MakeExperienceBuffer()
  File "run.py", line 809, in _MakeExperienceBuffer
    wi = self.GetOrTrainSim().training_workload_info
  File "run.py", line 1160, in GetOrTrainSim
    self.sim = TrainSim(p, self.loggers)
  File "run.py", line 379, in TrainSim
    sim.CollectSimulationData()
  File "/home/xxx/balsa/sim.py", line 728, in CollectSimulationData
    self.search.Run(query_node, query_node.info['sql_str'])
  File "/home/xxx/balsa/balsa/search.py", line 245, in Run
    dp_tables)
  File "/home/xxx/balsa/balsa/search.py", line 317, in _dp_bushy_search_space
    return list(dp_tables[num_rels].values())[0][1], dp_tables
IndexError: list index out of range

I use only three queries in the query_dir, like:

select   supp_nation,   cust_nation,   l_year,   sum(volume) as revenue  from   (    select     n1.n_name as supp_nation,     n2.n_name as cust_nation,     extract(year from l_shipdate) as l_year,     l_extendedprice * (1 - l_discount) as volume    from     supplier,     lineitem,     orders,     customer,     nation n1,     nation n2    where     s_suppkey = l_suppkey     and o_orderkey = l_orderkey     and c_custkey = o_custkey     and s_nationkey = n1.n_nationkey     and c_nationkey = n2.n_nationkey     and (      (n1.n_name = 'VIETNAM' and n2.n_name = 'UNITED KINGDOM')      or (n1.n_name = 'UNITED KINGDOM' and n2.n_name = 'VIETNAM')     )     and l_shipdate between date '1995-01-01' and date '1996-12-31'   ) as shipping  group by   supp_nation,   cust_nation,   l_year  order by   supp_nation,   cust_nation,   l_year;

The I add print(join_graph) after Line 257 of balsa/balsa/search.py, which is

r = r_tup[1]

and it shows "Graph with 0 nodes and 0 edges". I think I cannot get a correct join graph in Line 224 of balsa/balsa/search.py, which is

join_graph, all_join_conds = query_node.GetOrParseSql()

I then check the definition of GetOrParseSql(self) in balsa/balsa/util/plans_lib.py and print graph and join_conds. It shows Graph with 0 nodes and 0 edges for the graph and [] for the join_conds. I then check the definition of simple_sql_parser in balsa/balsa/util/simple_sql_parser.py and print the result of join_conds after

join_conds = join_cond_pat.findall(sql)

The sql is one of the queries in my query_dir but the join_conds is still []. I check the regular expression and guess it cannot deal with the expression c_custkey = o_custkey in my queries since there are dots in the used regular expression.
As introduced in the paper, TPC-H is used as a benchmark. Could you please give me some hints for the above parser problem or add some codes on TPC-H. Many thanks in advance.
Another confusion is that when I run the above command for the first time and set

p.query_glob = ['test1.sql', 'test2.sql', 'test3.sql']
p.test_query_glob = ['test1.sql']

it shows

3 train queries: ['test1', 'test2', 'test3']
0 test queries: []
wandb: (1) Create a W&B account

even if in the BalsaAgent params test_query_glob is ['test1.sql']. I am just curious about why we need to get the Baseline PG performance by running all test and training queries before training. Hope your reply sincerely!

@Blondig Blondig changed the title IndexError occurred when running python run.py --run Balsa --local IndexError occurred when running python run.py --run Balsa——TPCH --local Jun 9, 2022
@Blondig Blondig changed the title IndexError occurred when running python run.py --run Balsa——TPCH --local IndexError occurred when running python run.py --run Balsa_TPCH --local Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant