Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to find exact dataset names for BigQuery public datasets used in Spider 2.0 #29

Open
antonio-veezoo opened this issue Dec 2, 2024 · 6 comments

Comments

@antonio-veezoo
Copy link

Hi,
Is there a complete list of the dataset used in spider-2.0-lite that are available as public data in BigQuery? Neither the strings in the jsonl files nor the directories in the resource folder seems to be exhaustive. For reference I am using the bigquery-public-data project in BigQuery so I should have access to all publicly available datasets there.

Let me know if I'm missing something. Happy to provide more info if needed.

Thanks!

@antonio-veezoo antonio-veezoo changed the title How to find exact dataset names for BigQuery public datasets Where to find exact dataset names for BigQuery public datasets used in Spider 2.0 Dec 2, 2024
@lfy79001
Copy link
Collaborator

lfy79001 commented Dec 2, 2024

Hi,

For example, if you want to check census_bureau_acs_1 database, you can check census_bureau_acs_1 schema doc to preview which datasets are used for this database.
Thanks!

@antonio-veezoo
Copy link
Author

Ok, that makes sense, thanks for the super quick answer! But what I'm trying to get at though is what is the difference between the folders in the spider2-lite/resource/databases/bigquery directory and the db field in the spider2-lite/spider2-lite.jsonl file. Basically I want to recreate DBs instances on which you run the queries at least for bigquery but I don't get how to do it.

@lfy79001
Copy link
Collaborator

lfy79001 commented Dec 3, 2024

The db field in the spider2-lite/spider2-lite.jsonl file represents a specific database, with each database containing datasets (schemas) from BigQuery. The file organization is displayed in spider2-lite/resource/databases/bigquery.

In the BigQuery warehouse, the hierarchy is as follows: a project contains multiple datasets (schemas), and each dataset contains multiple tables. Based on the requirements, we organize specific datasets into a single database and name it accordingly in the spider2-lite.jsonl file.

You can see a clearer database structure in Snowflake (spider2-snow).

@antonio-veezoo
Copy link
Author

antonio-veezoo commented Dec 3, 2024

Ok, with spider2-snow.jsonl it all checks out. The db_id field represents a folder in "spider2-snow/resource/databases" and to run the queries I just use the instances preset in the accounts you provided on snowflake, straightforward, thanks!

import os

list_dirs = set([dir.lower() for dir in os.listdir("spider2-snow/resource/databases")])

import json

with open("spider2-snow/spider2-snow.jsonl") as f:
    data = [json.loads(line) for line in f]
    
dbs_from_json = set([line["db_id"].lower() for line in data])
print(len(list_dirs)) # ->152
print(len(dbs_from_json)) # ->111
print(len((dbs_from_json - list_dirs))) # ->0

But on BigQuery, assuming that the questions from BigQuery DBs have instance id that contains "bq" I get this, why?

import os

list_dirs = set([dir.lower() for dir in os.listdir("spider2-lite/resource/databases/bigquery")])

import json

with open("spider2-lite/spider2-lite.jsonl") as f:
    data = [json.loads(line) for line in f if "bq" in json.loads(line)["instance_id"]]
    
dbs_from_json = set([line["db"].lower() for line in data])
print(len(list_dirs)) # -> 74
print(len(dbs_from_json)) # -> 76
print(len((dbs_from_json - list_dirs))) # -> 31

-------------- Edited -------------

The assumption on the condition was wrong, all good, thanks again!

import os

list_dirs = set([dir.lower() for dir in os.listdir("spider2-lite/resource/databases/bigquery")])

import json

with open("spider2-lite/spider2-lite.jsonl") as f:
    data = [json.loads(line) for line in f if "bq" == json.loads(line)["instance_id"][:2]]
    
dbs_from_json = set([line["db"].lower() for line in data])
print(len(list_dirs)) # -> 74
print(len(dbs_from_json)) # -> 43
print(len((dbs_from_json - list_dirs))) # ->  0

@antonio-veezoo
Copy link
Author

One last thing, has snowflake made any commitments on how long it is going to host the data for?

@lfy79001
Copy link
Collaborator

lfy79001 commented Dec 3, 2024

id.startswith("bq") or id.startswith("ga") -> bigquery example
id.startswith("sf") -> snowflake example
id.startswith("local") -> sqlite example

as for spider2-snow, all examples are snowflake, starts with sf

Snowflake is expected to be hosted for a long time; this is a close collaboration.🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants