-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where to find exact dataset names for BigQuery public datasets used in Spider 2.0 #29
Comments
Hi, For example, if you want to check |
Ok, that makes sense, thanks for the super quick answer! But what I'm trying to get at though is what is the difference between the folders in the |
The In the BigQuery warehouse, the hierarchy is as follows: a You can see a clearer database structure in Snowflake ( |
Ok, with import os
list_dirs = set([dir.lower() for dir in os.listdir("spider2-snow/resource/databases")])
import json
with open("spider2-snow/spider2-snow.jsonl") as f:
data = [json.loads(line) for line in f]
dbs_from_json = set([line["db_id"].lower() for line in data])
print(len(list_dirs)) # ->152
print(len(dbs_from_json)) # ->111
print(len((dbs_from_json - list_dirs))) # ->0 But on BigQuery, assuming that the questions from BigQuery DBs have instance id that contains "bq" I get this, why? import os
list_dirs = set([dir.lower() for dir in os.listdir("spider2-lite/resource/databases/bigquery")])
import json
with open("spider2-lite/spider2-lite.jsonl") as f:
data = [json.loads(line) for line in f if "bq" in json.loads(line)["instance_id"]]
dbs_from_json = set([line["db"].lower() for line in data])
print(len(list_dirs)) # -> 74
print(len(dbs_from_json)) # -> 76
print(len((dbs_from_json - list_dirs))) # -> 31 -------------- Edited ------------- The assumption on the condition was wrong, all good, thanks again! import os
list_dirs = set([dir.lower() for dir in os.listdir("spider2-lite/resource/databases/bigquery")])
import json
with open("spider2-lite/spider2-lite.jsonl") as f:
data = [json.loads(line) for line in f if "bq" == json.loads(line)["instance_id"][:2]]
dbs_from_json = set([line["db"].lower() for line in data])
print(len(list_dirs)) # -> 74
print(len(dbs_from_json)) # -> 43
print(len((dbs_from_json - list_dirs))) # -> 0 |
One last thing, has snowflake made any commitments on how long it is going to host the data for? |
id.startswith("bq") or id.startswith("ga") -> bigquery example as for spider2-snow, all examples are snowflake, starts with Snowflake is expected to be hosted for a long time; this is a close collaboration.🤔 |
Hi,
Is there a complete list of the dataset used in spider-2.0-lite that are available as public data in BigQuery? Neither the strings in the jsonl files nor the directories in the resource folder seems to be exhaustive. For reference I am using the
bigquery-public-data
project in BigQuery so I should have access to all publicly available datasets there.Let me know if I'm missing something. Happy to provide more info if needed.
Thanks!
The text was updated successfully, but these errors were encountered: