Skip to content

ebmdatalab/datalab-pandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datalab-pandas

This is very early stage library to simplify working with pandas for common EBMDataLab operations

Usage

Install the package as you usually would, e.g.

pip install ebmdatalab

To install upgrades you should use

pip install ebmdatalab --upgrade

Convenience for caching/storing BigQuery data as CSV

This will save the results of the SQL query as a CSV, and when it's run again, as long as the SQL hasn't changed, load that CSV rather than querying BigQuery again. Use a .zip, .gz, .bz2 or .xz file extension to store the cache in a compressed format, or .csv for uncompressed:

from ebmdatalab import bq

sql = "SELECT * FROM ebmdatalab.hscic.bnf"
df = bq.cached_read(sql, csv_path='bnf_codes.zip')  # add `use_cache=False` to override
df.head()

To access data in BigQuery, users will need to have the credentials of a service account with "BigQuery Guest" and "BigQuery Dataset Guest" roles. You can do that here.

  • Click "+ CREATE SERVICE ACCOUNT", choose a name for the account (eg "Rich Croker service account"), and add the "BigQuery Guest" and "BigQuery Dataset Guest" roles.
  • Select the newly created account, click "KEYS", "ADD KEY", "Create new key", "JSON", and "CREATE". This will download the credentials.

If using this library through a notebook derived from the datalab-jupyter image, the credentials file should be renamed to bq-service-account.json and moved to the repo root. Otherwise, EBMDATALAB_BQ_CREDENTIALS_PATH should be set to the path of the credentials file.

Other functions

See the examples/ directory for:

  • Logistic regression
  • CCG maps
  • Deciles charts

Development

This project uses flit for bundling and publishing. Publish thus:

flit publish

To install a package locally for development, install with a symlink so you can test changes without reinstalling the module:

flit install --symlink

Updating the package in Windows

1. Clone repository to local drive e.g. through GitHub desktop.

2. Open Anaconda command prompt by right-clicking and selecting Run as administrator.

  • Type pip list to check that ebmdatalab is installed
    • If it is, uninstall using pip uninstall ebmdatalab
  • Install flit: check if already installed using e.g. flit help, if not, type pip install flit
  • Change directory to work in same location as the repo e.g. >cd C:\Users\hcurtis\Documents\GitHub\datalab-pandas
  • Install symlink: flit install --symlink

3. Make changes to the package .py files as required

  • Edit code e.g. via Jupyter notebook
  • If you think this change should be incorporated into its own release, open __init__.py and increase the version number

4. Test the changes

  • To avoid having to restart the kernel every time you make a change, add the following commands in your notebook to tell the kernel to update its reference to the package (e.g. for charts):
import importlib
from ebmdatalab import charts
importlib.reload(charts)

5. Push changes

  • Open GitHub desktop and you should see your changed files.
  • Create a branch rather than commiting to master.
  • Describe and commit changes.
  • Make a pull request to merge branch with master (select from dropdown menu under Branch).
  • This will take you to GitHub where you need to click Create pull request

About

Useful pandas library stuff

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages