Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring and resolution of performance issues #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mproffitt
Copy link

This pull request implements a new API and resolves ISSUE-7 - Severe Performance Degradation when working with large data-sets.

A full breakdown of the changes provided is available at https://github.com/mproffitt/py-upset/blob/feature/ISSUE-7-Severe-Performance-Degradation/docs/WhatChanged-Version2.md with a discussion on performance towards the bottom.

Synopsis of changes

  • New resources module
  • New methods module
  • pyupset.__init__ exposes only the plot() function, visualisation.UpsetPlot, resources.FilterConfig, resources.DataExtractor classes and resources.SortMethods Enum
  • New API structure
  • New FilterConfig, GraphStore, Colours GridSpecStore and ExtractedData classes extending an Immutable type (once set, cannot be changed)
  • ExtractedData class is comparable
  • DataExtractor class moved to resources
  • DataExtractor now works on a merge table rather than generated indexes
  • Improved API Documentation
  • Added Tests for core functionality
  • Improved lint checks

    * New resources module
    * New methods module
    * New API structure
      * New FilterConfig, GraphStore, Colours GridSpecStore and ExtractedData classes
        extending an Immutable type (once set, cannot be changed)
      * ExtractedData class is comparable
      * DataExtractor class moved to resources
      * DataExtractor now works on a merge table rather than generated indexes
    * Improved API Documentation
    * Added Tests for core functionality
    * Improved lint checks

Full write-up of changes can be found in docs/WhatChanged-Version2.md

Differences in output.
    * The histogram plot show slightly different values to
      the original library. This could be for one of 2 reasons.
        1. An issue with selecting the results into ExtractedData
           objects.
        2. The original library plotted incorrect results potentially
           including NaN as a value. This would have provided larger
           datasets than the re-work which explicitly deletes NaN
           values.
@mproffitt mproffitt force-pushed the feature/ISSUE-7-Severe-Performance-Degradation branch from c01fa0e to 97aefda Compare November 22, 2016 13:32
    * Added reset method to change index on small dataframes
    * Frames are now merged in on a copy with the column names on the original frame reset post merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Severe performance degredation when working with large datasets
1 participant