-
Notifications
You must be signed in to change notification settings - Fork 149
Pandas out_flavor for better ctable performance #176
Comments
See also FrancescAlted's comment on this issue in #66. |
@FrancescAlted Some food for thought for when you return to bcolz: I implemented a quick hack to show the possible performance gains (>x2 on my machine) from using a column-major output flavor with ctable: #184. The implementation is clearly non-ideal but shows what I was after: returning pandas dataframes can be faster than returning numpy structured arrays. Relative timings of
By introducing an abstraction layer for the creation of the "result array" and its data access, one can minimize the impact on core bcolz code. Also, this would allow users to hook-in and implement their own out_flavors. The implementation of the abstraction layer in the PR is probably sub-optimal. It currently penalises numpy results with only a few rows. There are three possible reasons for this that I suspect:
That should be easy enough to solve with some profiling and possibly a different abstraction layer architecture. Eventually, further performance gains could result if As a side-benefit the abstraction layer would make implementing categoricals (#66) easier. Timing code:
|
Closing this in favor of #187 introducing an efficient abstraction layer which allows users to provide their own pandas out_flavor implementation. |
In this issue I want to make the case for the extension of the effect of
out_flavor
to__getitem__()
(and related functions) and the introduction of apandas
outflavor. While I appreciate the rationale for limiting bcolz to the numpy data model I believe the possible performance improvements with pandas merit consideration.I would be very interested to know if this has any chance for inclusion in bcolz, since the effort would be non-trivial for a clean implementation.
Executive Summary
ctable
)__getitem__()
and the like: Numpy bottleneck with ctable #174numpy
orpandas
Evidence of column assignment bottleneck
Note: for the sake of simplicity, when I use the term "column" in relation to numpy structured arrays, I refer to the fields of the (single column) structured array that bcolz uses to store its output column.
Numpy structured arrays (as used by bcolz) are inherently row-major. It is impossible to change this as far as I can see. This means that column assignment is fairly slow:
Significant speedup (factor 6.5!) can be achieved by moving to a column-major memory layout for the numpy array:
Why Pandas DataFrames would help
While moving to column-major numpy arrays would be the ideal solution, this is obviously not an option: they require the entire array to have a homogeneous dtype.
Pandas DataFrame however supports columns of different dtypes and by design stores data in column-major order. (As I remember because Wes McKinney said that most of his data analysis happened along columns, though I cannot find the reference. That said I think the following article in which he explains his reasons for not choosing numpy structured arrays is interesting: Wes McKinney: A Roadmap for Rich Scientific Data Structures in Python)
In addition, the choice of column-major ordering permits size-mutability: columns can be added without copying the data. This fits well with the ctable column-store philosopy. With numpy structured arrays, the entire array has to be copied (almost entry by entry as far as I can see) to add a new column.
Due to these advantages, column-major ordering is used my many well-known dedicated number crunching environments, among others: Fortran, MATLAB and R
Note: Making a pandas DataFrame for an already existing numpy structured array returned by ctable is not an option either: again, effectively element by element copying of the data is required.
Evidence of performance improvements with Pandas
Instantiation of Pandas DataFrames is admittedly an issue with smaller databases. Leaving this issue aside for the moment, one can see that assignment to an (instantiated DataFrame) is much faster:
Instantiation of the DataFrame will probably be an issue. My gut reaction is that it should be possible to instantiate (and cache) an empty DataFrame with the correct structure and then for each
__getitem__()
call shallow-copy this template and assign new data arrays. This should help with the instantiation overhead, since the column makeup of a ctable instance usually does not change too often during a program.Down the road
In the long run it might be worth getting rid of the memory copies altogether for DataFrame out_flavor and decopressing the chunks directly into the arrays backing the DataFrame. This would likely lead to further performance improvements (though smaller as the remaining memory copies would already be efficient en-bloc copies).
The text was updated successfully, but these errors were encountered: