- Potentially BREAKING behavior:
innerJoin
now has another parametercommonColumns
, which handles how columns that are common to both dataframes are handled. Previously, we wrongly assumed that the data in all common columns must match exactly. In that case it didn’t matter which column to take. However, if they did not match, the data for the common columns (including the one we joined by) was corrupted and left at default initialization from the first mismatch in a common column. There are now 3 different ways:ccLeft
: keep the data of the left inputccRename
: Rename the common columns to*_left
and*_right
(default)ccDrop
: Drop common columns that are not the joined one.
We choose to use
ccRename
as the default, because it keeps all information. It is a breaking change however, because the common columns now have a different name. But imo it’s better here to make people aware that this change happened instead of giving them wrong data. Feel free to conflict me on this.
- prepare Datamancer for
-d:nimPreviewHashFarm
- fix an issue in
arrange
, which could lead to calls with more than 2 columns to result in bad sorting from the 3rd column on.
- remove
toHashSet
from Datamancer, as it is now part of Arraymancer directly
- fix issue #65 where not enough identifier hygiene in formula macros could cause CT errors if user variable of same name as column name appeared and accented quoting was used
- allow a title for the page in
showBrowser
- add option to
exclude
columns when usingfromH5
serialization
Add experimental support for the JS backend!
Not everything works, but the most important parts are there. The
basic idea is that we have a non arraymancer data storage for columns,
simply a wrapper around seq[T]
.
Some basic features related to IO are currently not supported. And on
older Nim versions (older than current devel), it may be required to
be more explicit in f{}
formulas in terms of type hints.
- remove our
len
forTensor
due to addition in arraymaner now (update arraymancer dep to>= 0.7.28
) - improvement to formula logic to detect procedures with
{.error: "".}
pragma usage and ignore them for type matching (necessary due to forbidding e.g.+
in arraymancer forTensor + Scalar
now)
Note: If you are on Nim 1.6 or 2.0 (but not after) and you are using
a formula with max(c"foo")
where foo
is a column name, you will
have to replace that by max(col("foo"))
. Due to a new addition of a
max(varargs[Tensor]]): Tensor
type procedure in arraymancer, for
some reason older Nim versions bind the symbol max
fully to that
varargs
version. That breaks previous formulas.
In formulas using max
(or similarly min
) it may also be necessary
to supply type hints, even if they were not required previously for
related reasons. E.g. replace:
f{ max(idx("a"), 2) }
by
f{int: max(idx("a"), 2) }
to indicate the column "a"
should be treated as integers.
Given that this is fixed on Nim devel, I won’t attempt a hacky
workaround. Please just update your Nim version or your formulas. :(
- add
shuffle
to shuffle a DF. Either using stdlib global RNG or given RNG - add
extend
helpers forseq/Tensor/Column
to add a single element to any collection and return the extended version - throw custom
OSError
if CSV file cannot be read - add
%~
for(string, T)
to construct aVObject
Value - fix
spread
implementation if more keys are present - correctly handle quoted fields in CSV files, fixes issue #58
- add
allowLineBreaks
option to also allow for line breaks in quoted fields. It’s an optional option (despite being commonly useful for files with quotes), because counting the lines is otherwise a useful sanity check the parsing worked successfully. - [IO] fix for space separated files with quoted fields as columns
- handle DF construction from an empty seq or tensor
- hotfix for
assignStack
fix of previous version
- handle empty datasets in
assignStack
- add overload for
innerJoin
for more than two arguments
- fix potential segfault when calling
pretty
onnil
DF - allow
count
to take multiple arguments to group by and count - add
equal
procedure for Columns and DataFrames for comparison of values
- add optional
serialize
submodule to serialize to / from HDF5 files - allow tilde
~
in paths forreadCsv
- add
rows
iterator to get rows of DF as single row DF - add
[]
with single int index to get single row of DF - add
dropNaN
helper to remove rows of DF that contain NaNs - minor fixes for regressions appearing in newer nim (duplicate case & generic type resolution issue)
- [io] add
emphStrNumber
arg towriteCsv
,pretty
to disable highlighting of strings that look like numbers via explicit"
- do not lift control flow out of loop bodies even if they do not touch DF colums / indices
- improve error message for
add
of two DataFrames, prints the mismatching columns - fix Column conversion to string column from
char
input - add
item
to retrieve the single element of aColumn
andTensor
- replace all
doAssert
calls that really should be exceptions by exceptions so that--panics:on
is saner. Fixes issue #51
- use
contains
from themacrocache
module if available, PR #49
- add
toOrgTable
to convert a DF into an Org style table - fixes a reference semantic bug when slicling a DF containing a
constant column, e.g. by using
head
ortail
(or manually) - small improvements to non generic generics (make them usable for
filter
if used withdfFn
for example) - memory optimize construction of
groupMap
ingroup_by
call
toDf
now degensyms identifiers so the name of the column is the same as the variable name also in templates (PR #45 by @hugogranstrom)- implement lifting out nodes in formulas that perform calculations on
full columns to avoid the extreme performance penalty of writing
something like
f{"foo" ~ `x` + sum(col("y"))}
where previously the calculation of the sum would have been performed in line in the loop. Now these are lifted out of the loop body and assigned to a singlelet
variable that is injected instead - add
maxLines
option to CSV parser, useful to only read the first N lines - fix extraction of
idx/col
references innnkCall
arguments - fix type determination for procedures with default arguments of type
bool
- refactor
toHtml
out ofshowBrowser
to access raw generated HTML table
- fix regression for
bool
columns, by handlingntyBool
ingencase
. Fixes the ggplotnim CI, as noticed by Vindaar/ggplotnim#151.
- fix similar issue to
gather
withinnerJoin
by also bindingitems
for sets & callingitems
directly
- fix reference semantics bugs for constant columns for
filter
calls - fix issue with
gather
in files that do not importsets
by binding theitems
iterator in the scope
- fix issues with
toDf
if only single argument given (with and without string names) not setting DF length / raising CT error - add
to
option to convert type of data frame - some more internal macro logic cleanup
- define
ScalarLike
concept to match types that can be converted tofloat
via.float
for e.g. units to have a%~
Value conversion for them - minor cleanup of macro logic
- rename
genColumn
todefColumn
toDf
now supports assignment of generic arguments as well, as long as the column types required have been generated already.defColumn
now generates all combinations of the given types- fixes some issues with
unionType
getting confused - makes
toColumn
work correctly witharray
- fix regression in
ggplotnim
formula due to badly determined result type. Only use resulting type ofPreface
if type acceptable (e.g. not generic) - fix
toColumn
for single elementColumn
construction
- keep conversions from other number types to
int
andfloat
after all in the context oftoDf
andtoColumn
.
MAJOR, POSSIBLY BREAKING: Add experimental support for “non-generic generic
Columns
”.
See the bottom for a list of known breaking changes.
What does that mean?
First of all the DataFrame
type is now an alias to
DataTable[Column]
. DataTable
is a new name for a generic version
of DataFrame
to avoid breaking changes when making DataFrame
generic. Current code should just continue to work fine.
The existing ColumnKind
enum now has an additional member called
colGeneric
. This value is used in other variants of a Column
like
type, defined by a ColumnLike
concept. Essentially, these types are
equivalent to Column
, but contain additional fields in the
colGeneric
branch. For example consider an extended ColumnLike
type that can also store KiloGram
and Meter
units (from unchained
):
type
ColumnKiloGram|Meter = ref object
len*: int
case kind*: ColKind
of colFloat:
fCol*: Tensor[float]
of colInt:
iCol*: Tensor[int]
of colBool:
bCol*: Tensor[bool]
of colString:
sCol*: Tensor[string]
of colObject:
oCol*: Tensor[Value]
of colConstant:
cCol*: Value
of colNone:
nil
# up to here the same type as `Column`
of colGeneric:
# depending on the instance it the generic stores `KiloGram` or `Meter` data
case gkKind: GenericKiloGram|MeterKind # an auto generated enum for gen eric types
of gkKiloGram:
gKiloGram: Tensor[KiloGram]
of gkMeter:
gMeter: Tensor[Meter]
This generalizes to any number of generics.
Such a new Column
type is generated using the genColumn
macro:
genColumn(KiloGram, Meter)
to generate the above.
After generating the new type, it can be accessed using:
colType(KiloGram, Meter) # <- returns the type
To construct a DataTable
of this type, you can do:
let df = colType(KiloGram, Meter).newDataTable() # or `newDataTable(colType(KiloGram, Meter))` of course
Further an existing DataTable
can be extended by a new type column
using:
let df = newDataFrame() # construct an old school data frame
# ... put in some data
let dfKg = df.extendDataFrame("foo" # <- column name
@[1.kg, 2.kg]) # <- fill with kilo gram data
if the ColumnKiloGram
type has been generated before using
genColumn(KiloGram)
this will return a DataTable[KiloGram]
containing the old data of df
as well as a new column called "foo"
of type KiloGram
.
mutate
also works with formulas that access generic types or
generate columns of new generic types. There are certain limitations
currently though. In some cases the formula may need to be aware of
the type of the DataTable
it acts on. For this there is a new macro,
dfFn
, which wraps around a regular f{}
macro and receives the
DataTable
it should act on:
genColumn(KiloGram, KiloGram²)
let dfKg2 = dfKg.mutate(dfFn(dfKg, f{KiloGram -> KiloGram²: "kg2" ~ `kg` * `kg`}))
as this is a bit annoying, there is a mutate2
(the name is
consciously stupid, as a proper name still hasn’t been chosen) that
does this automatically:
genColumn(KiloGram, KiloGram²)
let dfKg2 = dfKg.mutate2(f{KiloGram -> KiloGram²: "kg2" ~ `kg` * `kg`})
Columns of course only have to be generated once.
Note: one thing when dealing with multiple columns of different types
to keep in mind (as this surely will come up more now): The idx
and
col
helpers in formulas, support explicit type annotations for
individual columns:
f{float -> Meter: "foo" ~ `x` * idx(`y`, Meter)}
# where `x` will be read as `float` and `y` as `Meter`!
Many things are likely to break… :)
See the playground/non_generic_generics.nim for a few examples for usage.
The release is a bit less refined than I would have liked, but as the code is (as far as I can tell), not breaking existing code and mostly working, I want to merge it now, to test it properly in real usage and fix things along the way. Otherwise it will be on ice forever.
The commit that contains the added code is squashed as the development
code is ultra messy. Check out the nonGenericGenerics
branch (or PR)
or the cleanUpCommitsForRebase
branch (or PR) for the full history.
Known breaking changes and issues:
- assigning data of types that can be converted to
int
orfloat
(e.g.int8
) to a DF does not auto convert them anymore. This was always a helper to store them, but in the future once this feature is more refined, it’ll be better to store them as is colGeneric
is a new enum field forColumnKind
and thus has to be handled in code dealing with the enum manually
- remove outdated warning about failed type deduction in formulas
This release gets rid of all hints during compile time, afaict.
- remove unused imports
- make sure variables follow same naming
- remove dead code
- add
styles:usage
tonim.cfg
- BREAKING: change semantics of assignment formula (using
<-
) in the context ofmutate
. Previously, using such formulas in amutate
(ortransmute
) call would end up renaming a column from RHS to LHS. However, this was never clearly communicated & was a bit unclear. In particular it made it impossible to generate a constant column in amutate
call, which seems much more useful to me. To rename a column, simply use therename
procedure as before. Note that af{"bar" <- "foo"}
formula is required in that case. - raise an exception in
rename
if a formula of different kind thanfkAssign
is given - change default printing width of columns in a DF. Make them a bit wider to accommodate float columns printed in exp notation.
- another quick release to help with some windows line ending CSV
files
- adds a
lineBreak
andeat
option toreadCsv
to help with certain windows style line ending CSV files in which otherwise we might miscount the number of lines
- adds a
- hotfix release fixing an issue with
readCsv
.- if a file contained columns that do not allow us to determine
types, fixes an issue in which parsing of them failed, due to a
missing reset of
col
- add a
maxGuesses
argument toreadCsv
to stop guessing types after this many rows (set to ‘object’ columns in that case) - fix a small issue in which we always entered the
skipLines
loop, even if we didn’t have to skip any lines
- if a file contained columns that do not allow us to determine
types, fixes an issue in which parsing of them failed, due to a
missing reset of
- add support for reading CSV files from http and https URLs.
- do not ignore `skipInitialSpace` and `quote` readCsv arguments.
- replace an assertion by a proper check in
summarize
if user hands a non reducing formula to it - replace usages of
seqsToDf
in the docs - BREAKING: in
readCsv
thecolNames
argument, if any are given, now implies we skip the parsing of the header completely. If there is a header in the file that is to be ignored,colNames
must be combined withskipLines
! See also the updated docstring. - possibly breaking: when parsing CSV files with space / tab separators, spacing at the
end of the lines does not cause issues anymore (they previously
caused us to count them as real columns, meaning possible crashes
due to number of column mismatches). This can be breaking for a
user, but in that case they relied on unspecified behavior. Empty
columns at the beginning or ending in the file are a bit crazy for
space based seps. However, we might add a
skipInitialSpace
equivalent for this in the future.
select
now respects the order of the given columns, i.e. the order of the columns in the resulting DF are in the order of the given columns- add
relocate
to change the column order of one or more keys - add experimental operation to access column at index
i
usingdf[[i]]
syntax
- fix CSV parsing for files with fully empty columns
- allow printing of columns of kind
colNone
- add filename as title to
showBrowser
calls
- fix regression when calling
arrange
with purely column references to constant columns
- constant
DataFrame
columns have seen improvements. Before most operations on them would convert them to a non-constant column, often forced to convert to an object column. Now, most operations (that make sense) are supported on constants themselves and if a non-constant conversion is required, it aims to use the type corresponding to the underlyingValue
kind of the constant. That way conversions of constants to full columns should now lead to native (float, int, string, bool) tensors (unless an operation with another native, incompatible type is performed) - some bugs were fixed that could cause reference semantics of
dataframes to shine through when using
filter
- BREAKING: the
toValueKind
procedure now takes aColumn
instead of aColumnKind
. This is to be able to handle the constant to full conversion properly. Note: A deprecated variant of the former version is still around! - add
filterToIdx
, which takes a DF and a sequence / tensor of integers. The procedure will keep only those rows of the DF whose indices are part of the seq/Tensor - slight performance improvements for the parsing of CSV files (larger
for string heavy files) by avoiding an unnecessary
newString
call (yeah,setLen
resizes for you if needed…) - allow more valid Nim code inside of
f{}
formulas, e.g. if expressions and block statements - fix type determinations in
f{}
formulas, if a procedure with default parameters, but no explicit type information is given. - certain expressions in
f{}
formulas (for exampleisNaN(idx("foo"))
) could produce unintended CT errors and work now (sorry, had to add awhen compiles
check :( ). - experimental support for “full formulas” as I call them that allow
to have more control over variables in the scope of the formula:
formula: preface: foo in df["Foo", float] bar in baz(df["Bar", int]) loop: bar^2.float + foo
allows for custom variable names inside of the context (and more importantly) to perform a full column operation (e.g.
baz
) on a column before the loop and use the elements of that operation inside of the loop. Note that this is not for reducing operations on columns (i.e.mean(df["Bar", float])
)! It is still planned to lift reducing operations out of the loop body, but that is still pending. - SEMI-BREAKING: add preliminary support for reducing formulas that require a
for
loop. This (currently) allows forres += <formula>
like statements inside of a loop instead of justres = <formula>
where in the latter the formula must produce a scalar by itself (i.e. does not allow element wise access to columns). Now a formula that accesses a single element viaidx(...)
will produce a loop with an accumulation. Note: to make use of this feature you must use the full formula syntax, as otherwise the default value ofres
is unclear.formula: preface: var res = 1.0 Bidx in df["B", float] loop: res *= Bidx * 1.5
- add
lag
,lead
procedures that take aTensor/Column
and return a newTensor/Column
that is shiftet forward / backward N elements (the left overs are zeroed by default, but adjustable usingfill
argument) - the
showBrowser
helper to view aDataFrame
in the browser now adds an additional “index” column - improve performance of
groups
iterator (particularly in cases where the DF is already sorted / the sorting is cheap) - fix type deduction issues in formulas using dot expressions for certain cases
- add convenience comparison operators for
Value
elements of a column with regular types within a =f{}= formula (they are emitted as templates into the closure scope to avoid having them available in all scopes). Use theconvenienceValueComparisons
template to emit them to a local scope if desired outside formula scopes.
- make sure to only import and export
arraymancer/tensor
submodule - fix CSV parsing wrt. empty fields (treated as NaN) and explicit NaN & Inf values
- fix CSV parsing of files with extraneous newlines
- fix CSV parsing with missing values at the end of a line (becomes
NaN
) - fix CSV parsing of empty fields if missing in first row and element is not float
- add more parsing tests
- add basic implementation of
spread
(inverse ofgather
; similar to dplyrpivot_wider
). The current implementation is rather basic and performance may be suboptimal for very large data frames. - add
null
helper to create aVNull Value
- significantly improve the docs of the
dataframe.nim
module. - fixes an issue where unique column reference names were combined into the same column due to a bad name generation algorithm
- significantly improves performance in applications in which allocation of memory is a bottleneck (tensors were zero initialized).
- disable formula output at CT by default. Compile with
-d:echoFormulas
to see the output. - remove CT warnings for unrelated stuff (node kinds)
- avoid some object conversions in column operations (ref #11)
- add
[]=
overloads for columns for slice assignments - significantly improve performance of
mutate/transmute
operations for grouped dataframes (O(150,000) groups in < 0.5 s possible now) - fixes #12 by avoiding hashing of columns. Some performance
regression in
innerJoin
,setDiff
(~2x slower in bad cases).
- allow assignment of constants in
seqsToDf
- allow assignment of scalars to DF as column directly
- add filename argument to
showBrowser
- make
compileFormulaImpl
actually typed to make formulas work correctly inside of generics (refggplotnim
Vindaar/ggplotnim#116 - change internal macro type logic to use strings
- fix slicing of constant columns
- fully qualify
Value
on scalar formula construction
- fix formulas (and type deduction) for certain use cases involving
nnkBracketExpr
that are not references to columns
- improve type deduction capabilities for infix nodes
- add overload for
drop
that doesn’t just work on a mutable data frame - fix reference semantics issues if DF is modified and visible in result (only data is shared, but columns should be respected)
arrange
now also takes avarargs[string]
instead of aseq
. While there is still a bug of not properly being able to use varargs, at least an array is possible (and hopefully at some point proper varargs).
- CSV parser is more robust, can handle unnammed columns
- explicit types in
idx
,col
column reference finally works (e.g.idx("foo", float)
accesses the column “foo” as a float tensor overwriting type deductions and type hints)
- allow
nnkMacroDef
infindType
- add development notes and ideas about rewrite of formula macro in
notes/formula_dev_notes.org
- initial version of Datamancer based on
ggplotnim
data frame with major formula macro rewrite