Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore/broswerfs deprecated issue 127 #93

Draft
wants to merge 65 commits into
base: master
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
20f1fdf
Update license, package.json, and add basic notice to the Readme (#3)
wilwade Apr 27, 2021
5d03d1b
Bloom r us (#2)
enddynayn May 5, 2021
5547ad4
GitHub Actions NPM Release (#5)
wilwade May 6, 2021
c887f10
Setup package.json to support the dist directory instead of just incl…
wilwade May 10, 2021
7d3e6dc
Build before publishing (#7)
wilwade May 10, 2021
ce24b50
chore(README): update to include bloom filter (#9)
enddynayn May 11, 2021
4099f37
use xxhash-wasm (#8)
shannonwells May 12, 2021
2ccc4eb
Update package org from unfinishedlabs -> dsnp (#12)
wilwade May 18, 2021
a8ddc96
Chore/pre release 0.1.0 (#13)
wilwade Jun 7, 2021
a4f425a
Don't publish types for now (#14)
wilwade Jun 7, 2021
ba60652
feat(fetch): add browser/node http client (#15)
enddynayn Jun 18, 2021
b35942b
Parquetjs browser support (#17)
shannonwells Jun 30, 2021
f605856
fix bug caused by side-effect of change in xxhasher. regen package-lo…
shannonwells Jul 1, 2021
ac8ebb2
update BSON due to vulnerability, minor fixes (#20)
shannonwells Jul 1, 2021
81e38e4
add cjs and ejm builds (#21)
acruikshank Jul 13, 2021
98871a2
Clean up dependencies and remove lodash (#22)
wilwade Aug 17, 2021
44d9f3c
Added test cases for types and fixed bigint precision (#24)
waylandli Oct 21, 2021
a82975b
Chore/typescript conversion/27 (#33)
waylandli Nov 30, 2021
7c7bbb9
updated typescript and esbuild. Fixed typescript issue (#35)
waylandli Dec 2, 2021
6fb5fcd
Convert lib/schema.js to typescript 1/3 (#37)
waylandli Feb 22, 2022
f1bee25
Converting shred.js to typescript 2/3 (#40)
waylandli Feb 24, 2022
14bbb97
Convertinging util.js to typescript 3/3 (#39)
waylandli Feb 25, 2022
ce1f98b
Reader conversion (#44)
waylandli Mar 16, 2022
eb029bd
[Issue #41] - Slimmed down FixedTFramedTransport class (#46)
dopatraman Mar 16, 2022
d3b2089
Removed redundant reference to callback (#48)
dopatraman Apr 12, 2022
a1bc22e
fix for typed array BYTE_ARRAY (#23) (#49)
aramikm Apr 13, 2022
4bd382e
removed force32 (#51)
aramikm Apr 19, 2022
91ac192
Removed ts-ignored lines (#50)
dopatraman Apr 19, 2022
8f30383
Add test coverage for Buffer Reader (#52)
dopatraman Apr 22, 2022
b4ca349
Support async iteration in the reader (#55)
wilwade Apr 26, 2022
33c80fe
Export the types into the package (#54)
wilwade Apr 26, 2022
0b027a3
Removed async declaration (#59)
dopatraman Apr 29, 2022
5a94d60
Chore/ts writer take2 (#60)
mehtaishita May 10, 2022
555af72
Upstream bug fixes for RLE encoding (#61)
dopatraman May 10, 2022
cadf486
Performance improvements using a cursor instead of a shift (#57)
wilwade May 10, 2022
2f150a0
Update packages (#63)
wilwade May 10, 2022
b8bfc07
Remove baseUrl config option (#65)
wilwade May 11, 2022
c86f490
Chore/fix entry file (#66)
wilwade May 11, 2022
d382d0f
Types Cleanup (#67)
wilwade May 11, 2022
73cf70d
Better handling of millis and micros (#68)
wilwade May 12, 2022
a2cd4ff
Bug/browser types bug #70 (#71)
wilwade Jun 22, 2022
a62db08
Fix typo in comment (#74)
JasonYeMSFT Dec 13, 2022
a011a2e
Feature - collect and report multiple field errors (#75)
dgaudet Mar 8, 2023
2c733b5
Add ability to read decimal columns (#79)
dgaudet Apr 26, 2023
b5698e4
Feature: Parquet Schema from JSON Schema (#82) with
wilwade May 25, 2023
17cb5ed
update to node 16, FIX HASHER BUG (#84)
shannonwells May 25, 2023
fa1865b
Ensure Buffer objects are returned by compression functions (#88)
JasonYeMSFT Jun 21, 2023
19f3ffa
Decimal Writer Support (#90)
wilwade Jun 23, 2023
43732c5
Add null pages and boundary order (Fixes #92) (#94)
wilwade Jul 13, 2023
ac5257d
Feature: Timestamp support for JSONSchema generated schemas (#95)
noxify Jul 25, 2023
c07e7e8
Add support to byte array decimal fields (#97)
YECHUNAN Aug 14, 2023
bda4e3f
upgrade to nodejs 18.18.2 (#101)
shannonwells Nov 17, 2023
19707ef
Support bloom filters in lists/nested columns (#105)
shannonwells Nov 28, 2023
e0f1ebd
Update docs and add simple test for readme encoding examples (#107)
wilwade Jan 9, 2024
0a42955
Use the Double primative for JSON Schema "number" type (#111)
mpotter Jan 11, 2024
6fdb9da
Update index.ts to support RLE_DICTIONARY (#112)
saritvakrat Jan 18, 2024
8d34ac1
Reference Tests and Breaking Change: Optional nullable fields are now…
wilwade Jan 19, 2024
2622ff1
allow typeLength to come from opts.column when decoding FIXED_LEN_BYT…
j4ys0n Jan 19, 2024
117e5a5
Feat/support aws s3 v3 (#115)
shannonwells Jan 30, 2024
91fc71f
Update Deprecated Function Calls (#118)
wilwade Feb 10, 2024
3de7eea
fix: Fix incorrect primitive type detection (#122)
JasonYeMSFT Mar 13, 2024
5db9db4
Swap brotli dependency (#123)
wilwade Mar 14, 2024
2435994
Dependency updates (#128)
wilwade May 14, 2024
6396d26
Linting and Formatting (#133)
wilwade Jul 1, 2024
d6b64e0
chore(*): replace browserfs for zenfs
enddynayn Jul 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Feature - collect and report multiple field errors (#75)
Problem
=======
This PR is intended to implement 2 enhancements to schema error
reporting.
* When a parquet schema includes an invalid type, encoding or
compression the current error does not indicate which column has the the
problem
* When a parquet schema has multiple issues, the code currently fails on
the first, making multiple errors quite cumbersome

Solution
========
Modified the schema.ts and added tests to:
* Change error messages from the original `invalid parquet type:
UNKNOWN` to `invalid parquet type: UNKNOWN, for Column: quantity`
* Keep track of schema errors as we loop through each column in the
schema, and at the end, if there are any errors report them all as
below:
`invalid parquet type: UNKNOWN, for Column: quantity`
`invalid parquet type: UNKNOWN, for Column: value`

Change summary:
---------------
* adding tests and code to ensure multiple field errors are logged, as
well as indicating which column had the error
* also adding code to handle multiple encoding and compression schema
issues

Steps to Verify:
----------------
1. Download this [parquet
file](https://usaz02prismdevmlaas01.blob.core.windows.net/ml-job-config/dataSets/multiple-unsupported-columns.parquet?sv=2020-10-02&st=2023-01-09T15%3A28%3A09Z&se=2025-01-10T15%3A28%3A00Z&sr=b&sp=r&sig=GS0Skk93DCn5CnC64DbnIH2U7JhzHM2nnhq1U%2B2HwPs%3D)
2. attempt to open this parquet with this library `const reader = await
parquet.ParquetReader.openFile(<path to parquet file>)`
3. You should receive errors for more than one column, which also
includes the column name for each error

---------

Co-authored-by: Wil Wade <wil.wade@unfinished.com>
  • Loading branch information
dgaudet and wilwade authored Mar 8, 2023
commit a011a2e135bf7ce82538f6bd9cc342593e2eb637
17 changes: 14 additions & 3 deletions lib/schema.ts
Original file line number Diff line number Diff line change
@@ -81,6 +81,7 @@ function buildFields(schema: SchemaDefinition, rLevelParentMax?: number, dLevelP
}

let fieldList: Record<string, ParquetField> = {};
let fieldErrors: Array<string> = [];
for (let name in schema) {
const opts = schema[name];

@@ -129,9 +130,15 @@ function buildFields(schema: SchemaDefinition, rLevelParentMax?: number, dLevelP
continue;
}

let nameWithPath = (`${name}` || 'missing name')
if (path && path.length > 0) {
nameWithPath = `${path}.${nameWithPath}`
}

const typeDef = opts.type ? parquet_types.PARQUET_LOGICAL_TYPES[opts.type] : undefined;
if (!typeDef) {
throw 'invalid parquet type: ' + (opts.type || "missing type");
fieldErrors.push(`Invalid parquet type: ${(opts.type || "missing type")}, for Column: ${nameWithPath}`);
continue;
}

/* field encoding */
@@ -140,15 +147,15 @@ function buildFields(schema: SchemaDefinition, rLevelParentMax?: number, dLevelP
}

if (!(opts.encoding in parquet_codec)) {
throw 'unsupported parquet encoding: ' + opts.encoding;
fieldErrors.push(`Unsupported parquet encoding: ${opts.encoding}, for Column: ${nameWithPath}`);
}

if (!opts.compression) {
opts.compression = 'UNCOMPRESSED';
}

if (!(opts.compression in parquet_compression.PARQUET_COMPRESSION_METHODS)) {
throw 'unsupported compression method: ' + opts.compression;
fieldErrors.push(`Unsupported compression method: ${opts.compression}, for Column: ${nameWithPath}`);
}

/* add to schema */
@@ -167,6 +174,10 @@ function buildFields(schema: SchemaDefinition, rLevelParentMax?: number, dLevelP
};
}

if (fieldErrors.length > 0) {
throw fieldErrors.reduce((accumulator, currentVal) => accumulator + '\n' + currentVal);
}

return fieldList;
}

57 changes: 57 additions & 0 deletions test/schema.js
Original file line number Diff line number Diff line change
@@ -467,4 +467,61 @@ describe('ParquetSchema', function() {
}
});

it('should indicate which column had an invalid type in a simple flat schema', function() {
assert.throws(() => {
new parquet.ParquetSchema({
quantity: {type: 'UNKNOWN'},
})
}, 'Invalid parquet type: UNKNOWN, for Column: quantity');
});

it('should indicate each column which has an invalid type in a simple flat schema', function() {
assert.throws(() => {
new parquet.ParquetSchema({
quantity: {type: 'UNKNOWN'},
value: {type: 'UNKNOWN'},
})
}, 'Invalid parquet type: UNKNOWN, for Column: quantity\nInvalid parquet type: UNKNOWN, for Column: value');
});

it('should indicate each column which has an invalid type when one is correct in a simple flat schema', function() {
assert.throws(() => {
new parquet.ParquetSchema({
quantity: {type: 'INT32'},
value: {type: 'UNKNOWN'},
})
}, 'Invalid parquet type: UNKNOWN, for Column: value');
});

it('should indicate each column which has an invalid type in a nested schema', function() {
assert.throws(() => {
new parquet.ParquetSchema({
name: { type: 'UTF8' },
stock: {
fields: {
quantity: { type: 'UNKNOWN' },
warehouse: { type: 'UNKNOWN' },
}
},
price: { type: 'UNKNOWN' },
})
}, 'Invalid parquet type: UNKNOWN, for Column: stock.quantity\nInvalid parquet type: UNKNOWN, for Column: stock.warehouse');
});

it('should indicate which column had an invalid encoding in a simple flat schema', function() {
assert.throws(() => {
new parquet.ParquetSchema({
quantity: {type: 'INT32', encoding: 'UNKNOWN'},
})
}, 'Unsupported parquet encoding: UNKNOWN, for Column: quantity');
});

it('should indicate which column had an invalid compression type in a simple flat schema', function() {
assert.throws(() => {
new parquet.ParquetSchema({
quantity: {type: 'INT32', compression: 'UNKNOWN'},
})
}, 'Unsupported compression method: UNKNOWN, for Column: quantity');
});

});