Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: framework agnostic preprocessing #781

Merged
merged 31 commits into from
Oct 28, 2024

Conversation

tharvik
Copy link
Collaborator

@tharvik tharvik commented Sep 17, 2024

on the road to (closer by every PR) ONNX' land.

preprocessing revamp

  • simplifing preprocessing by using flat functions rather than classes
    • allows to change type with every function rather than using the way-too-generic tf.TensorContainer
    • for example, force the users to normalize the images before passing theses to a model by having the latter ingesting NormalizedImage
  • get rid of {Image,Tabular,Text}Data classes by replacing theses by correctly typed dataset

dataset improvements

  • add Dataset.cached to avoid recomputation when iterating multiple times
    • is memory sensitive so can work with any dataset size (but not very useful if dataset materialization is bigger than RAM)
  • add preprocessOnce option to Disco to avoid all recomputation by keeping a preprocessed version of memory (more hardcore version of Dataset.cached)
    • can crash the JS VM with OOM but yeah, let the user make the call
    • quite useful to get reproducable timing for tests
  • Dataset.batch constructs batch concurrently

type safety

  • make Model, Task, Trainer, … generic on their datatype ("image" | "tabular" | "text") to check soundness at compile time
    • ensure that when defining a Task<'tabular'>, is does have the required {input,output}Columns
    • ensure that a {Disco,Trainer,Validator}<'image'> doesn't get feed a Dataset<{Tabular,Text}> but only a Dataset<[Image, label: string]>
  • single TrainingInformation<'tabular'>.outputColumn instead of multiple colum
    • we in fact only supported single output but it was an array
  • clear separation between what is outside types ingested by Disco & Validator (Raw), and what is internal (ModelEncoded, after preprocessing and before postprocessing)

misc

  • change server/tests/status to not wait for changes on timeout but with promises in order to avoid slower computer timeouts
  • reduce discojs/validator/ to a single file
  • bump deps

sadly, there is a performance drawback to that: as the preprocessing computation is solely done in JS which is fundamentally monothread (as python is), processing is slower than the tfjs-node accelerated version we have. I tried a few ways to workaround it and push V8 to do magic but it didn't work. there is some way to make make all that work (#758) but that would require platform specific accelerators. and this PR was already a bit long so pushing it to another iteration.

@tharvik tharvik force-pushed the 650-framework-agnostic-preprocessing-tharvik branch 3 times, most recently from a9d1035 to 0ba102d Compare September 24, 2024 10:25
@tharvik tharvik marked this pull request as ready for review September 24, 2024 10:35
@tharvik tharvik requested a review from JulienVig September 24, 2024 10:35
Copy link
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge amount of work! You seem to be pushing typescript to its limits haha

I have an error when trying to build the webapp:

Error: Cannot find module @rollup/rollup-darwin-arm64. npm has a bug related to optional dependencies (https://github.com/npm/cli/issues/4828). Please try 'npm i again after removing both package-lock.json and node_modules directory.
at requireWithFriendlyError (/Users/vignoud/Disco/review/node_modules/rollup
/dist/native.js: 59:9)
at Object. ‹anonymous> (/Users/vignoud/Disco/review/node_modules/rollup/dist/
native. js: 68:76)
•. 3 lines matching cause stack trace ...
at Module._load (node:internal/modules/cjs/loader:1023:12) at cjsLoader (node:internal/modules/esm/translators:356:17)
at ModuleWrap. ‹anonymous> (node:internal/modules/esm/translators:305:7) at ModuleJob. run (node:internal/modules/esm/module_job:218:25)
at async ModuleLoader. import (node:internal/modules/esm/loader:329:24) {
  [cause]: Error: Cannot find module '@rollup/rollup-darwin-arm64'
  Require stack:
    - /Users/vignoud/Disco/review/node_modules/rollup/dist/native.js
    at Module._resolveFilename (node:internal/modules/cjs/loader:1144:15)
    at Module._load (node:internal/modules/cjs/loader:985:27) at Module. require (node:internal/modules/cjs/loader:1235:19)
    at require (node:internal/modules/helpers:176:18)
    at requireWithFriendlyError (/Users/vignoud/Disco/review/node_modules/roll up/dist/native.js:41:10)
    at Object. ‹anonymous> (/Users/vignoud/Disco/review/node_modules/rollup/dis
    t/native. js: 68:76)
    at Module._compile (node:internal/modules/cjs/loader:1376:14)
    at Module._extensions..js (node: internal/modules/cjs/Loader:1435:10) at Module. load (node:internal/modules/cjs/loader:1207:32)
    at Module._load (node:internal/modules/cjs/loader: 1023:12) {
    code: 'MODULE_NOT_FOUND' requireStack: [
    '/Users/vignoud/Disco/review/node_modules/rollup/dist/native.js'
    ]
  }
}

I didn't manage to solve the issue even by re-creating the package-lock.json (it actually solves this error but then I can't build the server anymore because of new errors)

I couldn't review everything yet because of this but I already left a lot of questions (nothing blocking except the ropllup error so far)
If you can't solve the rollup error without having access to a Mac we can organize a call next week if you want

server/tests/e2e/federated.spec.ts Outdated Show resolved Hide resolved
discojs/src/processing/image.ts Outdated Show resolved Hide resolved
discojs/src/processing/image.ts Show resolved Hide resolved
discojs/src/processing/image.ts Outdated Show resolved Hide resolved
discojs/src/processing/image.ts Show resolved Hide resolved
discojs/src/processing/index.ts Show resolved Hide resolved
discojs/src/models/gpt/index.ts Outdated Show resolved Hide resolved
discojs/src/validator.ts Outdated Show resolved Hide resolved
cli/src/args.ts Show resolved Hide resolved
webapp/cypress/support/e2e.ts Outdated Show resolved Hide resolved
@tharvik tharvik force-pushed the 650-framework-agnostic-preprocessing-tharvik branch 3 times, most recently from 5323bed to 88b20b3 Compare September 30, 2024 14:45
@tharvik tharvik force-pushed the 650-framework-agnostic-preprocessing-tharvik branch 3 times, most recently from 7d520b8 to 0ac6c29 Compare October 4, 2024 09:42
@tharvik
Copy link
Collaborator Author

tharvik commented Oct 4, 2024

You seem to be pushing typescript to its limits haha

héhé, yeah, it's what I do in pretty much any coding languages, pushing it to its limits then complaining about it (:

I couldn't review everything yet because of this but I already left a lot of questions

yeah,I think that the more we progress, the long the review will be, as we both know more of the code, what can get wrong, and are getting opiniated on various components. I feel quite good about it in fact, that means we care more for disco 🪩

If you can't solve the rollup error without having access to a Mac we can organize a call next week if you want

fwiw, it was indeed a mac specific issue, that gets only triggered there, and will probably happen again if there is an update of rollup that I do on my side. well, now we know

@tharvik tharvik force-pushed the 650-framework-agnostic-preprocessing-tharvik branch from 9a611b7 to dbbf432 Compare October 4, 2024 18:46
@tharvik tharvik requested review from JulienVig and removed request for JulienVig October 21, 2024 07:57
@tharvik tharvik force-pushed the 650-framework-agnostic-preprocessing-tharvik branch 2 times, most recently from fe0e1c8 to cca6dc3 Compare October 23, 2024 12:43
@tharvik tharvik requested a review from JulienVig October 23, 2024 12:55
Copy link
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing refactor!!

  • Mainly, I noticed that testing an image model always shows the images as red (while the accuracy is 100% in this case). That can be a simple UI mistake but maybe a deeper problem in the validation.
    Screenshot 2024-10-24 at 14 02 43

  • Not sure if this was introduced in this PR or the dark theme but when there is only one datapoint the chart now shows a bar instead of a dot
    Screenshot 2024-10-24 at 11 06 07

  • Also the x-axis labels are getting crowded
    Screenshot 2024-10-24 at 14 01 09

  • Testing with a language model shows a "Download CSV" button that doesn't work
    Screenshot 2024-10-24 at 11 08 20

discojs/src/types/data_format.ts Outdated Show resolved Hide resolved
discojs/src/types/data_format.ts Show resolved Hide resolved
discojs/src/dataset/dataset.ts Outdated Show resolved Hide resolved
discojs/src/models/tfjs.ts Show resolved Hide resolved
server/src/task_set.ts Show resolved Hide resolved
@tharvik tharvik force-pushed the 650-framework-agnostic-preprocessing-tharvik branch from cca6dc3 to 4f0bc55 Compare October 28, 2024 09:12
@tharvik
Copy link
Collaborator Author

tharvik commented Oct 28, 2024

  • Mainly, I noticed that testing an image model always shows the images as red (while the accuracy is 100% in this case). That can be a simple UI mistake but maybe a deeper problem in the validation.

ouf, it was a simple UI mistake, I inversed red & green in #682

  • Not sure if this was introduced in this PR or the dark theme but when there is only one datapoint the chart now shows a bar instead of a dot
  • Also the x-axis labels are getting crowded

ha yes, good point, I reworked it to make it a bit nicer. the curve is now "cubic" instead of "smooth" which makes it less jittery at the expense of precision.

  • Testing with a language model shows a "Download CSV" button that doesn't work

right, it is not implemented. I wasn't sure of what to show in it so I prefered to throw a TODO. now, do we want to even have this feature? I feel that it is not really useful (for testing that is, for prediction it is useful). we need a way to present testing results but CSV is too technical.

  • images: we already have a nice way to show the results
  • tabular: yes, not really nice but at least it is somewhat readable (colors would be nicer)
  • text: I can think of way to present it: for every input line (styled as gray), show the next word in green/red

WDYT? (for a next iteration)

@tharvik tharvik merged commit a5735e8 into develop Oct 28, 2024
23 checks passed
@tharvik tharvik deleted the 650-framework-agnostic-preprocessing-tharvik branch October 28, 2024 12:00
@tharvik tharvik mentioned this pull request Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants