-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better way to manage depenedencies? #57
Comments
I can't recommend NX enough, and getting started with it incrementally is faster than you may think. Their ecosystem enables you to spin up new packages and dependencies rapidly and easily, breaking down your architecture into easily manageable small pieces. I would be interested in contributing to help this endeavour. I am looking to use EmbedJS right now for my application and migrating off from a kind of hacky use of Embedchain running with JsPyBridge. There are some functionalities I need from EmbedJS which I will have to add and am happy to contribute. |
Yes, I have been considering switching to NX primarily. My only concern is asking people to install even more packages per loader. But this removes the need to install 3rd party packages (like for vector databases) as those dependencies will be packaged into the corresponding sub-package published. I haven't used NX before (last time had to deal with node monorepos, it was the era of lerna) but let me take a first stab at it. You can review the PR and let me know if there are best practises that I missed. Do let me know what missing functionalities will be useful for you? |
Sounds good. Make heavy use of generators for all the packages, as they make it dead simple to spin up new ones with all configs working. Also the VS Code NX extension is phenomenal. I switched over to NX a few years ago after having used Lerna for a while. I discovered NX because Nrwl took over managament of Lerna. It is excellent. As to functionality, there are a couple things. Searching and querying via metadata filters is huge. I also tend to use this kind of abstraction a bit like a vector ORM, so being able to raw search and get all the documents back is huge rather than running through the query method and having it use the internal LLM. Also, auto-detecting the file format for adding would be huge. Embedchain in Python has some good code there which could easily be ported to TS using an LLM. Tracking the individual data sources and storing them would be great too. Embedchain keeps a running DB of the files and data sources so you can query for the files, or for the documents. |
I came across NX from their takeover of lerna as well. However, moving this lib to NX will take some time as I am not familiar with NX at the moment. I do think the monorepo route is the long term direction to go in but it will come up after some core functional items are added in. Some of the functionalities you mention are aligned with the roadmap I have. I am currently adding support for auto-detecting files and URLs to auto load using relevant loaders. So in principle, you should be able to pass a path/URL/a json (etc) and the library should be able to auto load the content identifying the correct loader. You should also be able to pass a directory path and load all files in it. This is being added in this PR #66 and I hope to get it out over the next few days after more testing. The next major feature to add in is going to be searching via metadata. Could you tell me more about the last feature? Is that just a list of files / data sources that were added so far to the vector database? |
All for moving to monorepo. (Bias declaration: My only experience with NX on another project is people gave up.) Do what you know as you are the primary maintainer. I'm currently working on a project which is heavily using embedJS and will be contributing back and knowing how to do this best would be great. e.g. we add loaders, embedders. Currently the biggest thing we have added is an abstraction of the whole conversations layer so conversaiton history can also be stored in memory/database so it persists. So far to do this has required changes across the intialiser and base-model, so this wouldn't be as easy to split. |
Yes; I explored nx past few days. I am begining to feel You can contribute in a number of ways -
|
The library has now switched over to NX. Closing this thread. |
I have been considering this for a while now. The more loaders, LLMs and embedding models we add, the question becomes more and more important. Currently the library already separates out the dependencies for the vector databases using clever
package.json
config, leveraging peer and optional dependencies.But this does not scale well but is not a problem today. At the same time, it's getting important that we really break out the various parts even more. I don't want a lot of dependencies being added in by default. It's also problematic to address vulnerabilities. Ultimately people will only use some of the choices and they should only have those relevant modules.
Here are our options -
Continue with the current package.json based solution. We will have more optional dependencies and we will address this by more detailed documentation.
Switch to a monorepo. A core package (
embedJs
) and several sub-packages (likeembedjs-openai
,embedjs-slack-loader
, etc)Clearly option 2 is the long term direction but might be more work than necessaty today.
What do you all think?
The text was updated successfully, but these errors were encountered: