labels: experimental
simulate some of the optimizer characteristics inherent in the open source community in an ML training process
one of the reasons that open source development is so effective is that it creates opportunity for different people to contribute who respectively are more or less concerned with different problem domains. let's model a developer as a distribution over the problem domains that interest them, analogous to modeling a document as a distribution over topics (LDA). dual to the developers, we model projects as distributions over problem domains. we can therefore characterize a problem domain by the projects and/or the developers associated with it. when some developer D contributes to some project P, they are informing that project with the priorities they learned from the problem domains that are their focus, a neghiborhood around D in the problem domain manifold. projects spread through developer communities, so the next person to contribute is likely to have interests similar to the person who introduced them to the project. we simulate this "development homophilly" effect by traversing the neighborhood.
graph construction
- cluster the data
- treat clusters as nodes
- construct edges using (?) process. for now, let's assume erdos-renyi random graph
- pick some node
ROOT
at random to start the process.
local sampling
- sample
k
nodes from the neighborhood ofROOT
. call the union of data objects in this set of k+1 nodes our "batch community". - sample a training batch of size
n
from the batch community.
graph traversal
- sample a neighbor from ROOT and use this as the new ROOT
create opportunity for a mechanism which lets the learning process structure the data space and explore it however it likes, so it can e.g. spend more time learning 'x' thing, or associate related exemplars for some feature it is close to learning.)
- associate each datum with a learnable vector.
- use neighborhoods in the vector space of a given
ROOT
datum/batch as "batch communities". - compute an outter gradient to update the learnable vector
one concern I have with the "learnable vector" idea is that it's unclear to me what the loss function would look like to learn the update procedure. Maybe RL would fit well here where the model could learn it's own search procedure for traversing the data space. Maybe reward could be a function of gradient magnitude? Favor search directions where the model learns more. How to balance with catastrophic forgetting?
a simple "learnable vector" would be to just use the network activations (possibly constrained to a particular layer). expanding on this, could create different graphs based on similarities across different layers, e.g. each layer gets its own nearest-neighbors graph.