- How can the data pipeline be more disciplined?
- Set up best practices
- Have one pipeline which is always running, and perhaps another script to send messages to the main pipeline in-place of, this idea was regarding scheduling.
- The pipeline is already implementation-agnostic, but how does that translate to a salable system? perhaps add another layer which is specific to the library on top of which a model is to b executed?
- Online and offline pipelines? The current system is designed for offline training.
- How can the deployment process be automated using the pipeline?
- Better version control for the models and data/metadata.
- Group a set of files as a experiment.
Add a function that will be executed at the end of the loop, where I can add stuff like moving files, etc.
Rethink how the training time is recorded to ensure a model that ended up failing to train can be relaunched. Or is this even an good behaviour to have?
- Old versions will be mlflow_deleted. That is based on the assumption that the version is being overwritten.
The train and eval functions are expected to return a metric_container which will be logged in both normal and mlflow
- The training output will not be logged by mlflow. Only the eval outputs will be logged, since that is what we want to look at. If someone want the train output they’d have to log it during the run.
The UI will be launched along side the pipeline. Also allow to launch the ui separately through the mlpipeline.
- CLOSING NOTE [2019-07-15 Mon 18:38]
This makes no sense if the tracking uri is set to a remote server
- CLOSING NOTE [2019-07-15 Mon 18:39]
mlflow runs already have a runstatus
- CLOSING NOTE [2019-07-15 Mon 18:39]
- CLOSING NOTE [2019-07-12 Fri 16:53]
- When in this mode, it will execute the `export_model` method for all the experiments for all versions.
- CLOSING NOTE [2019-03-27 Wed 14:45]
- CLOSING NOTE [2019-07-28 Sun 14:43]
- CLOSING NOTE [2019-07-28 Sun 14:44]
- The idea is to take away the need to comment and uncomment break statements
- additionally can have this work on multiple levels: Do you want to run a whole epoc, pass a setting, if not it’ll break when it says so.
- git is used not to track development, but to track the experiments.
- The steps
- Checkout to an experiment branch. Is this necessary?
- Before a run, stage everything, git the repo and store the hash. This is assuming the related files are all tracked.
- is it good practice to stage everything? or find a way to check all the files that are being loaded? I know I can track the files being imported, how about other files being accessed?
- One way is to provide an api to open files which will track the files by itself.
- stakoverflow: check what files are open in Python
- Another option is to look into using mlflows version control interface
- CLOSING NOTE [2019-07-28 Sun 14:44]