[CAY-1089, 1127, 1130] Introduce worker-side components of SyncSGD without backup worker #1131
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This closes #1089 , #1127 , #1130
Block diagram and sequence diagram of SyncSGD are in the following presentation:
https://docs.google.com/presentation/d/1ao_9D3qbyxilypLM7xohZbR2Ihin5N8Hd50JyrSVOmY/edit?usp=sharing
Two main changed policies for SyncSGD
PSModelAccessor
can push new models to the server, whenPushBarrier
is unblocked.MiniBatchBarrier
is unblocked.Specific changes in each files
1. Parameters
New parameter
Synchronicity
is added. Since default value isasync
, if no information is given in command line, parameter server will work asynchronously.2. syncmsg.avsc
Necessary messages for SyncSGD are defined in this file. Message names are quite explicit.
Messages from worker to server :
RequestPushPermissionMsg
,MiniBatchFinishedMsg
Messages from server to worker :
PermitPushMsg
,StartNextMiniBatchMsg
,TerminateLearningMsg
3. AsyncDolphinLauncher
Distinguish parameter server's model with
isAsync
boolean value.In async model,
NullPushBarrier
andNullMiniBatchBarrier
, which do nothing, will be binded.In sync model,
SyncPushBarrier
andSyncMiniBatchBarrier
will be binded.For communication,
BatchManager
is added as a client of CentCommConf.4. AsyncWorkerTask
StateMachine for three states
MINI_BATCH_RUNNING
,WAITING_NEXT_MINI_BATCH
,MINI_BATCH_CLOSING
is added.Two main changes are the following:
a. For loop for each epoch
In async model, each worker can finish their own for loop asynchronously when
epochIdx == maxNumEpochs
. In sync model, worker can finish their for loop when the worker receivesTerminateLearningMsg
from driver. If the message is received,learningFlag
value will be changed to finish for loop.b. MiniBatchBarrier
After
trainer.runMiniBatch()
is finished, workers are blocked byMiniBatchBarrier
.5. PSModelAccessor
Before push operation,
PushBarrier
asks driver whether it would be ok to push.6. ResettableCountDownLatch
Modified version of
CountDownLatch
, sinceCountDownLatch
is unresettable.7. Driver-side components(including
BatchManager
,DriverSideSyncSGDMsgSender
)BatchManager
manages workers' mini-batch life cycle.DriverSideSyncSGDMsgSender
sends messages related to SyncSGD to workers. They are introduced in this PR to addBatchManager
as a client ofCentCommConf
. They will be implemented in latter PR since they are driver-side components.8. LearningState
This enum indicates learning state of
AsyncWorkerTask
. If the state isProgressLearning
, next mini-batch will be started. If the state isTerminateLearning
,AsyncWorkerTask
will finish its learning.9. NullMiniBatchBarrier, NullPushBarrier
These are for sync model. These components do nothing(no blocking).
10. SyncMiniBatchBarrier
As mentioned in
AsyncWorkerTask
, this barrier will be blocked inwaitMiniBatchControlMsgFromDriver()
.There are two kinds of
MiniBatchControlMsg
:StartNextMiniBatchMsg
,TerminateLearningMsg
. Both messages will count downminiBatchLatch
and the barrier will be unblocked. IfStartNextMiniBatchMsg
is received, this function will returnProgressLearning
, which allows next mini-batch to be started. IfTerminateLearningMsg
is received, this function will returnTerminateLearning
, which makesAsyncWorkerTask
to stop its learning.11. SyncPushBarrier
As mentioned in
PSModelAccessor
, push operation will be blocked bypushBarrier.requestPushPermission()
.If the worker is slow worker, push operation will be blocked until
StartNextMiniBatchMsg
is received from driver. However, because this PR is implementing SyncSGD without backup worker, this kind of situation will not be happened.thisRoundNum
value is necessary to distinguish between up-to-dateRequestPushPermissionMsg
and oldRequestPushPermissionMsg
.12. WorkerSideSyncSGDMsgHandler
Following events will be happened when each message is received.
PushPermitMsg
: syncPushBarrier will be unblocked.StartNextMiniBatchMsg
: updatethisRoundNum
value ofsyncPushBarrier
and reset its latch. Then, unblocksyncMiniBatchBarrier
to start next minibatch.TerminateLearningMsg
: updatelearningState
value ofAsyncWorkerTask
withTerminateLearning
. Then unblocksyncMiniBatchBarrier
to terminate learning.13. WorkerSideSyncSGDMsgSender
There are two kinds of messages that will be sent from worker to the driver :
RequestPushPermissionMsg
andMiniBatchFinishedMsg
.14. SyncPushBarrierTest
For now, driver-side components are not implemented yet. Therefore, test class for
SyncPushBarrier
is necessary to check whether it works correctly. In this test class, handlers for each messages that worker receives from driver are tested.When
PermitPushMsg
is received, handler should unblocksyncPushBarrier
by counting down itspushLatch
. This point is tested intestPermitPush()
.When
StartNextMiniBatchMsg
is received, handler should updatethisRoundNum
and this point is tested intestStartNextMiniBatch()
.