Skip to content

Conversation

@helinwang
Copy link
Contributor

No description provided.

@helinwang helinwang requested a review from typhoonzero August 23, 2017 21:30
c.ch = make(chan record, c.bufSize)
// FIXME: connection is created asyncrosly in monitorMaster go routine,
// ensure the connection is ready for use before calling c.addClient.
time.Sleep(time.Second)
Copy link
Contributor Author

@helinwang helinwang Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting because I think without this line the program should work (c.conn.Call will block until a connection is established). If deleting this line does not work, I would be happy to revert this line and send another PR to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating connection seems already moved to NewClient, these lines should be removed.

}
}

func retry(f func() error, dur time.Duration, count int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dur is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

if err != nil {
if count > 0 {
time.Sleep(dur)
return retry(f, dur, count-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought should be waiting forever if etcd is not started, this is also what we do in paddle cloud starting up scripts.

Copy link
Contributor Author

@helinwang helinwang Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, make sense, will change to wait forever.

I got this error when trying Paddle cloud with fault tolerant on (log below). Anyway, it's better to handle this in the master code as well.

➜ paddle git:(develop) ✗ paddlecloud logs -n 1000 b ==========================b-trainer-brwn0========================== time="2017-08-23T23:42:29Z" level=info msg="Waiting for ps desired registered ..." ... # a lot of same lines time="2017-08-24T01:05:47Z" level=info msg="Waiting for ps desired registered ..." ==========================b-trainer-lbs8m========================== label selector: paddle-job-master=b, desired: 1 label selector: paddle-job=b, desired: 1 Starting training job: /pfs/dlnel/home/helinwang@baidu.com/jobs/b, num_gradient_servers: 1, trainer_id: 0, version: I0823 20:36:07.662369 34 Util.cpp:166] commandline: --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=0 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164 I0823 20:36:07.668710 34 GradientMachine.cpp:85] Initing parameters.. I0823 20:36:07.668742 34 GradientMachine.cpp:92] Init parameters done. panic: dial tcp 10.1.93.6:2379: getsockopt: connection refused goroutine 17 [running, locked to thread]: main.paddle_new_etcd_master_client(0x7faf728eebd4, 0x5, 0x40, 0x1c400000008)	/paddle/build/go/src/github.com/PaddlePaddle/Paddle/go/master/c/client.go:79 +0x132 main._cgoexpwrap_f2aa1382a54e_paddle_new_etcd_master_client(0x7faf728eebd4, 0x5, 0x40, 0xd74167732b8bc700)	command-line-arguments/_obj/_cgo_gotypes.go:100 +0x41 Aborted (core dumped) job returned 134...setting pod return message... =============================== termination log wroted... 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done. Now master server will wait etcd forever.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@helinwang helinwang merged commit 26c473a into PaddlePaddle:develop Aug 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants