- Notifications
You must be signed in to change notification settings - Fork 79
Run fit-a-line demo with fault tolerant mode #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docker/k8s_tools.py Outdated
| label_selector = "paddle-job-master=%s" % PADDLE_JOB_NAME | ||
| pod_list = fetch_pods_info(label_selector) | ||
| master_ips = [item[1] for item in pod_list] | ||
| return ",".join(master_ips) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why joining master ips, there should only be one I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for compatibility with multiple master instance, and it's sure we support one master instance for now, i will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 I think we are not going to have multiple master instances. The master server is supposed to be the single control point.
demo/fit_a_line/train_ft.py Outdated
| import gzip | ||
| import glob | ||
| import recordio | ||
| import cPickle as pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe delete this line and import glob, import recordio, since they are no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| # training | ||
| trainer.train( | ||
| reader=paddle.batch( | ||
| paddle.reader.shuffle(cloud_reader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use paddle.v2.reader.creator.recordio here? (it wraps cloud_reader, providing a uniform interface for local and cloud recordio file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I could not find paddle.v2.reader.creator.recordio, I remember we have it. Probably I got confused.
demo/fit_a_line/train_ft.py Outdated
| from paddle.v2.reader.creator import cloud_reader | ||
| import paddle.v2.dataset.uci_housing as uci_housing | ||
| | ||
| master_ip = os.getenv("MASTER_IP") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to do:
etcd_ip = os.getenv("ETCD_IP") etcd_endpoint = "http://" + etcd_ip + ":" + "2379" # ... export MASTER_IP=$(python /root/k8s_tools.py fetch_master_ips) export ETCD_IP="$MASTER_IP"We can later just change the implementation of export ETCD_IP="$MASTER_IP".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| label_selector = "paddle-job-master=%s" % PADDLE_JOB_NAME | ||
| pod_list = fetch_pods_info(label_selector) | ||
| master_ips = [item[1] for item in pod_list] | ||
| return master_ips[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May return None when master is still not ready? Curious why need to remove the while True waiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the code:https://github.com/PaddlePaddle/cloud/pull/278/files#diff-548cc24bbda04b9e89570c52ae5df9c7R48, we waiting for the master pod until the state became RUNNING.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
Releated #147
Fixed #270