Skip to content

Conversation

@Yancey0623
Copy link
Collaborator

@Yancey0623 Yancey0623 commented Aug 8, 2017

Releated #147

Fixed #270

@Yancey0623 Yancey0623 requested a review from typhoonzero August 8, 2017 08:12
@Yancey0623 Yancey0623 changed the title Run fit-a-line demo with fault tolerant mode [WIP]Run fit-a-line demo with fault tolerant mode Aug 8, 2017
@Yancey0623 Yancey0623 changed the title [WIP]Run fit-a-line demo with fault tolerant mode Run fit-a-line demo with fault tolerant mode Aug 8, 2017
label_selector = "paddle-job-master=%s" % PADDLE_JOB_NAME
pod_list = fetch_pods_info(label_selector)
master_ips = [item[1] for item in pod_list]
return ",".join(master_ips)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why joining master ips, there should only be one I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just for compatibility with multiple master instance, and it's sure we support one master instance for now, i will fix this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 I think we are not going to have multiple master instances. The master server is supposed to be the single control point.

import gzip
import glob
import recordio
import cPickle as pickle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe delete this line and import glob, import recordio, since they are no longer used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# training
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(cloud_reader(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use paddle.v2.reader.creator.recordio here? (it wraps cloud_reader, providing a uniform interface for local and cloud recordio file).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I could not find paddle.v2.reader.creator.recordio, I remember we have it. Probably I got confused.

from paddle.v2.reader.creator import cloud_reader
import paddle.v2.dataset.uci_housing as uci_housing

master_ip = os.getenv("MASTER_IP")
Copy link
Collaborator

@helinwang helinwang Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to do:

etcd_ip = os.getenv("ETCD_IP") etcd_endpoint = "http://" + etcd_ip + ":" + "2379" # ... export MASTER_IP=$(python /root/k8s_tools.py fetch_master_ips) export ETCD_IP="$MASTER_IP"

We can later just change the implementation of export ETCD_IP="$MASTER_IP".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

label_selector = "paddle-job-master=%s" % PADDLE_JOB_NAME
pod_list = fetch_pods_info(label_selector)
master_ips = [item[1] for item in pod_list]
return master_ips[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May return None when master is still not ready? Curious why need to remove the while True waiting?

Copy link
Collaborator Author

@Yancey0623 Yancey0623 Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the code:https://github.com/PaddlePaddle/cloud/pull/278/files#diff-548cc24bbda04b9e89570c52ae5df9c7R48, we waiting for the master pod until the state became RUNNING.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@Yancey0623 Yancey0623 merged commit ccce04f into PaddlePaddle:develop Aug 9, 2017
@Yancey0623 Yancey0623 deleted the fitaline_ft branch August 9, 2017 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants