- Notifications
You must be signed in to change notification settings - Fork 3.7k
Session error when running distributed training #45
Description
Hi
When I run distributed training following the guides in https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/docs/distributed_training.md,
I configure with 1 ps and 2 workers. The ps works ok, but all the workers show errors:
tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "10.150.144.48:1111" config: allow_soft_placement: true graph_options { optimizer_options { } }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
The details of this error is as follows:
2017-06-25 06:41:26.914625: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "10.150.144.48:1111" config: allow_soft_placement: true graph_options { optimizer_options { } }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}. {u'cluster': {u'ps': [u'10.150.144.48:3333'], u'worker': [u'10.150.144.48:1111', u'10.150.144.48:2222']}, u'task': {u'index': 0, u'type': u'worker'}} Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 62, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/usr/local/bin/t2t-trainer", line 58, in main schedule=FLAGS.schedule) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 247, in run output_dir=FLAGS.output_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run return _execute_schedule(experiment, schedule) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule return task() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train hooks=self._train_monitors + extra_hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 669, in _call_train monitors=hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit loss = self._train_model(input_fn=input_fn, hooks=hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model config=self._session_config File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__ stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__ self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__ _WrappedSession.__init__(self, self._create_session()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session return self._sess_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session self.tf_sess = self._session_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 412, in create_session init_fn=self._scaffold.init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 178, in _restore_checkpoint sess = session.Session(self._target, graph=self._graph, config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1292, in __init__ super(Session, self).__init__(target, graph, config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 562, in __init__ self._session = tf_session.TF_NewDeprecatedSession(opts, status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "10.150.144.48:1111" config: allow_soft_placement: true graph_options { optimizer_options { } }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}. ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>): <tf.Tensor 'report_uninitialized_variables_1/boolean_mask/Gather:0' shape=(?,) dtype=string> If you want to mark it as used call its "mark_used()" method. It was originally created here: ['File "/usr/local/bin/t2t-trainer", line 62, in <module>\n tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/usr/local/bin/t2t-trainer", line 58, in main\n schedule=FLAGS.schedule)', 'File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 247, in run\n output_dir=FLAGS.output_dir)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run\n return _execute_schedule(experiment, schedule)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule\n return task()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 669, in _call_train\n monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model\n config=self._session_config', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession\n stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__\n stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__\n self._sess = _RecoverableSession(self._coordinated_creator)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__\n _WrappedSession.__init__(self, self._create_session())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session\n return self._sess_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session\n self.tf_sess = self._session_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session\n self._scaffold.finalize()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 192, in finalize\n default_ready_for_local_init_op)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 254, in get_or_default\n op = default_constructor()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 189, in default_ready_for_local_init_op\n variables.global_variables())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n stack = [s.strip() for s in traceback.format_stack()]'] ==================================
It seems {DIRECT_SESSION, GRPC_SESSION}.` is not registered, So can you help to see this problem?