Skip to content

Conversation

@FeixLiu
Copy link
Contributor

@FeixLiu FeixLiu commented Jan 18, 2022

PR types

Others

PR changes

Others

Describe

comm init for the dist model inf system.

Test with the following codes:

import paddle.distributed.fleet as fleet import paddle from paddle.fluid import core paddle.enable_static() fleet.init(is_collective=True) config = core.DistModelConfig() config.model_dir = "./inference_model/rank_" + str(fleet.worker_index()) + "/step_0" config.place = 'GPU' config.device_id = fleet.worker_index() config.current_endpoint = "127.0.0.1:700" + str(fleet.worker_index()) config.trainer_endpoints = ["127.0.0.1:7000", "127.0.0.1:7001", "127.0.0.1:7002", "127.0.0.1:7003", "127.0.0.1:7004", "127.0.0.1:7005", "127.0.0.1:7006", "127.0.0.1:7007"] config.pp_degree = 2 config.mp_degree = 4 config.mp_ring_id = 0 if fleet.worker_index() <= 3: config.pp_downstream_ring_id = 20 config.pp_upstream_ring_id = -1 if fleet.worker_index() >= 4: config.pp_downstream_ring_id = -1 config.pp_upstream_ring_id = 20 config.local_rank = fleet.worker_index() config.nranks = 8 dist = core.DistModel(config) dist.init()
@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

VLOG(3) << "Init comm group for mp.";
std::vector<std::string> peer_endpoints;
for (int64_t
idx = (config_.local_rank / config_.mp_degree) * config_.mp_degree,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记得CoordSys吗,后面最好抽象一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉没啥必要,inf组网的维度最多只有pp和mp,为了这两个再搞一个coord sys感觉有点多余。其实主要是之前把c++端的coord sys移到python 端了。。。不想再移回来😂

comm_init_block, config_.pp_downstream_ring_id);
}
}
framework::NaiveExecutor e(place_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实可以不用executor执行op来跑的,直接掉api就行,不过这样也没啥问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样比较简洁吧,以后如果需要其它op可以直接加在这里

@wangxicoding wangxicoding merged commit 4c46eed into PaddlePaddle:develop Jan 18, 2022
@FeixLiu FeixLiu deleted the comm_init branch January 18, 2022 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants