valueerror: error initializing torch.distributed using env:// rendezvous: environment variable master_addr expected, but not set

#68
by mahi22muki - opened

I am trying to run the script over 2 server (each 4GPU*2) using mpirun with horovod.
I am facing this error. Rank , world size , local size is not getting detected automatically , master_add and port also not getting fetched.

Help me up to resolve the error .
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
tried with different approached , nothing worked out.

Did you ever find a solution?

Sign up or log in to comment