Training a 530-billion parameter NLP model

2 min readFeb 9, 2022

In Q4–21, Microsoft and Nvidia announced a whopping 560 billion parameter NLP model called Megatron-Turing Natural Language Generation model (MT-NLG). MT-NLG has 3x more parameters than GPT-3 (175 Billion).

You can see that NLP model evolution over the last few years in the picture below. The number of parameters has grown 10x/year over the last few years. Its clear that NLP AI model growth is outpacing Moore Law (2x increase in compute performance every 18 months in an ideal scenario). It will be interesting to observe the trajectory of NLP models over the coming quarters and years.

The two companies collaborated to train MT-NLG on Selene supercomputer from Nvidia. The Selene supercomputer consists of 560 DGX A100 system. Each DGX A100 has eight NVIDIA A100 80GB Tensor Core GPUs. The Selene supercomputer consist of total 555,520 cores.

Selene computer cost is of the order of 100 million USD. A DGX A100 system has a list price of 199,00 USD. This means that Selene computer exceeds 100 million USD if we just consider the DGX A100 system.

Researchers apparently did not share the time it took to train MT-NLG on Selene supercomputer. However, according to this paper, it will take 34 days to train a 175B GPT-3 model on a 1,024 A100 system.

Training a 530-billion parameter NLP model

Written by Faisal Mateen