The Case for On-Prem AI Data Centers

AI has become and will continue to be a dominant technology for enterprises worldwide. The technology to change business practices and make better decisions in a wide range of industries has led to an unprecedented demand for access to servers that can perform the AI process's training or inference phase. The AI infrastructure needed for the training phase can be significant in terms of cost, but a high end system (multiple CPUs and GPUs) may not always be the best choice. By implementing AI training within an enterprise's data center, organizations can reduce costs and become more productive and flexible at the same time.

Cloud Benefits and Drawbacks

Many organizations are moving their workloads to a public cloud infrastructure, which, by definition, is shared by many clients. While the scalability in a public cloud can be quite large, very few training models require thousands of GPUs working concurrently. A benefit to using a public, shared cloud infrastructure is that a large number of high-end (read expensive) servers may be available. Conversely, a large number of high end servers may not be available when desired. In addition, the costs associated with data ingress and egress for large training models can be significant, especially if the training data needs to be imported from another public, shared cloud provider.

On-Prem for AI Training

Several reasons exist to consider and implement AI within an on-prem data center.

Cost – While acquiring servers with GPUs may be high, the longer term cost can be lower compared to using a public, shared cloud. Cloud fees can be relatively high over time, especially for data movements. In addition, the costs for acquiring a high end GPU server can be high, whether all CPUs or GPUs are used 100% of the available time, which is unlikely.
Performance – There are a range of CPU and GPU combinations available, both in terms of the quantity of each and the performance. With an understanding of enterprise AI requirements, the number and performance of the CPUs (1, 2, 4, or 8) is essential. The latest generation of CPUs range from 16 to 128 cores, and base clock rates approaching 4 GHz. A range of GPUs exist, from older generations to the latest releases, with up to thousands of cores. Optimal and multiple configurations can be implemented in a data center, depending on the project's CPU and GPU requirements.
Retraining – While there are various methods to estimate the cost to train a model of a particular size and number of GPUs available, many models need to be continuously re-trained with new parameters. For inference accuracy, the model must be retrained with updated and more recent data, which can take as long as the original training depending on the amount of new data to be used. In an on-prem data center, the systems can be used repeatedly, whereas in the public cloud, expenses can pile off with each iteration and re-training of the model.
Software – There are many software choices to consider when creating an efficient and effective AI training solution. A public, shared cloud provider may not have all the available components, which may require additional setup and testing for each instance acquired in a public cloud infrastructure.
Data Location and Sovereignty – For many industries and geographies, there may be restrictions and requirements for where the data used for AI training must reside. An on-prem data center allows organizations to adhere to these regulations, where using a remote, public cloud data center may not be permitted.
Security – For many organizations, the security of both data and results is critical. In an on-prem data center, security teams can implement more stringent security policies regarding access to the systems or storage devices. When creating and using AI that needs access to internal processes and data, implementing AI in an on-prem data center is an obvious choice.
Compliance – When the data is subject to various regulations, creating a conformant on-prem data center may be ideal, compared to identifying a public cloud that adheres to these regulations.

Trio of Supermicro AI GPU systems: 8U system, 4U system, 5U System

Summary

Implementing an effective and efficient on-prem AI-focused data center requires understanding the performance requirements for the workloads that best suit the enterprise. An on-prem data center, when properly designed, can decrease the time to get results for AI training and can deliver low latency inference results and decisions tuned to the type of model. An on-prem data center can be uniquely configured at a low cost to respond to the needs of the enterprise. Understanding workloads, the amount of data, the fine tuning of the AI workflow, and in-house expertise with various software layers will help determine the best option for the organization.

Supermicro AI Infrastructure Solutions

Accelerate and simplify your AI deployment with AI-ready infrastructure solutions. As a leading supplier of AI on-prem infrastructure, Supermicro’s turn-key reference designs leverage that vast experience of building some of the world’s largest AI clusters. The solutions span from large scale training clusters to intelligent edge inferencing solutions.

Learn More