The rise of AI has brought significant complexity and miniaturization to data center hardware. Components in servers and switches are now smaller, denser, and structurally more intricate. As cloud services and AI become central to everyday life, hardware failures due to shock and vibration can have real-world consequences.
AI is also expanding into fields like edge computing, autonomous vehicles, and robotics. Hardware designed for data centers may soon power these technologies, making shock and vibration testing critical to their long-term reliability.
This series of blog posts focus on measurement and analysis of real-world random vibration data. It is part of Google's Open Source Random Vibration Test of Off-the-Shelf Data Center Hardware project, which includes:
Field data measurement, analysis, and lab replication methods
Advanced strain, motion, deformation, and pressure measurement techniques for data center hardware
High-speed microscopic motion analysis of key components
Fatigue analysis in the context of global transport and handling
The project was launched at the 2024 OCP Global Summit, and shared through the project's GitHub repository. The project aims to encourage future collaboration and make random vibration testing universally accessible and useful to the broader community.
In this post, we'll look at the following:
Example 3D Bar Diagram of Real-World Random Vibration Data, Amplitude vs. Frequency vs. Cycle
Figure 1.
Factors that contribute toward localized component-level stress and failure modes in a fully populated rack
Figure 2.
Current Shock and Vibration specifications (OCP, Dell) [1] give the impression that simple shock and vibration tests are all that’s needed to validate the reliability of data center hardware. To be sure, when everything is well understood about the product, these tests can be very useful validation tools.
What the specifications missed is the tremendous amount of engineering involved in design of these products, and latest industry trends that affect us all:
As the race to build AI data centers around the world begins, it is more important than ever to look inside the black boxes, and understand what’s going on under the hood. This series of papers will focus on “Environment Conditions”. Specifically, we will outline the workflow around the measurement and analysis of random vibration data from Google’s US Supply Chain, and why it is important to capture accurate data using thoughtful measurements and analysis techniques.
Good data accurately reflects real world environmental conditions. They allow us to set product requirements properly and design lab experiments correctly. With this paper, we hope to share good data with the greater industry, the methodology behind how to analyze them properly, and continue to build solid foundations that support future efforts on structural analysis and mechanical testing of data center hardware.
Current portable vibration data acquisition units have amazing sensor, battery, and storage capabilities integrated into the size of a USB drive. The Endaq S5-E100D40 sensors used in our experiments have sufficient capacity for 11 to 30 hours of continuous recording at 3000 samples per second, per sensor [2].
Portable sensor instrumented in a cargo truck
Figure 3.
We conducted multiple field experiments in the US between 2018 and 2020, and around the world in 2021. The sensors were attached to trailer floors of many cargo trucks before they began their journey to data centers. Trailer floors provide the best data because they provide the most reliable data for the excitation of a shaker table, but other locations such as the outside of a cardboard box or internals of the products can be informative as well.
More than 100 hours of data, measured up to 5000 samples per second, were captured by the end of 2021, giving us more than 2.0 x 10^9 data points to analyze, with more on the way.
Raw acceleration time history from multiple field measurements
Figure 4.
Power Spectral Density plots are commonly used to summarize random vibration data. Fundamentally, random vibration is the sum of many sine waves made up of various distributions of amplitude and frequency. Each sine wave describes the motions of a particular part of the vehicle over time, such as tires, suspension, and truck trailer, and their sum is what packaged product will experience during shipments.
Tom Irvine describes two standard PSD generation techniques in his papers: Bandpass Filtering [3] and Fast Fourier Transformation [4]. Most vibration software uses the second method for speed and efficiency. Irvine uses Matlab’s FFT function in his Matlab version of Vibrationdata GUI with additional features to customize the calculations further.
When the data set is large (such as the ones collected in these experiments), it is divided into smaller segments for further analysis. When we summarize all PSD plots of the data segments into one graph, it looks something like the following figure:
Distribution of PSD plots of a large ASTM d4169-14 matching data set
Figure 5.
In the plot above, data captured from short haul trucks with leaf spring suspension show close resemblance to ASTM d4169-14’s Truck Level 2 Random Vibration profile at 100%, which means the standard was probably generated with data captured during the transit of similar vehicles, in similar road conditions.
Decades of research are available on how FFT and PSD can be best used to understand and replicate real world random vibration environments. However, It was clear to our team early on there are some fundamental shortcomings in techniques involving Grms, FFT, and PSD:
All of these assume “stationary” random vibration signals that can be approximated with a normal distribution. But reality is a lot more complicated and does not reflect such a distribution. Factors such as road conditions, region of the world, weather conditions, and vehicle conditions all contribute toward the conditions that get transmitted into the trailer of the cargo truck.
This is important because electronic components, the tiny building blocks that create the AI data centers, are susceptible to fatigue damage during shock and vibration events. To fully characterize fatigue, you need to track stress cycles [6], which is strongly influenced by the exact amplitude and cycle count of the original acceleration time history.
Here is an example. Assume we measured a simple sine wave (1G Zero to Peak, 10hz, 360 seconds) during a field experience with ideal road and vehicle conditions. Afterward, the captured signal is transformed and published as a PSD profile using standard methods.
A test lab receives this profile and programs it into their shaker table. The resulting control signal in figure 7 looks nothing like the original signal in figure 6. We would not be testing our products correctly if we simply followed the standard procedure.
Simple Sine Wave, 1g zero to peak, 10 Hz, 360 seconds
Figure 6.
Time history after the 1G Sine Wave was transformed into a shaker table PSD profile
Figure 7.
Again, the actual environment may very well be perfectly described by the type of amplitude and frequency distribution in these standard techniques, or they may not. We need better tools to help us look underneath the data and easily explain the results to the greater industry.
At the end of the day, design choices will be made depending on environment, component, chassis, rack, and packaging factors. But we need something beyond Grms and PSD to quantify random vibration data sets. Rainflow Counting is one such tool. ASTM E1049 describes various techniques of counting cycles [7]. For our team, we settled on Rainflow Counting because it is easily accessible in Irvine’s software and easy to understand [8]. It is easy to import measured data from field experiments, put them into Vibrationdata’s software, and get some information right away.
Explanation of Rainflow Counting in ASTM E1049
Figure 8.
By the time of this paper, we’ve stopped using Rainflow Counting in the analysis of acceleration time history. But that technique is still used heavily in the analysis of strain and displacement data during lab experiments.
After using Bandpass Filtering to isolate the data into specific frequency bands, any one of Peak Detection algorithms can be used to track specific peaks and valleys, which helps us better understand any composition and patterns hidden within the original acceleration time history. The peaks and valleys for each frequency band can be further organized and plotted as histograms to make it easier to visualize their distribution.
Sample Acceleration Time History during Random Vibration Profile, Control Channel Bandpass Filtered, 15 Hz to 20 Hz
Figure 9.
Sample Acceleration Time History during Random Vibration Profile, Control Channel Bandpass Filtered, 15hz to 20hz With Peak Detection
Figure 10.
Sample Acceleration Time History during Random Vibration Profile, Control Channel Bandpass Filtered, 15hz to 20hz Acceleration Cycle Histogram
Figure 11.
Our team has wondered since the beginning what happens if our products have to stay on the road for one more day, or travel on terrains that are extra bumpy. We’ve had to answer questions such as: “Why can’t we lower the requirements just a little bit so that products can pass testing and be released faster?”
These questions reflect the industry’s practice of generic test standards and generic shock and vibration validations. How can we justify only testing products for minutes or hours, when actual units are shipped around the world for days to weeks? The simple answer is, “We can’t.”
When we take actual data from the environment, look deeper into the numbers, and try to better understand complex environments and billions of data points, we shift our perspective from that of execution of simple test plans to one that applies rigorous engineering methods to interesting and complex real world problems.
Bandpass Filtering and Peak Detection are two of many tools capable of looking inside the complexity of real world random vibration conditions, and reducing billions of data points into plots and tables like the one below. They provide useful information and insights into the original conditions, and enable product designers and mechanical test engineers to make informed decisions during product development and mechanical structural testing of complex data center hardware.
Example of amplitude vs. frequency vs. cycle count of real world field data vs. ASTM profile on a shaker table
Figure 12.
The example table shows data sets being broken down into amplitudes and cycle counts for a few frequency bands. The top half in blue represents numbers captured during one of Google’s field experiments, and the one below represents numbers measured during a lab experiment using ASTM d4169-14 Truck and Air profiles.
3D Bar Plots
A 3D representation of the data can be constructed by first plotting amplitude vs. cycle count as histograms, then plotting them as 3D Bar diagrams along the frequency axis in three dimensional space (Amplitude as x axis, Frequency as y axis, and cycle count as z axis). The result can be seen in the following 3D visualizations:
Example of field data separated into specific frequency bins using Bandpass filtering
Figure 13.
Example of a 3D Bar Diagram of Amplitude vs. Frequency vs. Cycle, Google US Field Measurement, 2019
Figure 14.
Example of a 3D Bar Diagram of Amplitude vs. Frequency vs. Cycle, ASTM d4169-14 Truck Profile, Level 2
Figure 15.
The chosen examples are meant to illustrate the difference between real world conditions and industry standard profiles, which as we can see, is quite striking. The goal for this specific track of papers is to look underneath the conventional method of analyzing random vibration data to see what’s hidden. As we can see, quite a lot. Doing so provides us with more tools and techniques to understand how things really work, to evaluate how data center hardware is actually tested in lab environments, and to see what really happens inside the electronics hardware in the real world.
Most readers are not expected to do their own field measurements to support their day to day work. In fact, it would be simpler if a basic but modernized set of measured field vibration data is available for everyone to use, just like current industry standards. But should the need to perform field measurements arise, simple tools with clear explanations of how they work should also be widely available so they can be examined and scrutinized and used by anyone freely, in any way that serves their goals the best.
Whatever it ends up being, the how’s and why’s of how the data are captured, analyzed, and made useful are more important than ever before as AI hardware find their ways into every facet of our society.
We will continue this track by diving into specific details of past field experiments, and specific examples and data sets as part of Google’s Open Source Random Vibration Testing Project’s regular release schedule.
With these data, we can begin to answer the following questions:
Q: What happens if a product is shipped on a truck one more day?
A: Product will be subjected to an additional N stress cycle in a longer duration.
Q: Can we lower the test profile by 1% to pass the test?
A: If we do so, we would’ve missed M stress cycles at the 99.999 percentile, which account for x% of the product’s fatigue life.
Q: How should we design a product’s packaging?
A: We should examine the field data, and choose a packaging design that will do the most to mitigate mechanical damage in components and sub-systems we care about, for conditions present in the supply chain.
(This series of papers is part of Google’s Open Random Vibration Testing of Off the Shelf Data Center HardWare Project, available and distributed through the project’s Github repository.)