How Vibration Data Can be Leveraged for Anomaly Detection and Fault Diagnosis
In my previous blog post, Preprocessing Vibration Data for Machine Learning, we walked through the preprocessing steps required to prepare vibration data for use in a Machine Learning (ML) model. We also covered how to analyze the data by visualizing it, examining key statistical parameters, and understanding the distribution of each variable in the dataset. Finally, we discussed how the Fast Fourier Transform (FFT) can help extract meaningful insights from vibration time series data, including how to identify peaks in the FFT and what those peaks can tell us.
In this post, we’ll explore how vibration data can be leveraged for anomaly detection and fault diagnosis to identify potential issues in vibrating machinery. This approach is crucial for;
- Minimizing maintenance time and costs
- Preventing progressive damage, and
- Reducing unplanned downtime.
Though any high-performance vibration data logger will suffice, in this instance I will be using our in-house enDAQ data logger to acquire vibration data from a small test rig designed to simulate various fault conditions on a rotating belt system with two pulleys.
We will cover the following essential topics in this guide:
Introduction to Anomaly detection and ML-based methods
What is an Anomaly?
In the context of machine condition monitoring, an anomaly refers to any data pattern or behavior that deviates significantly from the norm. These deviations may not always indicate a failure, but they often serve as early warning signs of potential issues. In vibration analysis, anomalies can manifest as unexpected spikes, irregular frequency patterns, or abnormal amplitude changes in the signal.
What are Faults in Vibration Machinery?
Faults are specific types of anomalies that indicate a mechanical or structural problem in a vibrating machine. Common faults include:
- Imbalance – Uneven mass distribution in rotating components.
- Misalignment – Shaft or component alignment errors.
- Bearing defects – Wear or damage in ball or roller bearings.
- Looseness – Mechanical parts not firmly fastened.
- Gear mesh issues – Irregularities in gear engagement.
Detecting these faults early is critical for preventing damage escalation and avoiding costly downtime.
Common Machine Learning Methods for Anomaly and Fault Detection
ML Method |
Description |
Use Cases in Vibration Analysis |
K-Means Clustering |
An unsupervised method that groups similar data points based on distance metrics. |
Identifying clusters of normal vs. abnormal behavior. |
Support Vector Machines (SVM) |
Supervised method that finds the optimal boundary between classes. |
Classifying normal and faulty machine states. |
Random Forest |
Ensemble learning method using multiple decision trees for classification. |
Fault classification using labeled vibration data. |
Autoencoders |
Neural networks that reconstruct input data; high reconstruction error indicates anomaly. |
Unsupervised anomaly detection in time series. |
Isolation Forest |
Tree-based model that isolates anomalies by randomly partitioning data. |
Efficient for detecting outliers in high-dimensional datasets. |
Statistical Methods |
Techniques such as Z-score, IQR, or control charts to detect deviations. |
Baseline anomaly detection with low computational cost. |
LSTM (Long Short-Term Memory) |
Recurrent neural network designed for sequential data with memory retention. |
Capturing temporal patterns and detecting sequence anomalies. |
What is LSTM?
Long Short-Term Memory (LSTM) is a special type of Recurrent Neural Network (RNN) designed to model sequential data and capture long-range dependencies in time series. Unlike traditional RNNs, which struggle to remember information over long sequences due to the vanishing gradient problem, LSTMs are designed to retain relevant information for extended periods.
What Kind of Neural Networks Are They?
LSTMs are part of the Recurrent Neural Network (RNN) family—neural architectures where connections between nodes form a directed graph along a sequence. This structure allows LSTMs to maintain and update an internal "memory" state, making them ideal for tasks where the order and timing of data points are critical, such as speech recognition, natural language processing, and vibration time series analysis.
How Do LSTMs Work?
LSTMs process input sequences one timestep at a time, updating a cell state that carries information throughout the sequence. What makes LSTMs unique is their ability to decide what to remember, what to forget, and what to output at each step using specially designed components called gates.
LSTM Architecture
Each LSTM unit (or cell) consists of:
- Cell state: A memory line that runs through the entire sequence, modified at each timestep.
- Hidden state: The output at the current timestep, which also feeds into the next timestep.
- Three gates:
- Forget gate: Decides what information to discard from the cell state.
- Input gate: Determines what new information to add to the cell state.
- Output gate: Controls what information from the cell state is output at this timestep.
These gates are controlled by sigmoid activations, which output values between 0 and 1 to act like on/off switches for the flow of information.
Component |
Function |
Input (xₜ) |
The input data at time t. |
Hidden state (hₜ₋₁) |
Carries output information from the previous timestep. |
Cell state (Cₜ) |
Maintains long-term memory across timesteps. |
Forget gate |
Uses sigmoid(Wf · [hₜ₋₁, xₜ] + bf) to remove irrelevant info. |
Input gate |
Uses sigmoid(Wi · [hₜ₋₁, xₜ] + bi) to add relevant info. |
Output gate |
Uses sigmoid(Wo · [hₜ₋₁, xₜ] + bo) to decide output. |
Weights (W) |
Learnable parameters for gates and state transitions. |
Biases (b) |
Additional parameters to fine-tune gate outputs. |
Why Use LSTMs for Vibration Analysis?
LSTMs are particularly effective for vibration-based anomaly detection because:
- They model temporal dependencies, which are critical for capturing evolving patterns in machine behavior.
- They can handle noisy and irregular data better than simpler models.
- Once trained on normal operating conditions, they can detect subtle deviations that traditional methods might miss.
The test designed for anomaly detection
To demonstrate how machine learning—and specifically LSTM (Long Short-Term Memory) networks—can be applied to anomaly detection, a test rig has been developed. As shown in the figure below, the setup includes two pulleys: a drive pulley and a driven pulley. The drive pulley is adjustable, allowing the pulleys to be positioned closer together to simulate a loose belt condition. Additionally, the driven pulley can be shifted to mimic a misalignment fault.
Figure 1. Test rig used for anomaly detection and fault detection using LSTM based on vibration data.
To simulate a change in mass, two zip ties have been attached to the belt. As a result, four distinct operating conditions are represented in the dataset:
- Healthy
- Loose belt
- Misaligned pulley
- Mass change
We will explore two different scenarios for using LSTM in anomaly detection:
- Scenario A: The LSTM model is trained to distinguish between healthy and faulty conditions. In this setup, the healthy data is labeled as one class, while the other three conditions are grouped and labeled as "faulty." This approach allows the model to simply detect whether an anomaly exists.
- Scenario B: The model is trained using all four classes separately. This enables the LSTM to not only detect an anomaly but also identify its root cause—whether it is due to belt looseness, pulley misalignment, or added mass.
In summary, the first scenario focuses on binary anomaly detection (healthy vs. faulty), while the second allows for multi-class fault classification.
Data Analysis
Moving forward, we will provide scripts for performing essential data analysis tasks. These include data visualization, statistical analysis, histogram generation, and anomaly detection.
To run these scripts, users will need at least one healthy dataset (containing vibration data along the x, y, and z axes) as well as any faulty dataset they may have—this could be one or more datasets representing different fault conditions.
The scripts are designed to be user-friendly and require minimal modification. In most cases, the only input required from the user is the file path to their datasets.
Importing libraries
First, we need to import the necessary libraries required for data analysis and LSTM-based anomaly detection.
# Import libraries import decimal import idelib import endaq import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score from keras.models import Sequential from keras.layers import LSTM, Dense, Dropout from sklearn.model_selection import train_test_split import seaborn as sns
Visualization
Next, we plot the data using the code below. The only modification required from the user is to provide the correct file paths to their datasets.
In this example, we are working with four datasets:
- healthy: Normal operating condition
- F_Loose: Simulated loose belt
- F_Misalignment: One pulley misaligned
- F_MassChange: Two zip ties added to simulate a change in mass
If, for instance, you have one healthy dataset and two faulty ones, you’ll need to specify three file paths accordingly.
import pandas as pd
import plotly.graph_objects as go
import endaq
from plotly.subplots import make_subplots
import plotly.express as px
# Updated: each label maps to a list of file paths
datasets = {
"Healthy": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide',
# Add more Healthy files if needed
],
"Looseness": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Loosness\DAQ16837_000158.ide'
],
"Missalignment": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Misalignment\DAQ16837_000155.ide'
],
"Mass imbalance": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Mass_imbalance\aeaa0a24-DAQ16837_000149-1744735366.ide'
]
}
# Column names to plot
columns_plot = ["X (40g)", "Y (40g)", "Z (40g)"]
# Assign consistent colors to each class label
color_palette = px.colors.qualitative.Set1
label_list = list(datasets.keys())
label_colors = {label: color_palette[i % len(color_palette)] for i, label in enumerate(label_list)}
# Plotting function
def plot_axes_combined(datasets, columns_plot, label_colors):
fig = make_subplots(
rows=3,
cols=1,
shared_xaxes=True,
vertical_spacing=0.05,
# subplot_titles=columns_plot
)
# For each label and list of file paths
for label, file_paths in datasets.items():
for idx, path in enumerate(file_paths):
try:
# Load and preprocess
doc = endaq.ide.get_doc(path)
accel = endaq.ide.to_pandas(doc.channels[80], time_mode='seconds')
accel = accel - accel.median() # Remove DC offset
for i, axis in enumerate(columns_plot):
if axis in accel.columns:
fig.add_trace(
go.Scatter(
x=accel.index,
y=accel[axis],
mode='lines',
name=label if idx == 0 and i == 0 else None, # show legend only once per label
legendgroup=label,
line=dict(color=label_colors[label]),
showlegend=(idx == 0 and i == 0)
),
row=i + 1, col=1
)
else:
print(f"[{label}] Column '{axis}' not found in {path}")
except Exception as e:
print(f"Error loading file {path}: {e}")
# Add axis labels
for i, axis in enumerate(columns_plot):
fig.update_yaxes(title_text=axis, row=i+1, col=1)
fig.update_xaxes(title_text="Time (seconds)", row=3, col=1)
# fig.update_layout(
# height=500,
# title_text="Acceleration Signals from All Datasets (X, Y, Z)",
# showlegend=True
# )
fig.update_layout(
height=400,
width=1100,
title_text="Acceleration Signals from All Datasets (X, Y, Z)",
showlegend=True,
legend=dict(
orientation="h", # Horizontal legend
yanchor="bottom",
y=1.02, # Slightly above the top of the plot
xanchor="center",
x=0.5, # Centered horizontally
traceorder="grouped",
title_text='', # Optional: remove legend title
itemsizing='constant',
valign='top'
),
legend_tracegroupgap=0, # No space between grouped traces
margin=dict(t=100) # Extra top margin to fit the legend
)
fig.show()
# Run the plot function
plot_axes_combined(datasets, columns_plot, label_colors)
To use the code below, make sure to change the file paths to point to your own data files. Replace the example paths with the actual locations of your CSV files. Each entry in the file_paths dictionary should correspond to one of your datasets. If you only have, for example, a healthy file and two faulty ones, just remove the unused line(s) from the dictionary.
# Example file paths – update these with your actual file locations
datasets = {
"Healthy": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide',
# Add more Healthy files if needed
],
"Looseness": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Loosness\DAQ16837_000158.ide'
],
"Missalignment": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Misalignment\DAQ16837_000155.ide'
],
"Mass imbalance": [
r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Mass_imbalance\aeaa0a24-DAQ16837_000149-1744735366.ide'
]
}
You can also choose which vibration axes to plot by specifying the desired columns in the line below:
# Specify the column names you want to plot columns_plot = ["X (40g)", "Y (40g)", "Z (40g)"]
The vibration signals along the X, Y, and Z axes will be visualized, as shown in the example plot below:
Figure 2. Vibration data along X, Y, and Z axis.
To gain deeper insights into the vibration patterns, the plots are zoomed in for closer examination (You can zoom in on Plotly-generated plots by drawing a selection box over the area of interest).
Figure 3. Vibration data along X, Y, and Z axis (zoomed-in).
Here's what we can observe from the detailed view:
- X-axis Vibration: The vibrations in the Healthy and F_MassChange datasets fluctuate more than in the other two conditions. In contrast, the F_Loose and F_Misalignment datasets show less fluctuation. This observation aligns with expectations—when the belt is loose, vibrations tend to decrease. Based on this behavior, it's possible to begin distinguishing Healthy and Mass Change conditions from Loose and Misaligned states.
- Y-axis Vibration: No clear or consistent patterns emerge from the Y-axis data, making it difficult to extract meaningful insights for classification or anomaly detection in this direction.
- Z-axis Vibration: The F_MassChange dataset exhibits frequent spikes that rise noticeably above the other datasets. These jumps occur repeatedly throughout the data collection period, suggesting that mass changes have a significant and detectable impact along the Z-axis.
Statistics
Next, we perform a statistical analysis of the vibration datasets. The code below takes the file paths of the datasets and outputs a summary statistics table for each one:
import pandas as pd import endaq # Specify the start and end times for the analysis (in seconds) start_time = 10 end_time = 50 # Updated: dictionary maps labels to lists of file paths datasets = { "Healthy": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide', # Add more Healthy files if needed ], "Looseness": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Loosness\DAQ16837_000158.ide' ], "Missalignment": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Misalignment\DAQ16837_000155.ide' ], "Mass imbalance": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Mass_imbalance\aeaa0a24-DAQ16837_000149-1744735366.ide' ] } # Columns to analyze columns = ["X (40g)", "Y (40g)", "Z (40g)"] def get_statistics_table(datasets, columns, start_time=None, end_time=None): stats_data = [] for label, file_list in datasets.items(): label_stats = [] for file_path in file_list: try: doc = endaq.ide.get_doc(file_path) accel = endaq.ide.to_pandas(doc.channels[80], time_mode='seconds') accel = accel - accel.median() if start_time is not None: accel = accel[accel.index >= start_time] if end_time is not None: accel = accel[accel.index <= end_time] stats = {} for col in columns: if col in accel.columns: stats[f"{col} Min"] = accel[col].min() stats[f"{col} Mean"] = accel[col].mean() stats[f"{col} Max"] = accel[col].max() stats[f"{col} Std"] = accel[col].std() stats[f"{col} Median"] = accel[col].median() else: print(f"[{label}] Column '{col}' not found in {file_path}.") label_stats.append(stats) except Exception as e: print(f"Failed to process {file_path}: {e}") if label_stats: label_df = pd.DataFrame(label_stats) aggregated_stats = label_df.mean().to_dict() aggregated_stats = {"Dataset": label, **aggregated_stats} # Ensure Dataset is first stats_data.append(aggregated_stats) df = pd.DataFrame(stats_data) # Reorder columns to put 'Dataset' first (just in case) cols = ['Dataset'] + [col for col in df.columns if col != 'Dataset'] return df[cols] # Generate the statistics table statistics_table = get_statistics_table(datasets, columns, start_time, end_time) # Display the result display(statistics_table)
This table provides key statistical metrics—such as minimum, maximum, mean, and standard deviation—for each vibration axis.
From this basic analysis, several insights emerge:
- X-axis (40g): The Mass Change condition shows a significantly different minimum and maximum value compared to the other datasets. This indicates more extreme vibration fluctuations along this axis in response to added mass.
- Y-axis (40g): Similar behavior is observed here. The Mass Change dataset again stands out, with noticeably different minimum and maximum values compared to the others. This suggests that even though visual plots may not reveal clear differences on the Y-axis, statistical variations still exist.
- Z-axis (40g): The Loose Belt condition is clearly distinguishable based on its minimum and maximum values. These statistics reflect the dampened vibration levels typically associated with a slack belt.
These insights demonstrate that even basic statistical analysis can provide meaningful differentiation between operating conditions, serving as a valuable first step in any anomaly detection workflow.
Histograms
Next, we visualize the distribution of vibration data using histograms. We'll generate three plots—one for each axis (X, Y, and Z)—to better understand how the data varies across different conditions. The code below creates interactive histograms using Plotly:
import pandas as pd import plotly.graph_objects as go from plotly.subplots import make_subplots import plotly.express as px import endaq # Dataset dictionary: label -> list of .ide file paths datasets = { "Healthy": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide', ], "Looseness": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Loosness\DAQ16837_000158.ide' ], "Missalignment": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Misalignment\DAQ16837_000155.ide' ], "Mass imbalance": [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Mass_imbalance\aeaa0a24-DAQ16837_000149-1744735366.ide' ] ### Big test rig # "Healthy": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000020_Healthy_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000021_Healthy_50.ide' # ], # "Loose": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000018_Loose_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000019_Loose_50.ide' # ], # "Grease": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000022_Grease_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000023_Grease_50.ide' # ], # "Cut": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000026_Cut_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000027_Cut_50.ide' # ], } # Columns (axes) to plot columns = ["X (40g)", "Y (40g)", "Z (40g)"] # Time range in seconds start_time = 10 end_time = 50 # Fixed color palette: one unique color per class color_palette = px.colors.qualitative.Set1 # or Set2, Bold, etc. label_list = list(datasets.keys()) label_colors = { "Healthy": "#d62728", # Blue "Looseness": "#1f77b4", # Orange "Missalignment": "#2ca02c", # Green "Mass imbalance": "#9467bd",# Red # "Loose": "#9467bd", # Purple # "Grease": "#8c564b", # Brown # "Cut": "#e377c2" # Pink } # label_colors = {label: color_palette[i % len(color_palette)] for i, label in enumerate(label_list)} # Create 3 subplots: one for X, Y, Z fig = make_subplots( rows=1, cols=3, subplot_titles=columns, shared_yaxes=True ) # Loop over each class label for label, file_list in datasets.items(): color = label_colors[label] legend_shown = False for file_path in file_list: try: doc = endaq.ide.get_doc(file_path) accel = endaq.ide.to_pandas(doc.channels[80], time_mode='seconds') accel = accel - accel.median() # Filter by time if start_time is not None: accel = accel[accel.index >= start_time] if end_time is not None: accel = accel[accel.index <= end_time] # Add histograms for each axis for j, col in enumerate(columns): if col in accel.columns: fig.add_trace( go.Histogram( x=accel[col], nbinsx=50, name=label if not legend_shown else None, legendgroup=label, opacity=0.5, marker=dict(color=color), showlegend=not legend_shown # Only show once ), row=1, col=j + 1 ) legend_shown = True # <--- Flip the flag after first trace else: print(f"[{label}] Column '{col}' not found in {file_path}") except Exception as e: print(f"[{label}] Failed to process {file_path}: {e}") # Final layout setup fig.update_layout( height=500, width=1100, title_text="Overlayed Histograms of Acceleration (X, Y, Z) per Class", barmode='overlay' ) fig.update_xaxes(title_text="Acceleration") fig.update_yaxes(title_text="Frequency") fig.show()
Tip: These histograms are interactive—clicking on a legend item will toggle its visibility. This makes it easy to compare specific conditions side by side.
Figure 4. Histogram of vibration data along each axis for heathy, loose, misaligned, and mass changed conditions.
Key Observations from the Histograms are:
- Healthy vs. Loose: In both the X (40g) and Z (40g) axes, the Loose condition exhibits a noticeably taller and narrower histogram compared to Healthy. This indicates a tighter range of lower-amplitude vibrations, which aligns with expectations for a loose belt condition:
Figure 5. Comparing histograms of healthy and loose conditions.
- Healthy vs. Misalignment: In the X-axis, the Misalignment condition has a taller histogram compared to Healthy, indicating a higher concentration of vibration values within a narrower range. This suggests more consistent vibrations in the X direction under misalignment. The Y and Z axes show similar distributions between Healthy and Misalignment, with no significant visual difference.
Figure 6. Comparing histograms of healthy and misaligned conditions.
- Healthy vs. Mass Change: The X-axis histogram for Mass Change is both taller and wider than that of Healthy, reflecting more frequent and extreme variations. However, the Y and Z axes do not show significant changes in distribution.
Figure 7. Comparing histograms of healthy and mass changed conditions.
These histogram comparisons reinforce the insights gathered from earlier visualizations and statistical summaries, and help further differentiate the types of faults based on their vibration signatures.
Anomaly Detection
So far, we’ve explored the vibration data in its raw form—visualizing it, analyzing statistical features, and comparing distributions across different operating conditions. These initial steps helped us extract meaningful insights and identify basic differences between healthy and faulty states.
In this section, we’ll go a step further and apply an LSTM (Long Short-Term Memory) model—a powerful and commonly used machine learning technique for time-series data—to detect anomalies in machinery vibration.
The goal is to train the LSTM model to distinguish between healthy and unhealthy operating conditions based on vibration patterns.
Important:
- The healthy dataset(s) should be kept separate.
- All faulty conditions (e.g., loose belt, misalignment, mass change) should be combined into a single dataset labeled as faulty (which will be done with the provided code).
- This binary classification approach will allow the model to learn the typical vibration behavior of a healthy machine and identify deviations from it.
Below is the code to build and train an LSTM model for anomaly detection:
import pandas as pd import numpy as np import endaq from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers import LSTM, Dense, Dropout, Input from keras.utils import to_categorical # -------- Load and Label Data -------- # healthy_datasets = [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide', ] #["healthy_1.csv", "healthy_2.csv"] #===================================== Small test rig:=============== faulty_datasets = { # 1: [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\F_Loose\DAQ16837_000158.ide'], # 2: [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\F_missAlignment\DAQ16837_000155.ide'], # 3: [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Faulty\aeaa0a24-DAQ16837_000149-1744735366.ide'] } def load_and_label_data_from_endaq(file_path, channel_index, label): doc = endaq.ide.get_doc(file_path) accel = endaq.ide.to_pandas(doc.channels[channel_index], time_mode='seconds') accel = accel - accel.median() # Remove DC offset accel['label'] = label return accel def load_multiple_datasets_from_endaq(healthy_datasets, faulty_datasets): combined_data = pd.DataFrame() for file_path in healthy_datasets: healthy_data = load_and_label_data_from_endaq(file_path, channel_index=80, label=0) combined_data = pd.concat([combined_data, healthy_data], ignore_index=True) for fault_type, fault_files in faulty_datasets.items(): for file_path in fault_files: faulty_data = load_and_label_data_from_endaq(file_path, channel_index=80, label=fault_type) combined_data = pd.concat([combined_data, faulty_data], ignore_index=True) return combined_data # -------- Data Preparation -------- # data = load_multiple_datasets_from_endaq(healthy_datasets, faulty_datasets) # Select features and labels X = data.drop(columns=['label']).values y = data['label'].values # Convert multi-fault labels to binary: 0=healthy, 1=faulty y_binary = np.where(y != 0, 1, 0) # Standardize scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # -------- Sequence Creation -------- # def create_sequences(X, y, time_steps=20): Xs, ys = [], [] for i in range(len(X) - time_steps): Xs.append(X[i:(i + time_steps)]) ys.append(y[i + time_steps]) # Label is from the last step in sequence return np.array(Xs), np.array(ys) time_steps = 20 X_seq, y_seq = create_sequences(X_scaled, y_binary, time_steps) y_encoded = to_categorical(y_seq) # -------- Train-Test Split -------- # X_train, X_test, y_train, y_test = train_test_split(X_seq, y_encoded, test_size=0.2, random_state=42) # -------- LSTM Model -------- # def build_lstm_model(input_shape, lstm_units=50): model = Sequential() model.add(Input(shape=input_shape)) model.add(LSTM(lstm_units, return_sequences=True)) model.add(Dropout(0.2)) model.add(LSTM(lstm_units)) model.add(Dropout(0.2)) model.add(Dense(y_encoded.shape[1], activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model model = build_lstm_model(X_train.shape[1:], lstm_units=50) history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test)) # -------- Evaluation -------- # y_pred = model.predict(X_test) y_pred_classes = np.argmax(y_pred, axis=1) y_test_classes = np.argmax(y_test, axis=1) accuracy = accuracy_score(y_test_classes, y_pred_classes) print(f'Accuracy: {accuracy:.4f}') # -------- ROC Curve -------- # fpr, tpr, _ = roc_curve(y_test_classes, y_pred[:, 1]) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend(loc='lower right') plt.grid(True) plt.show() # -------- Confusion Matrix with Percentages -------- # cm = confusion_matrix(y_test_classes, y_pred_classes) cm_percent = cm.astype('float') / cm.sum(axis=1, keepdims=True) * 100 plt.figure(figsize=(6, 5)) plt.imshow(cm_percent, interpolation='nearest', cmap=plt.cm.Blues) plt.title("Confusion Matrix (%)") plt.colorbar() class_names = ["Healthy", "Faulty"] tick_marks = np.arange(len(class_names)) plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names) thresh = cm_percent.max() / 2. for i in range(cm.shape[0]): for j in range(cm.shape[1]): plt.text(j, i, f'{cm[i, j]}\n({cm_percent[i, j]:.1f}%)', ha="center", va="center", color="white" if cm_percent[i, j] > thresh else "black", fontsize=12) plt.ylabel('True Label') plt.xlabel('Predicted Label') plt.tight_layout() plt.show()
To better understand how this works, let’s break the code into sections and explore what each one does.
import pandas as pd import numpy as np import endaq
pandas and numpy: Essential for data manipulation and numerical processing.
endaq: Used to read .ide files from the Mide/Endaq sensors, converting them to usable pandas DataFrames.
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score
Scikit-learn utilities for scaling data, splitting datasets, and evaluating model performance using metrics like ROC AUC and confusion matrix.
import matplotlib.pyplot as plt
For visualizing model performance (ROC curve, confusion matrix).
from keras.models import Sequential from keras.layers import LSTM, Dense, Dropout, Input from keras.utils import to_categorical
Keras deep learning library (TensorFlow backend).
Sequential: Easy model-building API.
LSTM: Core layer for learning time-dependent patterns.
Dense, Dropout, Input: Standard layers for model architecture.
to_categorical: Converts integer labels into one-hot encoded format (required for softmax output in multi-class classification).
healthy_datasets = [...] faulty_datasets = {...}
- Paths to healthy and faulty datasets.
- Faulty datasets are organized by fault type (e.g., loose, grease, cut), with integer labels.
def load_and_label_data_from_endaq(file_path, channel_index, label): doc = endaq.ide.get_doc(file_path) accel = endaq.ide.to_pandas(doc.channels[channel_index], time_mode='seconds') accel = accel - accel.median() # Remove DC offset accel['label'] = label return accel
Loads vibration data from an Endaq .ide file.
Extracts the signal from a specific sensor channel (e.g., channel 80 = Z-axis).
Removes DC offset to center the data.
Adds a label column to classify the dataset (0 for healthy, 1–6 for different faults).
def load_multiple_datasets_from_endaq(healthy_datasets, faulty_datasets): ...
Loads and combines all healthy and faulty datasets into a single DataFrame.
Each record is labeled accordingly.
data = load_multiple_datasets_from_endaq(...) X = data.drop(columns=['label']).values y = data['label'].values y_binary = np.where(y != 0, 1, 0)
Separates features (X) from labels (y).
Converts the multi-fault classification into binary classification:
- 0 = healthy
- 1 = any fault (loosely grouping all faults as "unhealthy")
def create_sequences(X, y, time_steps=20): ...
- Splits the continuous vibration signal into overlapping time sequences for LSTM input.
- Each sequence is time_steps long (e.g., 20 samples), and the label corresponds to the last point in the sequence.
X_seq, y_seq = create_sequences(X_scaled, y_binary, time_steps) y_encoded = to_categorical(y_seq)
X_seq: 3D array for LSTM: [samples, time_steps, features].
y_encoded: One-hot encoded labels for categorical output.
X_seq: 3D array for LSTM: [samples, time_steps, features].
y_encoded: One-hot encoded labels for categorical output.
X_train, X_test, y_train, y_test = train_test_split(...)
Splits data into training and testing sets (80/20).
def build_lstm_model(input_shape, lstm_units=50): ...
time_steps = 20:
- Defines the number of time steps in each sequence for the LSTM, meaning the model will learn from sequences of 20 time steps.
Input Layer (Input(shape=input_shape)):
- Accepts input data of shape (time_steps, features).
LSTM Layers:
- LSTM(lstm_units, return_sequences=True): First LSTM layer with 50 units, returning the full sequence for further LSTM processing.
- LSTM(lstm_units): Second LSTM layer that processes the sequence into a single output (last time step).
Dropout Layers:
- Dropout(0.2): Prevents overfitting by randomly dropping 20% of neurons during training.
Dense Layer:
- Dense(y_encoded.shape[1], activation='softmax'): Final layer with softmax activation, outputting class probabilities.
Model Compilation:
- Optimizer: adam, adjusts the learning rate for efficient training.
- Loss Function: categorical_crossentropy, for multi-class classification.
- Metrics: Accuracy is used to track model performance.
model = build_lstm_model(...) history = model.fit(...)
Compiles and trains the model using categorical_crossentropy (since labels are one-hot encoded).
Trains for 5 epochs.
y_pred = model.predict(X_test) y_pred_classes = np.argmax(y_pred, axis=1) y_test_classes = np.argmax(y_test, axis=1) accuracy = accuracy_score(...)
Predicts class probabilities.
Converts predictions and labels to integer class labels.
Calculates accuracy.
fpr, tpr, _ = roc_curve(...) roc_auc = auc(fpr, tpr) ...
Plots the ROC curve to visualize trade-offs between true positive and false positive rates.
Computes AUC (Area Under the Curve), a metric of overall classification performance.
cm = confusion_matrix(...) ... plt.imshow(...) ...
- Shows actual vs. predicted labels in a matrix format.
- Converts raw counts to percentages for easier interpretation.
- Annotates each cell with count and percentage.
After running the code, you will obtain three key outputs that evaluate the model's performance:
- Accuracy Score – a single number indicating the overall classification accuracy.
- ROC Curve – a visual representation of the model's ability to distinguish between healthy and faulty conditions.
- Confusion Matrix – a detailed breakdown of true vs. predicted labels, showing how well the model performs on each class.
These results are shown below:
Figure 8. ROC curve for anomaly detection.
The ROC (Receiver Operating Characteristic) curve is a graphical representation that shows the performance of a binary classification model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across different classification thresholds.
- True Positive Rate (TPR), also known as sensitivity or recall, is the proportion of actual positives correctly identified by the model.
- False Positive Rate (FPR), on the other hand, is the proportion of actual negatives that are incorrectly classified as positives.
The area under the curve (AUC) measures the overall ability of the model to discriminate between the two classes. A higher AUC (closer to 1) indicates a better model.
In our case, the ROC curve is positioned towards the top-left, which indicates a high accuracy. A curve that is close to the top-left corner means that the model has a high True Positive Rate (TPR) while maintaining a low False Positive Rate (FPR). This suggests that our model is excellent at identifying faulty conditions while minimizing misclassifications, resulting in high reliability.
The AUC value can also help quantify this—if it’s near 1.0, the model's performance is outstanding.
Figure 9. Confusion matrix for anomaly detection.
The model achieved an impressive accuracy of 0.99, indicating a highly reliable performance in distinguishing between healthy and faulty conditions.
From the confusion matrix, we observe the following:
- 99.5% of healthy samples were correctly identified as healthy.
- Only 0.5% of healthy samples (189 instances) were mistakenly classified as faulty.
- On the other hand, almost all faulty samples were correctly identified, with only 232 instances incorrectly labeled as healthy.
These results confirm that the model is highly effective at detecting anomalies, with a very low rate of false positives and false negatives.
Here are a few practical tips to improve model performance if your LSTM model's accuracy is not as high as expected:
- Increase lstm_units (e.g., from 50 to 100 or more) to give the model more learning capacity.
model = build_lstm_model(X_train.shape[1:], lstm_units=100)
- Try more epochs (e.g., 20–50) to allow the model more time to learn.
history = model.fit(X_train, y_train, epochs=30, batch_size=32, validation_data=(X_test, y_test))
- Adjust batch_size (e.g., try 16, 64) to find a better training dynamic.
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))
- Try longer sequences (e.g., time_steps = 40 or 60) if your vibration patterns have longer-term dependencies the model isn’t capturing.
time_steps = 40 # instead of 20 X_seq, y_seq = create_sequences(X_scaled, y_binary, time_steps)
- A deeper model (e.g., 3 LSTM layers) may help capture more complex patterns in the vibration data.
def build_lstm_model(input_shape, lstm_units=50): model = Sequential() model.add(Input(shape=input_shape)) model.add(LSTM(lstm_units, return_sequences=True)) model.add(Dropout(0.2)) model.add(LSTM(lstm_units, return_sequences=True)) # <--- 2nd LSTM Layer model.add(Dropout(0.2)) model.add(LSTM(lstm_units)) # <--- 3rd LSTM Layer (final one) model.add(Dropout(0.2)) model.add(Dense(y_encoded.shape[1], activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model
Fault detection
The LSTM model used in the previous section was designed to detect whether a machine is faulty or not—essentially a binary anomaly detection model. It was trained only on healthy machine data, so any deviation from that baseline was classified as a fault.
However, a common and more practical question users ask is:
"If a fault is detected, can we also determine what the root cause is?"
The answer is yes, but it depends heavily on the availability of labeled data for each specific fault type.
To enable the model to detect the type of fault, it must be trained not just on healthy data, but also on data corresponding to different known faults—each labeled accordingly. For example:
- Data from normal machine operation should be labeled as "Healthy".
- If you have a known issue such as belt looseness, you must collect data during that condition and label it as "Looseness".
- Similarly, for other conditions like misalignment, mass change, or any other fault type, labeled data must be collected for each.
As discussed in Section 3, we applied specific faults to our test rig (e.g., loose belt, pulley misalignment, added mass) and collected vibration data for each case. Using this multi-labeled dataset, we can now train an LSTM model to not only detect an anomaly, but also classify which specific fault it is.
The Code Below Performs:
- Training the LSTM model on multiple fault classes
- Predicting the specific fault type if a fault is detected
# ================== SETUP ================== import pandas as pd import numpy as np import os from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc from sklearn.preprocessing import MinMaxScaler from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Dropout from tensorflow.keras.utils import to_categorical import seaborn as sns import matplotlib.pyplot as plt from sklearn.utils.multiclass import unique_labels import endaq.ide import matplotlib.cm as mpl_cm # ================== PARAMETERS ================== start_time = 0 end_time = 50 time_steps_u = 60 test_size = 0.2 epochs_u = 2 batch_size_u = 32 lstm_units = 50 selected_columns = ['X (40g)', 'Y (40g)', 'Z (40g)'] # ================== DYNAMIC DATASET DICTIONARY ================== datasets = { #### Small test rig "Healthy": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide'], "Looseness": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Loosness\DAQ16837_000158.ide'], "Misalignment": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Misalignment\DAQ16837_000155.ide'], "Mass imbalance": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Mass_imbalance\aeaa0a24-DAQ16837_000149-1744735366.ide'] ### Big test rig # "Healthy": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000020_Healthy_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000021_Healthy_50.ide' # ], # "Loose": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000018_Loose_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000019_Loose_50.ide' # ], # "Grease": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000022_Grease_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000023_Grease_50.ide' # ], # "Cut": [ # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000026_Cut_0.ide', # r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\3.SmartBelt\2025.05.12.Anomaly_ML\DAQ15719_000027_Cut_50.ide' # ], } # ================== LABEL ENCODING ================== class_names = list(datasets.keys()) class_to_id = {name: idx for idx, name in enumerate(class_names)} id_to_class = {idx: name for name, idx in class_to_id.items()} num_classes_u = len(class_to_id) # ================== DATA LOADING FUNCTIONS ================== def load_and_label_data(filepath, label, start_time=None, end_time=None): ext = os.path.splitext(filepath)[1].lower() if ext == '.csv': data = pd.read_csv(filepath) if start_time is not None: data = data[data['time'] >= start_time] if end_time is not None: data = data[data['time'] <= end_time] elif ext == '.ide': doc = endaq.ide.get_doc(filepath) data = endaq.ide.to_pandas(doc.channels[80], time_mode='seconds') if start_time is not None: data = data[data.index >= start_time] if end_time is not None: data = data[data.index <= end_time] data['time'] = data.index else: raise ValueError(f"Unsupported file type: {ext}") data['label'] = label return data def load_all_labeled_data(datasets_dict, start_time, end_time): all_data = pd.DataFrame() for class_name, file_list in datasets_dict.items(): label = class_to_id[class_name] for filepath in file_list: labeled_data = load_and_label_data(filepath, label, start_time, end_time) all_data = pd.concat([all_data, labeled_data], ignore_index=True) return all_data # ================== MODEL PREPARATION FUNCTIONS ================== def preprocess_data(data, selected_columns): scaler = MinMaxScaler(feature_range=(0, 1)) scaled_data = scaler.fit_transform(data[selected_columns]) return scaled_data, scaler def prepare_data_for_lstm(scaled_data, labels, time_steps=time_steps_u, num_classes=num_classes_u): X, y = [], [] for i in range(time_steps, len(scaled_data)): X.append(scaled_data[i-time_steps:i, :]) y.append(labels[i]) X = np.array(X) y = to_categorical(y, num_classes=num_classes) return X, y def build_lstm_model(input_shape, lstm_units=50, num_classes=num_classes_u): model = Sequential() model.add(LSTM(lstm_units, return_sequences=True, input_shape=input_shape)) model.add(Dropout(0.2)) model.add(LSTM(lstm_units, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(num_classes, activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model # ================== EXECUTION ================== combined_data = load_all_labeled_data(datasets, start_time, end_time) scaled_data, scaler = preprocess_data(combined_data, selected_columns) labels = combined_data['label'].values X, y = prepare_data_for_lstm(scaled_data, labels, time_steps_u, num_classes_u) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42) model = build_lstm_model(X_train.shape[1:], lstm_units, num_classes_u) model.fit(X_train, y_train, epochs=epochs_u, batch_size=batch_size_u, validation_data=(X_test, y_test)) history = model.fit(X_train, y_train, epochs=epochs_u, batch_size=batch_size_u, validation_data=(X_test, y_test)) # ================== EVALUATION & PLOTTING ================== y_pred = model.predict(X_test) y_pred_classes = np.argmax(y_pred, axis=1) y_test_classes = np.argmax(y_test, axis=1) accuracy = accuracy_score(y_test_classes, y_pred_classes) print(f"\n✅ Accuracy: {accuracy:.4f}\n") class_indices = unique_labels(y_test_classes, y_pred_classes) labels_to_show = [id_to_class[i] for i in class_indices] cm = confusion_matrix(y_test_classes, y_pred_classes, labels=class_indices) cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100 annot = np.empty_like(cm).astype(str) for i in range(cm.shape[0]): for j in range(cm.shape[1]): count = cm[i, j] percent = cm_percent[i, j] annot[i, j] = f'{count}\n({percent:.1f}%)' plt.figure(figsize=(5.5, 5.5)) sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', xticklabels=labels_to_show, yticklabels=labels_to_show, cbar=False, square=True, linewidths=0.5, annot_kws={"size": 10}) plt.title('Confusion Matrix with Counts and Percentages') plt.xlabel('Predicted') plt.ylabel('True') plt.tight_layout() plt.show() # ROC Curves fpr = dict() tpr = dict() roc_auc = dict() for i in range(num_classes_u): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) plt.figure(figsize=(8, 6)) color_map = mpl_cm.get_cmap('tab10', num_classes_u) for i in range(num_classes_u): plt.plot( fpr[i], tpr[i], lw=2, color=color_map(i), label=f"ROC curve for {id_to_class[i]} (AUC = {roc_auc[i]:.2f})" ) plt.plot([0, 1], [0, 1], 'k--', lw=2) # plt.xlim([0.0, 1.0]) # plt.ylim([0.0, 1.05]) plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.title("Receiver Operating Characteristic (ROC) Curve") plt.legend(loc="lower right") plt.grid(True) plt.tight_layout() plt.show()
The code above is similar to the one used in the previous section for anomaly detection.
The key difference is that instead of using only healthy vs. faulty data, this version is trained using multiple labeled datasets, each corresponding to a specific fault condition (e.g., Loose Belt, Misalignment, Mass Change).
By doing this, the model learns not just to detect whether a fault exists, but also to classify the exact type of fault when one is detected.
This approach enables multi-class fault detection, making the model more informative and practical for real-world diagnostics.
# Import necessary libraries import os # For file path handling import numpy as np # Numerical operations import pandas as pd # Data handling with DataFrames import matplotlib.pyplot as plt # Plotting and visualization import matplotlib.patches as mpatches # Custom legend patches for plots # Machine learning and signal processing tools from sklearn.decomposition import PCA # Principal Component Analysis from sklearn.cluster import KMeans # KMeans clustering from sklearn.preprocessing import StandardScaler # Standardize data from scipy.stats import kurtosis, skew # Statistical features from scipy.fft import rfft, rfftfreq # Fast Fourier Transform for frequency features # Endaq IDE tool for loading .ide files import endaq.ide # ------------------ Load and Label Data ------------------ # This function loads sensor data and attaches a label indicating the condition (e.g., healthy, faulty) def load_and_label_data(dataset_name, label, start_time=None, end_time=None): ext = os.path.splitext(dataset_name)[1].lower() # Get the file extension # If the file is a CSV if ext == '.csv': data = pd.read_csv(dataset_name) # Apply time filtering if specified if start_time is not None: data = data[data['time'] >= start_time] if end_time is not None: data = data[data['time'] <= end_time] # If the file is a .ide format (Endaq format) elif ext == '.ide': doc = endaq.ide.get_doc(dataset_name) # Convert a specific channel (channel 80) into a pandas DataFrame data = endaq.ide.to_pandas(doc.channels[80], time_mode='seconds') # Filter data based on start and end time if start_time is not None: data = data[data.index >= start_time] if end_time is not None: data = data[data.index <= end_time] data['time'] = data.index # Add time column if not present else: raise ValueError(f"Unsupported file type: {ext}") data['label'] = label # Assign label to the dataset (e.g., 0 = healthy) return data # Loads and combines multiple datasets (healthy and faulty) def load_multiple_datasets(healthy_datasets, faulty_datasets, start_time=None, end_time=None): combined_data = pd.DataFrame() # Load and label each healthy dataset for dataset in healthy_datasets: healthy_data = load_and_label_data(dataset, label=0, start_time=start_time, end_time=end_time) combined_data = pd.concat([combined_data, healthy_data], ignore_index=True) # Load and label each faulty dataset based on its fault type for fault_type, fault_datasets in faulty_datasets.items(): for dataset in fault_datasets: faulty_data = load_and_label_data(dataset, label=fault_type, start_time=start_time, end_time=end_time) combined_data = pd.concat([combined_data, faulty_data], ignore_index=True) return combined_data # ------------------ Feature Extraction ------------------ # Extract statistical and frequency features from a vibration signal segment def extract_features(signal, sampling_rate=4052): N = len(signal) # Number of samples freqs = rfftfreq(N, d=1/sampling_rate) # Frequency bins fft_vals = np.abs(rfft(signal)) # FFT values (magnitude) # Return a dictionary of time-domain and frequency-domain features return { 'mean': np.mean(signal), 'std': np.std(signal), 'rms': np.sqrt(np.mean(signal**2)), 'kurtosis': kurtosis(signal), 'skewness': skew(signal), 'crest_factor': np.max(np.abs(signal)) / np.sqrt(np.mean(signal**2)), 'peak_to_peak': np.ptp(signal), 'freq_mean': np.mean(fft_vals), 'freq_std': np.std(fft_vals), 'freq_peak': freqs[np.argmax(fft_vals)] # Frequency with max amplitude } # ------------------ Define File Paths ------------------ # Paths to healthy condition datasets healthy_datasets = [ r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide', ] # Dictionary mapping fault types to datasets faulty_datasets = { 1: [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\F_Loose\DAQ16837_000158.ide'], 2: [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\F_missAlignment\DAQ16837_000155.ide'], 3: [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Faulty\aeaa0a24-DAQ16837_000149-1744735366.ide'] } # ------------------ Load and Process Data ------------------ combined_data = load_multiple_datasets(healthy_datasets, faulty_datasets) # Define a fixed window size for segmentation (e.g., 1-second if sampling rate is 1 kHz) window_size = 1000 features_list = [] labels_list = [] # Loop through each labeled group (e.g., all healthy, all loose, etc.) for label, group in combined_data.groupby("label"): values = group.iloc[:, 1] # Assuming vibration data is in the second column for i in range(0, len(values) - window_size, window_size): segment = values.iloc[i:i+window_size].values if len(segment) == window_size: features = extract_features(segment) # Extract features from segment features_list.append(features) labels_list.append(label) # ------------------ PCA and Clustering ------------------ # Convert list of features into a DataFrame features_df = pd.DataFrame(features_list) true_labels = np.array(labels_list) # Standardize features to have zero mean and unit variance scaler = StandardScaler() features_scaled = scaler.fit_transform(features_df) # Apply PCA to reduce to 2 dimensions for visualization pca = PCA(n_components=2) pca_result = pca.fit_transform(features_scaled) # Apply KMeans clustering to the feature space kmeans = KMeans(n_clusters=4, random_state=42) clusters = kmeans.fit_predict(features_scaled) # ------------------ Visualization: PCA by True Labels ------------------ # Map numeric labels to human-readable names label_names = { 0: "Healthy", 1: "Looseness", 2: "Misalignment", 3: "Mass change" } colors = ['green', 'red', 'blue', 'orange'] # Colors for plotting # Create scatter plot of PCA results colored by true condition label plt.figure(figsize=(10, 6)) for label in np.unique(true_labels): idx = true_labels == label plt.scatter( pca_result[idx, 0], pca_result[idx, 1], c=colors[label], label=label_names[label], s=50 ) plt.title("PCA Projection (Colored by True Labels)") plt.xlabel("PCA 1") plt.ylabel("PCA 2") plt.grid(True) # Add custom legend handles = [mpatches.Patch(color=colors[i], label=label_names[i]) for i in label_names] plt.legend(handles=handles, title="Data Type") plt.show() # ------------------ Visualization: PCA by Clusters ------------------ # Plot PCA result with colors based on KMeans clustering plt.figure(figsize=(10, 6)) plt.scatter(pca_result[:, 0], pca_result[:, 1], c=clusters, cmap='viridis', s=50) plt.title("PCA Projection (Colored by KMeans Clusters)") plt.xlabel("PCA 1") plt.ylabel("PCA 2") plt.grid(True) plt.colorbar(label="Cluster ID") plt.show() # ------------------ PCA Components (Feature Influence) ------------------ # Display the weights of each original feature in the PCA components pca_components = pd.DataFrame( pca.components_, columns=features_df.columns, index=["PCA 1", "PCA 2"] ) # Print PCA component contributions for interpretation print(pca_components.T)
✅ start_time = 0
- What it does:
Sets the starting point (in seconds) for the data segment to be used from each file. - Tip:
If the initial data contains startup noise or irrelevant behavior, consider increasing this value (e.g., start_time = 10).
✅ end_time = 50
- What it does:
Defines the end point (in seconds) for the data segment. - Tip:
Use a smaller window (like 30–60 seconds) to keep the dataset manageable during experimentation. For final models, increase if you want the model to learn long-term behavior.
✅ time_steps_u = 60
- What it does:
Sets the number of time steps (sequence length) for the LSTM input. Each sample fed into the LSTM includes 60 time steps of sensor data. - Tip:
If your vibration patterns have longer temporal dependencies, try increasing this to 80 or 100. For faster training, you can reduce it to 30–40.
✅ test_size = 0.2
- What it does:
Reserves 20% of the dataset for testing, while the remaining 80% is used for training. - Tip:
For smaller datasets, use a slightly higher test size like 0.3. For large datasets, 0.2 or even 0.1 may be sufficient.
✅ epochs_u = 2
- What it does:
Sets the number of complete training cycles (epochs) the model will go through. - Tip:
This is low for real training—start with 10–30 epochs to properly train your model. Use 2–5 epochs for quick debugging only.
✅ batch_size_u = 32
- What it does:
Specifies the number of samples the model processes before updating weights. - Tip:
Common choices are 16, 32, 64. Lower values (e.g., 16) may help with noisy data, while higher values (e.g., 64) can speed up training but might require more memory.
✅ lstm_units = 50
- What it does:
Sets the number of LSTM memory cells (neurons) in each LSTM layer. - Tip:
More units (e.g., 100 or 128) give the model more learning capacity. Use higher values if your data has complex patterns and you have enough training samples.
✅ selected_columns = ['X (40g)', 'Y (40g)', 'Z (40g)']
- What it does:
Chooses the input features from the vibration sensor—here, acceleration along the X, Y, and Z axes. - Tip:
If you know one axis captures most of the signal, you can use just ['X (40g)'] to reduce dimensionality. Otherwise, keeping all three gives a fuller picture.
Once you provide the correct file paths for your datasets and run the code, the model will:
- ✅ Train using the labeled vibration data,
- 📈 Return the accuracy of the model on the test set,
- 📊 Display a confusion matrix showing how well each fault type was classified,
- 📉 Show ROC curves for each class, giving a visual indication of how well the model separates different conditions.
These outputs help evaluate both the overall performance and the model’s ability to distinguish between specific fault types.
Figure 10. ROC curve for detecting fault type.
Figure 11. ROC curve for detecting fault type.
Based on the confusion matrix, we can conclude that the ML model accurately detects and classifies fault types with high reliability:
- Healthy: 99.2% of healthy samples are correctly classified as healthy. Only 0.8% are misclassified, mostly as "MassChange."
- Loose: 99.9% of loose condition data are accurately detected. A very small number are misclassified as healthy or another fault type.
- Misalignment: Nearly all misaligned samples are correctly labeled, indicating a distinct and consistent vibration signature that sets it apart from both healthy and other faulty conditions.
- MassChange: 99.8% of mass-changed cases are correctly identified, with very few misclassifications.
These results show that the model is not only able to detect faults but also to identify the root cause with very high precision across all fault categories.
Statistical Methods
Apart from Machine Learning (ML) approaches, statistical tools can also be very effective for detecting significant changes or anomalies in databases. Unlike ML, which often requires training on labeled data, statistical methods analyze the data directly—looking for shifts in distribution, central tendency, or spread.
These methods are especially useful when:
- You have limited labeled data.
- You want to perform quick and interpretable checks.
- You're dealing with simple or small-scale systems.
Here's a more advanced statistical and unsupervised learning methods table that includes powerful tools like PCA, t-SNE, UMAP, ICA, Autoencoders, and Factor Analysis—commonly used for anomaly detection, dimensionality reduction, or feature understanding:
Method |
Linear? |
Good for clustering? |
Notes |
PCA (Principal Component Analysis) |
✅ Yes |
⚠️ Limited |
Great for dimensionality reduction; highlights variance and outliers in a compact space. |
t-SNE (t-distributed Stochastic Neighbor Embedding) |
❌ No |
✅ Excellent |
Preserves local structure well; best for visualizing clusters and anomalies. |
UMAP (Uniform Manifold Approximation and Projection) |
❌ No |
✅ Excellent |
Faster than t-SNE; preserves both local and some global structure; great for large data. |
ICA (Independent Component Analysis) |
✅ Yes |
⚠️ Limited |
Useful for separating mixed signals; often used in signal processing and vibration data. |
Autoencoders |
❌ No |
✅ Excellent |
Deep learning-based; detects anomalies based on reconstruction error. |
Factor Analysis |
✅ Yes |
⚠️ Limited |
Identifies latent variables behind observed data; good for understanding data structure. |
Parallel Coordinates Plot (PCP) |
✅ Yes |
⚠️ Visual Aid Only |
A visualization method to detect patterns or separability across features. |
In this section, we'll perform PCA on the databases. Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of large datasets while preserving as much variability (information) as possible. It transforms the original variables into a new set of uncorrelated variables called principal components. These components are ordered by the amount of variance they capture from the data, allowing us to identify patterns, highlight similarities or differences, and simplify complex datasets for visualization or further analysis.
In the context of condition monitoring, PCA helps uncover hidden structure in sensor or measurement data by projecting it into a lower-dimensional space—making it easier to distinguish between different operating states or faults.
# ================== SETUP ================== import pandas as pd import numpy as np import os from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc from sklearn.preprocessing import MinMaxScaler from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Dropout from tensorflow.keras.utils import to_categorical import seaborn as sns import matplotlib.pyplot as plt from sklearn.utils.multiclass import unique_labels import endaq.ide import matplotlib.cm as mpl_cm # ================== PARAMETERS ================== start_time = 0 end_time = 50 time_steps_u = 60 test_size = 0.2 epochs_u = 2 batch_size_u = 32 lstm_units = 50 selected_columns = ['X (40g)', 'Y (40g)', 'Z (40g)'] # ================== DYNAMIC DATASET DICTIONARY ================== datasets = { #### Small test rig "Healthy": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Healthy\DAQ16837_000146.ide'], "Loose": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\F_Loose\DAQ16837_000158.ide'], "Misalignment": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\F_missAlignment\DAQ16837_000155.ide'], "MassChange": [r'C:\Users\hdabiri\OneDrive - Hutchinson Industries Inc\HAMED\Marketing\Faulty\aeaa0a24-DAQ16837_000149-1744735366.ide'] } # ================== LABEL ENCODING ================== class_names = list(datasets.keys()) class_to_id = {name: idx for idx, name in enumerate(class_names)} id_to_class = {idx: name for name, idx in class_to_id.items()} num_classes_u = len(class_to_id) # ================== DATA LOADING FUNCTIONS ================== def load_and_label_data(filepath, label, start_time=None, end_time=None): ext = os.path.splitext(filepath)[1].lower() if ext == '.csv': data = pd.read_csv(filepath) if start_time is not None: data = data[data['time'] >= start_time] if end_time is not None: data = data[data['time'] <= end_time] elif ext == '.ide': doc = endaq.ide.get_doc(filepath) data = endaq.ide.to_pandas(doc.channels[80], time_mode='seconds') if start_time is not None: data = data[data.index >= start_time] if end_time is not None: data = data[data.index <= end_time] data['time'] = data.index else: raise ValueError(f"Unsupported file type: {ext}") data['label'] = label return data def load_all_labeled_data(datasets_dict, start_time, end_time): all_data = pd.DataFrame() for class_name, file_list in datasets_dict.items(): label = class_to_id[class_name] for filepath in file_list: labeled_data = load_and_label_data(filepath, label, start_time, end_time) all_data = pd.concat([all_data, labeled_data], ignore_index=True) return all_data # ================== MODEL PREPARATION FUNCTIONS ================== def preprocess_data(data, selected_columns): scaler = MinMaxScaler(feature_range=(0, 1)) scaled_data = scaler.fit_transform(data[selected_columns]) return scaled_data, scaler def prepare_data_for_lstm(scaled_data, labels, time_steps=time_steps_u, num_classes=num_classes_u): X, y = [], [] for i in range(time_steps, len(scaled_data)): X.append(scaled_data[i-time_steps:i, :]) y.append(labels[i]) X = np.array(X) y = to_categorical(y, num_classes=num_classes) return X, y def build_lstm_model(input_shape, lstm_units=50, num_classes=num_classes_u): model = Sequential() model.add(LSTM(lstm_units, return_sequences=True, input_shape=input_shape)) model.add(Dropout(0.2)) model.add(LSTM(lstm_units, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(num_classes, activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model # ================== EXECUTION ================== combined_data = load_all_labeled_data(datasets, start_time, end_time) scaled_data, scaler = preprocess_data(combined_data, selected_columns) labels = combined_data['label'].values X, y = prepare_data_for_lstm(scaled_data, labels, time_steps_u, num_classes_u) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42) model = build_lstm_model(X_train.shape[1:], lstm_units, num_classes_u) model.fit(X_train, y_train, epochs=epochs_u, batch_size=batch_size_u, validation_data=(X_test, y_test)) # ================== EVALUATION & PLOTTING ================== y_pred = model.predict(X_test) y_pred_classes = np.argmax(y_pred, axis=1) y_test_classes = np.argmax(y_test, axis=1) accuracy = accuracy_score(y_test_classes, y_pred_classes) print(f"\n✅ Accuracy: {accuracy:.4f}\n") class_indices = unique_labels(y_test_classes, y_pred_classes) labels_to_show = [id_to_class[i] for i in class_indices] cm = confusion_matrix(y_test_classes, y_pred_classes, labels=class_indices) cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100 annot = np.empty_like(cm).astype(str) for i in range(cm.shape[0]): for j in range(cm.shape[1]): count = cm[i, j] percent = cm_percent[i, j] annot[i, j] = f'{count}\n({percent:.1f}%)' plt.figure(figsize=(6, 6)) sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', xticklabels=labels_to_show, yticklabels=labels_to_show, cbar=False, square=True, linewidths=0.5) plt.title('Confusion Matrix with Counts and Percentages') plt.xlabel('Predicted') plt.ylabel('True') plt.tight_layout() plt.show() # ROC Curves fpr = dict() tpr = dict() roc_auc = dict() for i in range(num_classes_u): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) plt.figure(figsize=(8, 6)) color_map = mpl_cm.get_cmap('tab10', num_classes_u) for i in range(num_classes_u): plt.plot( fpr[i], tpr[i], lw=2, color=color_map(i), label=f"ROC curve for {id_to_class[i]} (AUC = {roc_auc[i]:.2f})" ) plt.plot([0, 1], [0, 1], 'k--', lw=2) # plt.xlim([0.0, 1.0]) # plt.ylim([0.0, 1.05]) plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.title("Receiver Operating Characteristic (ROC) Curve") plt.legend(loc="lower right") plt.grid(True) plt.tight_layout() plt.show()
And the results will be:
Figure 12. PCA for healthy and faulty databases.
The PCA (Principal Component Analysis) plot reveals several insightful patterns in the data related to different belt conditions.
- Loose Condition: Data associated with the loose condition forms a narrow cluster along the PCA1 axis and spans a wider range along PCA2. This cluster is also positioned far from other conditions in terms of both PCA1 and PCA2, suggesting that PCA could potentially be effective in identifying loose belt conditions. However, more data from additional "loose" condition datasets is necessary to validate this finding.
- Mass-Changed Condition: Data for the mass-changed condition is spread out across both PCA1 and PCA2 axes and appears on the right side of the plot. This distribution may offer useful insights for clustering mass-changed conditions separately from other operational states.
- Misaligned and Healthy Conditions: These two conditions largely overlap in the PCA space, implying that PCA produces similar results for both. Consequently, PCA may not be a reliable method for distinguishing between healthy and misaligned belt states.
It is important to note that these observations are specific to the current test dataset and should not be generalized to all belt-driven systems in real-world applications. Nevertheless, the results suggest that PCA can be a valuable preliminary tool for distinguishing between different operational conditions—without relying on machine learning-based methods.
The components of PCA1 and PCA2 are provided in the table below, offering a deeper understanding of the features that contribute most to the variance in each principal component. This information can be particularly useful when interpreting the positioning of different condition data in the PCA plot.
Table 1. Components of PCA1 and PCA2.
Parameter |
PCA1 |
PCA2 |
Mean |
0.077504 |
-0.530687 |
Std |
0.386295 |
0.000617 |
RMS |
0.385350 |
0.050540 |
Kurtosis |
0.265002 |
0.395965 |
Skewness |
-0.108547 |
0.678720 |
Crest_factor |
0.295110 |
-0.150217 |
peak_to_peak |
0.378750 |
0.115777 |
freq_mean |
0.379757 |
0.028138 |
freq_std |
0.370553 |
0.128894 |
freq_peak |
0.319687 |
-0.212113 |
Conclusions
Vibration data plays a crucial role in condition monitoring and predictive maintenance of mechanical systems. By analyzing this data, we can detect anomalies early, helping to prevent costly damage and unexpected downtime. In this blog, we demonstrated how vibration data can be leveraged for fault detection and diagnosis.
We began by visualizing vibration signals along the X, Y, and Z axes, followed by statistical analysis and histogram representations to gain a clearer understanding of the signal characteristics. These visual and statistical insights provide a solid foundation for identifying abnormal patterns in system behavior.
To take this further, we employed Long Short-Term Memory (LSTM) networks—first for detecting anomalies in the system, and then for classifying specific fault types such as misalignment, looseness, and mass imbalance. LSTM's ability to capture temporal dependencies makes it well-suited for understanding the sequential nature of vibration signals.
One important takeaway is that detecting unhealthy conditions is relatively straightforward when we have access to healthy (baseline) vibration data, which can typically be collected during normal operation. However, identifying the exact fault type is more challenging, as it requires labeled vibration data for each specific fault condition—something that may not always be readily available in real-world scenarios.
Resources, Scripts & Files
Here are the files and scripts associated with this blog post.
Download this blog post as a PDF
White Paper: Vibration-Based Fault Detection in Belt Drive Systems Using FFT and LSTM Models: Presented at IWSHM Proceedings 2025