Author:
Paola Jaramillo Garcia is a data scientist at Mathworks.
Mohamed Anas is a regional engineering manager at Mathworks.
Reading time: 5 minutes
By integrating scalable software tools with tunable machine learning capabilities, engineers and scientists can efficiently identify the most suitable model that fits the specific industrial data and meets the model objectives, while protecting against overfitting.
Engineers and scientists are building smarter products and services, such as advanced driver assistance systems and predictive maintenance applications, driven by analytics based on industrial data. Analytics modeling is the ability to describe and predict a system’s behavior from historical data using domain-specific techniques for data preparation, feature engineering and machine learning. Combining these capabilities with automatic code generation, targeting edge-to-cloud, enables reuse while automating actions and decisions.
By leveraging the increased availability of ‘big industrial data’, compute power and scalable software tools, it becomes easier than ever to use machine learning in engineering applications. ML methods ‘learn’ information directly from the industrial data without relying on a predetermined equation as a model and are particularly suitable for today’s complex systems. However, two of the most common challenges faced by engineers and scientists who are modeling with machine learning relate to choosing a suitable ML model to classify their domain-specific data and eliminating data overfitting.

Classification models assign items to a discrete category based on a specific set of engineering features extracted from the historical data. Determining the best model often presents difficulties given the uniqueness of each data set and the desired outcome. Overfitting occurs when the model is too closely aligned with limited training data that may contain noise or errors. An overfitted model is unable to generalize well to data outside the training set, limiting its usefulness in a production system.
Choosing a classification model
Choosing a classification model type can be challenging because each has its own characteristics, which could be a strength or weakness depending on the problem. For starters, you must answer a few questions about the type and purpose of data: what’s the model meant to accomplish? How much data is there and of what type? How much detail is needed? Is storage a limiting factor? Answering these questions can help narrow the choices and select the correct classification model. You can use cross-validation to test how accurately a model will evaluate data. Afterward, you can select the best-fitting classification model.
There are many types of classification models. A common type is logistic regression. Because of its simplicity, this model is often used as a baseline. It’s applied to problems that have two possible classes into which data may be categorized. A logistic regression model returns probabilities for how likely a data point belongs to each class.
Another common type is k-nearest neighbor (kNN). This simple yet effective way of classification categorizes data points based on their distance to other points in a training data set. It has a short training time, but it can confuse irrelevant attributes for important ones unless weights are applied to the data, especially as the number of data points grows.
A third common classification model type is the decision tree. This model predicts responses visually and it’s relatively easy to follow the decision path taken from root to leaf. It’s especially useful when it’s important to show how the conclusion was reached.
A fourth common type is the support vector machine (SVM). This model uses a hyperplane to separate data into two or more classes. It’s accurate, tends not to overfit and is relatively easy to interpret, but training time can be somewhat long, especially for larger data sets.
A fifth common type is the artificial neural network (ANN). These networks can be configured and trained to solve a variety of different problems, including classification and time series prediction. However, the trained models are known to be difficult to interpret.
You can simplify the decision-making process by using scalable software tools to determine which model best fits a set of features, assess classifier performance, compare and improve model accuracy and, finally, export the best model. These tools also help users explore the data, select features, specify validation schemes and train multiple models.
Eliminating data overfitting
Overfitting occurs when a model fits a data set but doesn’t generalize well to new data. This is typically hard to avoid because it’s often the result of insufficient training data, especially when those responsible for the model didn’t gather the data themselves. The best way to avoid overfitting is by using enough training data to accurately reflect the model’s diversity and complexity.
Data regularization and generalization are two additional methods you can apply to check for overfitting. Regularization is a technique that prevents the model from over-relying on individual data points. Regularization algorithms introduce additional information into the model and handle multicollinearity and redundant predictors by making the model more parsimonious and accurate. These algorithms typically work by applying a penalty for complexity, such as adding the model’s coefficients into the minimization or including a roughness penalty.
Generalization divides available data into three subsets. The first set is the training set, the second is the validation set. The error on the validation set is monitored during the training process and the model is fine-tuned until accurate. The third subset is the test set, used on the fully trained classifier after the training and cross-validation phases to test that the model hasn’t overfitted the training and validation data.
There are six cross-validation methods that can help prevent overfitting. K-fold partitions data into k randomly chosen subsets (or folds) of roughly equal size, with one used to validate the model trained with the remaining subsets. This process is repeated k times, as each subset is used exactly once for validation. Holdout separates data into two subsets of a specified ratio for training and validation. Leaveout partitions data using the k-fold approach, where k equals the total number of observations in the data. Repeated random subsampling performs Monte Carlo repetitions of randomly separating data and aggregates results over all the runs. Stratify partitions data so both training and test sets have roughly the same class proportions in the response or target. Resubstitution uses the training data for validation without separating it. This method often produces overly optimistic estimates for performance and must be avoided if there’s enough data.
ML veterans and beginners alike run into trouble with classification and overfitting. While the challenges can seem daunting, leveraging the right tools and utilizing the validation methods will help apply machine learning more easily to real-world projects.