Asset & WorkforceStrategyTechnology

Machine Learning for Predictive Models

By January 3, 2023 No Comments

Dominic Ciarlette, ENTRUST, U.S., explains how machine learning can be employed to develop predictive models that can help utilities develop more efficient asset replacement strategies.


A medium‑sized U.S. gas utility company is engaged in a comprehensive asset review program. The goal of the program is to locate occurrences of a particular fitting within its distribution networks. The fitting, a type of mechanical tee, is associated with increased risk for leaking. The utility company is currently in the process of performing several thousand excavations and remediations to positively identify these fittings within their network. To aid this effort, the utility partnered with ENTRUST Solutions Group (ENTRUST) to produce a machine learning model capable of predicting occurrences of the target fitting within the networks. The model predictions validated and augmented the utility’s current asset review methodology and resulted in a reduction in costs and an increase in confidence in the program.


Natural gas distribution networks comprise main and service piping. The main piping carries gas throughout the network and service piping delivers gas from the network to the end‑user. Service piping is commonly joined to mains via tee fittings—named so for their resemblance to the letter “T.” These tee fittings serve to join and pressure seal two pipes together. The subject of this article is a specific type of tee fitting. Following investigations, this tee type has been associated with increased risk for leaking. In response, the utility is seeking to locate all such fittings in its system.

Unfortunately, complete and accurate construction records, which would definitively locate and identify all instances of a sought-after fitting, are rarely available to utilities. In some cases, such construction activities go back several decades. Furthermore, some records may have traded hands several times as the result of acquisitions and mergers. Due to the lack of sufficient records to locate all target fittings, the utility pursued an analytical approach to improve target-fitting identification.

Predictive Model Development

Most predictive model development begins with reviewing the available data and discussions with domain experts. Fortunately, we had at our disposal several thousand records of direct examinations, thanks to years of excavation activities undertaken by the utility. This set of excavated fitting records formed the basis for the data used to train the predictive model. Using the excavated records as training data allowed us to utilize a supervised-learning model approach. What followed was extensive feature engineering to cleanse, standardize and transform the training data to facilitate optimal predictive model training.

Feature Engineering

To prepare the data to be suitable as training and prediction input data, we performed several feature engineering steps. First, we sought to expand the available dataset to include additional predictive information.  Following discussions with utility subject matter experts (SMEs), we expanded the original dataset to include fields from other datasets within the utility’s data lake. Also, we learned about a possible causal relationship between fitting installation activities and weather—specifically, rainfall. With this in mind, we gathered weather records that covered the time period and locations of fitting installations and was able to establish a statistically significant relationship between rainfall amount and fitting type.

Then, we cleansed and standardized the data. This addressed issues in the following areas:


  • Punctuation
  • Whitespace
  • Letter-cases
  • In consistent units
  • Improperly aggregated information


The cleansing steps helped to ensure that the model was receiving the information present in the data in a consistent, proper manner.

Model Choice

Following feature engineering, we identified several model types to trial with the training data. The model types included:

  • Decision Tree
  • Random Forest
  • Logistic Regression
  • Gaussian Naïve Bayes
  • Gradient Boosting

We trained and evaluated models for all of the model types listed above. Model testing performance is listed in Figure 1.

We selected the Random Forest approach as the basis for the predictive model due to its favorable performance compared to other analytical modeling techniques. Following model selection, we optimized the parameters of the Random Forest model to boost predictive performance. Optimized parameters include maximum tree depth and estimator count.

Model Performance Testing

We evaluated the trained model’s predictive performance by using a set of testing data, which is a standard practice in machine learning application. In this case, we set aside 20% of the labeled data and did not include this in the data used to train the model. This approach ensures that a given model can’t simply rote-memorize the correct answers to questions and report a deceptively high performance score. Figure 2 is the trained model’s prediction results from the testing data represented as a confusion matrix.


According to the results, this model has favorable precision[1] and sensitivity[2] for both target and non‑target asset prediction. This means that the model is successfully predicting asset type from the testing data.

Predictive Model Application and Next Steps

Following model development, we generated predictions for all remaining records. By using these predictions, the utility is able to augment and validate its asset review program. With accurate model predictions, the utility can improve its resource allocation by focusing efforts on assets that the model predicted to be the target asset. This targeting approach also will increase the rate of target asset locating, thus reducing durations of elevated-risk assets present in the field.

The utility’s asset review program is an ongoing effort. Results from the predictive model application are yet to be realized. Once sufficient numbers of prediction‑based excavations are available, the predictive model’s performance on non‑testing records can be analyzed.


[1]Precision is the fraction of model predictions over the number of true cases.

[2]Sensitivity is the fraction of true model predictions over the number of model predictions.