This project aims to address the Machine Learning process in the context of a problem in the area of Civil Engineering, specifically related to concrete. Instead of performing tasks manually, we will create automation modules to develop our own AutoML system, without relying on specific frameworks. We will use Spark MLlib on PySpark to apply Machine Learning techniques. In the field of Civil Engineering, concrete is one of the most important materials. The compressive strength of concrete is an essential property, which depends in a non-linear way on the age of the concrete and the ingredients used in its composition. In this project, our objective will be to build a predictive model capable of estimating the characteristic compressive strength of concrete. and the other variables will be used as predictors.
Given that we are predicting a numerical value and have input and output data, this is a regression problem. To find the best predictive model, we will try different regression algorithms such as Linear Regression, DecisionTreeRegressor, RandomForestRegressor, GBTRegressor and IsotonicRegression. We will use hyperparameter optimization and tuning techniques to achieve the best possible performance. After training the model, we will make predictions using new data. This will allow us to assess the generalizability of the model and its applicability in real situations. By developing our own AutoML system, we will have greater control over the Machine Learning process and will be able to explore different algorithms and techniques to obtain the best results. Based on the comparative performance analysis of the regression algorithms used, we will select the one that presents the best performance for our specific problem of predicting the compressive strength of concrete.
Keywords: Python Language, Apache Spark, PySpark, Civil Engineering, Concrete Structures, Machine Learning, Data Analysis, AutoML, Spark MLlib, Linear Regression, DecisionTreeRegressor, RandomForestRegressor, GBTRegressor, IsotonicRegression
MLlibe, also known as MLlib (Machine Learning Library), is a powerful and versatile machine learning library designed for use with Apache Spark, a distributed processing and big data analytics platform. With MLlibe, it is possible to perform complex machine learning tasks on a large scale, taking advantage of Spark's distributed processing power. In MLlibe, proper data preparation is a crucial step before training machine learning models. One of the essential requirements of MLlibe is that all dataframe input columns are vectorized. This means that the characteristics present in these columns must be transformed into numerical representations, in the form of vectors, so that they can be processed by the machine learning algorithms available in MLlibe. Vectorizing the input columns is an important step, as it allows MLlibe's algorithms to work with the data efficiently and accurately. Vectorization involves transforming the characteristics of each column into a numerical vector, which quantitatively represents the information contained in these characteristics. In this way, MLlibe can perform mathematical and statistical calculations on the vectors to train and evaluate the machine learning models.
AutoML (Automatic Machine Learning) is an approach that aims to automate the process of building machine learning models. Traditionally, developing machine learning models involves several steps, such as selecting and preparing data, choosing and tuning algorithms, optimizing hyperparameters, and evaluating models. AutoML seeks to simplify and streamline this process, allowing users with different experience levels to enjoy the benefits of machine learning without necessarily becoming experts in data science. The goal of AutoML is to provide automated tools and techniques that can perform multiple steps in the model building process, reducing the need for manual intervention. This includes automating tasks such as data pre-processing, algorithm selection, hyperparameter tuning, and model evaluation.
There are several AutoML tools and platforms available that offer a variety of features. These tools can use approaches such as automated hyperparameter fetching, automated model selection, and even automatic code generation. In addition, many AutoML tools provide features to address common challenges such as detecting and handling missing data, balancing imbalanced classes, and model interpretability. While AutoML is useful and efficient in many cases, it is important to keep in mind that it is not a substitute for human experience and knowledge. In some complex or unique scenarios, more detailed or custom manual tuning may be required. However, AutoML plays an important role in democratizing access to machine learning, allowing a variety of professionals to explore and use this technology in their projects, even without having a deep knowledge of data science.
This project uses MLlibe, a machine learning library developed for Apache Spark, to perform machine learning model training tasks. Before training the models, it is necessary to prepare the data according to the requirements of MLlibe. One of the fundamental requirements of MLlibe is that all dataframe input columns are vectorized, that is, that the characteristics present in these columns are converted into numerical vectors. This is necessary so that machine learning algorithms can correctly process the data. In the provided code, several steps are taken to prepare the data before training the models. First, the data file is read using Spark and it displays some basic information about the dataframe, such as the number of records and the data visualization.
Next, a function called func_modulo_prep_data
is defined that automates data preparation. This function handles missing values, converts categorical variables to numerical ones, handles outliers, vectorizes the input columns, and standardizes the data if necessary. At the end, the function returns a new dataframe with the prepared columns. After preparing the data, the correlation between the dataframe variables is checked using the Correlation.corr
function. This allows identifying possible multicollinearities that can affect the performance of machine learning models. Next, a function called func_modulo_ml
is created that automates the use of various regression algorithms. This function creates, trains, and evaluates each model using different combinations of hyperparameters. The objective is to identify the model with the best performance.
The func_modulo_ml
function is applied to a list of regression algorithms, such as Linear Regression, Decision Tree Regression, Random Forest Regression, GBT Regression and Isotonic Regression. For each algorithm, the code trains the model using cross-validation and evaluates the performance of the model based on the RMSE (Root Mean Squared Error) and R2 coefficient metrics. At the end of training the models, a table summarizing the performance of each algorithm is displayed. Based on these results, the Gradient-Boosted Tree (GBT) model is selected as the best performer and will be used in production. The code also includes an example of how to make predictions with the trained model. Test data is provided and the GBT model is used to make predictions with this data. In summary, this project uses MLlibe and Spark to prepare data, train regression models and make predictions. It automates several steps in the process and lets you choose the best model based on performance.
https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength