Stop Thinking, Just Do!

Sungsoo Kim's Blog

Google Cloud BigQuery ML Using SQL

tagsTags

12 July 2022


Article Source


Google Cloud BigQuery ML Using SQL

My simple
whiteboard

As the Machine learning model are been utilized in different fields and with the advancement of technologies (Better hardware and High-level Programming languages) it is getting easier to build a working machine learning model that fits your needs. The process of making Machine learning models is getting simpler and lines of codes are getting less.

In this article, I will show you how to build a custom ML model with just SQL using Google BigQuery ML.

But first, what is Google BigQuery?

Google BigQuery is a server-less, cost-effective, and highly scalable data warehouse system provided by Google Cloud Platform with lots of features for business agility. BigQuery ML is one such feature which helps data scientist, data analysts and Data engineers(like me) to build operational production-grade ML models. We can create a model on structures as well as unstructured data. Most importantly just by using a SQL environment using queries in a short period.

In this article we will complete the following task:

  1. Use as sample data set, “German Credit data” to predict good or bad loan application.
  2. Divide the sample data into the train and test set.
  3. Create and load the data in the Google BigQuery table.
  4. Create a “logistic regression” model just by using SQL.
  5. And, finally, we will evaluate and predict the loan types as good or bad.

Sample Data That We Will Be Using

We will be using a sample structured data set “German credit data”.

You can also download the files from below git repository:

https://github.com/aakash-rathore/BigQueryML/tree/master/data

├── data
│   ├── germanCredit.csv
│   ├── GermanCredit.xls
│   ├── testData.csv
│   └── trainData.csv
├── README.md
└── test_train.py

In the repository “GermanCredit.xls” in the raw data file which gives details of all columns data. I have encoded all the categorical data and created a file “GermanCredit.csv” and “test_train.py” is a python script to divide the data in test and train data set(testData.csv and trainData.csv)

Loading Data In BigQuery

Login into the Google Cloud Platform Console and navigate to the BigQuery tab using the left options panel.

selecting Bigquery tool in the GCP
console

Create a data set using option present in BigQuery console.

Creating
dataset

Create tables from trainData.csv and testData.csv inside the previously created data set.

Creating a table with train data creating a table with test
data

We can preview the sample data in created tables using the “Preview “ option after selecting the respective table.

Previewing table
data

Now we can start creating an ML model, in our sample data “response” column is the label which is the result (1=good, 0=bad) and other columns are input features. The amazing thing is that we will be using only SQL queries to create and evaluate our ML model.

Creating ML Model

We will create a Logistics Regression model which is a classification model, in our case we will use it to classify whether the loan application is good or not based on previous credit report data.

SQL query to create an ML model:

# Creating logistic regression model using data from trainData tableCREATE OR REPLACE MODEL
  `mltestDataset.credit_classification_model` OPTIONS ( model_type='logistic_reg' labels=['response'] ) AS
SELECT
  * EXCEPT(obs)
FROM
  `mltestDataset.trainData`

Creating mode from the table containing training
data

Evaluate Created ML Model

There are different performance parameter to evaluate an ML model, in our created ML model roc_auc is one such simple queryable field when evaluating our trained ML model.

SQL query to evaluate the ML model:

# Evaluating logistic regression model using data from testData tableSELECT
  roc_auc,
  CASE
    WHEN roc_auc > .8 THEN 'good'
    WHEN roc_auc > .7 THEN 'fair'
    WHEN roc_auc > .6 THEN 'not great'
  ELSE
  'poor'
END
  AS model_quality
FROM
  ML.EVALUATE(MODEL mltestDataset.credit_classification_model,
    (
    SELECT
      * EXCEPT(obs)
    FROM
      `mltestDataset.testData` ) )

Output:

Using the ROC area under the curve to evaluate the model
performance

As you can see the performance somewhat fair, we can improve the performance by tuning the model using feature engineering. But for simplicity, we will use this model to analyze the prediction.

Prediction Using Created ML Model

Now we will use this model to predict the loan type(1=Good,0=Bad). Query to get the prediction from the model is given below:

# Getting prediction for our ML model using data from testData tableSELECT
  *
FROM
  ml.PREDICT(MODEL `mltestDataset.credit_classification_model`,
    (
    SELECT
      * EXCEPT(obs)
    FROM
      `mltestDataset.testData` ) );

Output:

We will check the first 5 predictions:

Checking
predictions

We can see the result of the first 5 predictions:

  1. For 1st record, our model predicts the loan as good(response=1) with 84% confidence, and actually, it is good (response=1).
  2. For the 2nd record, the model predicts the loan as good(response=1) with 93% confidence, and actually, it is good (response=1).
  3. For the 3rd record, the model predicts the loan as good(response=1) with 63% confidence, and actually, it is good (response=1).
  4. For the 4th record, the model predicts the loan as good(response=1) with 74% confidence, and actually, it is good (response=1).
  5. For the 5th record, the model predicts the loan as bad(response=0) with 66% confidence, and actually, it is bad (response=0).

Conclusion

In this article, I explained how we can create and evaluate an ML model in Google BigQuery using SQL. Later we analyzed the prediction made by this model.


comments powered by Disqus