The prevailing trend in machine learning involves transferring data to the model's environment for training. However, what if we reversed this process? Given that modern databases are significantly larger than machine learning models, wouldn't it be more efficient to move the models to the datasets?
This is the fundamental concept behind PostgresML – the data remains in its location, and you bring your code to the database. This inverted approach to machine learning offers numerous practical advantages that challenge conventional notions of a "database."
PostgresML: An Overview and its Advantages
PostgresML is a comprehensive machine learning platform built upon the widely-used PostgreSQL database. It introduces a novel approach called "in-database" machine learning, enabling you to execute various ML tasks within SQL without needing separate tools for each step.
Despite its relative novelty, PostgresML offers several key benefits:
This tutorial will demonstrate these features using a typical machine learning workflow:
All these steps will be performed within a Postgres database. Let's begin!
A Complete Supervised Learning Workflow with PostgresML
After signup, you'll access your PostgresML console for managing projects and resources.
The "Manage" section allows you to scale your environment based on computational needs.
PostgresML requires PostgreSQL. Installation guides for various platforms are available:
For WSL2, the following commands suffice:
sudo apt update sudo apt install postgresql postgresql-contrib sudo passwd postgres # Set a new Postgres password # Close and reopen your terminal
Verify the installation:
psql --version
For a more user-friendly experience than the terminal, consider the VSCode extension.
Use the connection details from your PostgresML console:
Connect using psql
:
psql -h "host" -U "username" -p 6432 -d "database_name"
Alternatively, use the VSCode extension as described in its documentation.
Enable the pgml extension:
CREATE EXTENSION IF NOT EXISTS pgml;
Verify the installation:
SELECT pgml.version();
We'll use the Diamonds dataset from Kaggle. Download it as a CSV or use this Python snippet:
import seaborn as sns diamonds = sns.load_dataset("diamonds") diamonds.to_csv("diamonds.csv", index=False)
Create the table:
CREATE TABLE IF NOT EXISTS diamonds ( index SERIAL PRIMARY KEY, carat FLOAT, cut VARCHAR(255), color VARCHAR(255), clarity VARCHAR(255), depth FLOAT, table_ FLOAT, price INT, x FLOAT, y FLOAT, z FLOAT );
Populate the table:
INSERT INTO diamonds (carat, cut, color, clarity, depth, table_, price, x, y, z) FROM '~/full/path/to/diamonds.csv' DELIMITER ',' CSV HEADER;
Verify the data:
SELECT * FROM diamonds LIMIT 10;
Train an XGBoost regressor:
SELECT pgml.train( project_name => 'Diamond prices prediction', task => 'regression', relation_name => 'diamonds', y_column_name => 'price', algorithm => 'xgboost' );
Train a multi-class classifier:
SELECT pgml.train( project_name => 'Diamond cut quality prediction', task => 'classification', relation_name => 'diamonds', y_column_name => 'cut', algorithm => 'xgboost', test_size => 0.1 );
Train a random forest model with preprocessing:
SELECT pgml.train( project_name => 'Diamond prices prediction', task => 'regression', relation_name => 'diamonds', y_column_name => 'price', algorithm => 'random_forest', preprocess => '{ "carat": {"scale": "standard"}, "depth": {"scale": "standard"}, "table_": {"scale": "standard"}, "cut": {"encode": "target", "scale": "standard"}, "color": {"encode": "target", "scale": "standard"}, "clarity": {"encode": "target", "scale": "standard"} }'::JSONB );
PostgresML provides various preprocessing options (encoding, imputing, scaling).
Train an XGBoost regressor with custom hyperparameters:
sudo apt update sudo apt install postgresql postgresql-contrib sudo passwd postgres # Set a new Postgres password # Close and reopen your terminal
Perform a grid search:
psql --version
Use pgml.predict
for predictions:
psql -h "host" -U "username" -p 6432 -d "database_name"
To use a specific model, specify its ID:
CREATE EXTENSION IF NOT EXISTS pgml;
Retrieve model IDs:
SELECT pgml.version();
PostgresML automatically deploys the best-performing model. For finer control, use pgml.deploy
:
import seaborn as sns diamonds = sns.load_dataset("diamonds") diamonds.to_csv("diamonds.csv", index=False)
Deployment strategies include best_score
, most_recent
, and rollback
.
Further Exploration of PostgresML
PostgresML extends beyond supervised learning. The homepage features a SQL editor for experimentation. Building a consumer-facing ML service might involve:
psycopg2
or pg-promise
for database interaction.pgml.predict
upon user interaction.Conclusion
PostgresML offers a novel approach to machine learning. To further your understanding, explore the PostgresML documentation and consider resources like DataCamp's SQL courses and AI fundamentals tutorials.
The above is the detailed content of PostgresML Tutorial: Doing Machine Learning With SQL. For more information, please follow other related articles on the PHP Chinese website!