Migrating sub-select MySQL query to Presto
When you have working SQL query there might not be so simple and straightforward it’s migration to Presto SQL. Recently I had a job to extract data from one table but with multiple added columns which requires sub-select. To demonstrate the problem and solution I have prepared simple user and location table where location entry represents where and where have users moved.
User
+----+-------+
| id | name |
+----+-------+
| 1 | Tom |
| 2 | Sally |
| 3 | Ben |
+----+-------+
Location
+----+---------+-------------+----------------------+
| id | user_id | name | moved_on |
+----+---------+-------------+----------------------+
| 1 | 1 | 'New York' | '2010-1-1 00:00:00' |
| 2 | 1 | 'Columbus' | '2011-1-1 00:00:00' |
| 3 | 1 | 'Miami' | '2012-1-1 00:00:00' |
| 4 | 2 | 'Miami' | '2010-1-1 00:00:00' |
| 5 | 2 | 'Columbus' | '2011-1-1 00:00:00' |
| 6 | 3 | 'Miami' | '2010-1-1 00:00:00' |
| 7 | 3 | 'New York' | '2011-1-1 00:00:00' |
+----+---------+-------------+----------------------+
Solution is using Presto’s window function
If you have trouble understanding windows functions by reading their docs, please continue reading this post and learn “by example”.
Task
Find all users who ever lived in Miami and display their current location and if their current location is Miami.
PostgreSQL
PostgreSQL solution would look something like the first next code example. Note the current_location inside SELECT clause.
SELECT
"user".id,
"user".name,
(
SELECT l.name
FROM location l
WHERE location.user_id=l.user_id
ORDER BY l.moved_on DESC
LIMIT 1
) current_location,
CASE WHEN
EXISTS (
SELECT l.id
FROM location l
WHERE
location.user_id = l.user_id AND
location.moved_on > l.moved_on
)
THEN 'True'
ELSE 'False'
END AS is_miami_latest_location
FROM location
JOIN "user"
ON "user".id = location.user_id
WHERE
location.name = 'Miami'
;
To test it out you can run postgreSQL in docker.
docker run --name pg -e POSTGRES_PASSWORD=mysecretpassword -d postgres
docker exec -it pg psql -U postgres
Psql shell will pop up where you create and seed db.
CREATE DATABASE subquery
\c subquery
CREATE TABLE IF NOT EXISTS "user" (
id INT PRIMARY KEY,
name VARCHAR (50) NOT NULL
);
CREATE TABLE IF NOT EXISTS "location" (
id INT PRIMARY KEY,
user_id INT NOT NULL,
name VARCHAR (50) NOT NULL,
moved_on TIMESTAMP NOT NULL,
CONSTRAINT fk_user
FOREIGN KEY (user_id)
REFERENCES "user" (id)
);
INSERT INTO "user" VALUES
(1, 'tom'),
(2, 'sally'),
(3, 'ben');
INSERT INTO location VALUES
(1, 1, 'New York', '2010-1-1 00:00:00'),
(2, 1, 'Columbus', '2011-1-1 00:00:00'),
(3, 1, 'Miami', '2012-1-1 00:00:00'),
(4, 2, 'Miami', '2010-1-1 00:00:00'),
(5, 2, 'Columbus', '2011-1-1 00:00:00'),
(6, 3, 'Miami', '2010-1-1 00:00:00'),
(7, 3, 'New York', '2011-1-1 00:00:00');
Presto
Solution in presto looks like the first next code example. See how I added WITH clause before SELECT clause and then used it with LEFT JOIN.
WITH city AS (
SELECT user_id, name, moved_on,
row_number() over (partition by user_id order by moved_on desc) city_freshness
FROM location
),
latest_location AS (SELECT * FROM city where city_freshness=1)
SELECT
"user".id,
"user".name,
latest_location.name as current_location,
CASE WHEN
EXISTS (
SELECT l.id
FROM location l
WHERE
location.user_id = l.user_id AND
location.moved_on > l.moved_on
)
THEN 'True'
ELSE 'False'
END AS is_miami_latest_location
FROM location
JOIN "user"
ON "user".id = location.user_id
LEFT join latest_location on location.user_id = latest_location.user_id
WHERE
location.name = 'Miami'
;
To test it out you can run Presto in docker.
docker run --name presto ahanaio/prestodb-sandbox
docker exec -it presto presto-cli --catalog memory
Presto shell will pop up where you can create and seed db.
CREATE SCHEMA subquery;
USE subquery;
CREATE TABLE IF NOT EXISTS "user" (
id INT,
name VARCHAR (50)
);
CREATE TABLE IF NOT EXISTS "location" (
id INT,
user_id INT,
name VARCHAR (50),
moved_on TIMESTAMP
);
INSERT INTO "user" VALUES
(1, 'tom'),
(2, 'sally'),
(3, 'ben');
INSERT INTO location VALUES
(1, 1, 'New York', TIMESTAMP '2010-1-1 00:00:00'),
(2, 1, 'Columbus', TIMESTAMP '2011-1-1 00:00:00'),
(3, 1, 'Miami', TIMESTAMP '2012-1-1 00:00:00'),
(4, 2, 'Miami', TIMESTAMP '2010-1-1 00:00:00'),
(5, 2, 'Columbus', TIMESTAMP '2011-1-1 00:00:00'),
(6, 3, 'Miami', TIMESTAMP '2010-1-1 00:00:00'),
(7, 3, 'New York', TIMESTAMP '2011-1-1 00:00:00');
Solution output
id | name | current_location | is_miami_latest_location
----+-------+------------------+--------------------------
1 | tom | Miami | True
2 | sally | Columbus | False
3 | ben | New York | False
(3 rows)
Alternatives
I have tried using sql conversion tools but it did not work. Converted query was not valid for presto engine. If you know why, please let me know.
Conclusion
Using window function in presto sql can be very powerful. I found this template very useful when querying data warehouse. I cannot learn it by heart therefore I copy it and slightly modify naming. I hope you will find it useful to copy & paste it as well. 😊