Migrating sub-select MySQL query to Presto

When you have working SQL query there might not be so simple and straightforward it’s migration to Presto SQL. Recently I had a job to extract data from one table but with multiple added columns which requires sub-select. To demonstrate the problem and solution I have prepared simple user and location table where location entry represents where and where have users moved.

User
+----+-------+
| id | name  |
+----+-------+
|  1 | Tom   |
|  2 | Sally |
|  3 | Ben   |
+----+-------+
Location
+----+---------+-------------+----------------------+
| id | user_id |     name    |       moved_on       |
+----+---------+-------------+----------------------+
|  1 |       1 |  'New York' |  '2010-1-1 00:00:00' |
|  2 |       1 |  'Columbus' |  '2011-1-1 00:00:00' |
|  3 |       1 |  'Miami'    |  '2012-1-1 00:00:00' |
|  4 |       2 |  'Miami'    |  '2010-1-1 00:00:00' |
|  5 |       2 |  'Columbus' |  '2011-1-1 00:00:00' |
|  6 |       3 |  'Miami'    |  '2010-1-1 00:00:00' |
|  7 |       3 |  'New York' |  '2011-1-1 00:00:00' |
+----+---------+-------------+----------------------+

Solution is using Presto’s window function

presto

If you have trouble understanding windows functions by reading their docs, please continue reading this post and learn “by example”.

Task

Find all users who ever lived in Miami and display their current location and if their current location is Miami.

PostgreSQL

PostgreSQL solution would look something like the first next code example. Note the current_location inside SELECT clause.

SELECT 
  "user".id,
  "user".name,
  (
    SELECT l.name
    FROM location l
    WHERE location.user_id=l.user_id
    ORDER BY l.moved_on DESC
    LIMIT 1
  ) current_location,
  CASE WHEN 
    EXISTS (
      SELECT l.id 
      FROM location l
      WHERE 
        location.user_id = l.user_id AND 
        location.moved_on > l.moved_on
    )
    THEN 'True'
    ELSE 'False'
  END AS is_miami_latest_location
FROM location
JOIN "user"
  ON "user".id = location.user_id
WHERE
  location.name = 'Miami'
;

To test it out you can run postgreSQL in docker.

docker run --name pg -e POSTGRES_PASSWORD=mysecretpassword -d postgres
docker exec -it pg psql -U postgres

Psql shell will pop up where you create and seed db.

CREATE DATABASE subquery
\c subquery

CREATE TABLE IF NOT EXISTS "user" (
   id INT PRIMARY KEY,
   name VARCHAR (50) NOT NULL
);


CREATE TABLE IF NOT EXISTS "location" (
   id INT PRIMARY KEY,
   user_id INT NOT NULL,
   name VARCHAR (50) NOT NULL,
   moved_on TIMESTAMP NOT NULL,
   CONSTRAINT fk_user
    FOREIGN KEY (user_id)
      REFERENCES "user" (id)
);

INSERT INTO "user" VALUES
  (1, 'tom'),
  (2, 'sally'),
  (3, 'ben');

INSERT INTO location VALUES
  (1, 1, 'New York', '2010-1-1 00:00:00'),
  (2, 1, 'Columbus', '2011-1-1 00:00:00'),
  (3, 1, 'Miami', '2012-1-1 00:00:00'),

  (4, 2, 'Miami', '2010-1-1 00:00:00'),
  (5, 2, 'Columbus', '2011-1-1 00:00:00'),

  (6, 3, 'Miami', '2010-1-1 00:00:00'),
  (7, 3, 'New York', '2011-1-1 00:00:00');

Presto

Solution in presto looks like the first next code example. See how I added WITH clause before SELECT clause and then used it with LEFT JOIN.

WITH city AS (
  SELECT user_id, name, moved_on,
    row_number() over (partition by user_id order by moved_on desc) city_freshness
  FROM location
),

latest_location AS (SELECT * FROM city where city_freshness=1)

SELECT 
  "user".id,
  "user".name,
  latest_location.name as current_location,
  CASE WHEN 
    EXISTS (
      SELECT l.id 
      FROM location l
      WHERE 
        location.user_id = l.user_id AND 
        location.moved_on > l.moved_on
    )
    THEN 'True'
    ELSE 'False'
  END AS is_miami_latest_location
FROM location
JOIN "user"
  ON "user".id = location.user_id
LEFT join latest_location on location.user_id = latest_location.user_id
WHERE
  location.name = 'Miami'
;

To test it out you can run Presto in docker.

docker run --name presto ahanaio/prestodb-sandbox
docker exec -it presto presto-cli --catalog memory

Presto shell will pop up where you can create and seed db.

CREATE SCHEMA subquery;
USE subquery;

CREATE TABLE IF NOT EXISTS "user" (
   id INT,
   name VARCHAR (50)
);

CREATE TABLE IF NOT EXISTS "location" (
   id INT,
   user_id INT,
   name VARCHAR (50),
   moved_on TIMESTAMP
);

INSERT INTO "user" VALUES
  (1, 'tom'),
  (2, 'sally'),
  (3, 'ben');

INSERT INTO location VALUES
  (1, 1, 'New York', TIMESTAMP '2010-1-1 00:00:00'),
  (2, 1, 'Columbus', TIMESTAMP '2011-1-1 00:00:00'),
  (3, 1, 'Miami', TIMESTAMP '2012-1-1 00:00:00'),

  (4, 2, 'Miami', TIMESTAMP '2010-1-1 00:00:00'),
  (5, 2, 'Columbus', TIMESTAMP '2011-1-1 00:00:00'),

  (6, 3, 'Miami', TIMESTAMP '2010-1-1 00:00:00'),
  (7, 3, 'New York', TIMESTAMP '2011-1-1 00:00:00');

Solution output

 id | name  | current_location | is_miami_latest_location
----+-------+------------------+--------------------------
  1 | tom   | Miami            | True
  2 | sally | Columbus         | False
  3 | ben   | New York         | False
(3 rows)

Alternatives

I have tried using sql conversion tools but it did not work. Converted query was not valid for presto engine. If you know why, please let me know.

Conclusion

Using window function in presto sql can be very powerful. I found this template very useful when querying data warehouse. I cannot learn it by heart therefore I copy it and slightly modify naming. I hope you will find it useful to copy & paste it as well. 😊

comments powered by Disqus