trino-taxi-data

This demo will

Install the required Stackable operators.
Spin up the following data products:
- Superset: A modern data exploration and visualization platform. This demo utilizes Superset to retrieve data from Trino via SQL queries and build dashboards on top of that data.
- Trino: A fast distributed SQL query engine for big data analytics that helps you explore your data universe. This demo uses it to enable SQL access to the data.
- MinIO: A S3 compatible object store. This demo uses it as persistent storage to store all the data used
- Hive metastore: A service that stores metadata related to Apache Hive and other services. This demo uses it as metadata storage for Trino.
- Open policy agent (OPA): An open-source, general-purpose policy engine unifying policy enforcement across the stack. This demo uses it as the authorizer for Trino, which decides which user can query which data.
Load test data into S3. It contains 2.5 years of New York City taxi trips.
Make data accessible via SQL in Trino.
Create Superset dashboards for visualization of the data.

You can see the deployed products and their relationship in the following diagram:

System Requirements

To run this demo, your system needs at least:

7 cpu units (core/hyperthread)
16GiB memory
28GiB disk storage

List Deployed Stacklets

To list the installed Stackable services, run the following command:

$ stackablectl stacklets list
┌─────────┬──────────────┬───────────┬──────────────────────────────────────────────┐
│ PRODUCT ┆ NAME         ┆ NAMESPACE ┆ ENDPOINTS                                    │
╞═════════╪══════════════╪═══════════╪══════════════════════════════════════════════╡
│ hive    ┆ hive-iceberg ┆ default   ┆ hive                172.18.0.4:30637         │
│         ┆              ┆           ┆ metrics             172.18.0.4:30176         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ opa     ┆ opa          ┆ default   ┆ http                http://172.18.0.2:32470  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ trino   ┆ trino        ┆ default   ┆ coordinator-metrics 172.18.0.2:32402         │
│         ┆              ┆           ┆ coordinator-https   https://172.18.0.2:31605 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ minio   ┆ minio        ┆ default   ┆ http                http://172.18.0.2:30357  │
│         ┆              ┆           ┆ console-http        http://172.18.0.2:30310  │
└─────────┴──────────────┴───────────┴──────────────────────────────────────────────┘

When a product instance has not finished starting yet, the service will have no endpoint. Depending on your internet connectivity, creating all the product instances might take considerable time. A warning might be shown if the product is not ready yet.

Inspect Data in S3

The S3 provided by MinIO is used as a persistent storage to store all the data used. You can look at the test data within the MinIO web interface by opening the endpoint console-http from your stackablectl stacklets list command output. You have to use the endpoint from your command output. In this case, it is http://172.18.0.3:30503. Open it with your favourite browser.

Click on the blue button Browse on the bucket demo and open the folders ny-taxi-data → raw.

The demo uploaded 1GB of parquet files, one file per month. The data contains taxi rides in New York City. The file size (and therefore the number of rides) decreased drastically because of the COVID-19 pandemic starting from 2020-03. Parquet is an open-source, column-oriented data file format for efficient storage and retrieval.

Use Trino Web Interface

Trino offers SQL access to the data within S3. Open the endpoint coordinator-https in your browser (https://172.18.0.3:30141 in this case). If you get a warning regarding the self-signed certificate (e.g. Warning: Potential Security Risk Ahead), you must tell your browser to trust the website and continue.

When you start executing SQL queries, you will see the queries getting processed here.

Use Superset Web Interface

Superset gives the ability to execute SQL queries and build dashboards. Open the endpoint external-superset in your browser (http://172.18.0.4:32295 in this case).

View the Dashboard

On the top, click on the Dashboards tab.

Click on the dashboard called Taxi data. It might take some time until the dashboards render all included charts.

You can clearly see the impact of COVID-19 on the taxi business.

Execute Arbitrary SQL Statements

Within Superset, you can create dashboards and run arbitrary SQL statements. On the top, click on the tab SQL Lab → SQL Editor.

On the left, select the database Trino, the schema demo and set See table schema to ny_taxi_data.

In the right textbox, enter the desired SQL statement. If you do not want to make on up, you can use the following:

select
  format_datetime(tpep_pickup_datetime, 'YYYY/MM') as month,
  count(*) as trips,
  sum(total_amount) as sales,
  avg(duration_min) as avg_duration_min
from ny_taxi_data
group by 1
order by 1

Summary

The demo loaded 2.5 years of taxi trip data from New York City with 68 million records and a total size of 1GB in parquet files. The data was put into the S3 storage. Trino enables you to query the data via SQL. Superset was used as a web-based frontend to execute SQL statements and build dashboards.

Where to go from here

There are multiple paths to go from here. The following sections can give you some ideas on what to explore next. You can find the description of the taxi data on the New York City website.

Execute Arbitrary SQL Statements

Within Superset you can execute arbitrary SQL statements to explore the taxi data. Can you answer the following questions by executing SQL statements? The Trino documentation on their SQL language might help you.

How many taxi trips there where in the year 2021?

See the answer

select
  count(*) as trips
from ny_taxi_data
where year(tpep_pickup_datetime) = 2021

returns 30.903.982 trips.

What was the maximum amount of passengers?

See the Answer

select
  max(passenger_count) as max_passenger_count
from ny_taxi_data;

Returns 112 passengers. Well that’s weird. Let’s examine the passengers distribution.

select
  passenger_count,
  count(*) as frequency
from ny_taxi_data
group by 1
order by 1 desc
limit 100

returns

 passenger_count | frequency
-----------------+-----------
           112.0 |         1
            96.0 |         1
             9.0 |        98
             8.0 |       156
             7.0 |       229
             6.0 |   1089568
             5.0 |   1715439
             4.0 |   1052834
             3.0 |   2504112
             2.0 |   9575299
             1.0 |  48133494
             0.0 |   1454268
            NULL |   2698591

We can see that one trip had 112 and another one 96 passengers. All the other trips start with a more "realistic" number of 9 passengers. As a bonus question: What exactly did the large number of passenger do?

select *
from ny_taxi_data
where passenger_count > 50

returns

 vendorid |  tpep_pickup_datetime   |  tpep_dropoff_datetime  | duration_min | passenger_count | trip_distance | payment_type | fare_amount | tip_amount | total_amount
----------+-------------------------+-------------------------+--------------+-----------------+---------------+--------------+-------------+------------+--------------
        2 | 2021-08-01 19:47:43.000 | 2021-08-01 19:57:54.000 |           10 |           112.0 |           1.8 | Credit card  |         9.0 |       2.46 |        14.76
        2 | 2021-08-03 11:51:58.000 | 2021-08-03 12:09:29.000 |           17 |            96.0 |          1.56 | Credit card  |        11.5 |       2.22 |        17.02

Pretty cheap for that amount of people! This probably are invalid records.

What was the highest tip (measured in percentage of the original fee) ever given?

See the Answer

select
  total_amount as fee,
  tip_amount as tip,
  tip_amount / total_amount * 100 as tip_percentage
from ny_taxi_data
where total_amount > 0
order by 3 desc
limit 5

returns

 fee  | tip  |   tip_percentage
------+------+--------------------
  4.2 | 10.0 |  238.0952380952381
 18.2 | 25.0 | 137.36263736263737
 8.24 | 9.24 | 112.13592233009709
 0.66 | 0.66 |              100.0
 0.01 | 0.01 |              100.0

Create Additional Dashboards

You also have the possibility to create additional charts and bundle them together in a Dashboard. Have a look at the Superset documentation on how to do that.

Load Additional Data

You can use the MinIO webinterface to upload additional data. As an alternative you can use the S3 API with an S3 client like s3cmd. It is recommended to put the data into a folder (prefix) in the demo bucket.

Have a look at the defined tables inside the hive.demo schema on how to inform Trino about the newly available data.

Table Definitions

show create table hive.demo.ny_taxi_data_raw

produces something like

CREATE TABLE IF NOT EXISTS hive.demo.ny_taxi_data_raw (
  VendorID BIGINT,
  tpep_pickup_datetime TIMESTAMP,
  tpep_dropoff_datetime TIMESTAMP,
  passenger_count DOUBLE,
  trip_distance DOUBLE,
  payment_type BIGINT,
  Fare_amount DOUBLE,
  Tip_amount DOUBLE,
  Total_amount DOUBLE
) WITH (
  external_location = 's3a://demo/ny-taxi-data/raw/',
  format = 'parquet'
)

If you want to transform or filter your data in any way before using it e.g. in Superset you can create a view as follows:

show create view hive.demo.ny_taxi_data

produces something like

create or replace view hive.demo.ny_taxi_data as
select
  vendorid,
  tpep_pickup_datetime,
  tpep_dropoff_datetime,
  date_diff('minute', tpep_pickup_datetime, tpep_dropoff_datetime) as duration_min,
  passenger_count,
  trip_distance,
  case payment_type when 1 then 'Credit card' when 2 then 'Cash' when 3 then 'No charge' when 4 then 'Dispute' when 6 then 'Voided trino' else 'Unknown' end as payment_type,
  fare_amount,
  tip_amount,
  total_amount
from hive.demo.ny_taxi_data_raw
where tpep_pickup_datetime >= from_iso8601_timestamp('2019-12-01T00:00:00')
and tpep_pickup_datetime <= from_iso8601_timestamp('2022-05-31T00:00:00')

Connect to Trino via CLI, Python or DBeaver

If you prefer running your SQL statements via command-line, a Python script or a graphical Database manager like DBeaver please have a look at the the Trino documentation on how to do that.