Geospatial Query Performance Test of Presto and PostGIS

Yifeng Jiang
6 min readDec 23, 2019

--

At a recent project, I did a geospatial query performance test of PostGIS and Presto. I thought it is worth to share the observation I gained from a non-geospatial expert’s point of view.

Background

In this big data project, we need to process, ingest and query a huge amount of geospatial and other data. We started with PostGIS, the popular geospatial extension of PostgreSQL. PostGIS does well in terms of rich geospatial funtion support and ease of use from the application. However, as data keeps coming in, PostGIS soon becomes the bottleneck.

  • Runs on a single node, not leverage well our infrastructure.
  • Low ingestion throughput. Ingestion rate of the PostGIS DB (20–100x smaller than the main NoSQL DB) is 4–5x slower than the main NoSQL DB.
  • High CPU load spikes. This may be improved by tuning query and PostgreSQL .
  • Table size limited to 16TB.
  • No High Availability built in (at least not easy).

To address these challenges, we need a distributed geospatial database. There are multiple options available. After some research under the following principals, we narrowed down the options to Presto and another distributed geospatial database built on top of the Hadoop big data stack.

  • Easy to scale natively.
  • Leverage our infrastructure to deliver high throughput at consistent low latency.
  • Simple to setup, use and manage.
  • Can be deployed anywhere.
  • Open source perferred. Small code base and active community.

This post is focus on geospatial query performance comparison of Presto and PostGIS.

Presto Introduction

Presto is a distributed SQL query engine for Big Data. It is open sourced by Facebook, now hosted under the Linux Foundation. Presto is optimized for low latency, interactive query, which is important for us as the geospatical database powers our RestAPI. Presto has a well-organized code base and active community around it. It supports standard geospatial functions with similar ST_prefix syntax to PostGIS.

One interesting thing about Presto is that, it does not store/manage the database data itself, instead it has a connecter mechanism to query data where it lives, including Hive, Redis, relational databases and many data stores. A single Presto query can join data from different data stores, allowing analytics across multiple sources. This architecture separates compute from storage, which enables independent, flexible and easy scale of resource.

Presto Connector

Presto and S3

Pure Storage FlashBlade, a high performance scale-out all-flash storage, plays a critical role in our infrastructure. FlashBlade supports multiple storage protocols including S3. Presto queries data stored in S3 via the Hive Connector. Data is stored as well-known format (CSV, ORC, Parquet, etc) in a S3 bucket. Presto accesses the data through a Hive external table backed by S3.

Presto Hive Connector

Being able to leverage S3 is a good fit for us as we can easily build a scalable data pipeline with the other big data stack (Hive, Spark) we are already using.

Geospatial Query with Presto

At a high level, our geospatial data pipeline looks like the below:

Geospatial Data Pipeline with Presto

The PostgreSQL (with PostGIS extension) used to serve our geospatial query. PostGIS became slower as data grows, especially for the ingestion/write path and big queries on the read path. In the test, we are replacing it by Presto. With Presto, a Spark job writes geospatial data as ORC files directly to FlashBlade S3. Geospatial column is stored as Well-Known Text (WKT) format in the table. On the read path, Presto fetches table schema and partition information from Hive Metastore, compiles SQL to Presto tasks, accesses data from S3 and does geospatial computation on multiple nodes.

With this new architecture, better scalability is expected as Spark, FlashBlade S3 and Presto are all scale-out systems.

Comparing Geospatial Query Performance of Presto and PostGIS

We tested typical geospatial queries with both Presto and PostGIS. For Ingestion, we measure job completion time by a single Spark job writing query-ready geospatial data. For read queries, we measure latency for typical geospatial queries in single session and concurrency scenarios. We run ingestion job/queries multiple times, take average speed as the result.

Because of resource limit, we run the tests in a small setup this time.

  • 4 nodes Presto cluster.
  • 1 node PostgreSQL/PostGIS.
  • 3 nodes Spark cluster.
  • Each node is a virtual machine with 8 vCores, 32GB RAM.
  • FlashBlade with 15 blades (definitely over spec comparing to compute but this is what we had for the test).

Test dataset is among tens of GBs, splittable.

Ingestion Performance

We tested two different input types. The below is the result.

Presto is 1.4–3.5x faster for ingestion. This might be a little miss leading because Presto is not really involved in the write path. The Spark job writes geospatial data directly into FlashBlade S3 bucket, which is different to PostGIS, where data is written through the database layer running on a single node. This is considered the main reason why ingestion for Presto is faster, because everything is distributed in the Presto pipeline.

Another reason is with PostGIS, multiple indices were created with the geospatial table for fast lookup. This slows down writes as PostGIS needs to update indices during ingestion. With Presto, there is no true indices. For fast read, some extra logic is put into the Spark job to optimize data layout for Presto to query later, by leveraging Hive partitions and sorting columns in ORC.

We expect the performance gap to be bigger with larger dataset and more Spark nodes.

Query Performance

Typical geospatial queries tested include computing simplified geometry (ST_Simplify), distance between two geometries (ST_Distance), relationship between geometries (ST_Contain), convex hull of multiple geometries (ST_ConvexHull) and so on. Many queries are simple lookups, some include joining on geometry column.

Here are the results for single-session small and big queries.

Single Session, Small Query
Single Session, Big Query

In the concurrency test, we simulated 10 sessions (equivalent to 10 users), where each session runs the same query 10 times.

Concurrency Test

PostGIS is faster for small queries that hits table indices. But PostGIS became very slow for big queries and queries that do not hit an index.
Presto is faster for big queries. It is more than 6x faster than PostGIS for query 6, which is the second largest query in the test. For the largest query 5, Presto took 11s, but PostGIS was timed out after not returning in 5m. High CPU load was observed for both PostGIS and Presto for query 5 and 6, indicating our geospatial queries are CPU bound.

Same trend for both single-session and concurrency test. All Presto nodes were at high CPU load in the concurrency test, which is good because this means loads were evenly distributed to the nodes in the cluster and it is highly possible to scale the system by adding more nodes.

It is also interesting to see Presto delivered consistent relatively low latency for the small queries.

Conclusion

It is as expected that PostGIS was fast for small queries, while Presto was good for big queries. In our case, it seems to be better to use Presto for the big geospatial tables, and keep using PostGIS for the small metadata tables.

I only did the test in a small setup this time, but I definitely would like to test on larger dataset, faster and more servers. Stay tuned.

As a final note, I sent a pull request to Presto to extend ST_Points to support major Well-Known spatial objects. I found it is straightforward to set up development and unit test evironment for the code. The PR is friendly reviewed by one of the Presto committers. Another +1 for Presto for scalable geospatial queries.

--

--

Yifeng Jiang
Yifeng Jiang

Written by Yifeng Jiang

Software & solutions engineer, big data and machine learning, jogger, hiker, traveler, gamer.

Responses (2)