Skip to content

Parsing and analysis of Vertica, Hive, and Presto SQL.

License

Notifications You must be signed in to change notification settings

uber/queryparser

Repository files navigation

queryparser

Parsing and analysis of Vertica, Hive, and Presto SQL.

Gentle introduction

Queryparser supports parsing for three sql-dialects (Vertica, Hive, and Presto).

Each dialect implements its own tokenization and parsing logic. There is a single abstract syntax tree (AST) for representing queries of any dialect. This AST is defined in Database/Sql/Type.hs and Database/Sql/Type/Query.hs.

The parsing logic produces an AST with table and column identifiers that are "raw" or optionally qualified. Frequently, it is desirable to convert the AST over raw names to an AST over resolved names, where identifiers are fully qualified. This transformation is called "name resolution" or simply "resolution". It requires as input the full list of columns in every table and the full list of tables in every schema, otherwise known as "catalog information".

An example of resolution:

  • SELECT column_c FROM table_t is a query with raw names
  • SELECT schema_s.table_t.column_c FROM schema_s.table_t is the same query with resolved names

Various utility functions ("analyses") have been defined for ASTs over resolved names, such as:

  • what tables appear in the query?
  • what columns appear in the query, per clause?
  • what is the table/column lineage of a query?
  • what sets of columns does a query compare for equality?

The analyses are the main value-add of this library.

Demo

Enough talk, let's show what it can do!

$ stack ghci
...
...> :set prompt "> "
> import Demo
> -- for the purposes of our demo, we have two tables: foo with columns a,b,c and bar with columns x,y,z
> demoAllAnalyses "SELECT * FROM foo" -- note that the SELECT * expands to a,b,c
Tables accessed:
    public.foo
Columns accessed by clause:
    public.foo.a	SELECT
    public.foo.b	SELECT
    public.foo.c	SELECT
Joins:
    no joins
Table lineage:
    no tables modified
> demoAllAnalyses "SELECT * FROM bar" -- and here the SELECT * expands to x,y,z
Tables accessed:
    public.bar
Columns accessed by clause:
    public.bar.x	SELECT
    public.bar.y	SELECT
    public.bar.z	SELECT
Joins:
    no joins
Table lineage:
    no tables modified
> demoAllAnalyses "SELECT x, count(1) FROM foo JOIN bar ON foo.a = bar.y WHERE z IS NOT NULL GROUP BY 1 ORDER BY 2 DESC, b"
Tables accessed:
    public.bar
    public.foo
Columns accessed by clause:
    public.bar.x	GROUPBY
    public.bar.x	SELECT
    public.bar.y	JOIN
    public.bar.z	WHERE
    public.foo.a	JOIN
    public.foo.b	ORDER
Joins:
    public.bar.y <-> public.foo.a
Table lineage:
    no tables modified
> -- let's play with some queries that modify table-data!
> demoTableLineage "INSERT INTO foo SELECT * FROM bar"
public.foo after the query depends on public.bar, public.foo before the query
> demoTableLineage "TRUNCATE TABLE foo"
public.foo no longer has data
> demoTableLineage "ALTER TABLE bar, foo RENAME TO baz, bar"
public.bar after the query depends on public.foo before the query
public.baz after the query depends on public.bar before the query
public.foo no longer has data
> -- let's explore a few subtler behaviors of the "joins" analysis (admittedly, something of a misnomer)
> demoJoins "SELECT * FROM foo JOIN bar ON a=x AND b+c = y+z"
public.bar.x <-> public.foo.a
public.bar.y <-> public.foo.b
public.bar.y <-> public.foo.c
public.bar.z <-> public.foo.b
public.bar.z <-> public.foo.c
> demoJoins "SELECT a FROM foo UNION SELECT x FROM bar"
public.bar.x <-> public.foo.a

Spin up your own ghci and paste in your own queries!

Requirements

To build, you need:

  • stack
  • docker (recommended)

OS X

You can install stack on OS X with brew:

brew install ghc haskell-stack

To install docker, use the default installer found on https://docs.docker.com/mac/

To allow stack to see the docker daemon, add eval "$(docker-machine env default)" to your bashrc or equivalent. (This is following https://docs.docker.com/machine/reference/env/)

Linux & Other

Follow the directions at https://github.com/commercialhaskell/stack#how-to-install

Installation

Once you've got what you need, check out the repo and change to the directory.

Stack will download other dependencies for you:

stack setup

Now you can build the package with:

stack build

Or run the tests with:

stack test

Or run the benchmarks with:

stack bench

Or pull things up in ghci with:

stack ghci

Contributing

If you'd like to contribute to the repo, use the above installation instructions to get started.

When you're ready, make sure that the code compiles with the Development flag, i.e.:

stack build --flag queryparser:development

Use

Mostly it boils down to this function:

parse :: Text -> Either ParseError SqlQuery

To parse some sql from the repl,

parse "SELECT 1;"

Areas of future interest

There is substantial room for future work in Queryparser. For more details, see Areas of future interest.