Skip to content

BatchLabs/charlatan

Repository files navigation

Charlatan

Build Status

Charlatan is a query engine for lists or streams of records in different formats. It natively supports CSV and JSON formats but can easily be extended to others.

It supports an SQL-like query language that is defined below. Queries are applied to records to extract values depending on zero or more criteria.

Query Syntax

SELECT <fields> FROM <source> [ WHERE <value> ] [ STARTING AT <index> ] [ LIMIT <count> ]
  • <fields> is a list of comma-separated field names. Each field name must exist in the source. When reading CSV files, the field names are the column names, while when reading JSON they represent keys.
  • <source> is the filename from which the data is read. The API is agnostique on this and one can implement support for any source type.
  • <value> is a SQL-like value, which can be either a constant (e.g. WHERE 1), a field (e.g. WHERE archived) or any operation using comparison operators (=, !=, <, <=, >, >=, AND, OR) and optionally parentheses (e.g. WHERE (foo > 2) AND (bar = "yo")). The parser allows to use && instead of AND and || instead of OR. It also support inclusive range tests, like WHERE age BETWEEN 20 AND 30.
  • STARTING AT <index> can be used to skip the first N records.
  • LIMIT <count> can be used to keep only the first N matched records.

Constant values include strings, integers, floats, booleans and the null value.

Examples

SELECT CountryName FROM sample/csv/population.csv WHERE Year = 2010 AND Value > 50000000 AND Value < 70000000
SELECT name, age FROM sample/json/people.jsons WHERE stats.walking > 30 AND stats.biking < 300
SELECT name, age FROM sample/json/people.jsons WHERE stats.walking BETWEEN 20 AND 100

Type Coercion Rules

  • int: same value if the constant is an integer. Truncated value if it’s a float. 1 if it’s a true boolean. 0 for everything else.
  • float: same value if the constant is an integer or a float. 1.0 if it’s a true boolean. 0.0 for everything else.
  • boolean: true if it’s a string (even if it’s empty), a true boolean, a non-zero integer or float. false for everything else.
  • string: the string representation of the constant. null becomes "null"

These rules mean that e.g. WHERE 0 is equivalent to WHERE false and WHERE "" is equivalent to WHERE true.

API

The library is responsible for parsing the query and executing against records. Everything else is up to you, including how fields are retrieved from records.

Note: code examples below don’t include error handling for clarity purposes.

// parse the query
query, _ := charlatan.QueryFromString("SELECT foo FROM myfile.json WHERE foo > 2")

// open the source file
reader, _ := os.Open(query.From())

defer reader.Close()

// skip lines if the query contains "STARTING AT <n>"
skip := query.StartingAt()

decoder := json.NewDecoder(reader)

for {
    skip--
    if skip >= 0 {
        continue
    }

    // get a new JSON record
    r, err := record.NewJSONRecordFromDecoder(decoder)

    if err == io.EOF {
        break
    }

    // evaluate the query against the record to test if it matches
    if match, _ := query.Evaluate(r); !match {
        continue
    }

    // extract the values and print them
    values, _ := query.FieldsValues(r)
    fmt.Printf("%v\n", values)
}

Two record types are included: JSONRecord and CSVRecord. Implementing a record only requires one method: Find(*Field) (*Const, error), which takes a field and return its value.

As an example, let’s implement a LineRecord that’ll be used to get specific characters on each line of a file, c0 being the first character:

type LineRecord struct { Line string }

func (r *LineRecord) Find(f *charlatan.Field) (*charlatan.Const, error) {

    // this is the field value we must return
    name := f.Name()

    // we reject fields that doesn't start with 'c'
    if len(name) < 2 || name[0] != 'c' {
        return nil, fmt.Errorf("Unknown field '%s'", name)
    }

    // we extract the character index from the field name.
    index, err := strconv.ParseInt(name[1:], 10, 64)
    if err != nil {
        return nil, err
    }

    // let's not be too strict and accept out-of-range indexes
    if index < 0 || index >= int64(len(r.Line)) {
        return charlatan.StringConst(""), nil
    }

    return charlatan.StringConst(fmt.Sprintf("%c", r.Line[index])), nil
}

One can now loop over a file’s content, construct LineRecords from its lines and evaluate queries against them:

query, _ := charlatan.QueryFromString("SELECT c1 FROM myfile WHERE c0 = 'a'")

f, _ := os.Open(query.From())
defer f.Close()

s := bufio.NewScanner(f)
for s.Scan() {
    r := &LineRecord{Line: s.Text()}

    if m, _ := query.Evaluate(r); !m {
        continue
    }

    values, _ := query.FieldsValues(r)
    fmt.Printf("%v\n", values)
}

Examples

Two examples are included in the repository under sample/csv/ and sample/json/.

Authors