parquet-go is a pure-go implementation of reading and writing the parquet format file.
- Support Read/Write Nested/Flat Parquet File
- Support all Types in Parquet
- Very simple to use
- High performance
- git.apache.org/thrift.git/lib/go/thrift
- github.com/golang/snappy
Add the parquet-go library to your $GOPATH/src:
go get github.com/xitongsys/parquet-go
Look at a few examples in example/
.
go run example/local_flat.go
There are two types in Parquet: Base Type and Logical Type. The type definitions is in ParquetType.go. The relationship between the parquet type and go type is shown in following table.
Parquet Type | Go Type | Example |
---|---|---|
BOOLEAN | bool | parquet:"name=name, type=BOOLEAN" |
INT32 | int32 | parquet:"name=name, type=INT32" |
INT64 | int64 | parquet:"name=name, type=INT64" |
INT96 | string | parquet:"name=name, type=INT96" |
FLOAT | float32 | parquet:"name=name, type=FLOAT" |
DOUBLE | float64 | parquet:"name=name, type=DOUBLE" |
BYTE_ARRAY | string | parquet:"name=name, type=BYTE_ARRAY" |
FIXED_LEN_BYTE_ARRAY | string | parquet:"name=name, type=FIXED_LEN_BYTE_ARRAY, length=10" |
UTF8 | string | parquet:"name=name, type=UTF8" |
INT_8 | int32 | parquet:"name=name, type=INT_8" |
INT_16 | int32 | parquet:"name=name, type=INT_16" |
INT_32 | int32 | parquet:"name=name, type=INT_32" |
INT_64 | int64 | parquet:"name=name, type=INT_64" |
UINT_8 | uint32 | parquet:"name=name, type=UINT_8" |
UINT_16 | uint32 | parquet:"name=name, type=UINT_16" |
UINT_32 | uint32 | parquet:"name=name, type=UINT_32" |
UINT_64 | uint64 | parquet:"name=name, type=UINT_64" |
DATE | int32 | parquet:"name=name, type=DATE" |
TIME_MILLIS | int32 | parquet:"name=name, type=TIME_MILLIS" |
TIME_MICROS | int64 | parquet:"name=name, type=TIME_MICROS" |
TIMESTAMP_MILLIS | int64 | parquet:"name=name, type=TIMESTAMP_MILLIS" |
TIMESTAMP_MICROS | int64 | parquet:"name=name, type=TIMESTAMP_MICROS" |
INTERVAL | string | parquet:"name=name, type=INTERVAL" |
DECIMAL | string | parquet:"name=name, type=DECIMAL, scale=2, precision=2" |
List | slice | parquet:"name=name, type=INT64" |
Map | map | parquet:"name=name, type=INT64, keytype=INT64" |
There are three repetition types in Parquet: REQUIRED, OPTIONAL, REPEATED.
Repetition Type | Example | Description |
---|---|---|
REQUIRED | V1 int32 `parquet:"name=v1, type=INT32"` |
No extra description |
OPTIONAL | V1 *int32 `parquet:"name=v1, type=INT32"` |
Declare as pointer |
REPEATED | V1 []int32 `parquet:"name=v1, type=int32, repetitontype=repeated"` |
Add 'repetitiontype=repeated' in tags |
The different between a List and a REPEATED variable is the 'repetitiontype' in tags. Although both of them are stored as slice in go, they are different in parquet. You can find the detail of List in parquet at here.
The core data structure named "Table":
type Table struct {
RepetitionType parquet.FieldRepetitionType
Type parquet.Type
Path []string
MaxDefinitionLevel int32
MaxRepetitionLevel int32
Values []interface{}
DefinitionLevels []int32
RepetitionLevels []int32
}
Values is the column data; RepetitionLevels is the repetition levels of the values; DefinitionLevels is the definition levels of the values. The architecture of the data struct is following:
Tables -> Page
Pages -> Chunk
Chunks -> RowGroup
RowGroups -> ParquetFile
read/write a parquet file need a ParquetFile interface implemented
type ParquetFile interface {
Seek(offset int, pos int) (int64, error)
Read(b []byte) (n int, err error)
Write(b []byte) (n int, err error)
Close()
Open(name string) (ParquetFile, error)
Create(name string) (ParquetFile, error)
}
Using this interface, parquet-go can read/write parquet file on any plantform(local/hdfs/s3...)
The following is a simple example of read/write parquet file on local disk. It can be found in example directory:
package main
import (
"github.com/xitongsys/parquet-go/ParquetFile"
"github.com/xitongsys/parquet-go/ParquetReader"
"github.com/xitongsys/parquet-go/ParquetWriter"
"log"
"time"
)
type Student struct {
Name string `parquet:"name=name, type=UTF8"`
Age int32 `parquet:"name=age, type=INT32"`
Id int64 `parquet:"name=id, type=INT64"`
Weight float32 `parquet:"name=weight, type=FLOAT"`
Sex bool `parquet:"name=sex, type=BOOLEAN"`
Day int32 `parquet:"name=day, type=DATE"`
}
func main() {
fw, _ := ParquetFile.NewLocalFileWriter("flat.parquet")
//write flat
pw, _ := ParquetWriter.NewParquetWriter(fw, new(Student), 4)
num := 10
for i := 0; i < num; i++ {
stu := Student{
Name: "StudentName",
Age: int32(20 + i%5),
Id: int64(i),
Weight: float32(50.0 + float32(i)*0.1),
Sex: bool(i%2 == 0),
Day: int32(time.Now().Unix() / 3600 / 24),
}
pw.Write(stu)
}
pw.Flush(true)
pw.WriteStop()
log.Println("Write Finished")
fw.Close()
///read flat
fr, _ := ParquetFile.NewLocalFileReader("flat.parquet")
pr, err := ParquetReader.NewParquetReader(fr, new(Student), 4)
if err != nil {
log.Println("Failed new reader", err)
}
num = int(pr.GetNumRows())
for i := 0; i < num; i++ {
stus := make([]Student, 1)
pr.Read(&stus)
log.Println(stus)
}
pr.ReadStop()
fr.Close()
}
If you just want to get some columns data, your can use column reader
///read flat
fr, _ := ParquetFile.NewLocalFileReader("column.parquet")
pr, err := ParquetReader.NewParquetColumnReader(fr, 4)
if err != nil {
log.Println("Failed new reader", err)
}
num = int(pr.GetNumRows())
names := make([]interface{}, num)
pr.ReadColumnByPath("name", &names)
log.Println(names)
ids := make([]interface{}, num)
pr.ReadColumnByIndex(2, &ids)
log.Println(ids)
pr.ReadStop()
fr.Close()
Read/Write initial functions have a parallel parameters np which is the number of goroutines in reading/writing.
func NewParquetReader(pFile ParquetFile.ParquetFile, obj interface{}, np int64) (*ParquetReader, error)
func NewParquetWriter(pFile ParquetFile.ParquetFile, obj interface{}, np int64) (*ParquetWriter, error)
Plugin is used for some special purpose and will be added gradually.
This plugin is used for data format similar with CSV(not nested).
func main() {
md := []CSVWriter.MetadataType{
{Type: "UTF8", Name: "Name"},
{Type: "INT32", Name: "Age"},
{Type: "INT64", Name: "Id"},
{Type: "FLOAT", Name: "Weight"},
{Type: "BOOLEAN", Name: "Sex"},
}
//write flat
fw, _ := ParquetFile.NewLocalFileWriter("csv.parquet")
pw, _ := CSVWriter.NewCSVWriter(md, fw, 1)
num := 10
for i := 0; i < num; i++ {
data := []string{
fmt.Sprintf("%s_%d", "Student Name", i),
fmt.Sprintf("%d", 20+i%5),
fmt.Sprintf("%d", i),
fmt.Sprintf("%f", 50.0+float32(i)*0.1),
fmt.Sprintf("%t", i%2 == 0),
}
rec := make([]*string, len(data))
for j := 0; j < len(data); j++ {
rec[j] = &data[j]
}
pw.WriteString(rec)
data2 := []interface{}{
ParquetType.UTF8("Student Name"),
ParquetType.INT32(20 + i*5),
ParquetType.INT64(i),
ParquetType.FLOAT(50.0 + float32(i)*0.1),
ParquetType.BOOLEAN(i%2 == 0),
}
pw.Write(data2)
}
pw.Flush(true)
pw.WriteStop()
log.Println("Write Finished")
fw.Close()
}