Title: | Read and Write 'Parquet' Files |
Version: | 0.4.2 |
Description: | Self-sufficient reader and writer for flat 'Parquet' files. Can read most 'Parquet' data types. Can write many 'R' data types, including factors and temporal types. See docs for limitations. |
Depends: | R (≥ 4.0.0) |
License: | MIT + file LICENSE |
URL: | https://github.com/r-lib/nanoparquet, https://nanoparquet.r-lib.org/ |
BugReports: | https://github.com/r-lib/nanoparquet/issues |
Encoding: | UTF-8 |
Suggests: | arrow, bit64, DBI, duckdb, hms, mockery, pillar, processx, rprojroot, spelling, testthat, tzdb, withr |
RoxygenNote: | 7.3.2.9000 |
Config/testthat/edition: | 3 |
Config/Needs/website: | tidyverse/tidytemplate, r-lib/pkgdown, dplyr, gt, gtExtras, knitr, nycflights13, prettyunits, quarto, rmarkdown, sessioninfo, svglite |
Language: | en-US |
Biarch: | true |
NeedsCompilation: | yes |
Packaged: | 2025-02-22 10:23:40 UTC; gaborcsardi |
Author: | Gábor Csárdi [aut, cre],
Hannes Mühleisen |
Maintainer: | Gábor Csárdi <csardi.gabor@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-02-22 10:50:03 UTC |
nanoparquet: Read and Write 'Parquet' Files
Description
Self-sufficient reader and writer for flat 'Parquet' files. Can read most 'Parquet' data types. Can write many 'R' data types, including factors and temporal types. See docs for limitations.
Details
nanoparquet
is a reader and writer for a common subset of Parquet files.
Features:
Read and write flat (i.e. non-nested) Parquet files.
Can read most Parquet data types.
Can read a subset of columns from a Parquet file.
Can write many R data types, including factors and temporal types to Parquet.
Can append a data frame to a Parquet file without first reading and then rewriting the whole file.
Completely dependency free.
Supports Snappy, Gzip and Zstd compression.
-
Competitive with other tools in terms of speed, memory use and file size.
Limitations:
Nested Parquet types are not supported.
Some Parquet logical types are not supported:
INTERVAL
,UNKNOWN
.Only Snappy, Gzip and Zstd compression is supported.
Encryption is not supported.
Reading files from URLs is not supported.
nanoparquet always reads the data (or the selected subset of it) into memory. It does not work with out-of-memory data in Parquet files like Apache Arrow and DuckDB does.
Installation
Install the R package from CRAN:
install.packages("nanoparquet")
Usage
Read
Call read_parquet()
to read a Parquet file:
df <- nanoparquet::read_parquet("example.parquet")
To see the columns of a Parquet file and how their types are mapped to
R types by read_parquet()
, call read_parquet_schema()
first:
nanoparquet::read_parquet_schema("example.parquet")
Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:
df <- data.table::rbindlist(lapply( Sys.glob("some-folder/part-*.parquet"), nanoparquet::read_parquet ))
Write
Call write_parquet()
to write a data frame to a Parquet file:
nanoparquet::write_parquet(mtcars, "mtcars.parquet")
To see how the columns of the data frame will be mapped to Parquet types
by write_parquet()
, call infer_parquet_schema()
first:
nanoparquet::infer_parquet_schema(mtcars)
Inspect
Call read_parquet_info()
, read_parquet_schema()
, or
read_parquet_metadata()
to see various kinds of metadata from a Parquet
file:
-
read_parquet_info()
shows a basic summary of the file. -
read_parquet_schema()
shows all columns, including non-leaf columns, and how they are mapped to R types byread_parquet()
. -
read_parquet_metadata()
shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.
nanoparquet::read_parquet_info("mtcars.parquet") nanoparquet::read_parquet_schema("mtcars.parquet") nanoparquet::read_parquet_metadata("mtcars.parquet")
If you find a file that should be supported but isn't, please open an issue here with a link to the file.
Options
See also ?parquet_options()
for further details.
-
nanoparquet.class
: extra class to add to data frames returned byread_parquet()
. If it is not defined, the default is"tbl"
, which changes how the data frame is printed if the pillar package is loaded. -
nanoparquet.compression_level
: See?parquet_options()
for the defaults and the possible values for each compression method.Inf
selects maximum compression for each method. -
nanoparquet.num_rows_per_row_group
: The number of rows to put into a row group bywrite_parquet()
, if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million. -
nanoparquet.use_arrow_metadata
: unless this is set toFALSE
,read_parquet()
will make use of Arrow metadata in the Parquet file. Currently this is used to detect factor columns. -
nanoparquet.write_arrow_metadata
: unless this is set toFALSE
,write_parquet()
will add Arrow metadata to the Parquet file. This helps preserving classes of columns, e.g. factors will be read back as factors, both by nanoparquet and Arrow. -
nanoparquet.write_data_page_version
: Data version to write by default. Possible values are 1 and 2. Default is 1. -
nanoparquet.write_minmax_values
: Whether to write minimum and maximum values per row group, for data types that support this inwrite_parquet()
.
License
MIT
Author(s)
Maintainer: Gábor Csárdi csardi.gabor@gmail.com
Authors:
Hannes Mühleisen (ORCID) [copyright holder]
Other contributors:
Google Inc. [copyright holder]
Apache Software Foundation [copyright holder]
Posit Software, PBC [copyright holder]
RAD Game Tools [copyright holder]
Valve Software [copyright holder]
Tenacious Software LLC [copyright holder]
Facebook, Inc. [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/r-lib/nanoparquet/issues
Append a data frame to an existing Parquet file
Description
The schema of the data frame must be compatible with the schema of the file.
Usage
append_parquet(
x,
file,
compression = c("snappy", "gzip", "zstd", "uncompressed"),
encoding = NULL,
row_groups = NULL,
options = parquet_options()
)
Arguments
x |
Data frame to append. |
file |
Path to the output file. |
compression |
Compression algorithm to use for the newly written
data. See |
encoding |
Encoding to use for the newly written data. It does not
have to be the same as the encoding of data in |
row_groups |
Row groups of the new, extended Parquet file.
|
options |
Nanoparquet options, for the new data, see
|
Warning
This function is not atomic! If it is interrupted, it may leave the file in a corrupt state. To work around this create a copy of the original file, append the new data to the copy, and then rename the new, extended file to the original one.
About row groups
A Parquet file may be partitioned into multiple row groups, and indeed
most large Parquet files are. append_parquet()
is only able to update
the existing file along the row group boundaries. There are two
possibilities:
-
append_parquet()
keeps all existing row groups infile
, and creates new row groups for the new data. This mode can be forced by thekeep_row_groups
option inoptions
, seeparquet_options()
. Alternatively,
write_parquet
will overwrite the last row group in file, with its existing contents plus the (beginning of) the new data. This mode makes more sense if the last row group is small, because many small row groups are inefficient.
By default append_parquet
chooses between the two modes automatically,
aiming to create row groups with at least num_rows_per_row_group
(see parquet_options()
) rows. You can customize this behavior with
the keep_row_groups
options and the row_groups
argument.
See Also
Infer Parquet schema of a data frame
Description
Infer Parquet schema of a data frame
Usage
infer_parquet_schema(df, options = parquet_options())
Arguments
df |
Data frame. |
options |
Return value of |
Value
Data frame, the inferred schema. It has the same columns as
the return value of read_parquet_schema()
:
file_name
, name
, r_type
, type
, type_length
, repetition_type
, converted_type
, logical_type
, num_children
, scale
, precision
, field_id
.
See Also
read_parquet_schema()
to read the schema of a Parquet file,
parquet_schema()
to create a Parquet schema from scratch.
nanoparquet's type maps
Description
How nanoparquet maps R types to Parquet types.
R's data types
When writing out a data frame, nanoparquet maps R's data types to Parquet logical types. The following table is a summary of the mapping. For the details see below.
R type | Parquet type | Default | Notes |
character | STRING (BYTE_ARRAY) | x | I.e. STRSXP. Converted to UTF-8. |
" | BYTE_ARRAY | ||
" | FIXED_LEN_BYTE_ARRAY | ||
" | ENUM | ||
" | UUID | ||
Date | DATE | x | |
difftime | INT64 | x | If not hms::hms. Arrow metadata marks it as Duration(NS). |
factor | STRING | x | Arrow metadata marks it as a factor. |
" | ENUM | ||
hms::hms | TIME(true, MILLIS) | x | Sub-milliseconds precision is lost. |
integer | INT(32, true) | x | I.e. INTSXP. |
" | INT64 | ||
" | INT96 | ||
" | DECIMAL (INT32) | ||
" | DECIMAL (INT64) | ||
" | INT(8, *) | ||
" | INT(16, *) | ||
" | INT(32, signed) | ||
list | BYTE_ARRAY | Must be a list of raw vectors. Messing values are NULL . |
|
" | FIXED_LEN_BYTE_ARRAY | Must be a list of raw vectors of the same length. Missing values are NULL . |
|
logical | BOOLEAN | x | I.e. LGLSXP. |
numeric | DOUBLE | x | I.e. REALSXP. |
" | INT96 | ||
" | FLOAT | ||
" | DECIMAL (INT32) | ||
" | DECIMAL (INT64) | ||
" | INT(*, *) | ||
" | FLOAT16 | ||
POSIXct | TIMESTAMP(true, MICROS) | x | Sub-microsecond precision is lost. |
The non-default mappings can be selected via the schema
argument. E.g.
to write out a factor column called 'name' as ENUM
, use
write_parquet(..., schema = parquet_schema(name = "ENUM"))
The detailed mapping rules are listed below, in order of preference. These rules will likely change until nanoparquet reaches version 1.0.0.
Factors (i.e. vectors that inherit the factor class) are converted to character vectors using
as.character()
, then written as aSTRSXP
(character vector) type. The fact that a column is a factor is stored in the Arrow metadata (see below), unless thenanoparquet.write_arrow_metadata
option is set toFALSE
.Dates (i.e. the
Date
class) is written asDATE
logical type, which is anINT32
type internally.-
hms
objects (from the hms package) are written asTIME(true, MILLIS)
. logical type, which is internally theINT32
Parquet type. Sub-milliseconds precision is lost. -
POSIXct
objects are written asTIMESTAMP(true, MICROS)
logical type, which is internally theINT64
Parquet type. Sub-microsecond precision is lost. -
difftime
objects (that are nothms
objects, see above), are written as anINT64
Parquet type, and noting in the Arrow metadata (see below) that this column has typeDuration
withNANOSECONDS
unit. Integer vectors (
INTSXP
) are written asINT(32, true)
logical type, which corresponds to theINT32
type.Real vectors (
REALSXP
) are written as theDOUBLE
type.Character vectors (
STRSXP
) are written as theSTRING
logical type, which has theBYTE_ARRAY
type. They are always converted to UTF-8 before writing.Logical vectors (
LGLSXP
) are written as theBOOLEAN
type.Other vectors error currently.
You can use infer_parquet_schema()
on a data frame to map R data types
to Parquet data types.
To change the default R to Parquet mapping, use parquet_schema()
and
the schema
argument of write_parquet()
. Currently supported
non-default mappings are:
-
integer
toINT64
, -
integer
toINT96
, -
double
toINT96
, -
double
toFLOAT
, -
character
toBYTE_ARRAY
, -
character
toFIXED_LEN_BYTE_ARRAY
, -
character
toENUM
, -
factor
toENUM
, -
integer
toDECIAML
&INT32
, -
integer
toDECIAML
&INT64
, -
double
toDECIAML
&INT32
, -
double
toDECIAML
&INT64
, -
integer
toINT(8, *)
,INT(16, *)
,INT(32, signed)
, -
double
toINT(*, *)
, -
character
toUUID
, -
double
toFLOAT16
, -
list
ofraw
vectors toBYTE_ARRAY
, -
list
ofraw
vectors toFIXED_LEN_BYTE_ARRAY
.
Parquet's data types
When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level data types. The following table summarizes the mappings. See more details below.
Parquet type | R type | Notes |
Logical types | ||
BSON | character | |
DATE | Date | |
DECIMAL | numeric | REALSXP, potentially losing precision. |
ENUM | character | |
FLOAT16 | numeric | REALSXP |
INT(8, *) | integer | |
INT(16, *) | integer | |
INT(32, *) | integer | Large unsigned values may overflow! |
INT(64, *) | numeric | REALSXP |
INTERVAL | list(raw) | Missing values are NULL . |
JSON | character | |
LIST | Not supported. | |
MAP | Not supported. | |
STRING | factor | If Arrow metadata says it is a factor. Also UTF8. |
" | character | Otherwise. Also UTF8. |
TIME | hms::hms | Also TIME_MILLIS and TIME_MICROS. |
TIMESTAMP | POSIXct | Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. |
UUID | character | In 00112233-4455-6677-8899-aabbccddeeff form. |
UNKNOWN | Not supported. | |
Primitive types | ||
BOOLEAN | logical | |
BYTE_ARRAY | factor | If Arrow metadata says it is a factor. |
" | list(raw) | Otherwise. Missing values are NULL . |
DOUBLE | numeric | REALSXP |
FIXED_LEN_BYTE_ARRAY | list(raw) | Missing values are NULL . |
FLOAT | numeric | REALSXP |
INT32 | integer | |
INT64 | numeric | REALSXP |
INT96 | POSIXct | |
The exact rules are below. These rules will likely change until nanoparquet reaches version 1.0.0.
The
BOOLEAN
type is read as a logical vector (LGLSXP
).The
STRING
logical type and theUTF8
converted type is read as a character vector with UTF-8 encoding.The
DATE
logical type and theDATE
converted type are read as aDate
R object.The
TIME
logical type and theTIME_MILLIS
andTIME_MICROS
converted types are read as anhms
object, see the hms package.The
TIMESTAMP
logical type and theTIMESTAMP_MILLIS
andTIMESTAMP_MICROS
converted types are read asPOSIXct
objects. If the logical type has theUTC
flag set, then the time zone of thePOSIXct
object is set toUTC
.-
INT32
is read as an integer vector (INTSXP
). -
INT64
,DOUBLE
andFLOAT
are read as real vectors (REALSXP
). -
INT96
is read as aPOSIXct
read vector with thetzone
attribute set to"UTC"
. It was an old convention to store time stamps asINT96
objects. The
DECIMAL
converted type (FIXED_LEN_BYTE_ARRAY
orBYTE_ARRAY
type) is read as a real vector (REALSXP
), potentially losing precision.The
ENUM
logical type is read as a character vector.The
UUID
logical type is read as a character vector that uses the00112233-4455-6677-8899-aabbccddeeff
form.The
FLOAT16
logical type is read as a real vector (REALSXP
).-
BYTE_ARRAY
is read as a factor object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.) Otherwise
BYTE_ARRAY
is read a list of raw vectors, with missing values denoted byNULL
.
Other logical and converted types are read as their annotated low level types:
-
INT(8, true)
,INT(16, true)
andINT(32, true)
are read as integer vectors because they areINT32
internally in Parquet. -
INT(64, true)
is read as a real vector (REALSXP
). Unsigned integer types
INT(8, false)
,INT(16, false)
andINT(32, false)
are read as integer vectors (INTSXP
). Large positive values may overflow into negative values, this is a known issue that we will fix.-
INT(64, false)
is read as a real vector (REALSXP
). Large positive values may overflow into negative values, this is a known issue that we will fix. -
INTERVAL
is a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted byNULL
. -
JSON
columns are read as character vectors (STRSXP
). -
BSON
columns are read as raw vectors (RAWSXP
).
These types are not yet supported:
Nested types (
LIST
,MAP
) are not supported.The
UNKNOWN
logical type is not supported.
You can use the read_parquet_schema()
function to see how R would read
the columns of a Parquet file. Look at the r_type
column.
The Arrow metadata
Apache Arrow (i.e. the arrow R package) adds additional metadata to
Parquet files when writing them in arrow::write_parquet()
. Then,
when reading the file in arrow::read_parquet()
, it uses this metadata
to recreate the same Arrow and R data types as before writing.
nanoparquet::write_parquet()
also adds the Arrow metadata to Parquet
files, unless the nanoparquet.write_arrow_metadata
option is set to
FALSE
.
Similarly, nanoparquet::read_parquet()
uses the Arrow metadata in the
Parquet file (if present), unless the nanoparquet.use_arrow_metadata
option is set to FALSE.
The Arrow metadata is stored in the file level key-value metadata, with
key ARROW:schema
.
Currently nanoparquet uses the Arrow metadata for two things:
It uses it to detect factors. Without the Arrow metadata factors are read as string vectors.
It uses it to detect
difftime
objects. Without the arrow metadata these are read asINT64
columns, containing the time difference in nanoseconds.
See Also
nanoparquet-package for options that modify the type mappings.
Map between R and Parquet data types
Description
Note that this function is now deprecated. Please use
read_parquet_schema()
for files, and infer_parquet_schema()
for
data frames.
Usage
parquet_column_types(x, options = parquet_options())
Arguments
x |
Path to a Parquet file, or a data frame. |
options |
Nanoparquet options, see |
Details
This function works two ways. It can map the R types of a data frame to
Parquet types, to see how write_parquet()
would write out the data
frame. It can also map the types of a Parquet file to R types, to see
how read_parquet()
would read the file into R.
Value
Data frame with columns:
-
file_name
: file name. -
name
: column name. -
type
: (low level) Parquet data type. -
r_type
: the R type that corresponds to the Parquet type. Might beNA
ifread_parquet()
cannot read this column. See nanoparquet-types for the type mapping rules. -
repetition_type
: whether the column inREQUIRED
(cannot beNA
) orOPTIONAL
(may beNA
).REPEATED
columns are not currently supported by nanoparquet. -
logical_type
: Parquet logical type in a list column. An element has at least an entry calledtype
, and potentially additional entries, e.g.bit_width
,is_signed
, etc.
See Also
read_parquet_metadata()
to read more metadata,
read_parquet_info()
for a very short summary.
read_parquet_schema()
for the complete Parquet schema.
read_parquet()
, write_parquet()
, nanoparquet-types.
Nanoparquet options
Description
Create a list of nanoparquet options.
Usage
parquet_options(
class = getOption("nanoparquet.class", "tbl"),
compression_level = getOption("nanoparquet.compression_level", NA_integer_),
keep_row_groups = FALSE,
num_rows_per_row_group = getOption("nanoparquet.num_rows_per_row_group", 10000000L),
use_arrow_metadata = getOption("nanoparquet.use_arrow_metadata", TRUE),
write_arrow_metadata = getOption("nanoparquet.write_arrow_metadata", TRUE),
write_data_page_version = getOption("nanoparquet.write_data_page_version", 1L),
write_minmax_values = getOption("nanoparquet.write_minmax_values", TRUE)
)
Arguments
class |
The extra class or classes to add to data frames created
in |
compression_level |
The compression level in
|
keep_row_groups |
This option is used when appending to a Parquet
file with |
num_rows_per_row_group |
The number of rows to put into a row group, if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million. |
use_arrow_metadata |
If this option is
|
write_arrow_metadata |
Whether to add the Apache Arrow types as
metadata to the file |
write_data_page_version |
Data version to write by default. Possible values are 1 and 2. Default is 1. |
write_minmax_values |
Whether to write minimum and maximum values
per row group, for data types that support this in |
Value
List of nanoparquet options.
Examples
# the effect of using Arrow metadata
tmp <- tempfile(fileext = ".parquet")
d <- data.frame(
fct = as.factor("a"),
dft = as.difftime(10, units = "secs")
)
write_parquet(d, tmp)
read_parquet(tmp, options = parquet_options(use_arrow_metadata = TRUE))
read_parquet(tmp, options = parquet_options(use_arrow_metadata = FALSE))
Create a Parquet schema
Description
You can use this schema to specify how to write out a data frame to
a Parquet file with write_parquet()
.
Usage
parquet_schema(...)
Arguments
... |
Parquet type specifications, see below.
For backwards compatibility, you can supply a file name
here, and then |
Details
A schema is a list of potentially named type specifications. A schema
is stored in a data frame. Each (potentially named) argument of
parquet_schema
may be a character scalar, or a list. Parameterized
types need to be specified as a list. Primitive Parquet types may be
specified as a string or a list.
Value
Data frame with the same columns as read_parquet_schema()
:
file_name
, name
, r_type
, type
, type_length
, repetition_type
, converted_type
, logical_type
, num_children
, scale
, precision
, field_id
.
Possible types:
Special type:
-
"AUTO"
: this is not a Parquet type, but it tellswrite_parquet()
to map the R type to Parquet automatically, using the default mapping rules.
Primitive Parquet types:
-
"BOOLEAN"
-
"INT32"
-
"INT64"
-
"INT96"
-
"FLOAT"
-
"DOUBLE"
-
"BYTE_ARRAY"
-
"FIXED_LEN_BYTE_ARRAY"
: fixed-length byte array. It needs atype_length
parameter, an integer between 0 and 2^31-1.
Parquet logical types:
-
"STRING"
-
"ENUM"
-
"UUID"
-
"INTEGER"
: signed or unsigned integer. It needs abit_width
and anis_signed
parameter.bit_width
must be 8, 16, 32 or 64.is_signed
must beTRUE
orFALSE
. -
"INT"
: same as"INTEGER"
. The Parquet documentation uses"INT"
, but the actual specification uses"INTEGER"
. Both are supported in nanoparquet. -
"DECIMAL"
: decimal number of specified scale and precision. It needs theprecision
andprimitive_type
parameters. Also supports thescale
parameter, it defaults to zero if not specified. -
"FLOAT16"
-
"DATE"
-
"TIME"
: needs anis_adjusted_utc
(TRUE
orFALSE
) and aunit
parameter.unit
must be"MILLIS"
,"MICROS"
or"NANOS"
. -
"TIMESTAMP"
: needs anis_adjusted_utc
(TRUE
orFALSE
) and aunit
parameter.unit
must be"MILLIS"
,"MICROS"
or"NANOS"
. -
"JSON"
-
"BSON"
Logical types MAP
, LIST
and UNKNOWN
are not supported currently.
Converted types are deprecated in the Parquet specification in favor of
logical types, but parquet_schema()
accepts some converted types as a
syntactic shortcut for the corresponding logical types:
-
INT_8
meanlist("INT", bit_width = 8, is_signed = TRUE)
. -
INT_16
meanlist("INT", bit_width = 16, is_signed = TRUE)
. -
INT_32
meanlist("INT", bit_width = 32, is_signed = TRUE)
. -
INT_64
meanlist("INT", bit_width = 64, is_signed = TRUE)
. -
TIME_MICROS
meanslist("TIME", is_adjusted_utc = TRUE, unit = "MICROS")
. -
TIME_MILLIS
meanslist("TIME", is_adjusted_utc = TRUE, unit = "MILLIS")
. -
TIMESTAMP_MICROS
meanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MICROS")
. -
TIMESTAMP_MILLIS
meanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MILLIS")
. -
UINT_8
meanslist("INT", bit_width = 8, is_signed = FALSE)
. -
UINT_16
meanslist("INT", bit_width = 16, is_signed = FALSE)
. -
UINT_32
meanslist("INT", bit_width = 32, is_signed = FALSE)
. -
UINT_64
meanslist("INT", bit_width = 64, is_signed = FALSE)
.
Missing values
Each type might also have a repetition_type
parameter, with possible
values "REQUIRED"
, "OPTIONAL"
or "REPEATED"
. "REQUIRED"
columns
do not allow missing values. Missing values are allowed in "OPTIONAL"
columns. "REPEATED"
columns are currently not supported in
write_parquet()
.
Examples
parquet_schema(
c1 = "INT32",
c2 = list("INT", bit_width = 64, is_signed = TRUE),
c3 = list("STRING", repetition_type = "OPTIONAL")
)
Parquet encodings
Description
Various Parquet encodings
Nanoparquet defaults
Currently the defaults are decided based on the R types. This might change in the future. In general, the defaults will likely change until nanoparquet reaches version 1.0.0.
Current encoding defaults:
Definition levels always use
RLE
. (Nanoparquet does not currently write repetition levels, but they'll also useRLE
, once implemented.)-
factor
columns useRLE_DICTIONARY
. -
logical
columns useRLE
if the average run length of the first 10,000 values is at least 15. Otherwise they use thePLAIN
encoding. -
integer
,double
andcharacter
columns useRLE_DICTIONARY
if at least two third of their values are repeated. Otherwise they usePLAIN
encoding. -
list
columns ofraw
vectors always use thePLAIN
encoding currently.
Parquet encodings
See https://github.com/apache/parquet-format/blob/master/Encodings.md for more details on Parquet encodings.
PLAIN
encoding
Supported types: all.
In general values are written back to back:
Integer types are little endian.
Floating point types follow the IEEE standard.
-
BYTE_ARRAY
: for each element, there is a little endian 4-byte length and then the bytes themselves. -
FIXED_LEN_BYTE_ARRAY
: bytes are written back to back.
Nanoparquet can read and write this encoding for all primitive types.
RLE_DICTIONARY
encoding
Supported types: dictionary indices in data pages.
This encoding combines run-length encoding and bit-packing.
Repeated sequences of the same value can be run-length encoded, and
non-repeated parts are bit packed.
It is used for data pages of dictionaries.
The dictionary pages themselves are PLAIN
encoded.
The deprecated PLAIN_DICTIONARY
name is treated the same as
RLE_DICTIONARY
.
Nanoparquet can read and write this encoding.
RLE
encoding
Supported types: BOOLEAN
. Also for definition and repetition levels.
This is the same encoding as RLE_DICTIONARY
, with a slightly different
header. It combines run-length encoding and bit packing.
It is used for BOOLEAN
columns, and also for definition and
repetition levels.
Nanoparquet can read and write this encoding.
BIT_PACKED
encoding (deprecated in favor of RLE
)
Supported types: none. Only for definition and repetition levels, but
RLE
should be used instead.
This is a simple bit packing encoding for integers, that was previously
used for encoding definition and repetition levels. It is not used in new
Parquet files because the the RLE
encoding includes it and it is better.
Nanoparquet currently cannot read or write the BIT_PACKED
encoding.
DELTA_BINARY_PACKED
encoding
Supported types: INT32
, INT64
.
This encoding efficiently encodes integer columns if the differences between consecutive elements are often the same, and/or the differences between consecutive elements are small. The extreme case of an arithmetic sequence can be encoded in O(1) space.
Nanoparquet can read this encoding, but cannot currently write it.
DELTA_LENGTH_BYTE_ARRAY
encoding
Supported types: BYTE_ARRAY
.
This encoding uses DELTA_BINARY_PACKED
to encode the length of all
byte array elements. It is especially efficient for short byte array
elements, i.e. a column of short strings.
Nanoparquet can read this encoding, but cannot currently write it.
DELTA_BYTE_ARRAY
encoding
Supported types: BYTE_ARRAY
, FIXED_LEN_BYTE_ARRAY
.
This encoding is efficient if consecutive byte array elements share the same prefix, because each element can reuse a prefix of the previous element.
Nanoparquet can read this encoding, but cannot currently write it.
BYTE_STREAM_SPLIT
encoding
Supported types: FLOAT
, DOUBLE
, INT32
, INT64
,
FIXED_LEN_BYTE_ARRAY
.
This encoding stores the first bytes of the elements first, then the second bytes, etc. It does not reduce the size in itself, but may allow more efficient compression.
Nanoparquet can read this encoding, but cannot currently write it.
See Also
write_parquet()
on how to select a non-default encoding when
writing Parquet files.
Read a Parquet file into a data frame
Description
Converts the contents of the named Parquet file to a R data frame.
Usage
read_parquet(file, col_select = NULL, options = parquet_options())
Arguments
file |
Path to a Parquet file. It may also be an R connection,
in which case it first reads all data from the connection, writes
it into a temporary file, then reads the temporary file, and
deletes it. The connection might be open, it which case it must be
a binary connection. If it is not open, then |
col_select |
Columns to read. It can be a numeric vector of column
indices, or a character vector of column names. It is an error to
select the same column multiple times. The order of the columns in
the result is the same as the order in |
options |
Nanoparquet options, see |
Value
A data.frame
with the file's contents.
See Also
See write_parquet()
to write Parquet files,
nanoparquet-types for the R <-> Parquet type mapping.
See read_parquet_info()
, for general information,
read_parquet_schema()
for information about the
columns, and read_parquet_metadata()
for the complete metadata.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
parquet_df <- nanoparquet::read_parquet(file_name)
print(str(parquet_df))
Short summary of a Parquet file
Description
Short summary of a Parquet file
Usage
read_parquet_info(file)
parquet_info(file)
Arguments
file |
Path to a Parquet file. |
Value
Data frame with columns:
-
file_name
: file name. -
num_cols
: number of (leaf) columns. -
num_rows
: number of rows. -
num_row_groups
: number of row groups. -
file_size
: file size in bytes. -
parquet_version
: Parquet version. -
created_by
: A string scalar, usually the name of the software that created the file.NA
if not available.
See Also
read_parquet_metadata()
to read more metadata,
read_parquet_schema()
for column information.
read_parquet()
, write_parquet()
, nanoparquet-types.
Read the metadata of a Parquet file
Description
This function should work on all files, even if read_parquet()
is
unable to read them, because of an unsupported schema, encoding,
compression or other reason.
Usage
read_parquet_metadata(file, options = parquet_options())
parquet_metadata(file)
Arguments
file |
Path to a Parquet file. |
options |
Options that potentially alter the default Parquet to R
type mappings, see |
Value
A named list with entries:
-
file_meta_data
: a data frame with file meta data:-
file_name
: file name. -
version
: Parquet version, an integer. -
num_rows
: total number of rows. -
key_value_metadata
: list column of a data frames with two character columns calledkey
andvalue
. This is the key-value metadata of the file. Arrow stores its schema here. -
created_by
: A string scalar, usually the name of the software that created the file.
-
-
schema
: data frame, the schema of the file. It has one row for each node (inner node or leaf node). For flat files this means one root node (inner node), always the first one, and then one row for each "real" column. For nested schemas, the rows are in depth-first search order. Most important columns are:-
file_name
: file name. -
name
: column name. -
r_type
: the R type that corresponds to the Parquet type. Might beNA
ifread_parquet()
cannot read this column. See nanoparquet-types for the type mapping rules. -
r_type
: -
type
: data type. One of the low level data types. -
type_length
: length for fixed length byte arrays. -
repettion_type
: character, one ofREQUIRED
,OPTIONAL
orREPEATED
. -
logical_type
: a list column, the logical types of the columns. An element has at least an entry calledtype
, and potentially additional entries, e.g.bit_width
,is_signed
, etc. -
num_children
: number of child nodes. Should be a non-negative integer for the root node, andNA
for a leaf node.
-
-
$row_groups
: a data frame, information about the row groups. Some important columns:-
file_name
: file name. -
id
: row group id, integer from zero to number of row groups minus one. -
total_byte_size
: total uncompressed size of all column data. -
num_rows
: number of rows. -
file_offset
: where the row group starts in the file. This is optional, so it might beNA
. -
total_compressed_size
: total byte size of all compressed (and potentially encrypted) column data in this row group. This is optional, so it might beNA
. -
ordinal
: ordinal position of the row group in the file, starting from zero. This is optional, so it might beNA
. IfNA
, then the order of the row groups is as they appear in the metadata.
-
-
$column_chunks
: a data frame, information about all column chunks, across all row groups. Some important columns:-
file_name
: file name. -
row_group
: which row group this chunk belongs to. -
column
: which leaf column this chunks belongs to. The order is the same as in$schema
, but only leaf columns (i.e. columns withNA
children) are counted. -
file_path
: which file the chunk is stored in.NA
means the same file. -
file_offset
: where the column chunk begins in the file. -
type
: low level parquet data type. -
encodings
: encodings used to store this chunk. It is a list column of character vectors of encoding names. Current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT". -
path_in_scema
: list column of character vectors. It is simply the path from the root node. It is simply the column name for flat schemas. -
codec
: compression codec used for the column chunk. Possible values are: "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO", "BROTLI", "LZ4", "ZSTD". -
num_values
: number of values in this column chunk. -
total_uncompressed_size
: total uncompressed size in bytes. -
total_compressed_size
: total compressed size in bytes. -
data_page_offset
: absolute position of the first data page of the column chunk in the file. -
index_page_offset
: absolute position of the first index page of the column chunk in the file, orNA
if there are no index pages. -
dictionary_page_offset
: absolute position of the first dictionary page of the column chunk in the file, orNA
if there are no dictionary pages. -
null_count
: the number of missing values in the column chunk. It may beNA
. -
min_value
: list column of raw vectors, the minimum value of the column, in binary. IfNULL
, then then it is not specified. This column is experimental. -
max_value
: list column of raw vectors, the maximum value of the column, in binary. IfNULL
, then then it is not specified. This column is experimental. -
is_min_value_exact
: whether the minimum value is an actual value of a column, or a bound. It may beNA
. -
is_max_value_exact
: whether the maximum value is an actual value of a column, or a bound. It may beNA
.
-
See Also
read_parquet_info()
for a much shorter summary.
read_parquet_schema()
for column information.
read_parquet()
to read, write_parquet()
to write Parquet files,
nanoparquet-types for the R <-> Parquet type mappings.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet::read_parquet_metadata(file_name)
Read a page from a Parquet file
Description
Read a page from a Parquet file
Usage
read_parquet_page(file, offset)
Arguments
file |
Path to a Parquet file. |
offset |
Integer offset of the start of the page in the file.
See |
Value
Named list. Many entries correspond to the columns of
the result of read_parquet_pages()
. Additional entries are:
-
codec
: compression codec. Possible values: -
has_repetition_levels
: whether the page has repetition levels. -
has_definition_levels
: whether the page has definition levels. -
schema_column
: which schema column the page corresponds to. Note that only leaf columns have pages. -
data_type
: low level Parquet data type. Possible values: -
repetition_type
: whether the column the page belongs to isREQUIRED
,OPTIONAL
orREPEATED
. -
page_header
: the bytes of the page header in a raw vector. -
num_null
: number of missing (NA
) values. Only set in V2 data pages. -
num_rows
: this is the same asnum_values
for flat tables, i.e. files without repetition levels. -
compressed_data
: the data of the page in a raw vector. It includes repetition and definition levels, if any. -
data
: the uncompressed data, if nanoparquet supports the compression codec of the file (GZIP and SNAPPY at the time of writing), or if the file is not compressed. In the latter case it is the same ascompressed_data
.
See Also
read_parquet_pages()
for a summary of all pages.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::read_parquet_pages(file_name)
options(max.print = 100) # otherwise long raw vector
nanoparquet:::read_parquet_page(file_name, 4L)
Metadata of all pages of a Parquet file
Description
Metadata of all pages of a Parquet file
Usage
read_parquet_pages(file)
Arguments
file |
Path to a Parquet file. |
Details
Reading all the page headers might be slow for large files, especially if the file has many small pages.
Value
Data frame with columns:
-
file_name
: file name. -
row_group
: id of the row group the page belongs to, an integer between 0 and the number of row groups minus one. -
column
: id of the column. An integer between the number of leaf columns minus one. Note that only leaf columns are considered, as non-leaf columns do not have any pages. -
page_type
:DATA_PAGE
,INDEX_PAGE
,DICTIONARY_PAGE
orDATA_PAGE_V2
. -
page_header_offset
: offset of the data page (its header) in the file. -
uncompressed_page_size
: does not include the page header, as per Parquet spec. -
compressed_page_size
: without the page header. -
crc
: integer, checksum, if present in the file, can beNA
. -
num_values
: number of data values in this page, includeNULL
(NA
in R) values. -
encoding
: encoding of the page, current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT". -
definition_level_encoding
: encoding of the definition levels, seeencoding
for possible values. This can be missing in V2 data pages, where they are always RLE encoded. -
repetition_level_encoding
: encoding of the repetition levels, seeencoding
for possible values. This can be missing in V2 data pages, where they are always RLE encoded. -
data_offset
: offset of the actual data in the file. -
page_header_length
: size of the page header, in bytes.
See Also
read_parquet_page()
to read a page.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::read_parquet_pages(file_name)
Read the schema of a Parquet file
Description
This function should work on all files, even if read_parquet()
is
unable to read them, because of an unsupported schema, encoding,
compression or other reason.
Usage
read_parquet_schema(file, options = parquet_options())
Arguments
file |
Path to a Parquet file. |
options |
Return value of |
Value
Data frame, the schema of the file. It has one row for each node (inner node or leaf node). For flat files this means one root node (inner node), always the first one, and then one row for each "real" column. For nested schemas, the rows are in depth-first search order. Most important columns are: - `file_name`: file name. - `name`: column name. - `r_type`: the R type that corresponds to the Parquet type. Might be `NA` if [read_parquet()] cannot read this column. See [nanoparquet-types] for the type mapping rules. - `type`: data type. One of the low level data types. - `type_length`: length for fixed length byte arrays. - `repettion_type`: character, one of `REQUIRED`, `OPTIONAL` or `REPEATED`. - `logical_type`: a list column, the logical types of the columns. An element has at least an entry called `type`, and potentially additional entries, e.g. `bit_width`, `is_signed`, etc. - `num_children`: number of child nodes. Should be a non-negative integer for the root node, and `NA` for a leaf node.
See Also
read_parquet_metadata()
to read more metadata,
read_parquet_info()
to show only basic information.
read_parquet()
, write_parquet()
, nanoparquet-types.
RLE decode integers
Description
RLE decode integers
Usage
rle_decode_int(
x,
bit_width = attr(x, "bit_width"),
length = attr(x, "length") %||% NA
)
Arguments
x |
Raw vector of the encoded integers. |
bit_width |
Bit width used for the encoding. |
length |
Length of the output. If |
Value
The decoded integer vector.
See Also
Other encodings:
rle_encode_int()
RLE encode integers
Description
RLE encode integers
Usage
rle_encode_int(x)
Arguments
x |
Integer vector. |
Value
Raw vector, the encoded integers. It has two attributes:
-
bit_length
: the number of bits needed to encode the input, and -
length
: length of the original integer input.
See Also
Other encodings:
rle_decode_int()
Write a data frame to a Parquet file
Description
Writes the contents of an R data frame into a Parquet file.
Usage
write_parquet(
x,
file,
schema = NULL,
compression = c("snappy", "gzip", "zstd", "uncompressed"),
encoding = NULL,
metadata = NULL,
row_groups = NULL,
options = parquet_options()
)
Arguments
x |
Data frame to write. |
file |
Path to the output file. If this is the string |
schema |
Parquet schema. Specify a schema to tweak the default
nanoparquet R -> Parquet type mappings. Use |
compression |
Compression algorithm to use. Currently |
encoding |
Encoding to use. Possible values:
If If a specified encoding is invalid for a certain column type,
or nanoparquet does not implement it, Currently
See parquet-encodings for more about encodings. |
metadata |
Additional key-value metadata to add to the file.
This must be a named character vector, or a data frame with columns
character columns called |
row_groups |
Row groups of the Parquet file. If |
options |
Nanoparquet options, see |
Details
write_parquet()
converts string columns to UTF-8 encoding by calling
base::enc2utf8()
. It does the same for factor levels.
Value
NULL
, unless file
is ":raw:"
, in which case the Parquet
file is returned as a raw vector.
See Also
read_parquet_metadata()
, read_parquet()
.
Examples
# add row names as a column, because `write_parquet()` ignores them.
mtcars2 <- cbind(name = rownames(mtcars), mtcars)
write_parquet(mtcars2, "mtcars.parquet")