| Title: | Read and Write 'Parquet' Files |
| Version: | 0.3.1 |
| Description: | Self-sufficient reader and writer for flat 'Parquet' files. Can read most 'Parquet' data types. Can write many 'R' data types, including factors and temporal types. See docs for limitations. |
| Depends: | R (≥ 4.0.0) |
| License: | MIT + file LICENSE |
| URL: | https://github.com/r-lib/nanoparquet, https://r-lib.github.io/nanoparquet/ |
| BugReports: | https://github.com/r-lib/nanoparquet/issues |
| Encoding: | UTF-8 |
| Suggests: | arrow, bit64, DBI, duckdb, hms, mockery, pillar, processx, rprojroot, spelling, testthat, withr |
| RoxygenNote: | 7.3.1 |
| Config/testthat/edition: | 3 |
| Config/Needs/website: | tidyverse/tidytemplate |
| Language: | en-US |
| NeedsCompilation: | yes |
| Packaged: | 2024-07-01 11:36:06 UTC; gaborcsardi |
| Author: | Gábor Csárdi [aut, cre],
Hannes Mühleisen |
| Maintainer: | Gábor Csárdi <csardi.gabor@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2024-07-01 17:10:02 UTC |
nanoparquet: Read and Write 'Parquet' Files
Description
Self-sufficient reader and writer for flat 'Parquet' files. Can read most 'Parquet' data types. Can write many 'R' data types, including factors and temporal types. See docs for limitations.
Details
nanoparquet is a reader and writer for a common subset of Parquet files.
Features:
Read and write flat (i.e. non-nested) Parquet files.
Can read most Parquet data types.
Can write many R data types, including factors and temporal types to Parquet.
Completely dependency free.
Supports Snappy, Gzip and Zstd compression.
Limitations:
Nested Parquet types are not supported.
Some Parquet logical types are not supported:
FLOAT16,INTERVAL,UNKNOWN.Only Snappy, Gzip and Zstd compression is supported.
Encryption is not supported.
Reading files from URLs is not supported.
Being single-threaded and not fully optimized, nanoparquet is probably not suited well for large data sets. It should be fine for a couple of gigabytes. Reading or writing a ~250MB file that has 32 million rows and 14 columns takes about 10-15 seconds on an M2 MacBook Pro. For larger files, use Apache Arrow or DuckDB.
Installation
Install the R package from CRAN:
install.packages("nanoparquet")
Usage
Read
Call read_parquet() to read a Parquet file:
df <- nanoparquet::read_parquet("example.parquet")
To see the columns of a Parquet file and how their types are mapped to
R types by read_parquet(), call parquet_column_types() first:
nanoparquet::parquet_column_types("example.parquet")
Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:
df <- data.table::rbindlist(lapply(
Sys.glob("some-folder/part-*.parquet"),
nanoparquet::read_parquet
))
Write
Call write_parquet() to write a data frame to a Parquet file:
nanoparquet::write_parquet(mtcars, "mtcars.parquet")
To see how the columns of the data frame will be mapped to Parquet types
by write_parquet(), call parquet_column_types() first:
nanoparquet::parquet_column_types(mtcars)
Inspect
Call parquet_info(), parquet_column_types(), parquet_schema() or
parquet_metadata() to see various kinds of metadata from a Parquet
file:
-
parquet_info()shows a basic summary of the file. -
parquet_column_types()shows the leaf columns, these are are the ones thatread_parquet()reads into R. -
parquet_schema()shows all columns, including non-leaf columns. -
parquet_metadata()shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.
nanoparquet::parquet_info("mtcars.parquet")
nanoparquet::parquet_column_types("mtcars.parquet")
nanoparquet::parquet_schema("mtcars.parquet")
nanoparquet::parquet_metadata("mtcars.parquet")
If you find a file that should be supported but isn't, please open an issue here with a link to the file.
Options
See also ?parquet_options().
-
nanoparquet.class: extra class to add to data frames returned byread_parquet(). If it is not defined, the default is"tbl", which changes how the data frame is printed if the pillar package is loaded. -
nanoparquet.use_arrow_metadata: unless this is set toFALSE,read_parquet()will make use of Arrow metadata in the Parquet file. Currently this is used to detect factor columns. -
nanoparquet.write_arrow_metadata: unless this is set toFALSE,write_parquet()will add Arrow metadata to the Parquet file. This helps preserving classes of columns, e.g. factors will be read back as factors, both by nanoparquet and Arrow.
License
MIT
Author(s)
Maintainer: Gábor Csárdi csardi.gabor@gmail.com
Authors:
Hannes Mühleisen (ORCID) [copyright holder]
Other contributors:
Google Inc. [copyright holder]
Apache Software Foundation [copyright holder]
Posit Software, PBC [copyright holder]
RAD Game Tools [copyright holder]
Valve Software [copyright holder]
Tenacious Software LLC [copyright holder]
Facebook, Inc. [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/r-lib/nanoparquet/issues
nanoparquet's type maps
Description
How nanoparquet maps R types to Parquet types.
R's data types
When writing out a data frame, nanoparquet maps R's data types to Parquet logical types. This is how the mapping is performed.
These rules will likely change until nanoparquet reaches version 1.0.0.
Factors (i.e. vectors that inherit the factor class) are converted to character vectors using
as.character(), then written as aSTRSXP(character vector) type. The fact that a column is a factor is stored in the Arrow metadata (see below), unless thenanoparquet.write_arrow_metadataoption is set toFALSE.Dates (i.e. the
Dateclass) is written asDATElogical type, which is anINT32type internally.-
hmsobjects (from the hms package) are written asTIME(true, MILLIS). logical type, which is internally theINT32Parquet type. Sub-milliseconds precision is lost. -
POSIXctobjects are written asTIMESTAMP(true, MICROS)logical type, which is internally theINT64Parquet type. Sub-microsecond precision is lost. -
difftimeobjects (that are nothmsobjects, see above), are written as anINT64Parquet type, and noting in the Arrow metadata (see below) that this column has typeDurationwithNANOSECONDSunit. Integer vectors (
INTSXP) are written asINT(32, true)logical type, which corresponds to theINT32type.Real vectors (
REALSXP) are written as theDOUBLEtype.Character vectors (
STRSXP) are written as theSTRINGlogical type, which has theBYTE_ARRAYtype. They are always converted to UTF-8 before writing.Logical vectors (
LGLSXP) are written as theBOOLEANtype.Other vectors error currently.
You can use parquet_column_types() on a data frame to map R data types
to Parquet data types.
Parquet's data types
When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level data types. The exact rules are below.
These rules will likely change until nanoparquet reaches version 1.0.0.
The
BOOLEANtype is read as a logical vector (LGLSXP).The
STRINGlogical type and theUTF8converted type is read as a character vector with UTF-8 encoding.The
DATElogical type and theDATEconverted type are read as aDateR object.The
TIMElogical type and theTIME_MILLISandTIME_MICROSconverted types are read as anhmsobject, see the hms package.The
TIMESTAMPlogical type and theTIMESTAMP_MILLISandTIMESTAMP_MICROSconverted types are read asPOSIXctobjects. If the logical type has theUTCflag set, then the time zone of thePOSIXctobject is set toUTC.-
INT32is read as an integer vector (INTSXP). -
INT64,DOUBLEandFLOATare read as real vectors (REALSXP). -
INT96is read as aPOSIXctread vector with thetzoneattribute set to"UTC". It was an old convention to store time stamps asINT96objects. The
DECIMALconverted type (FIXED_LEN_BYTE_ARRAYorBYTE_ARRAYtype) is read as a real vector (REALSXP), potentially losing precision.The
ENUMlogical type is read as a character vector.The
UUIDlogical type is read as a character vector that uses the00112233-4455-6677-8899-aabbccddeeffform.-
BYTE_ARRAYis read as a factor object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.) Otherwise
BYTE_ARRAYis read a list of raw vectors, with missing values denoted byNULL.
Other logical and converted types are read as their annotated low level types:
-
INT(8, true),INT(16, true)andINT(32, true)are read as integer vectors because they areINT32internally in Parquet. -
INT(64, true)is read as a real vector (REALSXP). Unsigned integer types
INT(8, false),INT(16, false)andINT(32, false)are read as integer vectors (INTSXP). Large positive values may overflow into negative values, this is a known issue that we will fix.-
INT(64, false)is read as a real vector (REALSXP). Large positive values may overflow into negative values, this is a known issue that we will fix. -
FLOAT16is a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted byNULL. -
INTERVALis a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted byNULL. -
JSONandBSONare read as character vectors (STRSXP).
These types are not yet supported:
Nested types (
LIST,MAP) are not supported.The
UNKNOWNlogical type is not supported.
You can use the parquet_column_types() function to see how R would read
the columns of a Parquet file. Look at the r_type column.
The Arrow metadata
Apache Arrow (i.e. the arrow R package) adds additional metadata to
Parquet files when writing them in arrow::write_parquet(). Then,
when reading the file in arrow::read_parquet(), it uses this metadata
to recreate the same Arrow and R data types as before writing.
nanoparquet::write_parquet() also adds the Arrow metadata to Parquet
files, unless the nanoparquet.write_arrow_metadata option is set to
FALSE.
Similarly, nanoparquet::read_parquet() uses the Arrow metadata in the
Parquet file (if present), unless the nanoparquet.use_arrow_metadata
option is set to FALSE.
The Arrow metadata is stored in the file level key-value metadata, with
key ARROW:schema.
Currently nanoparquet uses the Arrow metadata for two things:
It uses it to detect factors. Without the Arrow metadata factors are read as string vectors.
It uses it to detect
difftimeobjects. Without the arrow metadata these are read asINT64columns, containing the time difference in nanoseconds.
See Also
nanoparquet-package for options that modify the type mappings.
Map between R and Parquet data types
Description
This function works two ways. It can map the R types of a data frame to
Parquet types, to see how write_parquet() would write out the data
frame. It can also map the types of a Parquet file to R types, to see
how read_parquet() would read the file into R.
Usage
parquet_column_types(x, options = parquet_options())
Arguments
x |
Path to a Parquet file, or a data frame. |
options |
Nanoparquet options, see |
Value
Data frame with columns:
-
file_name: file name. -
name: column name. -
type: (low level) Parquet data type. -
r_type: the R type that corresponds to the Parquet type. Might beNAifread_parquet()cannot read this column. See nanoparquet-types for the type mapping rules. -
repetition_type: whether the column inREQUIRED(cannot beNA) orOPTIONAL(may beNA).REPEATEDcolumns are not currently supported by nanoparquet. -
logical_type: Parquet logical type in a list column. An element has at least an entry calledtype, and potentially additional entries, e.g.bit_width,is_signed, etc.
See Also
parquet_metadata() to read more metadata,
parquet_info() for a very short summary.
parquet_schema() for the complete Parquet schema.
read_parquet(), write_parquet(), nanoparquet-types.
Short summary of a Parquet file
Description
Short summary of a Parquet file
Usage
parquet_info(file)
Arguments
file |
Path to a Parquet file. |
Value
Data frame with columns:
-
file_name: file name. -
num_cols: number of (leaf) columns. -
num_rows: number of rows. -
num_row_groups: number of row groups. -
file_size: file size in bytes. -
parquet_version: Parquet version. -
created_by: A string scalar, usually the name of the software that created the file.NAif not available.
See Also
parquet_metadata() to read more metadata,
parquet_column_types() and parquet_schema() for column information.
read_parquet(), write_parquet(), nanoparquet-types.
Read the metadata of a Parquet file
Description
This function should work on all files, even if read_parquet() is
unable to read them, because of an unsupported schema, encoding,
compression or other reason.
Usage
parquet_metadata(file)
Arguments
file |
Path to a Parquet file. |
Value
A named list with entries:
-
file_meta_data: a data frame with file meta data:-
file_name: file name. -
version: Parquet version, an integer. -
num_rows: total number of rows. -
key_value_metadata: list column of a data frames with two character columns calledkeyandvalue. This is the key-value metadata of the file. Arrow stores its schema here. -
created_by: A string scalar, usually the name of the software that created the file.
-
-
schema: data frame, the schema of the file. It has one row for each node (inner node or leaf node). For flat files this means one root node (inner node), always the first one, and then one row for each "real" column. For nested schemas, the rows are in depth-first search order. Most important columns are:-
file_name: file name. -
name: column name. -
type: data type. One of the low level data types. -
type_length: length for fixed length byte arrays. -
repettion_type: character, one ofREQUIRED,OPTIONALorREPEATED. -
logical_type: a list column, the logical types of the columns. An element has at least an entry calledtype, and potentially additional entries, e.g.bit_width,is_signed, etc. -
num_children: number of child nodes. Should be a non-negative integer for the root node, andNAfor a leaf node.
-
-
$row_groups: a data frame, information about the row groups. -
$column_chunks: a data frame, information about all column chunks, across all row groups. Some important columns:-
file_name: file name. -
row_group: which row group this chunk belongs to. -
column: which leaf column this chunks belongs to. The order is the same as in$schema, but only leaf columns (i.e. columns withNAchildren) are counted. -
file_path: which file the chunk is stored in.NAmeans the same file. -
file_offset: where the column chunk begins in the file. -
type: low level parquet data type. -
encodings: encodings used to store this chunk. It is a list column of character vectors of encoding names. Current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT". -
path_in_scema: list column of character vectors. It is simply the path from the root node. It is simply the column name for flat schemas. -
codec: compression codec used for the column chunk. Possible values are: "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO", "BROTLI", "LZ4", "ZSTD". -
num_values: number of values in this column chunk. -
total_uncompressed_size: total uncompressed size in bytes. -
total_compressed_size: total compressed size in bytes. -
data_page_offset: absolute position of the first data page of the column chunk in the file. -
index_page_offset: absolute position of the first index page of the column chunk in the file, orNAif there are no index pages. -
dictionary_page_offset: absolute position of the first dictionary page of the column chunk in the file, orNAif there are no dictionary pages.
-
See Also
parquet_info() for a much shorter summary.
parquet_column_types() and parquet_schema() for column information.
read_parquet() to read, write_parquet() to write Parquet files,
nanoparquet-types for the R <-> Parquet type mappings.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet::parquet_metadata(file_name)
Nanoparquet options
Description
Create a list of nanoparquet options.
Usage
parquet_options(
class = getOption("nanoparquet.class", "tbl"),
use_arrow_metadata = getOption("nanoparquet.use_arrow_metadata", TRUE),
write_arrow_metadata = getOption("nanoparquet.write_arrow_metadata", TRUE)
)
Arguments
class |
The extra class or classes to add to data frames created
in |
use_arrow_metadata |
If this option is
|
write_arrow_metadata |
Whether to add the Apache Arrow types as
metadata to the file |
Value
List of nanoparquet options.
Examples
# the effect of using Arrow metadata
tmp <- tempfile(fileext = ".parquet")
d <- data.frame(
fct = as.factor("a"),
dft = as.difftime(10, units = "secs")
)
write_parquet(d, tmp)
read_parquet(tmp, options = parquet_options(use_arrow_metadata = TRUE))
read_parquet(tmp, options = parquet_options(use_arrow_metadata = FALSE))
Metadata of all pages of a Parquet file
Description
Metadata of all pages of a Parquet file
Usage
parquet_pages(file)
Arguments
file |
Path to a Parquet file. |
Details
Reading all the page headers might be slow for large files, especially if the file has many small pages.
Value
Data frame with columns:
-
file_name: file name. -
row_group: id of the row group the page belongs to, an integer between 0 and the number of row groups minus one. -
column: id of the column. An integer between the number of leaf columns minus one. Note that only leaf columns are considered, as non-leaf columns do not have any pages. -
page_type:DATA_PAGE,INDEX_PAGE,DICTIONARY_PAGEorDATA_PAGE_V2. -
page_header_offset: offset of the data page (its header) in the file. -
uncompressed_page_size: does not include the page header, as per Parquet spec. -
compressed_page_size: without the page header. -
crc: integer, checksum, if present in the file, can beNA. -
num_values: number of data values in this page, includeNULL(NAin R) values. -
encoding: encoding of the page, current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT". -
definition_level_encoding: encoding of the definition levels, seeencodingfor possible values. This can be missing in V2 data pages, where they are always RLE encoded. -
repetition_level_encoding: encoding of the repetition levels, seeencodingfor possible values. This can be missing in V2 data pages, where they are always RLE encoded. -
data_offset: offset of the actual data in the file. -
page_header_length: size of the page header, in bytes.
See Also
read_parquet_page() to read a page.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::parquet_pages(file_name)
Read the schema of a Parquet file
Description
This function should work on all files, even if read_parquet() is
unable to read them, because of an unsupported schema, encoding,
compression or other reason.
Usage
parquet_schema(file)
Arguments
file |
Path to a Parquet file. |
Value
Data frame, the schema of the file. It has one row for each node (inner node or leaf node). For flat files this means one root node (inner node), always the first one, and then one row for each "real" column. For nested schemas, the rows are in depth-first search order. Most important columns are: - `file_name`: file name. - `name`: column name. - `type`: data type. One of the low level data types. - `type_length`: length for fixed length byte arrays. - `repettion_type`: character, one of `REQUIRED`, `OPTIONAL` or `REPEATED`. - `logical_type`: a list column, the logical types of the columns. An element has at least an entry called `type`, and potentially additional entries, e.g. `bit_width`, `is_signed`, etc. - `num_children`: number of child nodes. Should be a non-negative integer for the root node, and `NA` for a leaf node.
See Also
parquet_metadata() to read more metadata,
parquet_column_types() to show the columns R would read,
parquet_info() to show only basic information.
read_parquet(), write_parquet(), nanoparquet-types.
Read a Parquet file into a data frame
Description
Converts the contents of the named Parquet file to a R data frame.
Usage
read_parquet(file, options = parquet_options())
Arguments
file |
Path to a Parquet file. |
options |
Nanoparquet options, see |
Value
A data.frame with the file's contents.
See Also
See write_parquet() to write Parquet files,
nanoparquet-types for the R <-> Parquet type mapping.
See parquet_info(), for general information,
parquet_column_types() and parquet_schema() for information about the
columns, and parquet_metadata() for the complete metadata.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
parquet_df <- nanoparquet::read_parquet(file_name)
print(str(parquet_df))
Read a page from a Parquet file
Description
Read a page from a Parquet file
Usage
read_parquet_page(file, offset)
Arguments
file |
Path to a Parquet file. |
offset |
Integer offset of the start of the page in the file.
See |
Value
Named list. Many entries correspond to the columns of
the result of parquet_pages(). Additional entries are:
-
codec: compression codec. Possible values: -
has_repetition_levels: whether the page has repetition levels. -
has_definition_levels: whether the page has definition levels. -
schema_column: which schema column the page corresponds to. Note that only leaf columns have pages. -
data_type: low level Parquet data type. Possible values: -
repetition_type: whether the column the page belongs to isREQUIRED,OPTIONALorREPEATED. -
page_header: the bytes of the page header in a raw vector. -
num_null: number of missing (NA) values. Only set in V2 data pages. -
num_rows: this is the same asnum_valuesfor flat tables, i.e. files without repetition levels. -
compressed_data: the data of the page in a raw vector. It includes repetition and definition levels, if any. -
data: the uncompressed data, if nanoparquet supports the compression codec of the file (GZIP and SNAPPY at the time of writing), or if the file is not compressed. In the latter case it is the same ascompressed_data.
See Also
parquet_pages() for a summary of all pages.
Examples
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::parquet_pages(file_name)
options(max.print = 100) # otherwise long raw vector
nanoparquet:::read_parquet_page(file_name, 4L)
RLE decode integers
Description
RLE decode integers
Usage
rle_decode_int(
x,
bit_width = attr(x, "bit_width"),
length = attr(x, "length") %||% NA
)
Arguments
x |
Raw vector of the encoded integers. |
bit_width |
Bit width used for the encoding. |
length |
Length of the output. If |
Value
The decoded integer vector.
See Also
Other encodings:
rle_encode_int()
RLE encode integers
Description
RLE encode integers
Usage
rle_encode_int(x)
Arguments
x |
Integer vector. |
Value
Raw vector, the encoded integers. It has two attributes:
-
bit_length: the number of bits needed to encode the input, and -
length: length of the original integer input.
See Also
Other encodings:
rle_decode_int()
Write a data frame to a Parquet file
Description
Writes the contents of an R data frame into a Parquet file.
Usage
write_parquet(
x,
file,
compression = c("snappy", "gzip", "zstd", "uncompressed"),
metadata = NULL,
options = parquet_options()
)
Arguments
x |
Data frame to write. |
file |
Path to the output file. If this is the string |
compression |
Compression algorithm to use. Currently |
metadata |
Additional key-value metadata to add to the file.
This must be a named character vector, or a data frame with columns
character columns called |
options |
Nanoparquet options, see |
Details
write_parquet() converts string columns to UTF-8 encoding by calling
base::enc2utf8(). It does the same for factor levels.
Value
NULL, unless file is ":raw:", in which case the Parquet
file is returned as a raw vector.
See Also
parquet_metadata(), read_parquet().
Examples
# add row names as a column, because `write_parquet()` ignores them.
mtcars2 <- cbind(name = rownames(mtcars), mtcars)
write_parquet(mtcars2, "mtcars.parquet")