Simple wrapper to convert Athena backend file types

Utilises AWS Athena to convert AWS S3 backend file types. It also also to create more efficient file types i.e. "parquet" and "orc" from SQL queries.

dbConvertTable(conn, obj, name, ...)

# S4 method for AthenaConnection
dbConvertTable(
  conn,
  obj,
  name,
  partition = NULL,
  s3.location = NULL,
  file.type = c("NULL", "csv", "tsv", "parquet", "json", "orc"),
  compress = TRUE,
  data = TRUE,
  ...
)

Arguments

conn: An AthenaConnection object, produced by [DBI::dbConnect()]
obj: Athena table or SQL DML query to be converted. For SQL, the query need to be wrapped with DBI::SQL() and follow AWS Athena DML format link
name: Name of destination table
...: Extra parameters, currently not used
partition: Partition Athena table
s3.location: location to store output file, must be in s3 uri format for example ("s3://mybucket/data/").
file.type: File type for name, currently support ["NULL","csv", "tsv", "parquet", "json", "orc"]. "NULL" will let Athena set the file type for you.
compress: Compress name, currently can only compress ["parquet", "orc"] (AWS Athena CTAS)
data: If name should be created with data or not.

Value

dbConvertTable() returns TRUE but invisible.

Examples

if (FALSE) {
# Note:
# - Require AWS Account to run below example.
# - Different connection methods can be used please see `RAthena::dbConnect` documnentation

library(DBI)
library(RAthena)

# Demo connection to Athena using profile name
con <- dbConnect(athena())

# write iris table to Athena in defualt delimited format
dbWriteTable(con, "iris", iris)

# convert delimited table to parquet
dbConvertTable(con,
  obj = "iris",
  name = "iris_parquet",
  file.type = "parquet"
)

# Create partitioned table from non-partitioned
# iris table using SQL DML query
dbConvertTable(con,
  obj = SQL("select
                            iris.*,
                            date_format(current_date, '%Y%m%d') as time_stamp
                          from iris"),
  name = "iris_orc_partitioned",
  file.type = "orc",
  partition = "time_stamp"
)

# disconnect from Athena
dbDisconnect(con)
}