NEWS.md
AWS_ROLE_ARN
. This caused confusing when connecting through web identity (#177)dbplyr::in_catalog
when working with dplyr::tbl
(#178)INFO: (Data scanned: -43839744 Bytes)
clear_s3_resource
parameter to RAthena_options
to prevent AWS Athena output AWS S3 resource being cleared up by dbClearResult
(#168). Thanks to @juhoautio for the request.boto3.session.Session
class and client
method (#169)endpoint_override
parameter allow default endpoints for each service to be overridden accordingly (#169). Thanks to @aoyh for the request and checking the package in development.RAthena_options
to change 1 parameter at a time without affecting other pre-configured settingsretry_quiet
parameter in RAthena_options
function.dbplyr
2.0.0 backend API.dplyr
to benefit from AWS Athena unload
methods (noctua # 174).dbGetQuery
, dbExecute
, dbSendQuery
, dbSendStatement
work on older versions of R
(noctua # 170). Thanks to @tyner for identifying issue.AWS Athena UNLOAD
(noctua: # 160). This is to take advantage of read/write speed parquet
has to offer.import awswrangler as wr
import getpass
= getpass.getpass()
bucket = f"s3://{bucket}/data/"
path
if "awswrangler_test" not in wr.catalog.databases().values:
"awswrangler_test")
wr.catalog.create_database(
= ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
cols
= wr.s3.read_csv(
df ="s3://noaa-ghcn-pds/csv/189",
path=cols,
names=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB)
parse_dates
wr.s3.to_parquet(=df,
df=path,
path=True,
dataset="overwrite",
mode="awswrangler_test",
database="noaa"
table;
)
="awswrangler_test", table="noaa") wr.catalog.table(database
library(DBI)
con <- dbConnect(RAthena::athena())
# Query ran using CSV output
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 57.004 8.430 160.567
RAthena::RAthena_options(cache_size = 1)
# Query ran using UNLOAD Parquet output
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 21.622 2.350 39.232
# Query ran using cache
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 13.738 1.886 11.029
sql_translate_env
correctly translates R functions quantile
and median
to AWS Athena
equivalents (noctua # 153). Thanks to @ellmanj for spotting issue.AWS Athena
timestamp with time zone
data type.list
when converting data to AWS Athena
SQL
format.
library(data.table)
library(DBI)
x = 5
dt = data.table(
var1 = sample(LETTERS, size = x, T),
var2 = rep(list(list("var3"= 1:3, "var4" = list("var5"= letters[1:5]))), x)
)
con <- dbConnect(RAthena::athena())
#> Version: 2.2.0
sqlData(con, dt)
# Registered S3 method overwritten by 'jsonify':
# method from
# print.json jsonlite
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 2: 2 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 3: 3 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 4: 4 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 5: 5 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
#> Version: 2.1.0
sqlData(con, dt)
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 2: 2 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 3: 3 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 4: 4 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 5: 5 1:3|list(var5 = c("a", "b", "c", "d", "e"))
v-2.2.0 now converts lists into json lines format so that AWS Athena can parse with sql
array
/mapping
/json
functions. Small down side a s3 method conflict occurs when jsonify
is called to convert lists into json lines. jsonify
was choose in favor to jsonlite
due to the performance improvements (noctua # 156).
dbIsValid
wrongly stated connection is valid for result class when connection class was disconnected.sql_translate_env.paste
broke with latest version of dbplyr
. New method is compatible with dbplyr>=1.4.3
(noctua # 149).sql_translate_env
: add support for stringr
/lubridate
style functions, similar to Postgres backend.dbConnect
add timezone
parameter so that time zone between R
and AWS Athena
is consistent (noctua # 149).AthenaConnection
class: ptr
and info
slots changed from list
to environment
with in AthenaConnect
class. Allows class to be updated by reference. Simplifies notation when viewing class from RStudio environment tab.AthenaResult
class: info
slot changed from list
to environment
. Allows class to be updated by reference.By utilising environments for AthenaConnection
and AthenaResult
, all AthenaResult
classes created from AthenaConnection
will point to the same ptr
and info
environments for it’s connection. Previously ptr
and info
would make a copy. This means if it was modified it would not affect the child or parent class for example:
# Old Method
library(DBI)
con <- dbConnect(RAthena::athena(),
rstudio_conn_tab = F)
res <- dbExecute(con, "select 'helloworld'")
# modifying parent class to influence child
con@info$made_up <- "helloworld"
# nothing happened
res@connection@info$made_up
# > NULL
# modifying child class to influence parent
res@connection@info$made_up <- "oh no!"
# nothing happened
con@info$made_up
# > "helloworld"
# New Method
library(DBI)
con <- dbConnect(RAthena::athena(),
rstudio_conn_tab = F)
res <- dbExecute(con, "select 'helloworld'")
# modifying parent class to influence child
con@info$made_up <- "helloworld"
# picked up change
res@connection@info$made_up
# > "helloworld"
# modifying child class to influence parent
res@connection@info$made_up <- "oh no!"
# picked up change
con@info$made_up
# > "oh no!"
AWS Athena
data types [array, row, map, json, binary, ipaddress]
(noctua: # 135). Conversion types can be changed through dbConnect
and RAthena_options
.
library(DBI)
library(RAthena)
# default conversion methods
con <- dbConnect(RAthena::athena())
# change json conversion method
RAthena_options(json = "character")
RAthena:::athena_option_env$json
# [1] "character"
# change json conversion to custom method
RAthena_options(json = jsonify::from_json)
RAthena:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
# change bigint conversion without affecting custom json conversion methods
RAthena_options(bigint = "numeric")
RAthena:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
RAthena:::athena_option_env$bigint
# [1] "numeric"
# change binary conversion without affect, bigint or json methods
RAthena_options(binary = "character")
RAthena:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
RAthena:::athena_option_env$bigint
# [1] "numeric"
RAthena:::athena_option_env$binary
# [1] "character"
# no conversion for json objects
con2 <- dbConnect(RAthena::athena(), json = "character")
# use custom json parser
con <- dbConnect(RAthena::athena(), json = jsonify::from_json)
rstudio_conn_tab
within dbConnect
.AWS Athena
uses float
data type for the DDL only, RAthena
was wrongly parsing float
data type back to R. Instead AWS Athena
uses data type real
in SQL functions like select cast
https://docs.aws.amazon.com/athena/latest/ug/data-types.html. RAthena
now correctly parses real
to R’s data type double
(noctua: # 133)AWS
returns to get all results from AWS Glue
catalogue (noctua: # 137)dbGetPartition
. This simply tidies up the default AWS Athena partition format.
library(DBI)
library(RAthena)
con <- dbConnect(athena())
dbGetPartition(con, "test_df2", .format = T)
# Info: (Data scanned: 0 Bytes)
# year month day
# 1: 2020 11 17
dbGetPartition(con, "test_df2")
# Info: (Data scanned: 0 Bytes)
# partition
# 1: year=2020/month=11/day=17
bigint
, this is to align with other DBI interfaces i.e. RPostgres
. Now bigint
can be return in the possible formats: [“integer64”, “integer”, “numeric”, “character”]library(DBI)
con <- dbConnect(RAthena::athena(), bigint = "numeric")
When switching between the different file parsers the bigint
to be represented according to the file parser i.e. data.table
: “integer64” -> vroom
: “I”.
dbRemoveTable
: Check if key has “.” or ends with “/” before adding “/” to the end (noctua: # 125)sql_escape_date
into dplyr_integration.R
backend (#121). Thanks to @OssiLehtinen for developing Athena date translation.RAthena
to append to a static AWS s3 location using uuiduse_deprecated_int96_timestamps
set to TRUE
. This puts POSIXct data type in to java.sql.Timestamp
compatible format, such as yyyy-MM-dd HH:mm:ss[.f...]
. Thanks to Christian N Wolz for highlight this issue.s3_upload_location
simplified how s3 location is built. Now s3.location parameter isn’t affected and instead only additional components e.g. name, schema and partition.dbplyr v-2.0.0
function in_schema
now wraps strings in quotes, this breaks db_query_fields.AthenaConnection
. Now db_query_fields.AthenaConnection
removes any quotation from the string so that it can search AWS GLUE
for table metadata. (noctua: # 117)RAthena
would return a data.frame
for utility SQL
queries regardless of backend file parser. This is due to AWS Athena
outputting SQL UTILITY
queries as a text file that required to be read in line by line. Now RAthena
will return the correct data format based on file parser set in RAthena_options
for example: RAthena_options("vroom")
will return tibbles
.dbClearResult
when user doesn’t have permission to delete AWS S3 objects (noctua: # 96)RAthena_options
contains 2 new parameters to control how RAthena
handles retries.dbFetch
is able to return data from AWS Athena in chunk. This has been achieved by passing NextToken
to AthenaResult
s4 class. This method won’t be as fast n = -1
as each chunk will have to be process into data frame format.
library(DBI)
con <- dbConnect(RAthena::athena())
res <- dbExecute(con, "select * from some_big_table limit 10000")
dbFetch(res, 5000)
dbWriteTable
opts to use alter table
instead of standard msck repair table
. This is to improve performance when appending to tables with high number of existing partitions.dbWriteTable
now allows json to be appended to json ddls created with the Openx-JsonSerDe library.dbConvertTable
brings dplyr::compute
functionality to base package, allowing RAthena
to use the power of AWS Athena to convert tables and queries to more efficient file formats in AWS S3 (#37).dplyr::compute
to give same functionality of dbConvertTable
boto3
not being detected has been updated. This is due to several users not sure how to get RAthena
set-up.stop("Boto3 is not detected please install boto3 using either: `pip install boto3 numpy` in terminal or `install_boto()`.",
"\nIf this doesn't work please set the python you are using with `reticulate::use_python()` or `reticulate::use_condaenv()`",
call. = FALSE)
region_name
check before making a connection to AWS Athena (#110)dbWriteTable
would throw throttling error
every now and again, retry_api_call
as been built to handle the parsing of data between R and AWS S3.dbWriteTable
did not clear down all metadata when uploading to AWS Athena
dbWriteTable
added support ddl structures for user who have created ddl’s outside of RAthena
RAthena
retry functionality\dontrun
(#108)pyathena
, RAthena_options
now has a new parameter cache_size
. This implements local caching in R environments instead of using AWS list_query_executions
. This is down to dbClearResult
clearing S3’s Athena output when caching isn’t disabledRAthena_options
now has clear_cache
parameter to clear down all cached data.dbRemoveTable
now utilise AWS Glue
to remove tables from AWS Glue
catalogue. This has a performance enhancement:
library(DBI)
con = dbConnect(RAthena::athena())
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Athena method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.131 0.037 2.404
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Glue method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.065 0.009 1.303
dbWriteTable
now supports uploading json lines (http://jsonlines.org/) format up to AWS Athena
(#88).
library(DBI)
con = dbConnect(RAthena::athena())
dbWriteTable(con, "iris2", iris, file.type = "json")
dbGetQuery(con, "select * from iris2")
dbWriteTable
appending to existing table compress file type was incorrectly return.install_boto
added numpy
to RAthena
environment install as reticulate
appears to favour environments with numpy
(https://github.com/rstudio/reticulate/issues/216)Rstudio connection tab
comes into an issue when Glue Table isn’t stored correctly (#92)fwrite
(>=1.12.4) https://github.com/Rdatatable/data.table/blob/master/NEWS.md
sql_translate_env
(#44)
# Before
dbplyr::translate_sql("2019-01-01", con = con)
# '2019-01-01'
# Now
dbplyr::translate_sql("2019-01-01", con = con)
# DATE '2019-01-01'
paste
/paste0
would use default dplyr:sql-translate-env
(concat_ws
). paste0
now uses Presto’s concat
function and paste
now uses pipes to get extra flexibility for custom separating values.
# R code:
paste("hi", "bye", sep = "-")
# SQL translation:
('hi'||'-'||'bye')
append
set to TRUE
then existing s3.location will be utilised (#73)db_compute
returned table name, however when a user wished to write table to another location (#74). An error would be raised: Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not exist
This has now been fixed with db_compute returning dbplyr::in_schema
.
library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
tbl(con, "iris") %>%
compute(name = "temp.iris")
dbListFields
didn’t display partitioned columns. This has now been fixed with the call to AWS Glue being altered to include more metadata allowing for column names and partitions to be returned.dbListFields
RAthena_options
vroom
has been restricted to >= 1.2.0 due to integer64 support and changes to vroom
apidbStatistics
is a wrapper around boto3
get_query_execution
to return statistics for RAthena::dbSendQuery
results (#67)dbGetQuery
has new parameter statistics
to print out dbStatistics
before returning Athena results (#67)s3.location
now follows new syntax s3://bucket/{schema}/{table}/{partition}/{table_file}
to align with Pyathena
and to allow tables with same name but in different schema to be uploaded to s3 (#73).dplyr::tbl
when calling Athena when using the ident method (noctua # 64):
library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
# ident method:
t1 <- system.time(tbl(con, "iris"))
# sub query method:
t2 <- system.time(tbl(con, sql("select * from iris")))
# ident method
# user system elapsed
# 0.082 0.012 0.288
# sub query method
# user system elapsed
# 0.993 0.138 3.660
data.table
to vroom
. From now on it is possible to change file parser using RAthena_options
for example:
library(RAthena)
RAthena_options("vroom")
dbGetTables
that returns Athena hierarchy as a data.framedbWriteTable
append
parameter checks and uses existing AWS Athena DDL file type. If file.type
doesn’t match Athena DDL file type then user will receive a warning message:
warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".', call. = FALSE)
tolower
conversion due to request #41dbRemoveTable
can now remove S3 files for AWS Athena table being removed.as.character
was getting wrongly translated #45INTEGER
being incorrectly translated in sql_translate_env.R
dbWriteTable
now will split gzip
compressed files to improve AWS Athena performance. By default gzip
compressed files will be split into 20.Performance results
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(RAthena::athena())
# upload dataframe with different splits
dbWriteTable(con, "test_split1", df, compress = T, max.batch = nrow(df), overwrite = T) # no splits
dbWriteTable(con, "test_split2", df, compress = T, max.batch = 0.05 * nrow(df), overwrite = T) # 20 splits
dbWriteTable(con, "test_split3", df, compress = T, max.batch = 0.1 * nrow(df), overwrite = T) # 10 splits
AWS Athena performance results from AWS console (query executed: select count(*) from ....
):
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(RAthena::athena())
dbWriteTable(con, "test_split1", df, compress = T, overwrite = T) # default will now split compressed file into 20 equal size files.
Added information message to inform user about what files have been added to S3 location if user is overwriting an Athena table.
dbWriteTable
POSIXct
to Athena. This class was convert incorrectly and AWS Athena would return NA instead. RAthena
will now correctly convert POSIXct
to timestamp but will also correct read in timestamp into POSIXct
NA
in string format. Before RAthena
would return NA
in string class as ""
this has now been fixed.RAthena
would translate output into a vector with current the method dbFetch
n = 0.sql_translate_env
. Previously RAthena
would take the default dplyr::sql_translate_env
, now RAthena
has a custom method that uses Data types from: https://docs.aws.amazon.com/athena/latest/ug/data-types.html and window functions from: https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.html
s3.location
parameter is dbWriteTable
can now be made nullablesqlCreateTable
info message will now only inform user if colnames have changed and display the column name that have changedupload_data
has been rebuilt and removed the old “horrible” if statement with paste
now the function relies on sprintf
to construct the s3 location path. This method now is a lot clearer in how the s3 location is created plus it enables a dbWriteTable
to be simplified. dbWriteTable
can now upload data to the default s3_staging directory created in dbConnect
this simplifies dbWriteTable
to :data.table::fread
. This enables data types to be read in correctly and not required a second stage to convert data types once data has been read into Rdata.table::fread
and data.table::fwrite
have been disabledutil
functions from namespace: write.table
, read.csv
data.table
to namespacebigint
are convert into R bit64::integer64
and visa versadbConnect
methoddbFetch
with chunk sizes between 0 - 999. Fixed error where for loop
would return error instead of breaking.py_error
function, set call.
parameter to FALSE
AthenaQuery
s4 class changed to AthenaResult
dbFetch
added datatype collectiondbFetch
replaced S3 search for query key with output location from AthenadbClearResult
changed error, to return python error as warning to warn user doesn’t have permission to delete S3 resourcedbClearResult
replaced S3 search for query key with out location from AthenadbListTables
now returns vector of tables from aws glue
instead of using an AWS Athena
query. This method increases speed of call of querydbListFields
now returns column names from aws glue
instead of using an AWS Athena
query.. This method increases speed of call of querydbExistsTable
now returns boolean from aws glue
instead of using an AWS Athena
query.. This method increases speed of call of querycreate_work_group
: Creates a workgroup with the specified name.delete_work_group
: Deletes the workgroup with the specified name.list_work_group
: Lists available workgroups for the account.get_work_group
: Returns information about the workgroup with the specified name.update_work_group
: Updates the workgroup with the specified name. The workgroup’s name cannot be changed.get_session_token
to create temporary session credentialsassume_role
to assume AWS ARN RoledbConnect
set_aws_env
to set aws tokens to environmental variablesget_aws_env
to return expected results from system variablestag_options
to create tag options for create_work_group
work_group_config
and work_group_config_update
to create config of work groupAthenaConnection
dbColumnInfo
method: returns data.frame containing field_name
and type
time_check
to check how long is left on the Athena Connection, if less than 15 minutes a warning message is outputted to notify userdb_collect
for better integration with dplyrdb_save_query
for better integration with dplyrdb_copy_to
for better integration with dplyrAthenaConnection
:
request
build Athena query requestdb_desc
dbConnect