artemis.io.filehandler¶
Generic tool for reading raw bytes into Arrow buffer Aimed for handling ascii encoded files, e.g. tab delimited or legacy data Support for: Chunking data in bytes Scanning for line delimiter Extracting meta data from a header
Module Contents¶
-
class
artemis.io.filehandler.FileHandlerOptions¶ -
blocksize¶
-
num_rows= 4095¶
-
delimiter= ,¶
-
linesep=¶
-
header_offset= 0¶
-
header=¶
-
nsamples= 1¶
-
filetype= csv¶
-
encoding= utf8¶
-
schema= []¶
-
header_rows= 1¶
-
-
class
artemis.io.filehandler.FileHandlerTool(name, **kwargs)¶ Bases:
artemis.core.algo.IOAlgoBase-
initialize(self)¶
-
property
size_bytes(self)¶
-
cache_header(self, header)¶
-
cache_schema(self, schema)¶
-
cache_header_offset(self, offset)¶
-
validate(self, header, header_offset, schema)¶
-
prepare_csv(self, stream)¶
-
prepare_legacy(self, stream)¶ Assumes schema is supplied to the parser tool
-
prepare_sas(self, stream)¶
-
prepare_ipc(self, filepath_or_buffer)¶
-
exec_csv(self, stream)¶
-
exec_legacy(self, stream)¶
-
exec_sas(self, stream)¶
-
exec_ipc(self, filepath_or_buffer)¶
-
exec_blocks(self, stream)¶
-
execute(self, filepath_or_buffer)¶
-
_build_table_from_file(self, file_id)¶
-
_update(self, filepath_or_buffer)¶
-
_create_header(self, schema)¶
-
_readline(self, stream, size=-1)¶ Using pyarrow input_stream use cpython _pyio readline
-
_seek_delimiter(self, file_, delimiter, blocksize)¶ Dask-like line delimiter to read by bytes and seek to nearest line default block_size 2**16 or 64 bytes
BUG Last block is not at EOF???
-
_get_block(self, file_, offset, length, size, delimiter=None)¶ Dask-like block read of data in bytes Returns the length of bytes to read for a block starts at last position in file, does not ensure that file is already at position after delimiter
Requries starting offset to be after delimiter
# TODO, if offset not a delimiter seek to the next one
# BUG last seek goes past EOF
-
_read_block(self, file_, offset, length, delimiter=None)¶ Dask-like block read of data in bytes Ensures the start point of a block is after a delimiter
-