artemis.io.filehandler

Generic tool for reading raw bytes into Arrow buffer Aimed for handling ascii encoded files, e.g. tab delimited or legacy data Support for: Chunking data in bytes Scanning for line delimiter Extracting meta data from a header

Module Contents

class artemis.io.filehandler.FileHandlerOptions
blocksize
num_rows = 4095
delimiter = ,
linesep =
header_offset = 0
header =
footer_size = 0
footer =
nsamples = 1
filetype = csv
encoding = utf8
schema = []
header_rows = 1
class artemis.io.filehandler.FileHandlerTool(name, **kwargs)

Bases: artemis.core.algo.IOAlgoBase

initialize(self)
property size_bytes(self)
cache_header(self, header)
cache_schema(self, schema)
cache_header_offset(self, offset)
validate(self, header, header_offset, schema)
prepare_csv(self, stream)
prepare_legacy(self, stream)

Assumes schema is supplied to the parser tool

prepare_sas(self, stream)
prepare_ipc(self, filepath_or_buffer)
exec_csv(self, stream)
exec_legacy(self, stream)
exec_sas(self, stream)
exec_ipc(self, filepath_or_buffer)
exec_blocks(self, stream)
execute(self, filepath_or_buffer)
_build_table_from_file(self, file_id)
_update(self, filepath_or_buffer)
_create_header(self, schema)
_readline(self, stream, size=-1)

Using pyarrow input_stream use cpython _pyio readline

_seek_delimiter(self, file_, delimiter, blocksize)

Dask-like line delimiter to read by bytes and seek to nearest line default block_size 2**16 or 64 bytes

BUG Last block is not at EOF???

_get_block(self, file_, offset, length, size, delimiter=None)

Dask-like block read of data in bytes Returns the length of bytes to read for a block starts at last position in file, does not ensure that file is already at position after delimiter

Requries starting offset to be after delimiter

# TODO, if offset not a delimiter seek to the next one

# BUG last seek goes past EOF

_read_block(self, file_, offset, length, delimiter=None)

Dask-like block read of data in bytes Ensures the start point of a block is after a delimiter

class artemis.io.filehandler.FileFactory

Some ideas taken from github.com/claudep/tabimport Abstract away the stream type Assumes everything is a file read in as bytes

classmethod _sniff_format(cls, dfile)