artemis.io.writer

Writer classes to manage output data streams to collect record batches into Arrow, Parquet or Csv file formats.

Module Contents

class artemis.io.writer.BufferOutputOptions
BUFFER_MAX_SIZE = 2147483648
write_csv = True
class artemis.io.writer.BufferOutputWriter(name, **kwargs)

Bases: artemis.core.algo.IOAlgoBase

Manage output data with an in-memory buffer buffer is flushed to disk when a max buffer size is reached Only data sink supported is Arrow::BufferOutputStream

property total_records(self)
property total_batches(self)
property total_files(self)
initialize(self)
flush(self)

If all else fails, clear everything

_validate_metainfo(self)

Validate payload in Arrow table - number batches in file - number of rows - number of columns - schema

_finalize_file(self)
_finalize(self)

Close final writer Close final buffer Gather statistics

expected_sizeof(self, batch)
_reset(self)

reset for new stream

_new_sink(self)

return a new BufferOutputStream

_write_buffer(self)
_build_table_from_file(self, file_id)

build a table schema from inferred file schema

Parameters

file_id (uuid) –

_write_file(self)
_new_writer(self)

return a new writer requires closing the current writer flushing the buffer writing the buffer to file

_can_write(self, batch)
write(self, payload)

Manages writing a collection of batches caches a batch if beyond the max buffer size

this should function as a consumer of batches RecordBatches are given as a generator to ensure all batches are pushed to a buffer

static to_csv(buf, path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=False, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='n', chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.')
Write DataFrame to a comma-separated values (csv) file. Obtained from

pandas.core.frame.

bufpyarrow.buffer

arrow buffer of a RecordBatchFile

path_or_bufstring or file handle, default None

File path or object, if None is provided the result is returned as a string.

sepcharacter, default ,

Field delimiter for the output file.

na_repstring, default ''

Missing data representation

float_formatstring, default None

Format string for floating point numbers

columnssequence, optional

Columns to write

headerboolean or list of string, default True

Write out the column names. If a list of strings is given it is assumed to be aliases for the column names

indexboolean, default True

Write row names (index)

index_labelstring or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

modestr

Python write mode, default ‘w’

encodingstring, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compressionstring, optional

A string representing the compression to use in the output file. Allowed values are ‘gzip’, ‘bz2’, ‘zip’, ‘xz’. This input is only used when the first argument is a filename.

line_terminator : string, default ``

‘``

The newline character or character sequence to use in the output file

quotingoptional constant from csv module

defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric

quotecharstring (length 1), default '"'

character used to quote fields

doublequoteboolean, default True

Control quoting of quotechar inside a field

escapecharstring (length 1), default None

character used to escape sep and quotechar when appropriate

chunksizeint or None

rows to write at a time

date_formatstring, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

bytes