`artemis.io.writer`¶

Writer classes to manage output data streams to collect record batches into Arrow, Parquet or Csv file formats.

Module Contents¶

class artemis.io.writer.BufferOutputOptions¶

BUFFER_MAX_SIZE = 2147483648¶

write_csv = True¶

class artemis.io.writer.BufferOutputWriter(name, **kwargs)¶

Bases: artemis.core.algo.IOAlgoBase

Manage output data with an in-memory buffer buffer is flushed to disk when a max buffer size is reached Only data sink supported is Arrow::BufferOutputStream

property total_records(self)¶

property total_batches(self)¶

property total_files(self)¶

initialize(self)¶

flush(self)¶: If all else fails, clear everything

_validate_metainfo(self)¶: Validate payload in Arrow table - number batches in file - number of rows - number of columns - schema

_finalize_file(self)¶

_finalize(self)¶: Close final writer Close final buffer Gather statistics

expected_sizeof(self, batch)¶

_reset(self)¶: reset for new stream

_new_sink(self)¶: return a new BufferOutputStream

_write_buffer(self)¶

_build_table_from_file(self, file_id)¶

build a table schema from inferred file schema

Parameters: file_id (uuid) –

_write_file(self)¶

_new_writer(self)¶: return a new writer requires closing the current writer flushing the buffer writing the buffer to file

_can_write(self, batch)¶

write(self, payload)¶

Manages writing a collection of batches caches a batch if beyond the max buffer size

this should function as a consumer of batches RecordBatches are given as a generator to ensure all batches are pushed to a buffer

static to_csv(buf, path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=False, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='n', chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.')¶

Write DataFrame to a comma-separated values (csv) file. Obtained from

pandas.core.frame.

bufpyarrow.buffer: arrow buffer of a RecordBatchFile
path_or_bufstring or file handle, default None: File path or object, if None is provided the result is returned as a string.
sepcharacter, default ,: Field delimiter for the output file.
na_repstring, default '': Missing data representation
float_formatstring, default None: Format string for floating point numbers
columnssequence, optional: Columns to write
headerboolean or list of string, default True: Write out the column names. If a list of strings is given it is assumed to be aliases for the column names
indexboolean, default True: Write row names (index)
index_labelstring or sequence, or False, default None: Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
modestr: Python write mode, default ‘w’
encodingstring, optional: A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compressionstring, optional: A string representing the compression to use in the output file. Allowed values are ‘gzip’, ‘bz2’, ‘zip’, ‘xz’. This input is only used when the first argument is a filename.

line_terminator : string, default ``’

‘``

The newline character or character sequence to use in the output file

quotingoptional constant from csv module: defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric
quotecharstring (length 1), default '"': character used to quote fields
doublequoteboolean, default True: Control quoting of quotechar inside a field
escapecharstring (length 1), default None: character used to escape sep and quotechar when appropriate
chunksizeint or None: rows to write at a time
date_formatstring, default None: Format string for datetime objects
decimal: string, default ‘.’: Character recognized as decimal separator. E.g. use ‘,’ for European data

bytes

artemis.io.writer¶

Module Contents¶

`artemis.io.writer`¶