artemis.io.writer¶
Writer classes to manage output data streams to collect record batches into Arrow, Parquet or Csv file formats.
Module Contents¶
-
class
artemis.io.writer.BufferOutputWriter(name, **kwargs)¶ Bases:
artemis.core.algo.IOAlgoBaseManage output data with an in-memory buffer buffer is flushed to disk when a max buffer size is reached Only data sink supported is Arrow::BufferOutputStream
-
property
total_records(self)¶
-
property
total_batches(self)¶
-
property
total_files(self)¶
-
initialize(self)¶
-
flush(self)¶ If all else fails, clear everything
-
_validate_metainfo(self)¶ Validate payload in Arrow table - number batches in file - number of rows - number of columns - schema
-
_finalize_file(self)¶
-
_finalize(self)¶ Close final writer Close final buffer Gather statistics
-
expected_sizeof(self, batch)¶
-
_reset(self)¶ reset for new stream
-
_new_sink(self)¶ return a new BufferOutputStream
-
_write_buffer(self)¶
-
_build_table_from_file(self, file_id)¶ build a table schema from inferred file schema
- Parameters
file_id (uuid) –
-
_write_file(self)¶
-
_new_writer(self)¶ return a new writer requires closing the current writer flushing the buffer writing the buffer to file
-
_can_write(self, batch)¶
-
write(self, payload)¶ Manages writing a collection of batches caches a batch if beyond the max buffer size
this should function as a consumer of batches RecordBatches are given as a generator to ensure all batches are pushed to a buffer
-
static
to_csv(buf, path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=False, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='n', chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.')¶ - Write DataFrame to a comma-separated values (csv) file. Obtained from
pandas.core.frame.
- bufpyarrow.buffer
arrow buffer of a RecordBatchFile
- path_or_bufstring or file handle, default None
File path or object, if None is provided the result is returned as a string.
- sepcharacter, default
, Field delimiter for the output file.
- na_repstring, default
'' Missing data representation
- float_formatstring, default None
Format string for floating point numbers
- columnssequence, optional
Columns to write
- headerboolean or list of string, default True
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names
- indexboolean, default True
Write row names (index)
- index_labelstring or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and
headerandindexare True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R- modestr
Python write mode, default ‘w’
- encodingstring, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
- compressionstring, optional
A string representing the compression to use in the output file. Allowed values are ‘gzip’, ‘bz2’, ‘zip’, ‘xz’. This input is only used when the first argument is a filename.
line_terminator : string, default ``’
- ‘``
The newline character or character sequence to use in the output file
- quotingoptional constant from csv module
defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric
- quotecharstring (length 1), default
'"' character used to quote fields
- doublequoteboolean, default True
Control quoting of quotechar inside a field
- escapecharstring (length 1), default None
character used to escape sep and quotechar when appropriate
- chunksizeint or None
rows to write at a time
- date_formatstring, default None
Format string for datetime objects
- decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data
bytes
-
property