Metadata¶
This is the documentation which describes the Artemis metadata model(s) and the Cronus metadata service. Artemis metadata schemas are implemented in protobufs.
In order to ensure flexibility, reproducibility and to separate the definition of the BPM from the execution, Artemis requires a metadata model with a persistent representation. The persistent, serialized metadata must be flexible to work with, to store, and to use within Artemis and other applications. The Artemis metadata model must be able to support external application development and use within a scalable data production infrastructure.
Artemis metadata model is defined in the google protobuf messaging format. Protocol buffers are language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. Protocol buffers were developed by Google to support their infrastructure for service and application communication. Protobuf message format enables Artemis to define how to stucture metadata once, then special source code is generated to easily write and read the structured metadata to and from a variety of data streams and using a variety of languages. (Google developor pages). Protobuf messages can be read with reflection without a schema, making the format extremely flexible in terms of application development, use and persistency. The serialized protobuf is a bytestring, in other words a BLOB (binary large object), which is a flexible, lightweight storage mechanism for metadata. The language-neutral message format implies that any application can be built to interact with the data, while the BLOB simplifies storage. The messages can be persisted simply as files on a local filesystem, or for improved metadata management the messages can be cataloged in a simple key-value store. Applications can persist and retrieve configurations using a key.
The idea of persisting the configuration as well as managing the state of Artemis derived from experimentation with the Pachyderm data science workflow tool and Kubernetes. Moreover, Arrow intends to develop secure, over the wire message transport layer in the Arrow Flight project using gRPC and protobuf. Artemis can leverage Arrow Flight along with gRPC to build scalable, secure, data processing architecture that can be flexible for cloud-native and HPC deployments.
Artemis Information Management Model¶
Artemis Metadata Model¶
The Artemis metadata model has three primary components: 1. The defintion of the data processing job, i.e. all the required metadata to execute a business process model.
Defintion of the data source or source(s)
Definition of business process model for that particular data source(s)
The configuration of algorithms, tools, and services required to execute the business process model
Job processing metadata, i.e. metadata required to support the execution of the BPM and to retain data provenance.
The current state of the job
Metadata related to the raw data source
Metadata associating raw input data, intermediate data and output data (provenance)
Summary metadata
Statistical information gathered during the processing of the data
Cost information (timing disrtibutions) for processing stages, algorithm and tool execution
Detailed information on the model can be found in the Appendix.