artemis.generators.simutable.febrlgen

Modifier class to generate errors given a record. This class is a wrapper of the FEBRL data generator modifier functions.

Febrl (Freely Extensible Biomedical Record Linkage) is a freely available tool that enables record linkage through a GUI. The tool written in python which offers both a programming interface as well as GUI, supporting several record linakge algorithms. In addition, the tool includes a data generator which generates two record data sets suitable for performing record linkage. The data generator creates two datasets with the following random variables: field given a frequency table (histogram) date phone identification

The following probabilities are defined to modify The second (duplicate) dataset draws randomly from first (original). Each field defines a dictionary of probabilities: Modify a given field Mispell Insert a character Delete character Substiture character Swap two characters Swap two fields Swap words in field Split a word Merge words Null field Insert new value

PDFs for number of duplicates for each record: Uniform Poisson Zipf

Each duplicate apply the modifications up to: (Fixed) Max N modifications for a given record (Random) Max N modifications for a given field

Straightforward to implement. Requires suitable dictionaries for generating proper Canadian addresses. Requires dictionary of commonly mispelled words (with list of misspellings). This will also serve well for the Postal data.

Require a way to select from a probability distribution: https://eli.thegreenplace.net/2010/01/22/weighted-random-generation-python

FEBRLGEN Comments All fields the following keys must be given: # - select_prob Probability of selecting a field for introducing one or # more modifications (set this to 0.0 if no modifications # should be introduced into this field ever). Note: The sum # of these select probabilities over all defined fields must # be 100. # - misspell_prob Probability to swap an original value with a randomly # chosen misspelling from the corresponding misspelling # dictionary (can only be set to larger than 0.0 if such a # misspellings dictionary is defined for the given field). # - ins_prob Probability to insert a character into a field value. # - del_prob Probability to delete a character from a field value. # - sub_prob Probability to substitute a character in a field value with # another character. # - trans_prob Probability to transpose two characters in a field value. # - val_swap_prob Probability to swap the value in a field with another # (randomly selected) value for this field (taken from this # field’s look-up table). # - wrd_swap_prob Probability to swap two words in a field (given there are # at least two words in a field). # - spc_ins_prob Probability to insert a space into a field value (thus # splitting a word). # - spc_del_prob Probability to delete a space (if available) in a field (and # thus merging two words). # - miss_prob Probability to set a field value to missing (empty). # - new_val_prob Probability to insert a new value given the original value # was empty. #

# Note: The sum over the probabilities ins_prob, del_prob, sub_prob, # trans_prob, val_swap_prob, wrd_swap_prob, spc_ins_prob, spc_del_prob, # and miss_prob for each defined field must be 1.0; or 0.0 if no # modification should be done at all on a given field. # # =============================================================================

Module Contents

class artemis.generators.simutable.febrlgen.Modifier(fake, generators, schema, modifiers)

Bases: object

Base modification class for row of data

get_stats(self)
reset_fake(self, fake)
validate(self)

Ensure defined metafields probabilitites sum to 1

_reset(self)

Reset per record counters

field_pdf(self)

Create a map of duplicates and probabilities according to a pdf, i.e. uniform and store for re-use on each original event current version taken directly from FEBRL needs review b/c number of duplicates stored starts at 2?

random_select(self, prob)
modify(self, row)

modify given a row (or tuple of rows) loop over number of fields to modify random select field to modify random number of modifications in field random select field to modify

e.g. mod = random_select(field_dict[‘prob_list’])

selects modification according to pdf apply modifications in field

_modify(self, field, value)

determine whether to modify a field in a row select modification apply modification

character_range(self, data)

FEBRL defines the character type in the original configuration this should be implemented from the data model ourselves. For now, we brute force look up of data type each time. Also assumes everything is a string :( in the future we likely want proper data types

insert(self, field, data)

insert single character according to type

delete(self, field, data)
substitute(self, field, data)

substitute random character

misspell(self, field, data)

Dictionary of commonly misspelled words

transpose(self, field, data)

transpose two characters

replace(self, field, data)

replace

swap(self, field, data)

swap – randomly swap two words if field has at least two words

split(self, field, data)

split word

merge(self, field, data)

merge one or more words

nullify(self, field, data)

random null value

fill(self, field, data)

fill

select_position(self, input_string, len_offset)

dsgen::error_position randomly select position of character for a string to introduce an error FEBRL description: function that randomly calculates an error position within the given input string and returns the position as integer number 0 or larger. The argument ‘len_offset’ can be set to an integer (e.g. -1, 0, or 1). Provides an offset relative to the string length of the maximal error position that can be returned. Errors do not likely appear at the beginning of a word. Gaussian distribution is used with the mean being one position behind half the string length (simga = 1.0) to simulate errors.

error_character(self, input_char, char_range)

A function which returns a character created randomly. It uses row and column keyboard dictionaires. Directly taken from FEBRL dsgen