batch¶

Module for mini-batching data.

class batch.DayBatcher(datafile, skiprow=0, delimiter=', ')[source]¶

Gives batches from a csv file on a per day basis. The first field is assumed to be the day field. Days are assumed to be sorted in ascending order (No out of order days in csv file). For batching data too large to fit into memory. Written for one pass on data!!!

Parameters:	datafile – (str) File to read lines from. skiprow – (int) How many lines to ignore at beginning of file (e.g. if file has a header) delimiter – (str) The delimiter for the csv file

next_batch()[source]¶

Returns:	(np.array) shape=(num_rows_in_a_day, len(csv_lines)). Until end of datafile, each time called, returns 2D array of consecutive lines with same day stamp. Returns None when no more data is available (one pass batcher!!).

class batch.NormalizingReplayOnlineBatcher(datafile, batch_size, skipheader=False, delimiter=', ', size_check=None, refresh_ratio=0.5, ratio=(1, 0), pool_size=5, alpha=0.5, datastart_index=3)[source]¶

For replay batching on aggregate DNN model. For batching data too large to fit into memory. Written for one pass on data!!!

Parameters:

datafile – File to read data from
batch_size – For mini-batching
skipheader – Use if there is a header on the data file
delimiter – Typically ‘ ‘ or ‘,’ which delimits columns in data file
size_check – Ignore this
refresh_ratio – The proportion of the new mini-batch to use in refreshing the pool.
ratio – (tuple) (num_new, num_replay) The batcher will provide num_new new batches of data points and then num_replay batches of old data points from the pool.
pool_size – The scale of the pool. The pool will be pool_size * batchsize data points.
alpha – (float) For exponential running mean and variance. Lower alpha discounts older observations faster. The higher the alpha, the further you take into consideration the past.
datastart_index – The csv field where real valued features to be normalized begins. Assumed that all features beginnning at datastart_index till end of line are real valued.

new_batch(initialize=False)[source]¶

Returns:	until end of datafile, each time called, returns mini-batch number of lines from csv file as a numpy array. Returns shorter than mini-batch end of contents as a smaller than batch size array. Returns None when no more data is available(one pass batcher!!).

next_batch()[source]¶

replay_batch()[source]¶

class batch.OnlineBatcher(datafile, batch_size, skipheader=False, delimiter=', ', alpha=0.5, size_check=None, datastart_index=3, norm=False)[source]¶

Gives batches from a csv file. For batching data too large to fit into memory. Written for one pass on data!!!

Parameters:

datafile – (str) File to read lines from.
batch_size – (int) Mini-batch size.
skipheader – (bool) Whether or not to skip first line of file.
delimiter – (str) Delimiter of csv file.
alpha – (float) For exponential running mean and variance. Lower alpha discounts older observations faster. The higher the alpha, the further you take into consideration the past.
size_check – (int) Expected number of fields from csv file. Used to check for data corruption.
datastart_index – (int) The csv field where real valued features to be normalized begins. Assumed that all features beginnning at datastart_index till end of line are real valued.
norm – (bool) Whether or not to normalize the real valued data features.

next_batch()[source]¶

Returns:	(np.array) until end of datafile, each time called, returns mini-batch number of lines from csv file as a numpy array. Returns shorter than mini-batch end of contents as a smaller than batch size array. Returns None when no more data is available(one pass batcher!!).

class batch.OnlineLMBatcher(datafile, initial_state_triple, batch_size=100, num_steps=5, delimiter=' ', skiprows=0)[source]¶

For use with tiered_lm.py. Batcher keeps track of user states in upper tier RNN.

Parameters:	datafile – (str) CSV file to read data from. initial_state_triple – (tuple) Initial state for users in lstm. batch_size – (int) How many users in a mini-batch. num_steps – (int) How many log lines to get for each user. delimiter – (str) delimiter for csv file. skiprows – (int) How many rows to skip at beginning of csv file.

collect_stragglers()[source]¶: Second stage flushing

flush_batch()[source]¶: Called when EOF is encountered. Returns either first stage flush batch or second stage flush batch.

get_batch_from_overflow()[source]¶: Called at beginning of each new batch to see if users have any premade matrix of events ready.

get_state_triples()[source]¶

Returns:	(dict) Current states of users for all users in this mini-batch.

new_batch()[source]¶: First checks user_batch_overflow to see if there are user batches ready for the new mini-batch. Iterates over the file, adding user’s loglines to user_logs. When a user gets num_steps loglines, those num_steps loglines are added to the batch or if the user is already present in the batch to the user_batch_overflow. Now, when we have minibatch number of user batches, we return those as a batch. At most one user-batch for each user is allowed in a mini-batch

next_batch()[source]¶

Returns: (tuple) (batch, state_triples) Where batch is a three way array and state_triples contains current user states for upper tier lstm. At beginning of file batch will be shape (batch_size X num_steps X num_feats). At end of file during first stage of flushing batch will be shape (num_unique_users X num_steps X num_feats). At end of file during second stage of flushing batch will be shape (min(batch_size X num_steps, num_unique_users) X num_feats).

update_state_triples(new_triples)[source]¶

Called after training step of RNN to save current states of users.

Parameters:	new_triples – (3-tuple) context_list = np.array shape=(users X context_rnn_hidden_size) state_list = list of np.arrays of shape=(users X context_rnn_hidden_size) hidden_list = Same type as state list

class batch.StateTrackingBatcher(pipe_name, specs, batchsize=21, num_steps=3, layers=10, replay_ratio=(1, 0), next_step=False, warm_start_state=None, delimiter=', ', skipheader=False, alpha=0.5, datastart_index=3, standardize=True)[source]¶

Aggregate RNN batcher. Reads line by line from a file or pipe being fed csv format features by a feature extractor. Keeps track of window of user history for truncated backpropagation through time with a shifting set of users.

Parameters:

pipe_name – (str) Name of file or pipe to read from.
specs – (dict) From a json specification of the purpose of fields in the csv input file (See docs for formatting)
batchsize – (int) The maximum number of events in a mini-batch
num_steps – (int) The maximum number of events for any given user per mini-batch (window size sort of)
layers – (list) A list of the sizes of hidden layers for the stacked lstm.
replay_ratio – (tuple) (num_new_batches, num_replay_batches) Describes the ratio of new batches to replay batches.
next_step – (boolean) False (0) if autoencoding, True (1) if next time step prediction
warm_start_state – (tuple) Tuple of numpy arrays for warm_starting state of all users of RNN.
delimiter – (str) Delimiter of csv file.
skipheader – (bool) Whether or not to skip first line of csv file.
alpha – (float) For exponential running mean and variance. Lower alpha discounts older observations faster. The higher the alpha, the further you take into consideration the past.
datastart_index – (int) The csv field where real valued features to be normalized begins. Assumed that all features beginnning at datastart_index till end of line are real valued.
standardize – (bool) Whether or not to standardize the data using running mean and variance.

avg_state()[source]¶

Returns:	(list) The average of all the most recent states for each batch entity.

blank_slate()[source]¶

Creates and returns a zero state for one time step for 1 user

Returns:	(list) A list of 1 X state_size numpy arrays the number of layers long

event_padding_random(rowtext)[source]¶

Creates and returns a ‘non-event’ with random entries for event history padding of newly encountered user.

Parameters:	rowtext – (int) A log line for the user
Returns:	(np.array) A random event with user meta data attached.

get_eval_indices(key_map)[source]¶

Parameters:	key_map – (dict)
Returns:	(list) Data structure which keeps track of where to evaluate RNN.

get_events()[source]¶

Returns:	(np.array) 3 way array of shape (num_time_steps, num_users, num_features)

get_new_events()[source]¶

To get new events when not replaying old events.

Returns:	(int) 1 if not EOF 0 if EOF

get_states()[source]¶

Fetches the saved RNN states of users in current mini-batch

Returns:	(list) List of user states.

make_key_map()[source]¶

Returns:	(dict) For use in get_eval_indices.

new_batch()[source]¶

Returns:

(dict) A dictionary with keys to match to placeholders and values of numpy matrices. Entries are described as follows:

states A structured list of numpy arrays to feed as initial state for next round of training
inputs A three way numpy array of dimensions (timestep X user X (feature_size + target_size + meta_size)) where meta-size is the number of fields not used in training (user_id, timestamp, etc.)
eval_indices A num_time_steps long list of numpy vectors which contain the indices of hidden state outputs to evaluate on for each time step in this batch of training.
Other entries are split from the ‘inputs’ matrix using the specs dictionary which describes indices of matrices to extract.

next_batch()[source]¶

Returns:	(dict) A dictionary of numpy arrays from splitting a 3d (numsteps X mb_size X num_csv_fields) array into subarrays with keys pertaining to use in training.

package_data(batch)[source]¶

Parameters:	batch – (np.array) An assembled 3 way array of data collected from the stream with shape (num_time_steps, num_users, num_features)
Returns:	(dict) A dictionary of numpy arrays of the diced 3way feature array.

replay¶: Whether or not a replay batch was just processed.

replay_batch()[source]¶

Returns:

(dict) A dictionary with keys to match to placeholders and values of numpy matrices. Entries are described as follows:

states A structured list of numpy arrays to feed as initial state for next round of training
inputs A three way numpy array of dimensions (timestep X user X (feature_size + target_size + meta_size)) where meta-size is the number of fields not used in training (user_id, timestamp, etc.)
eval_indices A num_time_steps long list of numpy vectors which contain the indices of hidden state outputs to evaluate on for each time step in this batch of training.
Other entries are split from the ‘inputs’ matrix using the specs dictionary which describes indices of matrices to extract.

update_states(states)[source]¶

For updating the deque of lstm states for each user after a minibatch of training.

Parameters:	states – (list) The unstructured list of state matrices evaluated after a train step.

batch.split_batch(batch, spec)[source]¶

Splits numpy matrix into separate data fields according to spec dictionary.

Parameters:

batch – (np.array) Array with shape=(batch_size, num_features) of data collected from stream.
spec – (dict) A python dict containing information about which indices in the incoming data point correspond to which features. Entries for continuous features list the indices for the feature, while entries for categorical features contain a dictionary- {‘index’: i, ‘num_classes’: c}, where i and c are the index into the datapoint, and number of distinct categories for the category in question.

Returns:

(dict) A dictionary of numpy arrays of the split 2d feature array.