batch¶
Module for mini-batching data.
-
class
batch.
DayBatcher
(datafile, skiprow=0, delimiter=', ')[source]¶ Gives batches from a csv file on a per day basis. The first field is assumed to be the day field. Days are assumed to be sorted in ascending order (No out of order days in csv file). For batching data too large to fit into memory. Written for one pass on data!!!
Parameters: - datafile – (str) File to read lines from.
- skiprow – (int) How many lines to ignore at beginning of file (e.g. if file has a header)
- delimiter – (str) The delimiter for the csv file
-
class
batch.
NormalizingReplayOnlineBatcher
(datafile, batch_size, skipheader=False, delimiter=', ', size_check=None, refresh_ratio=0.5, ratio=(1, 0), pool_size=5, alpha=0.5, datastart_index=3)[source]¶ For replay batching on aggregate DNN model. For batching data too large to fit into memory. Written for one pass on data!!!
Parameters: - datafile – File to read data from
- batch_size – For mini-batching
- skipheader – Use if there is a header on the data file
- delimiter – Typically ‘ ‘ or ‘,’ which delimits columns in data file
- size_check – Ignore this
- refresh_ratio – The proportion of the new mini-batch to use in refreshing the pool.
- ratio – (tuple) (num_new, num_replay) The batcher will provide num_new new batches of data points and then num_replay batches of old data points from the pool.
- pool_size – The scale of the pool. The pool will be pool_size * batchsize data points.
- alpha – (float) For exponential running mean and variance. Lower alpha discounts older observations faster. The higher the alpha, the further you take into consideration the past.
- datastart_index – The csv field where real valued features to be normalized begins. Assumed that all features beginnning at datastart_index till end of line are real valued.
-
new_batch
(initialize=False)[source]¶ Returns: until end of datafile, each time called, returns mini-batch number of lines from csv file as a numpy array. Returns shorter than mini-batch end of contents as a smaller than batch size array. Returns None when no more data is available(one pass batcher!!).
-
class
batch.
OnlineBatcher
(datafile, batch_size, skipheader=False, delimiter=', ', alpha=0.5, size_check=None, datastart_index=3, norm=False)[source]¶ Gives batches from a csv file. For batching data too large to fit into memory. Written for one pass on data!!!
Parameters: - datafile – (str) File to read lines from.
- batch_size – (int) Mini-batch size.
- skipheader – (bool) Whether or not to skip first line of file.
- delimiter – (str) Delimiter of csv file.
- alpha – (float) For exponential running mean and variance. Lower alpha discounts older observations faster. The higher the alpha, the further you take into consideration the past.
- size_check – (int) Expected number of fields from csv file. Used to check for data corruption.
- datastart_index – (int) The csv field where real valued features to be normalized begins. Assumed that all features beginnning at datastart_index till end of line are real valued.
- norm – (bool) Whether or not to normalize the real valued data features.
-
class
batch.
OnlineLMBatcher
(datafile, initial_state_triple, batch_size=100, num_steps=5, delimiter=' ', skiprows=0)[source]¶ For use with tiered_lm.py. Batcher keeps track of user states in upper tier RNN.
Parameters: - datafile – (str) CSV file to read data from.
- initial_state_triple – (tuple) Initial state for users in lstm.
- batch_size – (int) How many users in a mini-batch.
- num_steps – (int) How many log lines to get for each user.
- delimiter – (str) delimiter for csv file.
- skiprows – (int) How many rows to skip at beginning of csv file.
-
flush_batch
()[source]¶ Called when EOF is encountered. Returns either first stage flush batch or second stage flush batch.
-
get_batch_from_overflow
()[source]¶ Called at beginning of each new batch to see if users have any premade matrix of events ready.
-
get_state_triples
()[source]¶ Returns: (dict) Current states of users for all users in this mini-batch.
-
new_batch
()[source]¶ First checks user_batch_overflow to see if there are user batches ready for the new mini-batch. Iterates over the file, adding user’s loglines to user_logs. When a user gets num_steps loglines, those num_steps loglines are added to the batch or if the user is already present in the batch to the user_batch_overflow. Now, when we have minibatch number of user batches, we return those as a batch. At most one user-batch for each user is allowed in a mini-batch
-
next_batch
()[source]¶ Returns: (tuple) (batch, state_triples) Where batch is a three way array and state_triples contains current user states for upper tier lstm. At beginning of file batch will be shape (batch_size X num_steps X num_feats). At end of file during first stage of flushing batch will be shape (num_unique_users X num_steps X num_feats). At end of file during second stage of flushing batch will be shape (min(batch_size X num_steps, num_unique_users) X num_feats).
-
update_state_triples
(new_triples)[source]¶ Called after training step of RNN to save current states of users.
Parameters: new_triples – (3-tuple) context_list = np.array shape=(users X context_rnn_hidden_size) state_list = list of np.arrays of shape=(users X context_rnn_hidden_size) hidden_list = Same type as state list
-
class
batch.
StateTrackingBatcher
(pipe_name, specs, batchsize=21, num_steps=3, layers=10, replay_ratio=(1, 0), next_step=False, warm_start_state=None, delimiter=', ', skipheader=False, alpha=0.5, datastart_index=3, standardize=True)[source]¶ Aggregate RNN batcher. Reads line by line from a file or pipe being fed csv format features by a feature extractor. Keeps track of window of user history for truncated backpropagation through time with a shifting set of users.
Parameters: - pipe_name – (str) Name of file or pipe to read from.
- specs – (dict) From a json specification of the purpose of fields in the csv input file (See docs for formatting)
- batchsize – (int) The maximum number of events in a mini-batch
- num_steps – (int) The maximum number of events for any given user per mini-batch (window size sort of)
- layers – (list) A list of the sizes of hidden layers for the stacked lstm.
- replay_ratio – (tuple) (num_new_batches, num_replay_batches) Describes the ratio of new batches to replay batches.
- next_step – (boolean) False (0) if autoencoding, True (1) if next time step prediction
- warm_start_state – (tuple) Tuple of numpy arrays for warm_starting state of all users of RNN.
- delimiter – (str) Delimiter of csv file.
- skipheader – (bool) Whether or not to skip first line of csv file.
- alpha – (float) For exponential running mean and variance. Lower alpha discounts older observations faster. The higher the alpha, the further you take into consideration the past.
- datastart_index – (int) The csv field where real valued features to be normalized begins. Assumed that all features beginnning at datastart_index till end of line are real valued.
- standardize – (bool) Whether or not to standardize the data using running mean and variance.
-
avg_state
()[source]¶ Returns: (list) The average of all the most recent states for each batch entity.
-
blank_slate
()[source]¶ Creates and returns a zero state for one time step for 1 user
Returns: (list) A list of 1 X state_size numpy arrays the number of layers long
-
event_padding_random
(rowtext)[source]¶ Creates and returns a ‘non-event’ with random entries for event history padding of newly encountered user.
Parameters: rowtext – (int) A log line for the user Returns: (np.array) A random event with user meta data attached.
-
get_eval_indices
(key_map)[source]¶ Parameters: key_map – (dict) Returns: (list) Data structure which keeps track of where to evaluate RNN.
-
get_events
()[source]¶ Returns: (np.array) 3 way array of shape (num_time_steps, num_users, num_features)
-
get_new_events
()[source]¶ To get new events when not replaying old events.
Returns: (int) 1 if not EOF 0 if EOF
-
get_states
()[source]¶ Fetches the saved RNN states of users in current mini-batch
Returns: (list) List of user states.
-
new_batch
()[source]¶ Returns: (dict) A dictionary with keys to match to placeholders and values of numpy matrices. Entries are described as follows: - states A structured list of numpy arrays to feed as initial state for next round of training
- inputs A three way numpy array of dimensions (timestep X user X (feature_size + target_size + meta_size)) where meta-size is the number of fields not used in training (user_id, timestamp, etc.)
- eval_indices A num_time_steps long list of numpy vectors which contain the indices of hidden state outputs to evaluate on for each time step in this batch of training.
- Other entries are split from the ‘inputs’ matrix using the specs dictionary which describes indices of matrices to extract.
-
next_batch
()[source]¶ Returns: (dict) A dictionary of numpy arrays from splitting a 3d (numsteps X mb_size X num_csv_fields) array into subarrays with keys pertaining to use in training.
-
package_data
(batch)[source]¶ Parameters: batch – (np.array) An assembled 3 way array of data collected from the stream with shape (num_time_steps, num_users, num_features) Returns: (dict) A dictionary of numpy arrays of the diced 3way feature array.
-
replay
¶ Whether or not a replay batch was just processed.
-
replay_batch
()[source]¶ Returns: (dict) A dictionary with keys to match to placeholders and values of numpy matrices. Entries are described as follows: - states A structured list of numpy arrays to feed as initial state for next round of training
- inputs A three way numpy array of dimensions (timestep X user X (feature_size + target_size + meta_size)) where meta-size is the number of fields not used in training (user_id, timestamp, etc.)
- eval_indices A num_time_steps long list of numpy vectors which contain the indices of hidden state outputs to evaluate on for each time step in this batch of training.
- Other entries are split from the ‘inputs’ matrix using the specs dictionary which describes indices of matrices to extract.
-
batch.
split_batch
(batch, spec)[source]¶ Splits numpy matrix into separate data fields according to spec dictionary.
Parameters: - batch – (np.array) Array with shape=(batch_size, num_features) of data collected from stream.
- spec – (dict) A python dict containing information about which indices in the incoming data point correspond to which features. Entries for continuous features list the indices for the feature, while entries for categorical features contain a dictionary- {‘index’: i, ‘num_classes’: c}, where i and c are the index into the datapoint, and number of distinct categories for the category in question.
Returns: (dict) A dictionary of numpy arrays of the split 2d feature array.