Aggregate Multivariate Autoencoders

images/system.png

dnn_agg

usage: dnn_agg [-h] [-learnrate LEARNRATE] [-numlayers NUMLAYERS]
               [-hiddensize HIDDENSIZE] [-mb MB] [-act ACT] [-norm NORM]
               [-keep_prob KEEP_PROB] [-debug] [-dist DIST]
               [-maxbadcount MAXBADCOUNT] [-embedding_ratio EMBEDDING_RATIO]
               [-min_embed MIN_EMBED] [-max_embed MAX_EMBED]
               [-verbose VERBOSE] [-variance_floor VARIANCE_FLOOR]
               [-initrange INITRANGE] [-decay_rate DECAY_RATE]
               [-decay_steps DECAY_STEPS] [-alpha ALPHA] [-input_norm]
               [-refresh_ratio REFRESH_RATIO] [-ratio RATIO [RATIO ...]]
               [-pool_size POOL_SIZE] [-random_seed RANDOM_SEED] [-replay]
               [-delimiter DELIMITER] [-skipheader]
               datafile results_folder dataspecs
Positional arguments:
datafile The csv data file for our unsupervised training.fields: day, user, redcount, [count1, count2, …., count408]
results_folder The folder to print results to.
dataspecs Filename of json file with specification of feature indices.
Options:
-learnrate=0.001
 Step size for gradient descent.
-numlayers=3 Number of hidden layers.
-hiddensize=20 Number of hidden units in hidden layers.
-mb=256 The mini batch size for stochastic gradient descent.
-act=tanh May be “tanh” or “relu”
-norm=none Can be “layer”, “batch”, or “none”
-keep_prob Percent of nodes to keep for dropout layers.
-debug=False Use this flag to print feed dictionary contents and dimensions.
-dist=diag “diag” or “ident”. Describes whether to model multivariate guassian with identity, or arbitrary diagonal covariance matrix.
-maxbadcount=20
 Threshold for early stopping.
-embedding_ratio=0.75
 For determining size of embeddings for categorical features.
-min_embed=2 Minimum size for embeddings of categorical features.
-max_embed=1000
 Maximum size for embeddings of categorical features.
-verbose=0 1 to print full loss contributors.
-variance_floor=0.01
 Parameter for diagonal MVN learning.
-initrange=1.0 For weight initialization.
-decay_rate=1.0
 Exponential learn rate decay for gradient descent.
-decay_steps=20
 Number of updates to perform learn rate decay
-alpha=0.99 Parameter for exponential moving average and variance
-input_norm=False
 Use this flag for online normalization
-refresh_ratio=0.5
 The proportion of the new mini-batch to use in refreshing the pool.
-ratio=[1, 1] (tuple) (x, y): Number of new batches of data points x and number of old data points y.
-pool_size=9000
 The scale of the pool.
-random_seed For reproducible results
-replay=False Use this flag for replay learning
-delimiter= Delimiter for input text file. You should be using ‘ ‘ for the dayshuffled cert.
-skipheader=False
 Whether or not to skip first line of input file.

lstm_agg

Cert Aggregate Feature LSTM.

usage: lstm_agg [-h] [-num_steps NUM_STEPS] [-learnrate LEARNRATE]
                [-initrange INITRANGE] [-numlayers NUMLAYERS]
                [-hiddensize HIDDENSIZE] [-verbose VERBOSE] [-mb MB]
                [-embedding_ratio EMBEDDING_RATIO]
                [-min_embedding MIN_EMBEDDING] [-max_embedding MAX_EMBEDDING]
                [-use_next_time_step USE_NEXT_TIME_STEP] [-act ACT]
                [-dist DIST] [-variance_floor VARIANCE_FLOOR] [-norm NORM]
                [-keep_prob KEEP_PROB] [-debug] [-random_seed RANDOM_SEED]
                [-replay_ratio REPLAY_RATIO [REPLAY_RATIO ...]]
                [-delimiter DELIMITER] [-maxbadcount MAXBADCOUNT] [-residual]
                [-skipheader] [-alpha ALPHA] [-input_norm]
                datafile results_folder dataspecs
Positional arguments:
datafile Path to data file.
results_folder Folder where to write losses.
dataspecs Name of json file with specs for splitting data.
Options:
-num_steps=5 Number of time steps for truncated backpropagation.
-learnrate=0.01
 Step size for gradient descent.
-initrange=0.0001
 For initialization of weights.
-numlayers=3 Number of hidden layers
-hiddensize=3 Number of hidden nodes per layer
-verbose=1 Level to print training progress and/or other details.
-mb=21 The max number of events in the structured mini_batch.
-embedding_ratio=0.5
 Embedding_ratio * num_classes = embedding size.
-min_embedding=5
 Minimum embedding size.
-max_embedding=500
 Maximum embedding size.
-use_next_time_step=0
 Whether to predict next time step or autoencode.
-act=relu A string denoting the activation function.
-dist=diag A string denoting the multivariate normal type for prediction.
-variance_floor=0.1
 Float to derive variance floor.
-norm “layer” for layer normalization. Default is None.
-keep_prob Percent of nodes to keep for dropout layers.
-debug=False Use this flag to print feed dictionary contents and dimensions.
-random_seed=5 Random seed for reproducible experiments.
-replay_ratio=(1, 0)
 Undocumented
-delimiter= Delimiter for input text file. You should be using ‘ ‘ for the dayshuffled cert.
-maxbadcount=100
 For stopping training when loss does not improve.
-residual=False
 Flag for calculating residual (difference between sequential actions) instead of next action
-skipheader=False
 Whether or not to skip first line of input file.
-alpha=0.99 Parameter for exponential moving average and variance
-input_norm=False
 Use this flag for online normalization