Anomaly Detection using Gaussian (Normal) Distribution
For training and evaluating Gaussian distribution algorithms, we are going to split the train, cross validation and test data sets using blow ratios.
1) Train: 60% of the Genuine records (y=0), no Fraud records(y=1). So the training set will not have a label as well.
2) CV: 20% of the Genuine records (y=0), 50% of the Fraud records(y=1)
3) Test: Remaining 20% of the Genuine records(y=0), Remaining 50% of the Fraud records(y=1)
Procedure for anomaly detection:
1) Fit the model p(x) on training set.
2) On cross validation/test data, predict
y = 1 if p(x) < epsilon (anomaly)
y = 0 if p(x) >= epsilon (normal)
3) We use cross validation to choose parameter epsilon using the evaluation metrics Preceion/Recall, F1-score.
We could use couple of Gaussian distribution models for training anomaly detection.
1) Gaussian (Normal) Distribution – the normal distribution is parametrized in terms of the mean and the variance.
2) Multivariate Normal Distribution – The probability density function for multivariate_normal is parametrized in terms of the mean and the covariance.
1) For this dataset, we are going to use multivariate normal probability density function, since it automatically generates the relationships (correlation) between variables to calculate the probabilities. So we don’t need to derive new features. As the features are outcome of PCA, it is difficult for us to understand the relationship between these features.
2) However multivariate normal probability density function is computationally expensive compared to normal Gaussian probability density function. On very large datasets, we might have to prefer Gaussian probability density function instead of multivariate normal probability density function to speed up the process and do feature engineering based on the subject matter expertise.
1) Features that we choose for these algorithms have to be normally distributed. Otherwise we need to transform the features to normal distribution using log, sqrt etc.
2) Choose features that might take on unusually large or small values in the event of an anomaly. We looked at the distribution in the beginning using distplot. So it is wise to choose features which have completely different distribution for fraud records compared to genuine records.