A Complete Anomaly Detection Algorithm From Scratch in Python

A Complete Anomaly Detection Algorithm From Scratch in Python

The Formulas and Process

Image for post

Image for post

Don’t be confused by the summation sign in this formula! This is actually the variance in a diagonal shape.

Image for post

Image for post

Image for post

 

1 is the perfect f score and 0 is the worst probability score

Anomaly Detection Algorithm

 
import pandas as pd 
import numpy as np
df = pd.read_excel('ex8data1.xlsx', sheet_name='X', header=None)
df.head()

Image for post

plt.figure()
plt.scatter(df[0], df[1])
plt.show()

Image for post

m = len(df)
s = np.sum(df, axis=0)
mu = s/m
mu
0    14.112226
1 14.997711
dtype: float64
vr = np.sum((df - mu)**2, axis=0)
variance = vr/m
variance
0    1.832631
1 1.709745
dtype: float64
var_dia = np.diag(variance)
var_dia
array([[1.83263141, 0.        ],
[0. , 1.70974533]])
k = len(mu)
X = df - mu
p = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1))
p

Image for post

def probability(df):
s = np.sum(df, axis=0)
m = len(df)
mu = s/m
vr = np.sum((df - mu)**2, axis=0)
variance = vr/m
var_dia = np.diag(variance)
k = len(mu)
X = df - mu
p = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1))
return p

For your case, you can simply keep a portion of your original data for cross-validation.

cvx = pd.read_excel('ex8data1.xlsx', sheet_name='Xval', header=None)
cvx.head()

Image for post

cvy = pd.read_excel('ex8data1.xlsx', sheet_name='y', header=None)
cvy.head()

Image for post

The purpose of cross-validation data is to calculate the threshold probability. And we will use that threshold probability to find the anomalous data of df.

p1 = probability(cvx)
y = np.array(cvy)
#Part of the array
array([[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
p.describe()
count    3.070000e+02
mean 5.378568e-02
std 1.928081e-02
min 1.800521e-30
25% 4.212979e-02
50% 5.935014e-02
75% 6.924909e-02
max 7.864731e-02
dtype: float64
def tpfpfn(ep, p):
tp, fp, fn = 0, 0, 0
for i in range(len(y)):
if p[i] <= ep and y[i][0] == 1:
tp += 1
elif p[i] <= ep and y[i][0] == 0:
fp += 1
elif p[i] > ep and y[i][0] == 1:
fn += 1
return tp, fp, fn
eps = [i for i in p1 if i <= p1.mean()]
len(eps)
128
def f1(ep, p):
tp, fp, fn = tpfpfn(ep)
prec = tp/(tp + fp)
rec = tp/(tp + fn)
f1 = 2*prec*rec/(prec + rec)
return f1
f = []
for i in eps:
f.append(f1(i, p1))
f
[0.16470588235294117,
0.208955223880597,
0.15384615384615385,
0.3181818181818182,
0.15555555555555556,
0.125,
0.56,
0.13333333333333333,
0.16867469879518074,
0.12612612612612614,
0.14583333333333331,
0.22950819672131148,
0.15053763440860213,
0.16666666666666666,
0.3888888888888889,
0.12389380530973451,
np.array(f).argmax()
127
e = eps[127]
e
0.00014529639061630078

Find out the Anomalous Examples

label = []
for i in range(len(df)):
if p[i] <= e:
label.append(1)
else:
label.append(0)
label
[0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
df['label'] = np.array(label)
df.head()

Image for post

Image for post

Does it make sense?

Conclusion

#datascience #machinelearning #artificialinteligence #python #programming

Leave a Reply

Close Menu