 ## The Formulas and Process

Don’t be confused by the summation sign in this formula! This is actually the variance in a diagonal shape.

1 is the perfect f score and 0 is the worst probability score

## Anomaly Detection Algorithm

`import pandas as pd import numpy as np`
`df = pd.read_excel('ex8data1.xlsx', sheet_name='X', header=None)df.head()`
`plt.figure()plt.scatter(df, df)plt.show()`
`m = len(df)`
`s = np.sum(df, axis=0)mu = s/mmu`
`0    14.1122261    14.997711dtype: float64`
`vr = np.sum((df - mu)**2, axis=0)variance = vr/mvariance`
`0    1.8326311    1.709745dtype: float64`
`var_dia = np.diag(variance)var_dia`
`array([[1.83263141, 0.        ],       [0.        , 1.70974533]])`
`k = len(mu)X = df - mup = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1))p`
`def probability(df):    s = np.sum(df, axis=0)    m = len(df)    mu = s/m    vr = np.sum((df - mu)**2, axis=0)    variance = vr/m    var_dia = np.diag(variance)    k = len(mu)    X = df - mu    p = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1))    return p`

For your case, you can simply keep a portion of your original data for cross-validation.

`cvx = pd.read_excel('ex8data1.xlsx', sheet_name='Xval', header=None)cvx.head()`
`cvy = pd.read_excel('ex8data1.xlsx', sheet_name='y', header=None)cvy.head()`

The purpose of cross-validation data is to calculate the threshold probability. And we will use that threshold probability to find the anomalous data of df.

`p1 = probability(cvx)`
`y = np.array(cvy)`
`#Part of the arrayarray([,       ,       ,       ,       ,       ,       ,       ,       ,`
`p.describe()`
`count    3.070000e+02mean     5.378568e-02std      1.928081e-02min      1.800521e-3025%      4.212979e-0250%      5.935014e-0275%      6.924909e-02max      7.864731e-02dtype: float64`
`def tpfpfn(ep, p):    tp, fp, fn = 0, 0, 0    for i in range(len(y)):        if p[i] <= ep and y[i] == 1:            tp += 1        elif p[i] <= ep and y[i] == 0:            fp += 1        elif p[i] > ep and y[i] == 1:            fn += 1    return tp, fp, fn`
`eps = [i for i in p1 if i <= p1.mean()]`
`len(eps)`
`128`
`def f1(ep, p):    tp, fp, fn = tpfpfn(ep)    prec = tp/(tp + fp)    rec = tp/(tp + fn)    f1 = 2*prec*rec/(prec + rec)    return f1`
`f = []for i in eps:    f.append(f1(i, p1))f`
`[0.16470588235294117, 0.208955223880597, 0.15384615384615385, 0.3181818181818182, 0.15555555555555556, 0.125, 0.56, 0.13333333333333333, 0.16867469879518074, 0.12612612612612614, 0.14583333333333331, 0.22950819672131148, 0.15053763440860213, 0.16666666666666666, 0.3888888888888889, 0.12389380530973451,`
`np.array(f).argmax()`
`127`
`e = epse`
`0.00014529639061630078`

## Find out the Anomalous Examples

`label = []for i in range(len(df)):    if p[i] <= e:        label.append(1)    else:        label.append(0)label`
`[0, 0, 0, 0, 0, 0, 0, 0, 0, 0,`
`df['label'] = np.array(label)df.head()`

## Conclusion

#datascience #machinelearning #artificialinteligence #python #programming