Till now, the margin is strictly defined by the support vectors. This defines what is called a hard classifier, that is the margins are well defined.
Suppose now that classes overlap in feature space, as shown in the figure here. One way to deal with this problem before we define the so-called kernel approach, is to allow a kind of slack in the sense that we allow some points to be on the wrong side of the margin.
We introduce thus the so-called slack variables \( \boldsymbol{\xi} =[\xi_1,x_2,\dots,x_n] \) and modify our previous equation $$ y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b)=1, $$ to $$ y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b)=1-\xi_i, $$ with the requirement \( \xi_i\geq 0 \). The total violation is now \( \sum_i\xi \). The value \( \xi_i \) in the constraint the last constraint corresponds to the amount by which the prediction \( y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b)=1 \) is on the wrong side of its margin. Hence by bounding the sum \( \sum_i \xi_i \), we bound the total amount by which predictions fall on the wrong side of their margins.
Misclassifications occur when \( \xi_i > 1 \). Thus bounding the total sum by some value \( C \) bounds in turn the total number of misclassifications.