Machine Learning : Support Vector Machine

Machine Learning : Support vector Machine

A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems. After giving an SVM model sets of labeled training data for each category, they’re able to categorize new text.

So you’re working on a text classification problem. You’re refining your training data, and maybe you’ve even tried stuff out using Naive Bayes. But now you’re feeling confident in your dataset, and want to take it one step further. Enter Support Vector Machines (SVM): a fast and dependable classification algorithm that performs very well with a limited amount of data.

Perhaps you have dug a bit deeper, and ran into terms like linearly separable, kernel trick and kernel functions. But fear not! The idea behind the SVM algorithm is simple, and applying it to natural language classification doesn’t require most of the complicated stuff.

Before continuing, we recommend reading our guide to naive base classifier first, since a lot of the things regarding text processing that are said there are relevant here as well.

And compared to both logistic regression and neural networks, the Support Vector Machine, or SVM sometimes gives a cleaner, and sometimes more powerful way of learning complex non-linear functions.

This is the Overview of logistic regression WHen y=1 (the curve is shown in the left ) andwhen y=0(the curve is shown in the right)

We have eliminated the 1/m term from the equation of SVM because it doesn't affect the equation as we have to minimize the parameter theta (Explanation shown in Image With red pen) .In SVM instead of Regularized parameter lambda we have use C as a parameter as shown in image.

In the Image Shown above have 2 optimization algorithm

---->Logistic Regression (using lambda as a parameter)

---->SVM(using C as a parameter)

and these two optimization problem will be equal when c=1/lambda that's shown in the image

Instead of finding probabilities in Logistic Regression the SVM will deal with only 0 and 1 depending on thetatransposex.

Large Margin Intuition:

Lets set the C value very high If C is very large then when minimizing this optimization objective we're going to be highly motivated to choose a value ,so that this first term is equal to zero.

Now Let's the the Graphical view of Training set and understand that;

in pink and green color we have shown how the regression and classification problem would haver solve this and that led to the unffair means because it sometime closer to the dataset and sometime wider and there are no fix margin

But the Black line show the SVM how it separate the dataset by taking a margin on both the side such that the prediction would be accurate .

WHat svm do is it best separate the database like shown in figure

This is the example to understand better how SVM perform :

In this if the one example is off the course (Mean one example of +ve is in the side of -ve) then instead of changing the boundary of classifier THe SVM keep the boundary as it was previous as one example cannot change the course of the result .

That's why we called SV as LARGE MARGIN CLASSIFIER.

Mathematics behind LMC:

Wehave just done the inner product concept on given above image and in above image the left bottom graph that P can be negative if the angle between u and v is > 90 degree then P will be negative that's waht it is shown there

So all the support vector machine is doing in the optimization objective is it's minimizing the squared norm of the square length of the parameter vector theta.↓

In the figure given below it is shown that why the SVm will choose the blck line and the appropriate margin over inappropriate margin

In the bottom left graph in the image below lets make a Green line which separate the both type of datatypes and lets make the normalized of the dataset X(1) on the vector theta then we have seen that the P(1) is small and now lets repeat the same process for training example and make the normalization on vector theta but this time It is -ve p(2) will be small negative as shown in figure

Now we want that p(1).||theta||>=1.this will only be possible when P(1) is large or ||theta|| be large but we have seen that p(1) is not so large then only have left one optio that is ||theta|| have to be large but according to our algorithm .We are minimizing the ||theta||.So now theta cannot be large hence this condition is wrong that's why it will not select green line on the left of the iamge

Vice versa for P(2)

Now In the right part of the image we have shown the right way of how SVM will choose

KERNELS IN SVM:

So what we will do if we have to make the non-linear boundary data then we will try to increase the features and but this will be very computationally costly so we have kernels in SVM .

Instead of increasing polynomial features we will focus on kernels in SVM

But where do we get these landmarks from? Where do we get l1, l2, l3 from? And it seems, also, that for complex learning problems, maybe we want a lot more landmarks than just three of them that we might choose by hand.

So in practice this is how the landmarks are chosen which is that given the machine learning problem. We have some data set of some some positive and negative examples. So, this is the idea here which is that we're gonna take the examples and for every training example that we have, we are just going to call it. We're just going to put landmarks as exactly the same locations as the training examples.

So if I have one training example if that is x1, well then I'm going to choose this is my first landmark to be at xactly the same location as my first training example.

And if I have a different training example x2. Well we're going to set the second landmark to be the location of my second training example.

On the figure on the right, I used red and blue dots just as illustration, the color of this figure, the color of the dots on the figure on the right is not significant.

But what I'm going to end up with using this method is I'm going to end up with m landmarks of l1, l2 down to l(m) if I have m training examples with one landmark per location of my per location of each of my training examples.

And this is nice because it is saying that my features are basically going to measure how close an example is to one of the things I saw in my training set. So, just to write this outline a little more concretely, given m training examples, I'm going to choose the the location of my landmarks to be exactly near the locations of my m training examples.