Machine Learning : Optimization

Machine Learning : Optimization

Debugging a learning Algortithm:
Suppose you have implemented regularized linear Regression to predict housing  prices.
However,when you test your hypothesis on a new set of houses .you find that it makes unacceptably large errors in its predictions what should you do next?
  • Get more training examples 
  • try smaller sets of features 
  • try getting additional features 
  • try adding polynomial features
  • try decresing lambda 
  • try increasing lambda 


 Model Selection :
It is not the compulsory that if your trianing set error is less then it is good model it can be overfitting the model 
suppose we have trained our model to fit with parameter d on the test set with theta 5 then it is not neccesary that it will be a fair means we have already fitted the parameter d for the test set then it is more likely to be do good in test set and so this we have introduce new concept called validation set or dev set 
we can define there error also-->

SO now instead of testing the 
Diagnosing Bias Variance Trade Off:--
In the Graph shown below(bottom center ) let's take the error vs degree of polynomial graph 
As we move left of the graph the model will underfit and as we move right our graph will overfit .So we have plotted the training set error according to this with Pink line and Now we have to plot the cross validation errror so we have plotted with red line as you can see the red line is slightly of the shape of convex that is why when we are moving right thr training set errror is less and it is overfitting the mode so it will not work properly on unseen data thats why it doesnot work well on this.
Now What we are trying  to do is we are adding the extra parameter lambda to reduce the error we called this process as regularization.Now If lambda is too large then it will cost the high bias problem and vice versa
No we have choosen the diferent value of lambda and checking the erorr on cross validation set and we have picked the mini. error lambda and then  pick it and then checck on test set which give quite good .
result 

Learning Curves:

Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

  • As the training set gets larger, the error for a quadratic function increases.
  • The error value will plateau out after a certain m, or training set size.

Experiencing high bias:

Low training set size: causes J_{train}(\Theta) to be low and J_{CV}(\Theta) to be high.

Large training set size: causes both J_{train}(\Theta) and J_{CV}(\Theta) to be high with J_{train}(\Theta)J_{CV}(\Theta).

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Experiencing high variance:

Low training set sizeJ_{train}(\Theta) will be low and J_{CV}(\Theta) will be high.

Large training set sizeJ_{train}(\Theta) increases with training set size and J_{CV}(\Theta) continues to decrease without leveling off. Also, J_{train}(\Theta) < J_{CV}(\Theta) but the difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

Building Spam Classifier :

Prioritizing What to Work On

System Design Example:

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.

So how could you spend your time to improve the accuracy of this classifier?

  • Collect lots of data (for example "honeypot" project but doesn't always work)
  • Develop sophisticated features (for example: using email header data in spam emails)
  • Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be most helpful.



ERROR ANALYSIS:




Error Analysis

The recommended approach to solving machine learning problems is to:

  • Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
  • Plot learning curves to decide if more data, more features, etc. are likely to help.
  • Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them. We could manually analyze the 100 emails and categorize them based on what type of emails they are. We could then try to come up with new cues and features that would help us classify these 100 emails correctly. Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. We could also see how classifying each word according to its root changes our error rate:

It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.

ERROE METRICS FOR SKEWED CLASSES:

what is Skewed classes?

Skewed classes basically refer to a dataset, wherein the number of training example belonging to one class out-numbers heavily the number of training examples beloning to the other.

Consider a binary classification, where a cancerous patient is to be detected based on some features. And say only 1  of the data provided has cancer positive. In a setting where having cancer is labelled 1 and not cancer labelled 0, if a system naively gives the prediction as all 0’s, still the prediction accuracy will be 99%.

% naive prediction ignoring features
def predict_cancer(x):
    return 0

Therefore, it can be said with conviction that the accuracy metrics or mean-squared error for skewed classes, is not a proper indicator of model performance. Hence, there is a need for a different error metric for skewed classes

When we have high threshold means we have predicted in between 0.5-1.0 then we have gradually increases the precision and have lower the recall
Let's understand this with cancer case .!
  • Let's continue our cancer classification example, where y equals 1 if the patient has cancer and y equals 0 otherwise.
  •  And let's say we're trained in logistic regression classifier which outputs probability between 0 and 1. So, as usual, we're going to predict 1, y equals 1, if h(x) is greater or equal to 0.5. And predict 0 if the hypothesis outputs a value less than 0.5. And this classifier may give us some value for precision and some value for recall.
  • But now, suppose we want to predict that the patient has cancer only if we're very confident that they really do. Because if you go to a patient and you tell them that they have cancer, it's going to give them a huge shock. What we give is a seriously bad news, and they may end up going through a pretty painful treatment process and so on.
  •  And so maybe we want to tell someone that we think they have cancer only if they are very confident. One way to do this would be to modify the algorithm, so that instead of setting this threshold at 0.5, we might instead say that we will predict that y is equal to 1 only if h(x) is greater or equal to 0.7. 
  • So this is like saying, we'll tell someone they have cancer only if we think there's a greater than or equal to, 70% chance that they have cancer.
  • And, if you do this, then you're predicting someone has cancer only when you're more confident and so you end up with a classifier that has higher precision. Because all of the patients that you're going to and saying, we think you have cancer, although those patients are now ones that you're pretty confident actually have cancer. And so a higher fraction of the patients that you predict have cancer will actually turn out to have cancer because making those predictions only if we're pretty confident.
  • But in contrast this classifier will have lower recall because now we're going to make predictions, we're going to predict y = 1 on a smaller number of patients. Now, can even take this further. Instead of setting the threshold at 0.7, we can set this at 0.9. Now we'll predict y=1 only if we are more than 90% certain that the patient has cancer. And so, a large fraction of those patients will turn out to have cancer.
  •  And so this would be a higher precision classifier will have lower recall because we want to correctly detect that those patients have cancer.
 Now consider a different example.
  • Suppose we want to avoid missing too many actual cases of cancer, so we want to avoid false negatives. In particular, if a patient actually has cancer, but we fail to tell them that they have cancer then that can be really bad. Because if we tell a patient that they don't have cancer, then they're not going to go for treatment.
  •  And if it turns out that they have cancer, but we fail to tell them they have cancer, well, they may not get treated at all. And so that would be a really bad outcome because they die because we told them that they don't have cancer.
  • They fail to get treated, but it turns out they actually have cancer. So, suppose that, when in doubt, we want to predict that y=1. So, when in doubt, we want to predict that they have cancer so that at least they look further into it, and these can get treated in case they do turn out to have cancer.
  • In this case, rather than setting higher probability threshold, we might instead take this value and instead set it to a lower value. So maybe 0.3 like so, right? And by doing so, we're saying that, you know what, if we think there's more than a 30% chance that they have cancer we better be more conservative and tell them that they may have cancer so that they can seek treatment if necessary.
  • And in this case what we would have is going to be a higher recall classifier, because we're going to be correctly flagging a higher fraction of all of the patients that actually do have cancer.
  • But we're going to end up with lower precision because a higher fraction of the patients that we said have cancer, a high fraction of them will turn out not to have cancer after all.
  • Why we might want to have higher precision or higher recall and the story actually seems to work both ways. But I hope the details of the algorithm is true and the more general principle is depending on where you want, whether you want higher precision- lower recall, or higher recall- lower precision. You can end up predicting y=1 when h(x) is greater than some threshold. 
  • And so in general, for most classifiers there is going to be a trade off between precision and recall, and as you vary the value of this threshold that we join here, you can actually plot out some curve that trades off precision and recall












Comments

Popular posts from this blog

Presentation_Rashmi

MySQL : Structured Query Language

spoken