Machine Learning :Sequence Model Part 3

Basic Models:

Let's say we have to translate one sentence into another sentence means translating french into english so we take up french sentence as input and english sentence as output

Now we can make Similar approach to FInd the Caption for the particular Image Let's say we have to find the caption for the cat which will be shown in the image So what we have done to do this is we have trained the CNN and then feed it to the sequence Model where it can generate the caption Lets see this in the Image .

Picking the Most Likely Sentence :

When we are Doing the Machine Translation what we are doing is we are predicting the sentence Like "jane visite I'africa in september "

this is the sentence that we have to translate so our algorithm can generate differnent sentence of the same meaning or the sentence with different meaning but containing same word

The Probability of that sentence coming is P(y1,y2,y3,y4.../x)

Now their is the Algorithm called the Greedy Search ?

Let's adiscuss about that

What that algorithm doing is it is taking the word wise word and predicting the sentence such that it can predict the exact word

"Jane Visite l'Africa en september"translation

--.jane is visiting Africa in september

--Jane is going to be visiting Africa in September

These two translation are correct but the accurate one is first one because accordig to sentence sentence 1 has the higher prob to coming in place of sentence 2

Beam Search

What Beam search is Doing it calculating the Best sentece for the translation Just like we have shown above with less complexity Unlike the Greedy search Beam Search Picked more then one Search for its accuracy let's say we picked Jane September in from sentence "Jane Visite l'Africa en september"

Now the second step of Beam search would be it see the next value Means It will see the next outcome or What will come Next

Such that the probability of the second word coming in secons step is P(y<1>,y<2>/x)=P(y<1>/x)P(y<2>/x,y<1>)

And By this We Will CAlculate the Remainning word By Picking the Next word and after that and Calculate their Probs To fin For Better REsults Let's Understand This with Diagram:

When Beam Search is equal to 3 Means It will take three Words and make futher words according to that like in Ex .. in -->september,jane-->is ,jane -->visits now in and jane has make three words so don;t need for september

Refining to Beam Search Process:

Length Normalization is the small change to the beam search algorithm that can help you get much better results .

Probability of a no. occuring in beam search is :-->

These probabilities are all numbers less than 1. Often they're much less than 1. And multiplying a lot of numbers less than 1 will result in a tiny, tiny, tiny number, which can result in numerical underflow. Meaning that it's too small for the floating part representation in your computer to store accurately.

So in practice, instead of maximizing this product, we will take logs. And if you insert a log there, then log of a product becomes a sum of a log, and maximizing this sum of log probabilities should give you the same results in terms of selecting the most likely sentence y. So by taking logs, you end up with a more numerically stable algorithm that is less prone to rounding errors, numerical rounding errors, or to really numerical underflow. And because the log function, that's the logarithmic function, this is strictly monotonically increasing function, maximizing P(y).

Now, there's one other change to this objective function that makes the machine translation algorithm work even better.

Which is that, if you referred to this original objective up here,

if you have a very long sentence, the probability of that sentence is going to be low, because you're multiplying as many terms here. Lots of numbers are less than 1 to estimate the probability of that sentence. And so if you multiply all the numbers that are less than 1 together, you just tend to end up with a smaller probability.

And so this objective function has an undesirable effect, that maybe it unnaturally tends to prefer very short translations. It tends to prefer very short outputs.

Because the probability of a short sentence is determined just by multiplying fewer of these numbers are less than 1.

And so the product would just be not quite as small.

And by the way, the same thing is true for this. The log of our probability is always less than or equal to 1. You're actually in this range of the log. So the more terms you have together, the more negative this thing becomes.

So there's one other change to the algorithm that makes it work better, which is instead of using this as the objective you're trying to maximize, one thing you could do is normalize this by the number of words in your translation.

And so this takes the average of the log of the probability of each word. And this significantly reduces the penalty for outputting longer translations. And in practice, as a heuristic instead of dividing by Ty, by the number of words in the output sentence, sometimes you use a softer approach. We have Ty to the power of alpha, where maybe alpha is equal to 0.7. So if alpha was equal to 1, then yeah, completely normalizing by length. If alpha was equal to 0, then, well, Ty to the 0 would be 1, then you're just not normalizing at all. And this is somewhat in between full normalization, and no normalization, and alpha's another hyper parameter you have within that you can tune to try to get the best results.