Sequence Model :Part 1

Machine Learning :Sequence Model Part 1

Sequence Modelling is the task of predicting what come next in the sequence .In Sequence Modeling the current output is dependent on previous input and the length of input is not fixed

As I said Earlier the Input length can be vary. In this image Speech recognition input has different length and sentiment classification inout has different length

Now Let's say we Have to find the name of a person in the sentence "Harry Potter and hermoine Granger invented a new spell".So this will be the Named-Entity Problem and It is very usefull in Finding some hidden names in the sentence ,company name,location name,etc..

So what will be the output of the given sequence according to our calculation:

Y:1 1 0 1 1 0 0 0 0

For Given X. so how do we find the X such that It can give that output where we can easily find the name. So For this we will be Required the Vocabulary(or Dictionary) that contain all the name now for the less complexity we are taking 10,000 words of vocabulary (10000 X 1) and making OneHotEncoder for each word such that we will have 9 one hot encoder each contain 10000 word

So the above Image show the One hot representation of our words through which we have to predict .Now General Question Always arise in mind that what if the given word is not in the vocabulary that we are using .So to settle this problem we take the UNK character in the place of this .Now Problem Solve

To relate between X and Y we use Recurrent neural Network:Before Getting Into let's first discuss

Why we Can't use Standard Model?

Problem Using Standard Model are:

Inputs and outputs can be different lengths in different examples
Doesn't share Features learned across different positions of text
Input layer can be very large so cause more complexity

What is Reccurent Neural Network?

Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have "memory". They can read inputs

x^{⟨ t ⟩}

(such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a unidirectional RNN to take information from the past to process later inputs. A bidirectional RNN can take context from both the past and the future.

In this above image we define the one hot for every word and their parameter (Waa,Wya,Wax) In this the Tx=Ty are to be equal such that input and output length are equal

On the Image shown the right most Diagram is also correct because it was doing the same thing that the whole diagram is doing Because on right most diagram there is the time lapse of one so It will calculate the same thing .

Now the Problem with RNN is "it only uses the information that is earlier in the sequence to make a prediction " and :In particular when predicting y3 it doesn't us information about the worst X4,X5,X6 and so on.Lets's understand with the example :"teddy Roosevelt was the great President" Now teddy is the Namein the sentence but sometime it interpreted as the toy "Teddy bear" So this can be very frustating So to solve this we use BNN(Bidirection Neural Network)

Now Let's Talk about Forward Propagation to make Prediction:

Above are the Equation that we are GOing to use To calculate the forward Propagation

BackPropagationThroughTime:

The goal of the backpropagation training algorithm is to modify the weights of a neural network in order to minimize the error of the network outputs compared to some expected output in response to corresponding inputs.Now How do we calcuate the BackPropogation Lets See in the Image

We calculate Loss Function to Measure the error betweent the predicted y and Original Y Such that It can tell us How Accurate our Model is .

Backpropagation Through Time, or BPTT, is the application of the Backpropagation training algorithm to recurrent neural network applied to sequence data like a time series.

Conceptually, BPTT works by unrolling all input timesteps. Each timestep has one input timestep, one copy of the network, and one output. Errors are then calculated and accumulated for each timestep. The network is rolled back up and the weights are updated.

Now We will define different Type of RNN:

It is Not Neccessary that every time you may get some Tx and Ty equal length Like in ex of Sentiment analysis

Where Tx--> Length of word

Ty-->Length of movie review(Like 5 star rating )

To solve these type of problem we will use diffrent type of Neural Network Like Many to One ,One to may ,one to one

One to many example are Music generation

Many to one example are Sentiment Classification

Many to Many example are Named_entity (which we have learned in previous ),Machine Translation

Language Modelling :

Let's say you are buliding a your'e building this speech recognition system and you hear the sentence

" the apple and pear salad was delicious"

SO what did you just hear me say ? Did I say :

"the apple and pair salad"

or did i say

"the apple and pear salad".

You probably think second sentence is much more likely and in fact that's what a good speech recognition system would help with even though these two sentences sound exactly the same And the way aspeech recognition system picks the second sentence is by using a language model which tells it what the probability is of either of these two sentences .

For example a language model might say that the chance fro the first sentence is 3.2 and the second senteence is 5.7 with these Probs the second is much more likely tocome

So what a language model does is given any sentence is job is to tell you what is the probability of a sentence, of that particular sentence. And by probability of sentence I mean, if you want to pick up a random newspaper, open a random email or pick a random webpage or listen to the next thing someone says, the friend of you says. What is the chance that the next sentence you use somewhere out there in the world will be a particular sentence like the apple and pear salad?

Modelling Of Language model

Now waht we will do is we tokenize the words that is given in sentence means we one hot them and put a <EOS> (end of sentence ) in the end of the sentence such that we can get to know wherethe sentence is ending .

Now the RNN model for language Modelling is :

Now what we are seeing in this Model is by taking the previous 2 to 3 words we are getting to know what are the probabaility of the new word

Loss Function is define to get the accuracy

And the P(y1,y2,y3) are the probs of these three sentences occuring

Sampling Novel Sequences:

This means we are taking the output of one word("the ") (y^1) and feeding that into the input of second (x^2 =y^1) such that we get to know what is the probability of( P(__/"the")) the word coming given that the first word is "the"

Now we Have One More Model That is Character Level Modelling :In character Level modelling we use the vocabulary of characters rather then words such that we get rid off the unknown <UNK> words problem .

But the problem is that it take lot of spac eand the calculation would be more complex by using this because if the word is of 10 characters then we end with having 10 onehotencoders for a single word .

Sequence Generation:

Vanishing Gradients Problem:

Let's take a sentence

"the cat which already ate the butter chicker or something else was full"

and take one more examples

"the Cats which already ate the Butter chicker or something else were full"

Now In first sentence the" was " was coming with" cat" and in second the "were " was coming with "cats " that's the general grammer But our Rnn will fail to compute this as we have trained our modelto predict by taking consideration only last 3 to 4 words .SO the Problem arise and it may might not give the best result and this problem is vanishing gradient problem.And It fail to capture Long Range Dependencies

We have already seen this type of problem in Neural Network that is when we are Backpropogating in the network the weights got to high that it is getting hard to compute them and they Just result "NAN" and this type of problem is called Exploding gradients problem where the Grdients Just blow up to very high and the solution is using "Gradient Clipping "

But we Can't use the "gradient clipping " for Sequence Model Because The vanishing gradient makes the gradient very close to zero, so it's difficult to know where to move in the state space; the exploding gradient makes the gradient a very large value, so it makes learning unstable. This problem is more pronounced in recurrent networks since they use the same matrix at each time step.

To solve Vanishing Gradient Problem we use LSTM and GRM

Gates Recurrent Unit :

Now we are given A sentence

"the cat which already ate the butter chicker or something else was full"

Now we define the variable

C=memory cell

which will take care of cat if it is singular it proceed acording to that if it was plural it proceed according to that

The key part of the GRU is this equation which is that we have come up with a candidate where we're thinking of updating c using c tilde, and then the gate will decide whether or not we actually update it.

And so the way to think about it is maybe this memory cell c is going to be set to either zero or one depending on whether the word you are considering, really the subject of the sentence is singular or plural. So because it's singular, let's say that we set this to one. And if it was plural, maybe we would set this to zero, and then the GRU unit would memorize the value of the c<t> all the way until here, where this is still equal to one and so that tells it, oh, it's singular so use the choice was.

And the job of the gate, of gamma u, is to decide when do you update these values. In particular, when you see the phrase, the cat, you know they you're talking about a new concept the especially subject of the sentence cat. So that would be a good time to update this bit and then maybe when you're done using it, the cat blah blah blah was full, then you know, okay, I don't need to memorize anymore, I can just forget that.

So the specific equation we'll use for the GRU is the following. Which is that the actual value of c<t> will be equal to this gate times the candidate value plus one minus the gate times the old value, c<t> minus one.

Let me pushed that a little bit to the right, and I'm going to add one more gate. So this is another gate gamma r. You can think of r as standing for relevance. So this gate gamma r tells you how relevant is c<t> minus one to computing the next candidate for c<t>. And this gate gamma r is computed pretty much as you'd expect with a new parameter matrix Wr, and then the same things as input x<t> plus br. So as you can imagine there are multiple ways to design these types of neural networks. And why do we have gamma r?

Why not use a simpler version from the previous slides? So it turns out that over many years researchers have experimented with many, many different possible versions of how to design these units, to try to have longer range connections, to try to have more the longer range effects and also address vanishing gradient problems. And the GRU is one of the most commonly used versions that researchers have converged to and found as robust and useful for many different problems.

LSTM (Long Short Term Memory ):

The LSTM is more Powerfull then GRU and we have one more gate in LSTM that is forget gate

And one new property of the LSTM is, instead of having one update gate control, both of these terms, we're going to have two separate terms. So instead of gamma_u and one minus gamma_u, we're going have gamma_u only one time .

And forget gate, which we're going to call gamma_f. So, this gate, gamma_f, is going to be sigmoid of pretty much what you'd expect, x_t plus b_f.

And then, we're going to have a new output gate which is sigma of W_o. And then again, pretty much what you'd expect, plus b_o.

And then, the update value to the memory so will be c_t equals gamma u. And this asterisk denotes element-wise multiplication. This is a vector-vector element-wise multiplication, plus, and instead of one minus gamma u, we're going to have a separate forget gate, gamma_f, times c of t minus one. So this gives the memory cell the option of keeping the old value c_t minus one and then just adding to it, this new value, c tilde of t. So, use a separate update and forget gates.

So, this stands for update, forget, and output gate. And then finally, instead of a_t equals c_t a_t is a_t equal to the output gate element-wise multiplied by c_t. So, these are the equations that govern the LSTM and you can tell it has three gates instead of two

Perhaps, the most common one is that instead of just having the gate values be dependent only on a_t minus one, x_t, sometimes, people also sneak in there the values c_t minus one as well. This is called a peephole connection. Not a great name maybe but you'll see, peephole connection. What that means is that the gate values may depend not just on a_t minus one and on x_t, but also on the previous memory cell value, and the peephole connection can go into all three of these gates' computations. So that's one common variation you see of LSTMs.

One technical detail is that these are, say, 100-dimensional vectors. So if you have a 100-dimensional hidden memory cell unit, and so is this. And the, say, fifth element of c_t minus one affects only the fifth element of the corresponding gates, so that relationship is one-to-one, where not every element of the 100-dimensional c_t minus one can affect all elements of the case. But instead, the first element of c_t minus one affects the first element of the case, second element affects the second element, and so on.

We have already discuss above what BRNN does now these blocks can be of LSTM or GRU BAsically people use LSTM blocks because these are more powerfull blocks then GRU