Thursday, 3 November 2011

Databases: Multivalued Dependencies

I don't know about anyone else, but multivalued dependencies took me quite a while to get my head around. I think I've got there now, but I thought working through an example might help. This is all new for me so it is possible that I haven't understood correctly, but here is my understanding.

Example

Credit: BerndH,
Wikipedia Commons
Credit: Beezhive
Wikipedia Commons
I have a relation Garden which has attributes location (which part of the garden), plant (which plant is growing there), colour (colour of the plant), That is, Garden(position, plant, colour). I'll do several examples of different data to illustrate different possible functional and multivalued relations.

For the first part, let's look at the colour of the flowers. I always plant mixed colours of plants when they exist so whenever I plant narcissus, I get some white and some yellow. That means, that whenever I have narcissus in a location, I should have an entry in the relation that says they are yellow and another that says they are white. This means I have the multivalued relation

plant ->> location

since whenever I have narcissus at a location, I have both white and yellow ones there, and whenever I have bluebells, I always have blue and white ones there.

locationplantcolour
frontnarcissusyellow
frontnarcissuswhite
backnarcissusyellow
backnarcissuswhite
leftnarcissusyellow
leftnarcissuswhite
rightbluebellsblue
rightbluebellswhite

Credit: MichaelMaggs,
Wikipedia Commons
Suppose I plant some bluebells in the front. As I know that I have the multivalued relation plant ->> location, to satisfy that relation, I have to plant both white and blue bluebells. If I only planted blue then the relation would no longer hold as I have to have every combination of the unmentioned attribute colour. If the colour combination (bluebell, white) occurs in one location, it has to occur everywhere bluebell occurs. I don't have yellow bluebells since they don't occur anywhere in the table, so the yellow doesn't suddenly get attached to the bluebell just because it is the colour of some flower. Basically, I just cut and paste the two lines for bluebells with just the location changed.

locationplantcolour
frontnarcissusyellow
frontnarcissuswhite
frontbluebellsblue
frontbluebellswhite
backnarcissusyellow
backnarcissuswhite
leftnarcissusyellow
leftnarcissuswhite
rightbluebellsblue
rightbluebellswhite

Now I'm going to add in another property of the flowers. Are they scented or not? Let's suppose we just check the flowers in the front and find out that yellow narcissus are not scented, but some of the white are scented and some aren't. With the information so far, can we say whether our multivalued dependency still holds or not? We don't have scented yellow narcissus but we do have scented white ones so all combinations of white/yellow and scented/unscented aren't present. This is not a problem since the multivalued dependency only tells us that if we have a certain combination of the attributes which aren't mentioned for a particular plant, then every time that plant occurs with a location, that combination of the unmentioned attributes must occur.

For our example, we know in the front we have yellow, unscented and also unscented and scented white narcissus and since we have narcissus in the back, they must also be unscented and yellow; white and unscented; white and scented. We also have narcissus to the left so they must also be present in the three combinations. Note how the table has got bigger and how easy it would be to miss a combination somewhere.

locationplantcolourscent
frontnarcissusyellowscented
frontnarcissuswhitescented
frontnarcissuswhiteunscented
frontbluebellsblueunscented
frontbluebellswhiteunscented
backnarcissusyellowscented
backnarcissuswhitescented
backnarcissuswhiteunscented
leftnarcissusyellowscented
leftnarcissuswhitescented
leftnarcissuswhiteunscented
rightbluebellsbluescented
rightbluebellswhitescented

How could we better represent this data without this horrible explosion when you just wanted to add scented/unscented? We could use the multivalued dependency plant ->> location to make the relation Garden(location,plant, colour, scent) in 4th Normal Form. We take out the attributes mentioned in the dependency and they form their own relation Where(plant, location) and then remove the right hand side from the Garden to form Flowers(plant, colour, scent).

Where
locationplant
frontnarcissus
frontbluebells
backnarcissus
leftnarcissus
rightbluebells
Flowers
plantcolourscent
narcissusyellowscented
narcissuswhitescented
narcissuswhiteunscented
bluebellsblueunscented
bluebellswhiteunscented

If we take the natural join of Flowers with Where, we get back exactly the tuples in Garden.

Thursday, 27 October 2011

Logistic Regression Cost Function

I'm going to calculate the cost function for two different lines. The first, shown in green, does not correctly classify all the points as some are on the wrong side of the line. The blue line classifies the points correctly but is probably not the best possible line.
The cost function for logistic regression is
Note that when y=1, the part inside the square brackets which is Cost(h(x),y) is (omitting superscripts),

Cost(h(x),y)=-ylog(h(x)) - (1-y)log(1-h(x))
= -1*log(h(x)) - (1-1)log(1-h(x))  
= -log(h(x))  

and when y=0,

Cost(h(x),y)=-ylog(h(x)) - (1-y)log(1-h(x))
=  -0*log(h(x)) - (1-0)log(1-h(x)) 
=  -log(1-h(x)). 

I'll work out Cost(h(x),y) in the table and then finish off underneath.  For the green line, θ0=-3, θ1=-1 and θ2=1. Why are the values like this and not θ0=3, θ1=1 and θ2=-1?

When we predict y, we use the value of h(x). When the value h(x) ≥ 0.5, we predict y=1 and when h < 0.5 we predict y=0. In the graph below, which is of the sigmoid function, the red line is the function h(x). The horizontal axis is z=θTx (note that the axis is not x). Now h(x) ≥ 0.5 = g(θTx ) when z=θTx ≥ 0.

From wiki: http://en.wikipedia.org/wiki/File:Logistic-curve.svg
Going back to our line, I wanted y=1 above the line and y=0 below it. I chose it that way. I could have chosen it the other way too as it's a pretty terrible line! This means we need the coefficients to be such that θTx ≥ 0 where we want y=1, that is  above the green line. Checking  you find that -3-x1+x2 ≥ 0 works.  (You can check it does work by putting in the point (2,5) which is above the line and checking it satisfies the inequality.  If it doesn't switch the signs of all the coefficients to make it work!)

Please note that log in the calculations is natural log, sometimes written ln.

x1x2y θTx = θ01x12x
= -3-x1+x2
h(x)=g(θTx)Cost(h(x),y) =
 -ylog(h(x)) - (1-y)log(1-h(x))
120-3-1*1+1*2=-21/(1+e-2) = 0.119202922- log(1-0.119202922) = 0.126928011
230-20.1192029220.126928011
240-10.26894142140.3132616875
350-11/(1+e-1) = 0.26894142140.3132616875
14001/(1+e-0) = 0.50.6931471806
541-40.017986214.0181499279
561-20.119202922- log(0.119202922) = 2.126928011
461-10.26894142141.3132616875
571-10.26894142141.3132616875
36100.50.6931471806
ΣCost(h(x),y)11.0382750722

The values marked in red correspond to the points on the wrong side of the line. Now we have the sum of the costs  for each of the training data, we can calculate J. Note that as there are 10 training examples, m=10 so our cost,


= 1/10*11.0382750722 1.1038.

For the line 0=-29 +4x1 +3x2, we want above the line to give y=1, so we have to write it so that
-29 +4x1 +3x2 ≥ 0 is true above the line, which it is. This gives us  θ0=-29, θ1=4 and θ2=3.

x1x2y θTx=θ01x12x2
= -29+4x1+3x2
h(x)=g(θTx)Cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x))
120-29+4*1+3*2=-191/(1+e-19) =
5.60279640614594E-009  ≈ 0.0000000056
5.60279646470696E-009  ≈ 0.0000000056
230-126.14417460221472E-0066.14419347772537E-006
240-90.00012339460.0001234022
350-20.1192029220.126928011
140-132.26032429790357E-0062.26032685249035E-006
54130.95257412680.0485873516
56190.99987660540.0001234022
46150.99330714910.0067153485
571120.99999385586.14419347772537E-006
36110.73105857860.3132616875
ΣCost(h(x),y)0.4957537573

If you look at the column marked h(x), you can see that for all the training examples where y=0, we have small values of h, that is h(x) < 0.5 and for all examples where y=1 we have large h(x), that is h(x) ≥ 0.5. This is what we wanted since we predict y=1 when h(x) ≥ 1 and y=0 when h(x) < 0.5. The accuracy of the predictions is reflected in the low sum of costs.

The total cost function for this h(x) is J(θ)= 1/10 * 0.4957537573  ≈ 0.0496.

Friday, 14 October 2011

What is the hypothesis function for?

Question: If I have already have House Size and Prices for X and Y Axis then why do I need Theta values and how to predict them? Is it a Hit and Trial rule? The answer might help me to write a practice example in programming language

1. You have some data about how much (y) some houses sold for and their size (x). These are pairs (x,y). It could be the sizes and values of all the properties sold in London last year.

 2. You want to predict the price of another house when you only know its size x. Maybe your friend has a house to sell in London and wants to know what he is likely to get for it. To do this you need an equation linking x and y. This equation is called the hypothesis hθ(x).

 3. To find hθ(x) =θ0 + θ1 x you need to find θ0 and θ1. There are two methods to do this. Gradient descent and Normal Equation. They are dealt with in more detail in the most recent set of lectures, Lectures IV. Gradient descent is a type of trial and improvement method. Normal equation is a direct calculation.

 4. Once you've found θ0 and θ1, you have  hθ(x) and can use it to predict the value of y for your new value of x, that is you can tell your friend how much he's likely to get for his property in London.

 (Just as an aside: Try to stick to lower case for the x and y because later on X is used for a matrix and it will get confusing.)


Additional comments



Dan Scott - The theta values are the parameters for the straight line equation that best estimate the values of y given x: from the straight line formula y = a * x + b our two theta values correspond to b and a respectively. When we start we don't have a good estimate (0, 0). The purpose of gradient descent / linear regression is to use the training data to derive the best line that fits the data.

 Anuj Shah - If you were to plot the data on an x-y plane, you'll see that there are multiple possible lines that can be drawn through the data. None of them perfect (unless the data all fits on the same straight line). As a result what you're looking to achieve is what is mathematically the most appropriate line to draw such that it minimizes the cost function.

Thursday, 13 October 2011

Introduction of New Notation for Multiple Features

Yesterday, someone mentioned some confusion over the notation hθ(x) as the x was a single variable and now is a vector. I'll go through the general notation and then relate it back to the case where hθ(x) = θ0 + θ1x.

The Hypothesis


When we just had one input variable x (size of house) and output y (cost of house) we had hypothesis

hθ(x) = θ0 + θ1x.

This generalises to

hθ(x1, x2, x3, ..., xn) = θ0 + θ1x1 +  θ2x2 + θ3x3 + ... + θnxn

This notation is rather cumbersome, so we use a more convenient vector form. In order to do this, we put in an addition x0, which we set to one (as θ00*1=θ0x0.)

hθ(x0x1, x2, x3, ..., xn) = θ0x0+ θ1x1 +  θ2x2 + θ3x3 + ... + θnxn


Putting in this extra x0 enables us to use the following notation.

I will use bold font to indicate vectors so x is the vector above. Using this notation, and writing θ as a row vector by taking its transpose θT =( θ0, θ1x1, θ2, θ3x3, ... , θn) we can rewrite the hypothesis as hθ(x) = θTx. Note that this x is the vector x.

Relating this to the Single Variable


For the single variable hθ(x) = θ0 + θ1x, we rewrite it as
h(x0, x1) = θ0 x0+ θ1x1θTx, that is,

h(x) = θTx

where θT = (θ0, θ1) and
x= ( x0
x1
).

Wednesday, 12 October 2011

Drawing the Hypothesis Function and Predicting the Outcome

Question: Once you've minimised the cost function J(θ0, θ1), how do you predict the house prices?


When you've done all the calculations and minimised J(θ0, θ1) over all values of θ0 and θ1,  those values of  θ0 and θ1to give you the best possible line hθ(x) = θ0 + θ1x through the data. This line can be used to predict the outcome y for some new input x. For the house price example, x is the size of the house and y is the amount it is sold for. The hypothesis hθ(x) predicts what the price will be given a particular size of house.

I'll use the same values as I used in the previous post.

Visually


Visually, we can draw a graph of  y = hθ(x) and then read the answer off the graph. I put the values from the previous post into an applet (linked to by Antony Flores) to obtain  hθ(x).  Note that it gives y = 0.92x + 0.93 which is a different order from the way we're writing  hθ(x) = 0.93 + 0.92x. (It doesn't make any difference, but could lead to confusion over which is θ0 and which is θ1.)

To Draw the Line


There are various different approaches to drawing the line, y = 0.93 + 0.92x. The value given by θ0, here 0.93, is called the y-intercept. It's the point where the  line crosses the y-axis. Plot that point. Next work out another point on the line, say when x= 10, ( y = 0.93 + 0.92*10 = 10.13) and plot that too.  Draw a line through the two points. Ideally, you should draw a third point for error  checking purposes. (If the three points aren't in a straight line, you've made a mistake plotting or calculating one of the points.)


Reading the Answer from the Graph



Once you've got your line on the graph,  you can read off the predicted y-value given an x-value. Suppose we want to know what y is when x=6. We draw a line (or visualise it) going vertically up from the x-axis at x=6 up to the line y= 0.93 + 0.92x. When it hits the line, head off horizontally towards the y-axis. Where you meet the y-axis, you can read off the predicted y-value. From my graph, it looks like it's somewhere around 6.5 as the scale is too small 9 (and the line way too thick) to be accurate.

Inaccuracy


The problem with this method is that it is difficult to read accurately. The scale of my graph isn't good enough to get an accurate prediction. A better option is to calculate the value using the function  hθ(x) = 0.93 + 0.92x.

Calculating the Value of y


Instead of trying to read off the value from the graph, we can calculate the predicted value of y from the hypothesis hθ(x) = 0.93 + 0.92x. To do this, we input the value of x we are interested in, in this case, x=6.

hθ(6) = 0.93 + 0.92*6 = 0.93+5.52 = 6.45. The predicted value of y when x=6 is y=6.45.


Examples


House prices (in $1000) are predicted by the hypothesis hθ(x) = 50 + 0.1x where x is the area in square feet.

What is the predicted price of a house which is
(a) 1000 square feet,
(b) 1250 square feet,
(c) 1500 square feet?

Answers: purple on purple background
(a)  hθ(1000) = 50 + 0.1*1000 = $150000
(b) hθ(1250) = 50 + 0.1*1250 = $175000
(c) hθ(1500) = 50 + 0.1*1500 = $200000

Tuesday, 11 October 2011

Machine Learning: Working out the Cost Function

I'll give an example, and then work out the cost function for three different pairs of values for  θ0 and θ1.
The cost function J( θ01) is used to measure how good a fit a line is to the data. If it's a good fit, then it's going to give you better predictions.  The line we're trying to make as good a fit as possible is hθ(x)= θ0 + θ1x. The idea is to minimise the value of J. (I'm not going to talk about how to minimise it here, just how to calculate it from given values of  θ0 and θ1.)

Training examples

xy
21
24
54
58
98
911

Below I'll work out the cost function for various values of theta0 and theta1. The values are just chosen as illustrations and have not been chosen in any particular way.

Working out the Cost Function



First, we'll take θ0=0 and θ1=0. This gives us h(x)=0, which we draw on the graph as the horizontal line, y=0 along the x-axis. The crosses are the points given by the training examples.

xyh(x)=0h(x)-y( h(x) - y )2
2100-1=-11
2400-4=-416
5400-4=-416
5800-8=-864
9800-8=-864
91100-11=-11121
Sum of ( h(x) - y )2282

Now the cost function is J(θ0, θ1)=1/2m Σ( h(x)-y)2. Now m= number of values in the training set =6, Σ( h(x)-y)2 just means the sum of all the values in the squared column above. Putting the values in gives


J(0, 0)=1/(2*6) (282) = 282/12=23.5.


We want to minimise J so let's try some other values of  θand θ1. Next, we'll try θ0=0 and θ1=1. This gives h(x)=x which is the line y=x.

xyh(x)=xh(x)-y( h(x) - y )2
2122-1=11
2422-4=-24
5455-4=11
5855-8=-39
9899-8=11
91199-1=-24
Sum of ( h(x) - y )220

 So J(0, 1)=1/(2*6) (20) = 20/12 = 1.67 (2 decimal places). This is better than the answer we had for the previous values of  θand θ1 as we are trying to minimise the cost function.

Finally, we'll do θ0=1 and θ1=1. This gives h(x)=x+1 which is the line y=1+x on the graph.

xyh(x)=1+ xh(x)-y( h(x) - y )2
2133-1=24
2433-4=-11
5466-4=24
5866-8=-24
981010-8=24
9111010-11=-11
Sum of ( h(x) - y )218

So J(1, 1)=1/(2*6) (18) = 18/12 = 1.5, which again is an improvement.