Thursday, 3 November 2011

Databases: Multivalued Dependencies

I don't know about anyone else, but multivalued dependencies took me quite a while to get my head around. I think I've got there now, but I thought working through an example might help. This is all new for me so it is possible that I haven't understood correctly, but here is my understanding.

Example

Credit: BerndH,
Wikipedia Commons

Credit: Beezhive
Wikipedia Commons

I have a relation Garden which has attributes location (which part of the garden), plant (which plant is growing there), colour (colour of the plant), That is, Garden(position, plant, colour). I'll do several examples of different data to illustrate different possible functional and multivalued relations.

For the first part, let's look at the colour of the flowers. I always plant mixed colours of plants when they exist so whenever I plant narcissus, I get some white and some yellow. That means, that whenever I have narcissus in a location, I should have an entry in the relation that says they are yellow and another that says they are white. This means I have the multivalued relation

plant ->> location

since whenever I have narcissus at a location, I have both white and yellow ones there, and whenever I have bluebells, I always have blue and white ones there.

location	plant	colour
front	narcissus	yellow
front	narcissus	white
back	narcissus	yellow
back	narcissus	white
left	narcissus	yellow
left	narcissus	white
right	bluebells	blue
right	bluebells	white

Credit: MichaelMaggs,
Wikipedia Commons

Suppose I plant some bluebells in the front. As I know that I have the multivalued relation plant ->> location, to satisfy that relation, I have to plant both white and blue bluebells. If I only planted blue then the relation would no longer hold as I have to have every combination of the unmentioned attribute colour. If the colour combination (bluebell, white) occurs in one location, it has to occur everywhere bluebell occurs. I don't have yellow bluebells since they don't occur anywhere in the table, so the yellow doesn't suddenly get attached to the bluebell just because it is the colour of some flower. Basically, I just cut and paste the two lines for bluebells with just the location changed.

location	plant	colour
front	narcissus	yellow
front	narcissus	white
front	bluebells	blue
front	bluebells	white
back	narcissus	yellow
back	narcissus	white
left	narcissus	yellow
left	narcissus	white
right	bluebells	blue
right	bluebells	white

Now I'm going to add in another property of the flowers. Are they scented or not? Let's suppose we just check the flowers in the front and find out that yellow narcissus are not scented, but some of the white are scented and some aren't. With the information so far, can we say whether our multivalued dependency still holds or not? We don't have scented yellow narcissus but we do have scented white ones so all combinations of white/yellow and scented/unscented aren't present. This is not a problem since the multivalued dependency only tells us that if we have a certain combination of the attributes which aren't mentioned for a particular plant, then every time that plant occurs with a location, that combination of the unmentioned attributes must occur.

For our example, we know in the front we have yellow, unscented and also unscented and scented white narcissus and since we have narcissus in the back, they must also be unscented and yellow; white and unscented; white and scented. We also have narcissus to the left so they must also be present in the three combinations. Note how the table has got bigger and how easy it would be to miss a combination somewhere.

location	plant	colour	scent
front	narcissus	yellow	scented
front	narcissus	white	scented
front	narcissus	white	unscented
front	bluebells	blue	unscented
front	bluebells	white	unscented
back	narcissus	yellow	scented
back	narcissus	white	scented
back	narcissus	white	unscented
left	narcissus	yellow	scented
left	narcissus	white	scented
left	narcissus	white	unscented
right	bluebells	blue	scented
right	bluebells	white	scented

How could we better represent this data without this horrible explosion when you just wanted to add scented/unscented? We could use the multivalued dependency plant ->> location to make the relation Garden(location,plant, colour, scent) in 4th Normal Form. We take out the attributes mentioned in the dependency and they form their own relation Where(plant, location) and then remove the right hand side from the Garden to form Flowers(plant, colour, scent).

Where

location	plant
front	narcissus
front	bluebells
back	narcissus
left	narcissus
right	bluebells

Flowers

plant	colour	scent
narcissus	yellow	scented
narcissus	white	scented
narcissus	white	unscented
bluebells	blue	unscented
bluebells	white	unscented

If we take the natural join of Flowers with Where, we get back exactly the tuples in Garden.

Thursday, 27 October 2011

Logistic Regression Cost Function

I'm going to calculate the cost function for two different lines. The first, shown in green, does not correctly classify all the points as some are on the wrong side of the line. The blue line classifies the points correctly but is probably not the best possible line.

The cost function for logistic regression is

Note that when y=1, the part inside the square brackets which is Cost(h(x),y) is (omitting superscripts),

Cost(h(x),y)	=	-ylog(h(x)) - (1-y)log(1-h(x))
	=	-1*log(h(x)) - (1-1)log(1-h(x))
	=	-log(h(x))

and when y=0,

Cost(h(x),y)	=	-ylog(h(x)) - (1-y)log(1-h(x))
	=	-0*log(h(x)) - (1-0)log(1-h(x))
	=	-log(1-h(x)).

I'll work out Cost(h(x),y) in the table and then finish off underneath. For the green line, θ₀=-3, θ₁=-1 and θ₂=1. Why are the values like this and not θ₀=3, θ₁=1 and θ₂=-1?

When we predict y, we use the value of h(x). When the value h(x) ≥ 0.5, we predict y=1 and when h < 0.5 we predict y=0. In the graph below, which is of the sigmoid function, the red line is the function h(x). The horizontal axis is z=θ^Tx (note that the axis is not x). Now h(x) ≥ 0.5 = g(θ^Tx ) when z=θ^Tx ≥ 0.

From wiki: http://en.wikipedia.org/wiki/File:Logistic-curve.svg

Going back to our line, I wanted y=1 above the line and y=0 below it. I chose it that way. I could have chosen it the other way too as it's a pretty terrible line! This means we need the coefficients to be such that θ^Tx ≥ 0 where we want y=1, that is above the green line. Checking you find that -3-x₁+x₂ ≥ 0 works. (You can check it does work by putting in the point (2,5) which is above the line and checking it satisfies the inequality. If it doesn't switch the signs of all the coefficients to make it work!)

Please note that log in the calculations is natural log, sometimes written ln.

x₁	x₂	y	θ^Tx = θ₀+θ₁x₁+θ₂x₂ = -3-x₁+x₂	h(x)=g(θ^Tx)	Cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x))
1	2	0	-3-11+12=-2	1/(1+e^-2) = 0.119202922	- log(1-0.119202922) = 0.126928011
2	3	0	-2	0.119202922	0.126928011
2	4	0	-1	0.2689414214	0.3132616875
3	5	0	-1	1/(1+e^-1) = 0.2689414214	0.3132616875
1	4	0	0	1/(1+e^-0) = 0.5	0.6931471806
5	4	1	-4	0.01798621	4.0181499279
5	6	1	-2	0.119202922	- log(0.119202922) = 2.126928011
4	6	1	-1	0.2689414214	1.3132616875
5	7	1	-1	0.2689414214	1.3132616875
3	6	1	0	0.5	0.6931471806
				ΣCost(h(x),y)	11.0382750722

The values marked in red correspond to the points on the wrong side of the line. Now we have the sum of the costs for each of the training data, we can calculate J. Note that as there are 10 training examples, m=10 so our cost,

= 1/10*11.0382750722 ≈1.1038.

For the line 0=-29 +4x₁ +3x₂, we want above the line to give y=1, so we have to write it so that
-29 +4x₁ +3x₂ ≥ 0 is true above the line, which it is. This gives us θ₀=-29, θ₁=4 and θ₂=3.

x₁	x₂	y	θ^Tx=θ₀+θ₁x₁+θ₂x₂ = -29+4x₁+3x₂	h(x)=g(θ^Tx)	Cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x))
1	2	0	-29+41+32=-19	1/(1+e^-19) = 5.60279640614594E-009 ≈ 0.0000000056	5.60279646470696E-009 ≈ 0.0000000056
2	3	0	-12	6.14417460221472E-006	6.14419347772537E-006
2	4	0	-9	0.0001233946	0.0001234022
3	5	0	-2	0.119202922	0.126928011
1	4	0	-13	2.26032429790357E-006	2.26032685249035E-006
5	4	1	3	0.9525741268	0.0485873516
5	6	1	9	0.9998766054	0.0001234022
4	6	1	5	0.9933071491	0.0067153485
5	7	1	12	0.9999938558	6.14419347772537E-006
3	6	1	1	0.7310585786	0.3132616875
				ΣCost(h(x),y)	0.4957537573

If you look at the column marked h(x), you can see that for all the training examples where y=0, we have small values of h, that is h(x) < 0.5 and for all examples where y=1 we have large h(x), that is h(x) ≥ 0.5. This is what we wanted since we predict y=1 when h(x) ≥ 1 and y=0 when h(x) < 0.5. The accuracy of the predictions is reflected in the low sum of costs.

The total cost function for this h(x) is J(θ)= 1/10 * 0.4957537573 ≈ 0.0496.

Friday, 14 October 2011

What is the hypothesis function for?

Question: If I have already have House Size and Prices for X and Y Axis then why do I need Theta values and how to predict them? Is it a Hit and Trial rule? The answer might help me to write a practice example in programming language

1. You have some data about how much (y) some houses sold for and their size (x). These are pairs (x,y). It could be the sizes and values of all the properties sold in London last year.

2. You want to predict the price of another house when you only know its size x. Maybe your friend has a house to sell in London and wants to know what he is likely to get for it. To do this you need an equation linking x and y. This equation is called the hypothesis h_θ(x).

3. To find h_θ(x) =θ₀ + θ₁ x you need to find θ₀ and θ₁. There are two methods to do this. Gradient descent and Normal Equation. They are dealt with in more detail in the most recent set of lectures, Lectures IV. Gradient descent is a type of trial and improvement method. Normal equation is a direct calculation.

4. Once you've found θ₀ and θ₁, you have h_θ(x) and can use it to predict the value of y for your new value of x, that is you can tell your friend how much he's likely to get for his property in London.

(Just as an aside: Try to stick to lower case for the x and y because later on X is used for a matrix and it will get confusing.)

Additional comments

Dan Scott - The theta values are the parameters for the straight line equation that best estimate the values of y given x: from the straight line formula y = a * x + b our two theta values correspond to b and a respectively. When we start we don't have a good estimate (0, 0). The purpose of gradient descent / linear regression is to use the training data to derive the best line that fits the data.

Anuj Shah - If you were to plot the data on an x-y plane, you'll see that there are multiple possible lines that can be drawn through the data. None of them perfect (unless the data all fits on the same straight line). As a result what you're looking to achieve is what is mathematically the most appropriate line to draw such that it minimizes the cost function.

Thursday, 13 October 2011

Introduction of New Notation for Multiple Features

Yesterday, someone mentioned some confusion over the notation h_θ(x) as the x was a single variable and now is a vector. I'll go through the general notation and then relate it back to the case where h_θ(x) = θ₀ + θ₁x.

The Hypothesis

When we just had one input variable x (size of house) and output y (cost of house) we had hypothesis

h_θ(x) = θ₀ + θ₁x.

This generalises to

h_θ(x_1,x_2,x_{3, ...,}x_n) = θ₀ + θ₁x₁ + θ₂x₂ + θ₃x₃ + ... + θ_nx_n

This notation is rather cumbersome, so we use a more convenient vector form. In order to do this, we put in an addition x₀, which we set to one (as θ₀=θ₀*1=θ₀x₀.)

h_θ(x₀_,x_1,x_2,x_{3, ...,}x_n) = θ₀x₀+ θ₁x₁ + θ₂x₂ + θ₃x₃ + ... + θ_nx_n

Putting in this extra x₀ enables us to use the following notation.

I will use bold font to indicate vectors so x is the vector above. Using this notation, and writing θ as a row vector by taking its transpose θ^T =( θ₀, θ₁x₁, θ₂, θ₃x₃, ... , θ_n) we can rewrite the hypothesis as h_θ(x) = θ^Tx. Note that this x is the vector x.

Relating this to the Single Variable

For the single variable h_θ(x) = θ₀ + θ₁x, we rewrite it as
h(x_0,x₁) = θ₀ x₀+ θ₁x₁ = θ^Tx, that is,

h(x) = θ^Tx

where θ^T = (θ₀, θ₁) and

(

x₀
x₁

Wednesday, 12 October 2011

Drawing the Hypothesis Function and Predicting the Outcome

Question: Once you've minimised the cost function J(θ₀, θ₁), how do you predict the house prices?

When you've done all the calculations and minimised J(θ₀, θ₁) over all values of θ₀ and θ₁, those values of θ₀ and θ₁to give you the best possible line h_θ(x) = θ₀ + θ₁x through the data. This line can be used to predict the outcome y for some new input x. For the house price example, x is the size of the house and y is the amount it is sold for. The hypothesis h_θ(x) predicts what the price will be given a particular size of house.

I'll use the same values as I used in the previous post.

Visually

Visually, we can draw a graph of y = h_θ(x) and then read the answer off the graph. I put the values from the previous post into an applet (linked to by Antony Flores) to obtain h_θ(x). Note that it gives y = 0.92x + 0.93 which is a different order from the way we're writing h_θ(x) = 0.93 + 0.92x. (It doesn't make any difference, but could lead to confusion over which is θ₀ and which is θ₁.)

To Draw the Line

There are various different approaches to drawing the line, y = 0.93 + 0.92x. The value given by θ₀, here 0.93, is called the y-intercept. It's the point where the line crosses the y-axis. Plot that point. Next work out another point on the line, say when x= 10, ( y = 0.93 + 0.92*10 = 10.13) and plot that too. Draw a line through the two points. Ideally, you should draw a third point for error checking purposes. (If the three points aren't in a straight line, you've made a mistake plotting or calculating one of the points.)

Reading the Answer from the Graph

Once you've got your line on the graph, you can read off the predicted y-value given an x-value. Suppose we want to know what y is when x=6. We draw a line (or visualise it) going vertically up from the x-axis at x=6 up to the line y= 0.93 + 0.92x. When it hits the line, head off horizontally towards the y-axis. Where you meet the y-axis, you can read off the predicted y-value. From my graph, it looks like it's somewhere around 6.5 as the scale is too small 9 (and the line way too thick) to be accurate.

Inaccuracy

The problem with this method is that it is difficult to read accurately. The scale of my graph isn't good enough to get an accurate prediction. A better option is to calculate the value using the function h_θ(x) = 0.93 + 0.92x.

Calculating the Value of y

Instead of trying to read off the value from the graph, we can calculate the predicted value of y from the hypothesis h_θ(x) = 0.93 + 0.92x. To do this, we input the value of x we are interested in, in this case, x=6.

h_θ(6) = 0.93 + 0.92*6 = 0.93+5.52 = 6.45. The predicted value of y when x=6 is y=6.45.

Examples

House prices (in $1000) are predicted by the hypothesis h_θ(x) = 50 + 0.1x where x is the area in square feet.

What is the predicted price of a house which is
(a) 1000 square feet,
(b) 1250 square feet,
(c) 1500 square feet?

Answers: purple on purple background
(a) h_θ(1000) = 50 + 0.1*1000 = $150000
(b) h_θ(1250) = 50 + 0.1*1250 = $175000
(c) h_θ(1500) = 50 + 0.1*1500 = $200000

Tuesday, 11 October 2011

Machine Learning: Working out the Cost Function

I'll give an example, and then work out the cost function for three different pairs of values for θ₀ and θ₁.
The cost function J( θ₀,θ₁) is used to measure how good a fit a line is to the data. If it's a good fit, then it's going to give you better predictions. The line we're trying to make as good a fit as possible is h_θ(x)= θ₀ + θ₁x. The idea is to minimise the value of J. (I'm not going to talk about how to minimise it here, just how to calculate it from given values of θ₀ and θ₁.)

Training examples

x	y
2	1
2	4
5	4
5	8
9	8
9	11

Below I'll work out the cost function for various values of theta₀ and theta₁. The values are just chosen as illustrations and have not been chosen in any particular way.

Working out the Cost Function

First, we'll take θ₀=0 and θ₁=0. This gives us h(x)=0, which we draw on the graph as the horizontal line, y=0 along the x-axis. The crosses are the points given by the training examples.

x	y	h(x)=0	h(x)-y	( h(x) - y )²
2	1	0	0-1=-1	1
2	4	0	0-4=-4	16
5	4	0	0-4=-4	16
5	8	0	0-8=-8	64
9	8	0	0-8=-8	64
9	11	0	0-11=-11	121
			Sum of ( h(x) - y )²	282

Now the cost function is J(θ₀, θ₁)=1/2m Σ( h(x)-y)². Now m= number of values in the training set =6, Σ( h(x)-y)² just means the sum of all the values in the squared column above. Putting the values in gives

J(0, 0)=1/(2*6) (282) = 282/12=23.5.

We want to minimise J so let's try some other values of θ₀and θ₁. Next, we'll try θ₀=0 and θ₁=1. This gives h(x)=x which is the line y=x.

x	y	h(x)=x	h(x)-y	( h(x) - y )²
2	1	2	2-1=1	1
2	4	2	2-4=-2	4
5	4	5	5-4=1	1
5	8	5	5-8=-3	9
9	8	9	9-8=1	1
9	11	9	9-1=-2	4
			Sum of ( h(x) - y )²	20

So J(0, 1)=1/(2*6) (20) = 20/12 = 1.67 (2 decimal places). This is better than the answer we had for the previous values of θ₀and θ₁ as we are trying to minimise the cost function.

Finally, we'll do θ₀=1 and θ₁=1. This gives h(x)=x+1 which is the line y=1+x on the graph.

x	y	h(x)=1+ x	h(x)-y	( h(x) - y )²
2	1	3	3-1=2	4
2	4	3	3-4=-1	1
5	4	6	6-4=2	4
5	8	6	6-8=-2	4
9	8	10	10-8=2	4
9	11	10	10-11=-1	1
			Sum of ( h(x) - y )²	18

So J(1, 1)=1/(2*6) (18) = 18/12 = 1.5, which again is an improvement.

fnenu's Notes on Online Courses