The cost function for logistic regression is
Note that when y=1, the part inside the square brackets which is Cost(h(x),y) is (omitting superscripts),
Cost(h(x),y) | = | -ylog(h(x)) - (1-y)log(1-h(x)) |
= | -1*log(h(x)) - (1-1)log(1-h(x)) | |
= | -log(h(x)) |
and when y=0,
Cost(h(x),y) | = | -ylog(h(x)) - (1-y)log(1-h(x)) |
= | -0*log(h(x)) - (1-0)log(1-h(x)) | |
= | -log(1-h(x)). |
I'll work out Cost(h(x),y) in the table and then finish off underneath. For the green line, θ0=-3, θ1=-1 and θ2=1. Why are the values like this and not θ0=3, θ1=1 and θ2=-1?
When we predict y, we use the value of h(x). When the value h(x) ≥ 0.5, we predict y=1 and when h < 0.5 we predict y=0. In the graph below, which is of the sigmoid function, the red line is the function h(x). The horizontal axis is z=θTx (note that the axis is not x). Now h(x) ≥ 0.5 = g(θTx ) when z=θTx ≥ 0.
From wiki: http://en.wikipedia.org/wiki/File:Logistic-curve.svg |
Please note that log in the calculations is natural log, sometimes written ln.
x1 | x2 | y | θTx = θ0+θ1x1+θ2x2 = -3-x1+x2 |
h(x)=g(θTx) | Cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x)) |
---|---|---|---|---|---|
1 | 2 | 0 | -3-1*1+1*2=-2 | 1/(1+e-2) = 0.119202922 | - log(1-0.119202922) = 0.126928011 |
2 | 3 | 0 | -2 | 0.119202922 | 0.126928011 |
2 | 4 | 0 | -1 | 0.2689414214 | 0.3132616875 |
3 | 5 | 0 | -1 | 1/(1+e-1) = 0.2689414214 | 0.3132616875 |
1 | 4 | 0 | 0 | 1/(1+e-0) = 0.5 | 0.6931471806 |
5 | 4 | 1 | -4 | 0.01798621 | 4.0181499279 |
5 | 6 | 1 | -2 | 0.119202922 | - log(0.119202922) = 2.126928011 |
4 | 6 | 1 | -1 | 0.2689414214 | 1.3132616875 |
5 | 7 | 1 | -1 | 0.2689414214 | 1.3132616875 |
3 | 6 | 1 | 0 | 0.5 | 0.6931471806 |
ΣCost(h(x),y) | 11.0382750722 |
The values marked in red correspond to the points on the wrong side of the line. Now we have the sum of the costs for each of the training data, we can calculate J. Note that as there are 10 training examples, m=10 so our cost,
= 1/10*11.0382750722 ≈1.1038.
For the line 0=-29 +4x1 +3x2, we want above the line to give y=1, so we have to write it so that
-29 +4x1 +3x2 ≥ 0 is true above the line, which it is. This gives us θ0=-29, θ1=4 and θ2=3.
x1 | x2 | y | θTx=θ0+θ1x1+θ2x2 = -29+4x1+3x2 |
h(x)=g(θTx) | Cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x)) |
---|---|---|---|---|---|
1 | 2 | 0 | -29+4*1+3*2=-19 | 1/(1+e-19) = 5.60279640614594E-009 ≈ 0.0000000056 | 5.60279646470696E-009 ≈ 0.0000000056 |
2 | 3 | 0 | -12 | 6.14417460221472E-006 | 6.14419347772537E-006 |
2 | 4 | 0 | -9 | 0.0001233946 | 0.0001234022 |
3 | 5 | 0 | -2 | 0.119202922 | 0.126928011 |
1 | 4 | 0 | -13 | 2.26032429790357E-006 | 2.26032685249035E-006 |
5 | 4 | 1 | 3 | 0.9525741268 | 0.0485873516 |
5 | 6 | 1 | 9 | 0.9998766054 | 0.0001234022 |
4 | 6 | 1 | 5 | 0.9933071491 | 0.0067153485 |
5 | 7 | 1 | 12 | 0.9999938558 | 6.14419347772537E-006 |
3 | 6 | 1 | 1 | 0.7310585786 | 0.3132616875 |
ΣCost(h(x),y) | 0.4957537573 |
If you look at the column marked h(x), you can see that for all the training examples where y=0, we have small values of h, that is h(x) < 0.5 and for all examples where y=1 we have large h(x), that is h(x) ≥ 0.5. This is what we wanted since we predict y=1 when h(x) ≥ 1 and y=0 when h(x) < 0.5. The accuracy of the predictions is reflected in the low sum of costs.
The total cost function for this h(x) is J(θ)= 1/10 * 0.4957537573 ≈ 0.0496.