R-ACST8040

1 ACST8040 Quantitative Research Methods Solution to Exercise 2 Question 1 (a) Calculate i i iZ Y X= to obtain 1 10 6, )18,12, 3,( 25, 18,8, 15,, 21, ( , 12)Z Z = … Their absolute values ( )1 10 6,18,12,3 ),, , ( ,25,18 8,15,21,12Z Z = are ordered by ( )4 1 7 3 10 8 2 6 9 5, , , , , , , , , (3,6,8,12,12,15,18,18,21,25)Z Z Z Z Z Z Z Z Z Z = Hence the ordered ranks are 1,2,3,4.5,4.5, )6( ,7.5,7.5,9,10 with average ranks for ties. Thus the ranks of 1 10( , , )Z Z… are 1 10( , , ) (1,7.5,4.5,2,10,7.5,3,6,9,4.5)R R = . It follows that the observed value of the Wilcoxon signed rank test statistic is 1 3 7 2 4.5 3 9.5T R R R + = + + = + + = (b) To assess the effects of the new measure to lower the cost, we can test 0 : 0H θ = against 1 : 0H θ < . By part (a), the exact p-value of the test based on the data is Pr 9.5( )T + ≤ conditional on ties. Count the number of outcomes such that 9.5T + ≤ as follows: 0,1,2,3T + = have 1 1 1 2 5+ + + = outcomes as in the case of no ties. 4 9.5T +≤ ≤ have 1 4 2 4 3 4 2 6 29× + × + + × + = outcomes as listed below: T + Outcome Number T + Outcome Number 4 (1,3) 1 7 (1,6) 1 4.5 (4.5) 2× 2 7.5 (7.5) 2× , (3,4.5) 2× , (1,2,4.5) 2× 6 5 (2,3) 1 8 (2,6) 1 5.5 (1,4.5) 2× 2 8.5 (1,7.5) 2× , (1,3,4.5) 2× 4 6 (6), (1,2,3) 2 9 (9), (1,2,6), (3,6), (4.5,4.5) 4 6.5 (2,4.5) 2× 2 9.5 (2,7.5) 2× , (2,3,4.5) 2× 4 It follows that the exact p-value of the Wilcoxon signed rank test of 0 : 0H θ = against 1 : 0H θ < is calculated by 10 5 30 35Pr 9.5 0.03418 0.05 10242 ( )T + +≤ = = = < Thus 0H is rejected in favour of 1 : 0H θ < at the 5% level of significance. This provides sufficient evidence that the new measure is effective to reduce the cost. 2 (c) To test 0H by the normal approximation, calculate 0 10(11)E 27.5 4 [ ]T + = = , 0 10(11)(21) 2(2)(1)(3) 385 1Var 96 24 48 4 ( )T + = = = * 0 0.05 0 E 9.5 27.5 1.837 1.645 96Var [ ] ( ) T TT z T + + + = = = < = It also shows sufficient evidence for 0θ < at the 5% level, confirming the effects of the new measure to reduce the cost. (d) Calculate the Walsh averages to obtain their ordered values (1) (55)W W≤ ≤ below: i ( )iW i ( )iW i ( )iW i ( )iW i ( )iW 1 -25 12 -18 23 -12 34 -6 45 -1.5 2 -23 13 -18 24 -10.5 35 -5 46 0 3 -21.5 14 -16.5 25 -10.5 36 -5 47 1.5 4 -21.5 15 -16.5 26 -9.5 37 -4.5 48 2.5 5 -21 16 -16.5 27 -9 38 -4.5 49 4.5 6 -20 17 -15 28 -8.5 39 -3.5 50 6 7 -19.5 18 -15 29 -7.5 40 -3 51 7 8 -19.5 19 -15 30 -7.5 41 -3 52 8 9 -18.5 20 -14 31 -6.5 42 -3 53 9 10 -18 21 -13.5 32 -6.5 43 -3 54 10 11 -18 22 -12 33 -6 44 -2 55 25 Then θ is estimated by ((55 1) 2) (28) 8.5W Wθ += = = . By the numbers of outcomes counted in part (a), 35 4 31Pr 46 Pr 9 0.03027 0.025 1024 1024 ( ) ( )T T+ + ≥ = ≤ = = = > and 31 4 4 23Pr 47 Pr 8 0.02246 0.025 1024 1024 ( ) ( )T T+ + ≥ = ≤ = = = < Thus 2 47tα = and 55 1 47 9Cα = + = for 2(0.02246) 0.04492α = = . Then the exact %100(1 )% 95.51α = confidence interval of θ is given by ( ) ( )2( ) ( ) (9) (47), , ( 18.5, 1.5)C tW W W Wα α = = 3 Question 2 (a) Let ( )if x denote the density and ( )iF x the cdf of iX , 1,2i = . Since 1X and 2X are continuous with median 0, (0) Pr( 0) 0.5i iF X= < = and Pr( 0) 0.5iX > = , 1,2i = . Then the assumptions of 1 2~X X and independent 1X , 2X imply 1 1 2 1 1 2 2 1 2Pr 0, 1, 0 Pr 0, , 0 Pr 0( ) ( ) ( )X R X X X X X X X> = < = > < < = < < 2 1 1 1 0 0 Pr ( ) Pr ( )( ) ( )X x f x dx X x f x dx ∞ ∞ = > = > ∫ ∫ [ ] [ ]21 1 1 0 0 11 ( ) ( ) 1 ( ) 2 F x f x dx F x ∞∞ = = ∫ [ ] [ ]2 21 1 1 11 (0) 1 0.5 0.125 2 2 8 F= = = = Similarly, by interchanging 1X and 2X in the above equations, 1 2 2 2 1 1 2 0 Pr 0, 1, 0 Pr 0 Pr ( )( ) ( ) ( )X R X X X X x f x dx ∞ < = > = < = >< ∫ [ ]2 2 2 2 0 0 Pr ( ) 1 ( ) ( )( )X x f x dx F x f x dx ∞ ∞ = > = ∫ ∫ [ ] [ ]2 22 1 11 (0) 1 0.5 0.125 2 2 F= = = It then follows from the independence of 1X and 2X that 1 1 2 1 2 2Pr 1 Pr 0, 1, 0 Pr 0, 1, 0( ) ( ) ( )T X R X X R X+ = = > = < + < = > 1 20.125 0.125 0.25 Pr( 0)Pr( 0)X X= + = = > < 1 2Pr( 0, 0) Pr( 1)X X S= > < = = The range of T + is {0,1,2,3}. It is obvious that 1 2 1 2Pr 0 Pr 0, 0 Pr 0)Pr( 0 0.25 Pr( 0)( ) ( ) ( )T X X X X S+ = = < < = < < = = = and 1 2Pr 3 Pr 0, 0 0.25 Pr( 3)( ) ( )T X X S+ = = > > = = = Thus Pr 2 Pr( 2)( )T S+ = = = . These together prove ~T S+ . 4 (b) If N is even, then 1j N ja a j += = for 1,2, , 2j N= , and 1 1j N ja a N j += = + for 2, 2 1, ,j N N N= + . For each outcome 1 (1) ( )( , , ) ( , , )n j j nr r a a= of Y-scores drawn from 1{ , , }Na a with 1 (1) ( )j j n N≤ < < ≤ , take ( )2 1 2 1i i j ir N r N a= + = + , 1, ,i n= . If ( ) 2j i N≤ , then ( ) 1( ) 1 2 1 ( ) 2 { , , }j i i i Na j i r N j i N r a a= ≤ = + ≤ ∈ . If ( ) 2j i N> , then ( ) 1 ( ) 1 ( ) 2 2 2j i ia N j i r j i N N N N= + ≤ = ≤ = . Thus 1{ , , }i Nr a a∈ for all 1, ,i n= . Rearrange 1( , , )nr r in the order of 1{ , , }Na a if needed. Then there is a one-to-one mapping between 1( , , )nr r and 1( , , )nr r . Therefore, for each outcome 1( , , )nr r of Y-scores with 1 nr r c+ + = , there is one corresponding outcome 1( , , )nr r such that 1 1( 2 1) ( ) ( 2 1)n nr r n N r r n N c+ + = + + + = + It follows that under 20 : 1H γ = , ( ) ( ) ( )1 1Pr {( , , )} Pr {( , , )} 1 n n Nr r r r n= = ( )Pr( ) Pr ( 2 1)C c C n N c= = = + for every value c of C . Consequently, C is symmetric about ( ) 01 ( 2)1 E [ ]2 2 4 N n Nn C++ = = 5 Question 3 [25 marks] (a) The ranks of the observations in the combined sample are given by: X 35 57 39 30 52 42 38 49 24 36 32 44 Rank 6 19 9 4 17 11 8 15 2 7 5 12 Y 47 40 61 80 28 89 54 74 45 50 21 Rank 14 10 20 22 3 23 18 21 13 16 1 The observed value of W is 14 10 20 22 3 23 18 21 13 16 1 161w = + + + + + + + + + + = Since 12m = , 11n = and 12 11 23N = + = , the mean and variance of W under the null hypothesis 0 : 0H = are 0 11(23 1)E [ ] 132 2 W += = and 0 12(11)(23 1)Var ( ) 264 12 W += = Hence 0* 0.05 0 E [ ] 161 132 1.785 1.645 Var ( ) 264 W WW z W = = = > = This result shows sufficient evidence for 0 > , i.e., Y has a greater location parameter than X , at the 5% level of significance. (b) The rank scores for the Ansari-Bradley rank test are (1,2,…,10,11,12,11,10,…,2,1) for ranks (1,2,…,23). Hence from the Y-ranks obtained in part (a), the Y-scores rank are given by Y 47 40 61 80 28 89 54 74 45 50 21 Rank 14 10 20 22 3 23 18 21 13 16 1 Score 10 10 4 2 3 1 6 4 11 8 1 Thus the Ansari-Bradley rank test statistic is 10 10 4 2 3 1 6 4 11 8 1 59C = + + + + + + + + + + = Since 23N = is odd, 2 0 11(23 1)E [ ] 68.87 4(23) C += = , 2 0 2 12(11)(23 1) 23 3Var ( ) 66.374 48 23 ( ) ( ) C + += = Hence 0* 0.1 0 E [ ] 59 68.87 1.211 1.282 Var ( ) 66.374 C CC z C = = = > = This shows insufficient evidence at the 10% level of significance for Var( ) Var( )X Y< . 6 (c) The values of 1 12( , , )A A A= … and 1 11( , , )B B B= … for the Miller’s Jackknife test are: A 3.75 7.73 3.47 4.69 5.40 3.52 3.50 4.52 6.98 3.64 4.22 3.67 B 5.15 5.51 5.18 6.94 6.80 8.76 5.05 6.14 5.23 5.08 8.08 Calculate 3.75 7.73 3.67 4.591 12 A + + += = , 5.15 5.51 8.08 6.175 11 B + + += = 12 2 1 1 0.170 12(11) ( )i i A AV = = =∑ and 211 2 1 0.156 11(10) ( )j j B B V = = =∑ Then 1 2 4.591 6.175 2.776 0.170 0.156 A BQ V V = = = + + Thus the approximate p-value for Var( ) Var( )X Y< is Pr( 2.776) 0.00275Z < = by the Miller’s Jackknife test. This provides very strong evidence for Var( ) Var( )X Y< . (d) The values of * 3X X= and * 66Y Y= + with their ranks are shown below: X* 105 171 117 90 156 126 114 147 72 108 96 132 Rank 6 23 13 3 22 15 11 20 1 8 5 17 Y* 113 106 127 146 94 155 120 140 111 116 87 Rank 10 7 16 19 4 21 14 18 9 12 2 The empirical distribution functions *12( )F t of *X and *11( )G t of *Y at the ordered values (1) (23)Z Z≤ ≤ of ( )* *,X Y are given by ( ) ,iZ i = 1 2 3 4 5 6 7 8 9 10 11 12 X*/Y* X* Y* X* Y* X* X* Y* X* Y* Y* X* Y* * 12( )F t 1/12 1/12 2/12 2/12 3/12 4/12 4/12 5/12 5/12 5/12 6/12 6/12 * 11( )G t 0 1/11 1/11 2/11 2/11 2/11 3/11 3/11 4/11 5/11 5/11 6/11 ( ) ,iZ i = 13 14 15 16 17 18 19 20 12 22 23 X*/Y* X* Y* X* Y* X* Y* Y* X* Y* X* X* * 12( )F t 7/12 7/12 8/12 8/12 9/12 9/12 9/12 10/12 10/12 11/12 1 * 11( )G t 6/11 7/11 7/11 8/11 8/11 9/11 10/11 10/11 1 1 1 It follows that * * 12 ( ) 11 ( )1 23 12(11) 10max 1 1110 12 22 1 12 ( ) ( )i ii mnJ F Z G Z d ≤ ≤ = = = = 7 Run the following R-codes to obtain the output below: x <-c(35,57,39,30,52,42,38,49,24,36,32,44) y <-c(47,40,61,80,28,89,54,74,45,50,21) ks.test(3*x,y+66) Two-sample Kolmogorov-Smirnov test data: 3*x and y + 66 D = 0.16667, p-value = 0.98 alternative hypothesis: two-sided It shows 0.16667D = and hence verifies ( ) 12(11)(0.16667) 22J mn d D= = = . The p-value of the test is 0.98. (e) Based on the results in parts (a) – (d), we can draw the following answers: (i) The test in part (d) shows a very large p-value 0.98, which provides no evidence against the hypothesis of equal distribution for * 3X X= and * 66Y Y= + . (ii) Let 1θ and 2θ denote the medians; 1η and 2η the dispersion parameters of X and Y , respectively. Then the test in part (d) accepts 3 ~ 66X Y + . As a result, ( ) 1 21 1 21 1 21 1 1 2 3 661 66 (3 66)~ 33 3 X Y Y Y θ θθ θ θθ η ηη η η η = + = = = (iii) The location-shift model in part (a) is not appropriate since the Miller’s Jackknife test in part (c) shows very strong evidence for Var( ) Var( )X Y< and part (d) shows 3 ~ 66X Y + , which contradicts ~X Y + under the location-shift model. (iv) The location-scale parameter model is not justified in part (b) as X and Y do not have an equal location parameter θ by part (c). It is however justified in part (c), which allows 1 2θ θ≠ , by part (d) as shown in item (ii) above. (v) The result of part (a) is not justified because the location-shift model is not right. The result of part (b) is not justified since the Ansari-Bradley test requires 1 2θ θ= , which has no support from the analyses. The result of part (c) is justified because the location-scale parameter model is justified and 1 2θ θ= is not required for the Miller’s Jackknife test. (vi) There is insufficient evidence of difference ( 1 2θ θ≠ ) in the profitability of the two banks despite the total profit in sample Y is greater than in X . On the other hand, the result of part (c) and the relation 1 23η η= indicate that profits are more stable at the bank with profits X than the bank with profits Y . 8 Question 4 (a) First find the ranks { }ijr of all 15 observations { }ijX as follows: Treatment j 1 2 3 4 5 17 (2) 22 (5) 20 (4) 86 (15) 68 (13) ijX ( ijr ) 28 (7) 15 (1) 39 (9) 54 (11) 73 (14) 18 (3) 43 (10) 61 (12) 32 (8) 25 (6) 1 12R = 2 16R = 3 25R = 4 34R = 5 33R = Since 5k = , 1 5 3n n= = = and 15N = , the Kruskal-Wallis test statistic from the data is calculated by 2 2 2 2 2 2 1 12 12 12 16 25 34 333( 1) 3(16) ( 1) 15(16) 3 k j jj R H N N N n = + + + + = + = + ∑ 2 2 4,0.10 1,0.10 3270 109 96 1348 6.5 7.78 60 2 2 k χ χ = = = = < = = Thus 0H is accepted at the 10% level of significance. This shows insufficient evidence against 0H for general alternatives at the 10% level. (b) The counts No. ( , ) :{ }uv iu jvU i j X X= < for 1 5u v≤ < ≤ are calculated below: uvU v u 2 3 4 5 1 5 8 9 8 2 6 8 8 3 6 7 4 4 Hence the Jonckheere-Terpstra test statistic is 4 5 2(6) 7 4(8) 9 69uv u v J U < = = + + + + + =∑ Calculate 5 2 2 2 2 0 1 1 1 1 225 45E [ ] 15 3 45 4 4 4 k u u u J N n = = = = = = ∑ ∑ and 2 2 2 0 1 1 15 (30 3) 5(9)(6 3)Var ( ) (2 3) (2 3) 97.5 72 72 k u u u J N N n n = + + = + + = = ∑ 9 It follows that * 0 0.01 0 E [ ] 69 45 2.431 2.326 Var ( ) 97.5 J JJ z J = = = > = Thus there is very strong evidence for ordered alternatives 1 5τ τ≤ ≤ with at least one strict inequality. (c) Since 1 5 3n n= = = , * 3N = and * 3 3( )u u uN R R R= = , 1, ,5u = . Therefore, the Nemenyi-Damico-Wolfe one-sided treatments-versus-control multiple comparison procedure is given by Decide 1uτ τ> if *1uR R yα ≥ ; otherwise accept 1uτ τ= , 2, ,5u = . By R, we get *0.0919 21y = as shown below: cNDWol(0.1,c(3,3,3,3,3)) Monte Carlo Approximation (with 10000 Iterations) used: Control group size: 3 Treatment group size(s): 3 3 3 3 For the given experimentwise alpha=0.1, the upper cutoff value is Nemenyi, Damico-Wolfe Y*=21, with true experimentwise alpha level=0.0919 Thus at α = 10%, the decisions are 2 1 2 116 12 4 21 R R τ τ = = < = , 3 1 3 125 12 13 21 R R τ τ = = < = , 4 1 4 134 12 22 21 R R τ τ = = > > , 5 1 5 133 12 21 R R τ τ = = > . (d) The test in (a) accepts 0 1 5:H τ τ= = at the 10% level, the test in (b) rejects 0H and concludes 1 5τ τ≤ ≤ with at least one “<” at the 1% level, and the procedure in (c) decides 2 3 1 4 5, ,τ τ τ τ τ= < at the 10% level. While these results appear quite different, they are not contradictive due to differences in alternatives. Specifically, (a) looks at the evidence for any difference in 1 5, ,τ τ , whereas (b) only considers 1 1 5:H τ τ≤ ≤ , which attracts stronger evidence when the data match 1H with 1, , kR R mostly in the same order as 1H (as in this question). Similarly, the restrictions to 1uτ τ≥ for 2,3,4,5u = in (c) lead to stronger evidence against 1 5τ τ= = than in (a) even on a multiple-comparison basis because the data match it with 1uR R> for 2,3,4,5u = . Moreover, the results of (b) and (c) are consistent in the sense that they both include cases such as 1 2 3 4 5τ τ τ τ τ= = < = . 10 Question 5 (a) Based on the sample data { }ijX , the ranks ijr of ijX and No. ( , ) :{ }uv iu jvU i j X X= < for 1 4u v≤ < ≤ are calculated in the following tables: i 1ir 2ir 3ir 4ir 1 11 14 10 2 2 4 18 17 8 3 13 3 20 6 4 1 19 7 16 5 9 12 15 5 uvU v u 2 3 4 1 20 20 12 2 13 6 3 4 The Mack-Wolfe test statistic for known peak 2p = is calculated by 2 12 32 42 43 20 12 19 21 72A U U U U= + + + = + + + = Then 1 4 1 2( , , ) (5,5,5,5) 5 5 10, 3(5) 15 n n N N= = + = = = 2 2 2 2 2 2 2 2 1 2 1 4 2 0 2 10 15 5 5E 50 4 4 [ ] N N n n nA + + ×= = = and 3 3 2 2 2 2 0 2 2 10 15 3 10 15 5 5 (10 3) 5(10)(15) 5 (20)Var 154.17 72 6 ( ) ( )( )A + + + + = + = Hence the normalized Mack-Wolfe statistic is * 0.052 72 50 1.772 1.645 154.17 A z = = > = Reject 0H at 0.05α = Thus there is sufficient evidence for 1 2 3 4τ τ τ τ≤ ≥ ≥ at the 5% level. (b) Calculate 25vu u v uv uvU n n U U= = for 1 4u v≤ < ≤ to obtain uvU for all u v≠ : v u 1 2 3 4 1 – 20 20 12 uvU 2 5 – 13 6 3 5 12 – 4 4 13 19 21 – Then qU are calculated by 1 21 31 41 5 5 13 23U U U U= + + = + + = , 2 12 33 42 20 12 19 51U U U U= + + = + + = 3 13 23 43 20 13 21 54U U U U= + + = + + = , 4 14 22 34 12 6 4 22U U U U= + + = + + = 11 Next, as 1 4( , , ) (5,5,5,5)n n = are all equal, 0E [ ]qU and 0Var ( )qU are invariant in {1,2,3,4}q∈ . Hence 1 2 3 4, , , {23,51,54,22} { }U U U U = 3 qU U> , {1,2,4} q∈ 0* *3 0 3 3 0 3 0 E [ ]E [ ] Var ( ) Var ( ) q q q q U UU UU U U U = > = for {1,2,4} q∈ 3p = For 3,p = 3 12 13 23 43 20 20 13 21 74A U U U U= + + + = + + + = and 1 15N = , 2 10 N = 0 3 0 2E E 50[ ] [ ]A A= = and 0 3 0 2Var Var 154.17( ) ( )A A= = . It follows that * * 3 74 50 1.933 154.17p A A = = = (c) The R-command cUmbrPU with 0.08α = produce the following output: > cUmbrPU(0.06,c(5,5,5,5)) Monte Carlo Approximation (with 10000 Iterations) used: Group sizes: 5 5 5 5 For the given alpha=0.06, the upper cutoff value is Mack-Wolfe Peak Unknown A*(p-hat)= 2.1533650717, with true alpha level=0.0526 The R-output shows * ,0.0526 2.153pa = . As * * ,0.05261.933 2.153p pA a= < = , 0H is accepted at the 5% level of significance. As a result, there is insufficient evidence for umbrella alternatives with unknown peak at the 5% level. (d) If 3p = in part (b) is known, then * 0.053 1.933 1.645 A z= > = Reject 0H at 0.05α = Thus there is sufficient evidence for umbrella alternatives with known peak 3p = at the 5% level of significance. The main reason for the difference in the test results between known peak 3p = and unknown p with estimate 3p = is that *3A only takes the value with 3p = , whereas * p A can also take the values of * 1A , * 2A and * 4A with positive probabilities. This leads to different distributions of *p A and * 3A , hence different critical points and p-values to produce different test results.