Rからデータをエクスポート
まずはデータをつくるところから。
library(AER) data(CPS1985) library(dplyr) CPS1985 <- CPS1985 %>% mutate(married = case_when( married == "yes" ~ 1, married == "no" ~ 0)) library(rio) export(CPS1985, "CPS1985.dta")
今回は従属変数を結婚の有無にする。Stataでは従属変数はFactor型はダメなようなので、marriedは数値に置き換えてある。
Sataのdoファイル
// データの読み込み use "CPS1985.dta", clear // OLSの推定値を計算し、その結果をols_modelとしてメモリに格納する logit married gender age wage education experience ethnicity region occupation sector union estimates store ols_model // MSEの標本内推定値を計算する lassogof ols_model // 最適なペナルティパラメータをクロスバリデーションで選択 lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(cv, folds(10)) nolog rseed(123) // クロスバリデーションの結果を表示 di "Best lambda (penalty parameter) is: " e(lambda) // CV 関数をプロットする cvplot, minmax // 最適なペナルティパラメータを使用してLassoロジスティック回帰を行う lasso logit married gender age wage education experience ethnicity region occupation sector union, lambda(`e(lambda)') rseed(123) estimates store CV // Lassoの結果を表示 lassocoef CV, sort(coef,standardized) display(coef,standardized) // Adaptive Lasso lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(adaptive) rseed(123) estimates store adaptive lassocoef adaptive, sort(coef,standardized) display(coef,standardized) // Plugin Lasso lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(plugin) rseed(123) estimates store plugin lassocoef plugin, sort(coef,standardized) display(coef,standardized) // モデルの比較 // すべての推定量の予測のサンプル外MSEを推定する。予測のサンプル外MSEが最も低いものを選択する。 lassogof CV adaptive plugin
分析
通常の最小二乗法の重回帰。
OLSの推定値を計算し、その結果をols_modelとしてメモリに格納する
結果。
. logit married gender age wage education experience ethnicity region occupation sector union Logistic regression Number of obs = 534 LR chi2(9) = 51.33 Prob > chi2 = 0.0000 Log likelihood = -318.24263 Pseudo R2 = 0.0746 ------------------------------------------------------------------------------ married | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- gender | .0357108 .2041047 0.17 0.861 -.364327 .4357486 age | 3.049329 122.7877 0.02 0.980 -237.6102 243.7088 wage | .0202904 .0223395 0.91 0.364 -.0234941 .064075 education | -3.011563 122.7877 -0.02 0.980 -243.671 237.6479 experience | -2.996334 122.7877 -0.02 0.981 -243.6558 237.6632 ethnicity | -.1758906 .1411926 -1.25 0.213 -.452623 .1008418 region | -.1791324 .2164169 -0.83 0.408 -.6033018 .245037 occupation | .0250792 .0657441 0.38 0.703 -.1037768 .1539352 sector | -.0921213 .1340814 -0.69 0.492 -.354916 .1706733 union | .4018988 .2790936 1.44 0.150 -.1451147 .9489122 _cons | -19.01227 736.7268 -0.03 0.979 -1462.97 1424.946 ------------------------------------------------------------------------------
回帰分析は、全部変数を入れると、すべてが有意ではなくなる、の典型。
逸脱度の推定値を計算する
. lassogof ols_model Penalized coefficients ------------------------------------------------- | Deviance Name | Deviance ratio Obs ------------+------------------------------------ ols_model | 1.19192 0.0746 534 -------------------------------------------------
n-fold Cross Validation
CVを用いたLassoを行う。CVの種類はn-fold Cross Validation。まずは最適なλを計算する。
// 最適なペナルティパラメータをクロスバリデーションで選択 lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(cv, folds(10)) nolog rseed(123) // クロスバリデーションの結果を表示 di "Best lambda (penalty parameter) is: " e(lambda)
CV 関数をプロットする
cvplot, minmax
結果。
Lasso logit model No. of obs = 534 No. of covariates = 10 Selection: Cross-validation No. of CV folds = 10 -------------------------------------------------------------------------- | No. of Out-of- | nonzero sample CV mean ID | Description lambda coef. dev. ratio deviance ---------+---------------------------------------------------------------- 1 | first lambda .1325634 0 -0.0009 1.289243 15 | lambda before .0360385 1 0.0532 1.219509 * 16 | selected lambda .032837 1 0.0532 1.219486 17 | lambda after .0299198 2 0.0529 1.219896 21 | last lambda .0206226 4 0.0509 1.222458 -------------------------------------------------------------------------- * lambda selected by cross-validation.
最適なλが計算できたので、λを利用してLasso重回帰を行う。
. lasso logit married gender age wage education experience ethnicity region occupation sector union, lambda(`e(lambda)') rseed(123) Lasso logit model No. of obs = 534 No. of covariates = 10 Selection: Cross-validation No. of CV folds = 10 -------------------------------------------------------------------------- | No. of Out-of- | nonzero sample CV mean ID | Description lambda coef. dev. ratio deviance ---------+---------------------------------------------------------------- 1 | first lambda .1325634 0 -0.0009 1.289243 15 | lambda before .0360385 1 0.0532 1.219509 * 16 | selected lambda .032837 1 0.0532 1.219486 17 | lambda after .0299198 2 0.0529 1.219896 21 | last lambda .0206226 4 0.0509 1.222458 -------------------------------------------------------------------------- * lambda selected by cross-validation. . lassocoef CV, sort(coef,standardized) display(coef,standardized) ------------------------ | CV -------------+---------- _cons | .6809119 age | .478971 ------------------------
Adaptive Lasso
次に、Adaptive Lassoを行う。
結果。
. lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(adaptive) rseed(123) Lasso logit model No. of obs = 534 No. of covariates = 10 Selection: Adaptive No. of lasso steps = 2 Final adaptive step results -------------------------------------------------------------------------- | No. of Out-of- | nonzero sample CV mean ID | Description lambda coef. dev. ratio deviance ---------+---------------------------------------------------------------- 22 | first lambda .1325634 0 -0.0009 1.289243 76 | lambda before .0008722 1 0.0624 1.207617 * 77 | selected lambda .0007947 1 0.0624 1.207617 -------------------------------------------------------------------------- * lambda selected by cross-validation in final adaptive step. . lassocoef adaptive, sort(coef,standardized) display(coef,standardized) ------------------------ | adaptive -------------+---------- education | 2.133081 age | 1.18055 gender | -1.002929 union | .4719569 sector | -.2850779 ethnicity | -.1704469 region | .1324732 _cons | -3.55e-15 ------------------------
Plugin Lasso
次に行うのはPlugin Lasso。
. lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(plugin) rseed(123) Lasso logit model No. of obs = 534 No. of covariates = 10 Selection: Plugin homoskedastic -------------------------------------------------------------------------- | No. of | nonzero In-sample ID | Description lambda coef. dev. ratio BIC ---------+---------------------------------------------------------------- * 1 | selected lambda .0955425 1 0.0296 680.0375 -------------------------------------------------------------------------- * lambda selected by plugin formula assuming homoskedastic errors. . lassocoef plugin, sort(coef,standardized) display(coef,standardized) ------------------------ | plugin -------------+---------- _cons | .6474732 age | .1671559 ------------------------
モデルの比較
3つのLassoを行ったので、モデル比較を行う。 逸脱の低いモデルを選択する。
. lassogof CV adaptive plugin Penalized coefficients ------------------------------------------------- | Deviance Name | Deviance ratio Obs ------------+------------------------------------ CV | 1.210448 0.0602 534 adaptive | 1.204337 0.0650 534 plugin | 1.249956 0.0296 534 -------------------------------------------------
すべての推定量の予測のサンプル外MSEが推定される。予測のサンプル外MSEが最も低いものを選択すると最も良いモデルが選択できる。このケースでいえば、CVが最も低い値なので、CVで選ばれた推定が最も良いモデルだと判断できる。