井出草平の研究ノート

二項ロジスティック回帰のLasso、Adaptive Lasso、Plugin Lassoを実行し比較する[Stata]

Rからデータをエクスポート

まずはデータをつくるところから。

library(AER)
data(CPS1985)

library(dplyr)
CPS1985 <- CPS1985 %>%
  mutate(married = case_when(
      married == "yes" ~ 1,
      married == "no" ~ 0))

library(rio)
export(CPS1985, "CPS1985.dta")

今回は従属変数を結婚の有無にする。Stataでは従属変数はFactor型はダメなようなので、marriedは数値に置き換えてある。

Sataのdoファイル

// データの読み込み
use "CPS1985.dta", clear

// OLSの推定値を計算し、その結果をols_modelとしてメモリに格納する
logit married gender age wage education experience ethnicity region occupation sector union
estimates store ols_model

// MSEの標本内推定値を計算する
lassogof ols_model

// 最適なペナルティパラメータをクロスバリデーションで選択
lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(cv, folds(10)) nolog rseed(123)

// クロスバリデーションの結果を表示
di "Best lambda (penalty parameter) is: " e(lambda)

// CV 関数をプロットする
cvplot, minmax

// 最適なペナルティパラメータを使用してLassoロジスティック回帰を行う
lasso logit married gender age wage education experience ethnicity region occupation sector union, lambda(`e(lambda)') rseed(123)
estimates store CV

// Lassoの結果を表示
lassocoef CV, sort(coef,standardized) display(coef,standardized)

// Adaptive Lasso
lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(adaptive) rseed(123)
estimates store adaptive
lassocoef adaptive, sort(coef,standardized) display(coef,standardized)

// Plugin Lasso
lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(plugin) rseed(123)
estimates store plugin
lassocoef plugin, sort(coef,standardized) display(coef,standardized)

// モデルの比較
// すべての推定量の予測のサンプル外MSEを推定する。予測のサンプル外MSEが最も低いものを選択する。
lassogof CV adaptive plugin

分析

通常の最小二乗法の重回帰。

OLSの推定値を計算し、その結果をols_modelとしてメモリに格納する

結果。

. logit married gender age wage education experience ethnicity region occupation sector union
Logistic regression                                     Number of obs =    534
                                                        LR chi2(9)    =  51.33
                                                        Prob > chi2   = 0.0000
Log likelihood = -318.24263                             Pseudo R2     = 0.0746

------------------------------------------------------------------------------
     married | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      gender |   .0357108   .2041047     0.17   0.861     -.364327    .4357486
         age |   3.049329   122.7877     0.02   0.980    -237.6102    243.7088
        wage |   .0202904   .0223395     0.91   0.364    -.0234941     .064075
   education |  -3.011563   122.7877    -0.02   0.980     -243.671    237.6479
  experience |  -2.996334   122.7877    -0.02   0.981    -243.6558    237.6632
   ethnicity |  -.1758906   .1411926    -1.25   0.213     -.452623    .1008418
      region |  -.1791324   .2164169    -0.83   0.408    -.6033018     .245037
  occupation |   .0250792   .0657441     0.38   0.703    -.1037768    .1539352
      sector |  -.0921213   .1340814    -0.69   0.492     -.354916    .1706733
       union |   .4018988   .2790936     1.44   0.150    -.1451147    .9489122
       _cons |  -19.01227   736.7268    -0.03   0.979     -1462.97    1424.946
------------------------------------------------------------------------------

回帰分析は、全部変数を入れると、すべてが有意ではなくなる、の典型。

逸脱度の推定値を計算する

. lassogof ols_model

Penalized coefficients
-------------------------------------------------
            |                 Deviance
       Name |    Deviance        ratio        Obs
------------+------------------------------------
  ols_model |     1.19192       0.0746        534
-------------------------------------------------

n-fold Cross Validation

CVを用いたLassoを行う。CVの種類はn-fold Cross Validation。まずは最適なλを計算する。

// 最適なペナルティパラメータをクロスバリデーションで選択
lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(cv, folds(10)) nolog rseed(123)

// クロスバリデーションの結果を表示
di "Best lambda (penalty parameter) is: " e(lambda)

CV 関数をプロットする

cvplot, minmax

結果。

Lasso logit model                           No. of obs        =        534
                                            No. of covariates =         10
Selection: Cross-validation                 No. of CV folds   =         10

--------------------------------------------------------------------------
         |                                No. of      Out-of-
         |                               nonzero       sample      CV mean
      ID |     Description      lambda     coef.   dev. ratio     deviance
---------+----------------------------------------------------------------
       1 |    first lambda    .1325634         0      -0.0009     1.289243
      15 |   lambda before    .0360385         1       0.0532     1.219509
    * 16 | selected lambda     .032837         1       0.0532     1.219486
      17 |    lambda after    .0299198         2       0.0529     1.219896
      21 |     last lambda    .0206226         4       0.0509     1.222458
--------------------------------------------------------------------------
* lambda selected by cross-validation.

最適なλが計算できたので、λを利用してLasso重回帰を行う。

. lasso logit married gender age wage education experience ethnicity region occupation sector union, lambda(`e(lambda)') rseed(123)



Lasso logit model                           No. of obs        =        534
                                            No. of covariates =         10
Selection: Cross-validation                 No. of CV folds   =         10

--------------------------------------------------------------------------
         |                                No. of      Out-of-
         |                               nonzero       sample      CV mean
      ID |     Description      lambda     coef.   dev. ratio     deviance
---------+----------------------------------------------------------------
       1 |    first lambda    .1325634         0      -0.0009     1.289243
      15 |   lambda before    .0360385         1       0.0532     1.219509
    * 16 | selected lambda     .032837         1       0.0532     1.219486
      17 |    lambda after    .0299198         2       0.0529     1.219896
      21 |     last lambda    .0206226         4       0.0509     1.222458
--------------------------------------------------------------------------
* lambda selected by cross-validation.

. lassocoef CV, sort(coef,standardized) display(coef,standardized)

------------------------
             |        CV
-------------+----------
       _cons |  .6809119
         age |   .478971
------------------------

Adaptive Lasso

次に、Adaptive Lassoを行う。

結果。

. lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(adaptive) rseed(123)


Lasso logit model                          No. of obs         =        534
                                           No. of covariates  =         10
Selection: Adaptive                        No. of lasso steps =          2

Final adaptive step results
--------------------------------------------------------------------------
         |                                No. of      Out-of-
         |                               nonzero       sample      CV mean
      ID |     Description      lambda     coef.   dev. ratio     deviance
---------+----------------------------------------------------------------
      22 |    first lambda    .1325634         0      -0.0009     1.289243
      76 |   lambda before    .0008722         1       0.0624     1.207617
    * 77 | selected lambda    .0007947         1       0.0624     1.207617
--------------------------------------------------------------------------
* lambda selected by cross-validation in final adaptive step.


. lassocoef adaptive, sort(coef,standardized) display(coef,standardized)

------------------------
             |  adaptive
-------------+----------
   education |  2.133081
         age |   1.18055
      gender | -1.002929
       union |  .4719569
      sector | -.2850779
   ethnicity | -.1704469
      region |  .1324732
       _cons | -3.55e-15
------------------------

Plugin Lasso

次に行うのはPlugin Lasso。

. lasso logit married gender age wage education experience ethnicity region occupation sector union, selection(plugin) rseed(123)


Lasso logit model                           No. of obs        =        534
                                            No. of covariates =         10
Selection: Plugin homoskedastic

--------------------------------------------------------------------------
         |                                No. of
         |                               nonzero    In-sample
      ID |     Description      lambda     coef.   dev. ratio          BIC
---------+----------------------------------------------------------------
     * 1 | selected lambda    .0955425         1       0.0296     680.0375
--------------------------------------------------------------------------
* lambda selected by plugin formula assuming homoskedastic errors.

. lassocoef plugin, sort(coef,standardized) display(coef,standardized)

------------------------
             |    plugin
-------------+----------
       _cons |  .6474732
         age |  .1671559
------------------------

モデルの比較

3つのLassoを行ったので、モデル比較を行う。 逸脱の低いモデルを選択する。

. lassogof CV adaptive plugin

Penalized coefficients
-------------------------------------------------
            |                 Deviance
       Name |    Deviance        ratio        Obs
------------+------------------------------------
         CV |    1.210448       0.0602        534
   adaptive |    1.204337       0.0650        534
     plugin |    1.249956       0.0296        534
-------------------------------------------------

すべての推定量の予測のサンプル外MSEが推定される。予測のサンプル外MSEが最も低いものを選択すると最も良いモデルが選択できる。このケースでいえば、CVが最も低い値なので、CVで選ばれた推定が最も良いモデルだと判断できる。

参照:Lasso関連のエントリ

ides.hatenablog.com

ides.hatenablog.com

ides.hatenablog.com

ides.hatenablog.com

ides.hatenablog.com

ides.hatenablog.com

ides.hatenablog.com

ides.hatenablog.com