Rからデータをエクスポート
まずはデータをつくるところから。
library(AER) data(CPS1985) library(rio) export(CPS1985, "CPS1985.dta")
Sataのdoファイル
// データの読み込み use "CPS1985.dta", clear // OLSの推定値を計算し、その結果をols_modelとしてメモリに格納する quietly regress wage education experience age ethnicity region gender occupation sector union married estimates store ols_model // MSEの標本内推定値を計算する lassogof ols_model // 最適なペナルティパラメータをクロスバリデーションで選択 lasso linear wage education experience age ethnicity region gender occupation sector union marrie, selection(cv, folds(10)) nolog rseed(123) // クロスバリデーションの結果を表示 di "Best lambda (penalty parameter) is: " e(lambda) // CV 関数をプロットする cvplot, minmax // 最適なペナルティパラメータを使用してLassoロジスティック回帰を行う lasso linear wage education experience age ethnicity region gender occupation sector union marrie, lambda(`e(lambda)') rseed(123) estimates store CV // Lassoの結果を表示 lassocoef CV, sort(coef,standardized) display(coef,standardized) // Adaptive Lasso lasso linear wage education experience age ethnicity region gender occupation sector union marri, selection(adaptive) rseed(123) estimates store adaptive lassocoef adaptive, sort(coef,standardized) display(coef,standardized) // Plugin Lasso lasso linear wage education experience age ethnicity region gender occupation sector union marri, selection(plugin) rseed(123) estimates store plugin lassocoef plugin, sort(coef,standardized) display(coef,standardized) // モデルの比較 // すべての推定量の予測のサンプル外MSEを推定する。予測のサンプル外MSEが最も低いものを選択する。 lassogof CV adaptive plugin
分析
通常のOLSまで。通常の重回帰なので不要。
// データの読み込み use "CPS1985.dta", clear // OLSの推定値を計算し、その結果をols_modelとしてメモリに格納する quietly regress wage education experience age ethnicity region gender occupation sector union married estimates store ols_model // MSEの標本内推定値を計算する lassogof ols_model
n-fold Cross Validation
CVを用いたLassoを行う。CVの種類はn-fold Cross Validation。まずは最適なλを計算する。
// 最適なペナルティパラメータをクロスバリデーションで選択 lasso linear wage education experience age ethnicity region gender occupation sector union marrie, selection(cv, folds(10)) nolog rseed(123) // クロスバリデーションの結果を表示 di "Best lambda (penalty parameter) is: " e(lambda) // CV 関数をプロットする cvplot, minmax
結果。
Lasso linear model No. of obs = 534
No. of covariates = 10
Selection: Cross-validation No. of CV folds = 10
--------------------------------------------------------------------------
| No. of Out-of- CV mean
| nonzero sample prediction
ID | Description lambda coef. R-squared error
---------+----------------------------------------------------------------
1 | first lambda 1.960896 0 -0.0011 26.38862
61 | lambda before .0073826 9 0.2530 19.69139
* 62 | selected lambda .0067268 9 0.2530 19.69136
63 | last lambda .0061292 9 0.2530 19.69153
--------------------------------------------------------------------------
* lambda selected by cross-validation.

最適なλが計算できたので、λを利用してLasso重回帰を行う。
. lasso linear wage education experience age ethnicity region gender occupation sector union marrie, lambda(`e(lambda)') rseed(123)
Lasso linear model No. of obs = 534
No. of covariates = 10
Selection: Cross-validation No. of CV folds = 10
--------------------------------------------------------------------------
| No. of Out-of- CV mean
| nonzero sample prediction
ID | Description lambda coef. R-squared error
---------+----------------------------------------------------------------
1 | first lambda 1.960896 0 -0.0011 26.38862
61 | lambda before .0073826 9 0.2530 19.69139
* 62 | selected lambda .0067268 9 0.2530 19.69136
63 | last lambda .0061292 9 0.2530 19.69153
--------------------------------------------------------------------------
* lambda selected by cross-validation.
. lassocoef CV, sort(coef,standardized) display(coef,standardized)
------------------------
| CV
-------------+----------
education | 2.08908
age | 1.122071
gender | -1.076161
union | .5882659
sector | -.460368
ethnicity | -.3305098
region | .2912564
occupation | .2659422
married | .1929359
_cons | 0
------------------------
Adaptive Lasso
次に、Adaptive Lassoを行う。selection(adaptive)のところが違うだけである。
lasso linear wage education experience age ethnicity region gender occupation sector union marri, selection(adaptive) rseed(123) estimates store adaptive lassocoef adaptive, sort(coef,standardized) display(coef,standardized)
結果。
. lasso linear wage education experience age ethnicity region gender occupation sector union marri, selection(adaptive) rseed(123)
Lasso linear model No. of obs = 534
No. of covariates = 10
Selection: Adaptive No. of lasso steps = 2
Final adaptive step results
--------------------------------------------------------------------------
| No. of Out-of- CV mean
| nonzero sample prediction
ID | Description lambda coef. R-squared error
---------+----------------------------------------------------------------
64 | first lambda 9.819472 0 -0.0011 26.38862
109 | lambda before .1492472 7 0.2565 19.60032
* 110 | selected lambda .1359885 7 0.2565 19.5987
111 | lambda after .1239077 7 0.2564 19.60266
146 | last lambda .0047748 9 0.2530 19.69093
--------------------------------------------------------------------------
* lambda selected by cross-validation in final adaptive step.
. lassocoef adaptive, sort(coef,standardized) display(coef,standardized)
------------------------
| adaptive
-------------+----------
education | 2.133081
age | 1.18055
gender | -1.002929
union | .4719569
sector | -.2850779
ethnicity | -.1704469
region | .1324732
_cons | -3.55e-15
------------------------
Plugin Lasso
次に行うのはPlugin Lasso。こちらのコードもselection(plugin)が違うだけ。
lasso linear wage education experience age ethnicity region gender occupation sector union marri, selection(plugin) rseed(123) estimates store plugin lassocoef plugin, sort(coef,standardized) display(coef,standardized)
結果。
. lasso linear wage education experience age ethnicity region gender occupation sector union marri, selection(plugin) rseed(123)
Computing plugin lambda ...
Iteration 1: lambda = .1502937 no. of nonzero coef. = 4
Iteration 2: lambda = .1502937 no. of nonzero coef. = 4
Lasso linear model No. of obs = 534
No. of covariates = 10
Selection: Plugin heteroskedastic
--------------------------------------------------------------------------
| No. of
| nonzero In-sample
ID | Description lambda coef. R-squared BIC
---------+----------------------------------------------------------------
* 1 | selected lambda .1502937 4 0.1910 3180.827
--------------------------------------------------------------------------
* lambda selected by plugin formula assuming heteroskedastic errors.
. lassocoef plugin, sort(coef,standardized) display(coef,standardized)
------------------------
| plugin
-------------+----------
education | 1.301289
age | .4256132
gender | -.4170235
union | .0935011
_cons | 0
------------------------
モデルの比較
3つのLassoを行ったので、モデル比較を行う
. lassogof CV adaptive plugin
Penalized coefficients
-------------------------------------------------
Name | MSE R-squared Obs
------------+------------------------------------
CV | 18.90849 0.2827 534
adaptive | 19.08944 0.2758 534
plugin | 21.3259 0.1910 534
-------------------------------------------------
すべての推定量の予測のサンプル外MSEが推定される。予測のサンプル外MSEが最も低いものを選択すると最も良いモデルが選択できる。このケースでいえば、CVが最も低い値なので、CVで選ばれた推定が最も良いモデルだと判断できる。