Thursday, 24 January 2013

Working with R2MLwiN Part 1



Getting started with the R2MLwiN package

With the release of the R2MLwiN package late 2012, R users have access to another software package for running Bayesian models using Markov chain Monte Carlo (MCMC) methods. R2MLwiN is an R command interface to MLwiN, allowing users to fit multilevel models using MLwiN from within the R environment.

To use R2MLwiN, MLwiN needs to be installed. MLwiN is available from the Centre for Multilevel Modelling at the University of Bristol. See the MLwiN site for availability: MLwiN is free for UK researchers; there is a 30-day fully functional free trail version available; or MLwiN can be purchased from the Centre for Multilevel Modelling. As the name suggests, MLwiN is available for the Windows platform only.

R2MLwiN is available from CRAN.

Some Preliminaries

First, load the R2MLwiN package. If you need to install R2MLwiN, then type: install.packages("R2MLwiN", dependencies = TRUE) at the R prompt.

library(R2MLwiN)
The examples in these notes use the ALNT data file available from github. Assuming the data file is located in your working directory, the following command loads the data, and lists the variables.

alnt = read.csv("ALNT.csv", header = TRUE, sep = ",")
names(alnt)
But see the end of this blog for an easy way to get GitHub data into R.

The ALNT data file contains test data for 3097 students located in 70 primary school. The data are the results for a writing test administered to Year 7 students. Also, there are the results of a writing test obtained when the students were in Year 3.

The data are:
  • student - a unique id for each student
  • school - an id for each student's school
  • gender - a numeric code for gender: 0 = male; 1 = female
  • write3 - Year 3 writing scores
  • write7 - Year 7 writing scores
  • location - rural-urban variable for the location of the school (0 = metropolitan school; 1 = provincial city school; 2 = rural school; 3 = remote school)
  • dsi - Disadvantaged Schools Index - a measure of disadvantage for a school as a whole. Small DSIs mean greater disadvantage.

The ALNT data file needs some modification before it can be used by MLwiN. It needs a new variable, a column of 1s to be used as the 'intercept' variable.

alnt$cons = 1

Collecting the information to be passed to MLwiN

The information required by MLwiN includes:
    1. Path to MLwiN executable (mlwin.exe);
    2. The variables containing IDs for all levels - highest level first;
    3. Distribution to be modelled;
    4. Method of estimation (1 = MCMC);
    5. The data file; and
    6. The model.
These commands collect the information. The path in point 1 is the path to the MLwiN executable on my machine. It will probably need to be modified to suit different machines.

# 1 Path to MLwiN executable
mlwin = "C:/Program Files/MLwiN v2.26/i386"

# 2 variables containing IDs
levID = c("school", "student")

# 3 Distribution to be modelled
D = "Normal"

# 4 Method of estimation (1 = MCMC)
estoptions = list(EstM = 1)

# 5 The data file
indata = alnt

# 6 The model
formula = "write7 ~ (0 | cons) + (1 | cons) + (2 | cons)"
The first analysis is a variance components model. The analysis returns, among other things, estimates for the mean write7 score, and variances for the residuals at the student level and at the school level. The formulation of the model uses the ~ notation with the independent variable (write7) to the left, and predictor variables to the right. Except in this case, there are no predictor variables other than the constant term, the predictor "1". There is variation around the constant at the student level (represented by "(1 | cons)" and variation at the school level (represented by "(2 | cons)"). Thus, there are two parts to the model:
  • the fixed part: (0 | cons), and
  • the parts representing variation: (1 | cons) and (2 | cons).
The analysis returns an estimate for the fixed part, that is, the mean for write7, and estimates for variances of the parts that vary; that is, variance of the student level residuals and variance of the school level residuals.

An algebraic representation of the model is as follows:
\[ \begin{align} write7_{ij} &=\beta _{0} + u_{0j} + e_{0ij}\\\ \text{where } u_{0j} &\sim N(0, \hspace{.5em} \sigma_{u0}^{2})\\\ \text{and } e_{0ij} &\sim N(0, \hspace{.5em} \sigma _{e}^{2}) \end{align} \]
The analysis returns estimates for \( \beta_0 \), the mean write7 score, and the two variances, \( \sigma_{u0}^{2} \) and \( \sigma_{e}^{2} \).

Run the model

The model is run with a call to the runMLwiN() function. The function passes the relevant information to MLwiN, runs the model in MLwiN, then output is passed back to R for further processing. The model here is run using a number of default options: burn-in length (500), iterations (5000), priors (uninformative priors - see the MLWiN MCMC manual pages 4 and 62 for details on the default uninformative priors), the Gibbs sampler, and IGLS-generated starting values. (In MLwiN, the burn-in iterations are discarded, then the 5000 iterations to be monitored are stored.)

mod1 = runMLwiN(formula, levID = levID, D = D, indata = indata, estoptions = estoptions, 
    MLwiNPath = mlwin)

MLwiN is running, please wait......

-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- 
MLwiN multilevel model (Normal) 
Estimation algorithm:  MCMC      Elapsed time : 7.16s 
Number of obs:  3097             Number of iter.: 5000             Burn-in: 500 
Bayesian Deviance Information Criterion (DIC)
Dbar      D(thetabar)    pD      DIC
38153.164  38093.660  59.505     38212.668  
--------------------------------------------------------------------------------------------------- 
The model formula:
write7~(0|cons)+(1|cons)+(2|cons) 
Level 2: school     Level 1: student      
--------------------------------------------------------------------------------------------------- 
The fixed part estimates:  
               Coef.   Std. Err.        z   Pr(>|z|)         [95% Conf.   Interval]   ESS 
cons       742.85260     6.75865   109.90          0   ***    729.48333   755.99434   252 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
--------------------------------------------------------------------------------------------------- 
The random part estimates at the school level: 
                Coef.   Std. Err.   [95% Conf.    Interval]    ESS 
var_cons   2534.62402   550.03020   1629.75209   3786.49343   2411 
--------------------------------------------------------------------------------------------------- 
The random part estimates at the student level: 
                 Coef.   Std. Err.    [95% Conf.     Interval]    ESS 
var_cons   13122.38574   343.77850   12466.03779   13828.37180   4600 
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- 
The output is split into a number of sections. The first section gives information about the number of observations, the method of estimation, iterations and burn-in length, and DIC statistics. The second section gives the model, and the names of the variables containing IDs for each level. The third section gives the estimates for the fixed part of the model; here, the mean of the Year 7 write scores. The coefficient is the mean of the posterior distribution; Std Err is the standard deviation of the posterior distribution; "95% Conf. Interval" gives the locations of the 2.5% and 97.5% quantiles of the posterior distribution; and ESS is the Estimated Sample Size upon which these estimates are based. The fourth and fifth sections give the estimates for the variances of the residuals at each level in the model.

Obtaining diagnostics on the MCMC chains

The 5000 iterations to be monitored are returned to R in the mod1 object, and they are stored in mod1["chains"]. Here, the iterations are extracted from mod1 and stored in a data frame called estimates.

estimates = mod1["chains"]
head(estimates)
  iteration deviance FP_cons RP2_var_cons RP1_var_cons
1         1    38147   742.6         2253        13573
2         2    38137   741.6         2924        13245
3         3    38148   741.9         2207        13198
4         4    38176   735.5         2204        13390
5         5    38152   739.9         2383        13523
6         6    38167   738.2         2821        13596
The variable names in the estimates data frame are generic names: FP stands for "Fixed Part"; RP2_var stands for "Random Part variance at level 2"; and RP1_var stands for "Random Part variance at level 1". R2MLwiN provides two functions for further processing of the estimates. The first function, trajectories(), plots the trajectories for each parameter and deviance. The Range argument allows specific estimates to be selected for plotting - here, the estimates from the last 500 iterations are plotted.

trajectories(estimates, Range = c(4501, 5000))

plot of chunk trajectories_1

The second function, sixway(), provides detailed diagnostic information for the parameters. The following code requests diagnostics for the mean year 7 score, and is given the label "beta_0".

sixway(estimates[, "FP_cons"], "beta_0")

plot of chunk sixway_1

The upper-left panel shows the trajectory for the full chain. The upper-right panel shows the density of the posterior distribution. The two panels in the second row plot the autocorrelations and partial autocorrelations. The left panel in the third row plots the Monte Carlo standard error of posterior distribution against the number of iterations. The right panel in the third row gives the Raftery-Lewis and and Brooks-Draper diagnostics. The bottom panel provides summary statistics of the posterior. (The MLWiN MCMC manual (pp. 38-40) contains further details.)

Obtaining diagnostics for a derived quantity

Quantities such as the variance partition coefficient (or ICC) are not estimated directly as part of the model but many can be derived from estimates returned by the analysis. The vpc is a function of level 1 and level 2 variance parameters:
\[ vpc = \frac{\sigma _{u}^{2}}{\sigma _{e}^{2} + \sigma _{u}^{2}} \]
The point estimate for vpc can be calculated from the point estimates of these parameters, but given that the chains of the variance parameters are available in the estimates data frame, a chain of vpc values can be calculated and viewed. Using the names of the variance parameters in the estimates data frame (RP2_var_cons and RP1_var_cons), a chain for the vpc variable is calculated and added to estimates.

estimates$vpc = estimates$RP2_var_cons/(estimates$RP1_var_cons + estimates$RP2_var_cons)
And now the trajectory and the diagnostics are called as before.

trajectories(estimates[, "vpc"], Range = c(4501, 5000))
sixway(estimates[, "vpc"], "vpc")

plot of chunk trajectories_vpc

plot of chunk sixway_vpc


Getting GitHub data into R

Christopher Gandrud's Source_GitHubData function makes it easy to get GitHub data into R. Click on this link and you will find the function near the end of the post. Run the function in an R session. The URL for the ALNT data file is https://raw.github.com/SandyMuspratt/Blog-files/master/R2MLwiN%20I/ALNT.csv. Or, the shortened bitly URL is: http://bit.ly/10A5LBY

The following commands should get the ALNT data into your R session.

# Christopher Gandrud's source_GitHubData function
source_GitHubData <-function(url, sep = ",", header = TRUE)
{
   require(httr)
   request <- GET(url)
   stop_for_status(request)
   handle <- textConnection(content(request, as = 'text'))
   on.exit(close(handle))
   read.table(handle, sep = sep, header = header)
}

url = "http://bit.ly/10A5LBY"
alnt = source_GitHubData(url = url)
names(alnt)
[1] "student"  "school"   "gender"   "write3"   "write7"   "location" "dsi"



1 comment:

  1. Hello,

    When I run the sixway(estimates[, "FP_cons"], "beta_0") command
    I got the following error

    Error in plot.new() : figure margins too large
    In addition: Warning message:
    In par(new = TRUE) : calling par(new=TRUE) with no plot
    Error in par(mypar) : object 'mypar' not found

    What should I do in this situation?
    Thank you

    ReplyDelete