## Thursday, 24 January 2013

### Getting started with the R2MLwiN package

With the release of the R2MLwiN package late 2012, R users have access to another software package for running Bayesian models using Markov chain Monte Carlo (MCMC) methods. R2MLwiN is an R command interface to MLwiN, allowing users to fit multilevel models using MLwiN from within the R environment.

To use R2MLwiN, MLwiN needs to be installed. MLwiN is available from the Centre for Multilevel Modelling at the University of Bristol. See the MLwiN site for availability: MLwiN is free for UK researchers; there is a 30-day fully functional free trail version available; or MLwiN can be purchased from the Centre for Multilevel Modelling. As the name suggests, MLwiN is available for the Windows platform only.

R2MLwiN is available from CRAN.

### Some Preliminaries

First, load the R2MLwiN package. If you need to install R2MLwiN, then type: install.packages("R2MLwiN", dependencies = TRUE) at the R prompt.

library(R2MLwiN)
The examples in these notes use the ALNT data file available from github. Assuming the data file is located in your working directory, the following command loads the data, and lists the variables.

alnt = read.csv("ALNT.csv", header = TRUE, sep = ",")
names(alnt)

But see the end of this blog for an easy way to get GitHub data into R.

The ALNT data file contains test data for 3097 students located in 70 primary school. The data are the results for a writing test administered to Year 7 students. Also, there are the results of a writing test obtained when the students were in Year 3.

The data are:
• student - a unique id for each student
• school - an id for each student's school
• gender - a numeric code for gender: 0 = male; 1 = female
• write3 - Year 3 writing scores
• write7 - Year 7 writing scores
• location - rural-urban variable for the location of the school (0 = metropolitan school; 1 = provincial city school; 2 = rural school; 3 = remote school)
• dsi - Disadvantaged Schools Index - a measure of disadvantage for a school as a whole. Small DSIs mean greater disadvantage.

The ALNT data file needs some modification before it can be used by MLwiN. It needs a new variable, a column of 1s to be used as the 'intercept' variable.

alntcons = 1  ### Collecting the information to be passed to MLwiN The information required by MLwiN includes: 1. Path to MLwiN executable (mlwin.exe); 2. The variables containing IDs for all levels - highest level first; 3. Distribution to be modelled; 4. Method of estimation (1 = MCMC); 5. The data file; and 6. The model. These commands collect the information. The path in point 1 is the path to the MLwiN executable on my machine. It will probably need to be modified to suit different machines. # 1 Path to MLwiN executable mlwin = "C:/Program Files/MLwiN v2.26/i386" # 2 variables containing IDs levID = c("school", "student") # 3 Distribution to be modelled D = "Normal" # 4 Method of estimation (1 = MCMC) estoptions = list(EstM = 1) # 5 The data file indata = alnt # 6 The model formula = "write7 ~ (0 | cons) + (1 | cons) + (2 | cons)"  The first analysis is a variance components model. The analysis returns, among other things, estimates for the mean write7 score, and variances for the residuals at the student level and at the school level. The formulation of the model uses the ~ notation with the independent variable (write7) to the left, and predictor variables to the right. Except in this case, there are no predictor variables other than the constant term, the predictor "1". There is variation around the constant at the student level (represented by "(1 | cons)" and variation at the school level (represented by "(2 | cons)"). Thus, there are two parts to the model: • the fixed part: (0 | cons), and • the parts representing variation: (1 | cons) and (2 | cons). The analysis returns an estimate for the fixed part, that is, the mean for write7, and estimates for variances of the parts that vary; that is, variance of the student level residuals and variance of the school level residuals. An algebraic representation of the model is as follows: \begin{align} write7_{ij} &=\beta _{0} + u_{0j} + e_{0ij}\\\ \text{where } u_{0j} &\sim N(0, \hspace{.5em} \sigma_{u0}^{2})\\\ \text{and } e_{0ij} &\sim N(0, \hspace{.5em} \sigma _{e}^{2}) \end{align} The analysis returns estimates for $$\beta_0$$, the mean write7 score, and the two variances, $$\sigma_{u0}^{2}$$ and $$\sigma_{e}^{2}$$. ### Run the model The model is run with a call to the runMLwiN() function. The function passes the relevant information to MLwiN, runs the model in MLwiN, then output is passed back to R for further processing. The model here is run using a number of default options: burn-in length (500), iterations (5000), priors (uninformative priors - see the MLWiN MCMC manual pages 4 and 62 for details on the default uninformative priors), the Gibbs sampler, and IGLS-generated starting values. (In MLwiN, the burn-in iterations are discarded, then the 5000 iterations to be monitored are stored.) mod1 = runMLwiN(formula, levID = levID, D = D, indata = indata, estoptions = estoptions, MLwiNPath = mlwin)  MLwiN is running, please wait...... -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- MLwiN multilevel model (Normal) Estimation algorithm: MCMC Elapsed time : 7.16s Number of obs: 3097 Number of iter.: 5000 Burn-in: 500 Bayesian Deviance Information Criterion (DIC) Dbar D(thetabar) pD DIC 38153.164 38093.660 59.505 38212.668 --------------------------------------------------------------------------------------------------- The model formula: write7~(0|cons)+(1|cons)+(2|cons) Level 2: school Level 1: student --------------------------------------------------------------------------------------------------- The fixed part estimates: Coef. Std. Err. z Pr(>|z|) [95% Conf. Interval] ESS cons 742.85260 6.75865 109.90 0 *** 729.48333 755.99434 252 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 --------------------------------------------------------------------------------------------------- The random part estimates at the school level: Coef. Std. Err. [95% Conf. Interval] ESS var_cons 2534.62402 550.03020 1629.75209 3786.49343 2411 --------------------------------------------------------------------------------------------------- The random part estimates at the student level: Coef. Std. Err. [95% Conf. Interval] ESS var_cons 13122.38574 343.77850 12466.03779 13828.37180 4600 -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-  The output is split into a number of sections. The first section gives information about the number of observations, the method of estimation, iterations and burn-in length, and DIC statistics. The second section gives the model, and the names of the variables containing IDs for each level. The third section gives the estimates for the fixed part of the model; here, the mean of the Year 7 write scores. The coefficient is the mean of the posterior distribution; Std Err is the standard deviation of the posterior distribution; "95% Conf. Interval" gives the locations of the 2.5% and 97.5% quantiles of the posterior distribution; and ESS is the Estimated Sample Size upon which these estimates are based. The fourth and fifth sections give the estimates for the variances of the residuals at each level in the model. ### Obtaining diagnostics on the MCMC chains The 5000 iterations to be monitored are returned to R in the mod1 object, and they are stored in mod1["chains"]. Here, the iterations are extracted from mod1 and stored in a data frame called estimates. estimates = mod1["chains"] head(estimates)   iteration deviance FP_cons RP2_var_cons RP1_var_cons 1 1 38147 742.6 2253 13573 2 2 38137 741.6 2924 13245 3 3 38148 741.9 2207 13198 4 4 38176 735.5 2204 13390 5 5 38152 739.9 2383 13523 6 6 38167 738.2 2821 13596 The variable names in the estimates data frame are generic names: FP stands for "Fixed Part"; RP2_var stands for "Random Part variance at level 2"; and RP1_var stands for "Random Part variance at level 1". R2MLwiN provides two functions for further processing of the estimates. The first function, trajectories(), plots the trajectories for each parameter and deviance. The Range argument allows specific estimates to be selected for plotting - here, the estimates from the last 500 iterations are plotted. trajectories(estimates, Range = c(4501, 5000))  The second function, sixway(), provides detailed diagnostic information for the parameters. The following code requests diagnostics for the mean year 7 score, and is given the label "beta_0". sixway(estimates[, "FP_cons"], "beta_0")  The upper-left panel shows the trajectory for the full chain. The upper-right panel shows the density of the posterior distribution. The two panels in the second row plot the autocorrelations and partial autocorrelations. The left panel in the third row plots the Monte Carlo standard error of posterior distribution against the number of iterations. The right panel in the third row gives the Raftery-Lewis and and Brooks-Draper diagnostics. The bottom panel provides summary statistics of the posterior. (The MLWiN MCMC manual (pp. 38-40) contains further details.) ### Obtaining diagnostics for a derived quantity Quantities such as the variance partition coefficient (or ICC) are not estimated directly as part of the model but many can be derived from estimates returned by the analysis. The vpc is a function of level 1 and level 2 variance parameters: $vpc = \frac{\sigma _{u}^{2}}{\sigma _{e}^{2} + \sigma _{u}^{2}}$ The point estimate for vpc can be calculated from the point estimates of these parameters, but given that the chains of the variance parameters are available in the estimates data frame, a chain of vpc values can be calculated and viewed. Using the names of the variance parameters in the estimates data frame (RP2_var_cons and RP1_var_cons), a chain for the vpc variable is calculated and added to estimates. estimatesvpc = estimates$RP2_var_cons/(estimates$RP1_var_cons + estimates\$RP2_var_cons)
And now the trajectory and the diagnostics are called as before.

trajectories(estimates[, "vpc"], Range = c(4501, 5000))
sixway(estimates[, "vpc"], "vpc")


#### Getting GitHub data into R

Christopher Gandrud's Source_GitHubData function makes it easy to get GitHub data into R. Click on this link and you will find the function near the end of the post. Run the function in an R session. The URL for the ALNT data file is https://raw.github.com/SandyMuspratt/Blog-files/master/R2MLwiN%20I/ALNT.csv. Or, the shortened bitly URL is: http://bit.ly/10A5LBY

The following commands should get the ALNT data into your R session.

# Christopher Gandrud's source_GitHubData function
source_GitHubData <-function(url, sep = ",", header = TRUE)
{
require(httr)
request <- GET(url)
stop_for_status(request)
handle <- textConnection(content(request, as = 'text'))
on.exit(close(handle))
}

url = "http://bit.ly/10A5LBY"
alnt = source_GitHubData(url = url)
names(alnt)
[1] "student"  "school"   "gender"   "write3"   "write7"   "location" "dsi"

#### 1 comment:

1. Hello,

When I run the sixway(estimates[, "FP_cons"], "beta_0") command
I got the following error

Error in plot.new() : figure margins too large