Lecture 12 ======================================================== author: John S date: October 21, 2015 More on *apply family ======================================================== - Continued... - Often things that can be done in a loop can often be done with an "apply type" function. - Sometimes this is faster. - It can be easier to program and read. - **split** the data - **apply** the function - **combine** the results - how's midterm going? Today ===== - Extended example with strike data Strike Data ======================================================== From Bruce Western, Sociology, Harvard. - Data frame of 8 columns: country, year, days on strike per 1000 workers, unemployment, inflation, left-wing share of gov’t, centralization of unions, union density - 625 observations from 18 countries, 1951–1985 - Since 18 × 35 = 630 > 625, some years missing from some countries Strike Data ======================================================== ```{r, tidy=T, cache=T} strike <- read.csv("~/strikes.csv") names(strike) strike[1,] ``` Question ========== Does having a friendlier government make labor action more or less likely? More specific: Is there a relationship between a country's ruling party alignment (left vs. right) and the volume of strikes? One way to address: linear regression by country. Covariate: left.parliament Response: strike.volume Why is it important to do it by country? ======== ```{r,fig.width=14,tidy=T} par(cex=2) plot(strike$left.parliament,strike$strike.volume, xlab="% Left Parliament",ylab="Strike Volume") ``` First step: look at all the data! ======== ```{r,echo=F,fig.width=14,tidy=T} plot(strike$left.parliament[strike$country=="Australia"],strike$strike.volume[strike$country=="Australia"], xlab="% Left Parliament",ylab="Strike Volume",main="Australia") ``` Use par(mfrow=c(6,3)) to make for all 18 countries. Could use a for loop: ===== ```{r,tidy=T,cache=T} coefs <- data.frame(country=unique(strike$country),int=NA,slope=NA) for (i in 1:dim(coefs)[1]){ temp <- strike[strike$country==coefs$country[i],] coefs[i,2:3] <- lm(strike.volume~left.parliament,data=temp)$coef } head(coefs) ``` Another way: sapply() === - sapply(x,fun,...) - write a function that takes one country's data, fits model, and return coefficients. ```{r,cache=T,tidy=T} fit.for.one.country <- function(one.country.data){ coef <- lm(strike.volume~left.parliament,data=one.country.data)$coef return(coef) } ``` sapply ===== Make a list that has datasets for each country in each element ```{r,cache=T,tidy=T} datasets <- split(strike,strike$country) names(datasets) ``` sapply ===== Make a list that has datasets for each country in each element ```{r,cache=T,tidy=T} fits <- sapply(datasets,fit.for.one.country) fits[1:2,1:3] ``` plot results: not too conlusive... === ```{r,echo=F,fig.width=14} par(cex=2) plot(fits[2,],xaxt="n",xlab="",ylab="Regression coefficient", main="Countrywise Labor Activity By Left-Wing Score") axis(side=1,at=seq(along=colnames(fits)),labels=colnames(fits), las=2,cex.axis=0.5) abline(h=0,col="grey") ``` Other ways: ===== * by() (Please see book.) * lapply(): same as sapply, but it gives a list of outcomes Summary: ========== * **split** - **apply**- **combine** can be done with a loop * *apply family is another option + sometimes much more efficient + easier to code and read (after you learn it!) + but, output can have a format that is hard to control * Next: plyr library will address that problem