lecture13
========================================================
author: 
date: 

split-apply-combine
========================================================

A common data analysis task has three general steps:

-**Split** whatever data object we have into meaningful chunks
-**Apply** the function of interest to this division
-**Combine** the results into a new object.

*apply() and related functions
========================================================

- apply()
- tapply()
- sapply()
- split()

All useful, but output can be hard to control.

Show example
=============
```{r,cache=T,tidy=T}
data <- read.csv("~/debt.csv")
max.growths <- tapply(data$growth,data$Country,max)
max.growths[1:3]
is.vector(max.growths)
```
Huh?

Example can be cleaned up...
======
```{r,cache=T,tidy=T}
names(max.growths)[1:3]
data.frame(country=names(max.growths),max.growth=as.vector(max.growths))[1:5,]
```

plyr library is an easier way
=======
- Need to install the package (tools menu item) and load the library
```{r,cache=T,tidy=T}
library(plyr)
```
- main function syntax is: ?*ply()
- ? = input data type: d, l, a (data frame, list, or array)
- * = outut data type: d, l, a, _ (_ = nothing)

Author: Hadley Wickham
===

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/.


today: d*ply
=======

syntax:
output <- d*ply(.data, .(splitvariable), .fun, ...)


- .data – data frame
- .(...) – arguments to split by
- .fun – the function to be applied
- ... – additional arguments to function


ddply
=======
```{r,cache=T,tidy=T}
ddply(data,.(Country),summarize,max.growth=max(growth))[1:5,]
```

plyr library is an easier way
=======
```{r,cache=T,tidy=T}
dlply(data,.(Country),summarize,max.growth=max(growth))
```

What does this do?
=======
```{r,cache=T,tidy=T}
ddply(data,.(Country),c("nrow", "ncol"))[1:5,]
```

What does this do?
=======
```{r,cache=T,tidy=T}
cor.by.country <- function(one.c.data)
  cor(one.c.data$growth,one.c.data$ratio)

ddply(data,.(Country),"cor.by.country")[1:5,]
```

example with no output
=======
```{r,cache=T,tidy=T,echo=T}
plot.by.country <- function(one.c.data,xlims,ylims)
{
    plot(one.c.data$ratio,one.c.data$growth,ylab="Growth",xlab="Ratio",
         main=paste(one.c.data$Country[1]),type="n",xlim=xlims,ylim=ylims,
         sub=paste("Correlation=",round(cor(one.c.data$growth,one.c.data$ratio),2)))
  text(one.c.data$ratio,one.c.data$growth,one.c.data$Year,cex=.5)
}
```

Try it on one country's data
=====
```{r,fig.width=10,fig.height=8,tidy=T}
par(cex=2)
plot.by.country(data[data$Country=="Italy",],xlims=range(data$ratio),ylims=range(data$growth))
```

Make a book of pdfs
====
```{r,cache=T,tidy=T}
pdf("~/plots.pdf",onefile=T)
d_ply(data,.(Country),plot.by.country,xlims=range(data$ratio),ylims=range(data$growth))

par(mfrow=c(5,4),cex=.5)
d_ply(data,.(Country),plot.by.country,xlims=range(data$ratio),ylims=range(data$growth))
dev.off()
```