--- title: "Lab 5: Pareto and the 1 percent" author: "Stat 597A" date: "Friday, 9 October 2015" output: pdf_document --- We continue to look at the very rich by turning to a more systematic data source than Forbes magazine, the World Top Incomes Database hosted by the Paris School of Economics [http://topincomes.g-mond.parisschoolofeconomics.eu]. This is derived from income tax reports, and compiles information about the very highest incomes in various countries over time, trying as hard as possible to produce numbers that are comparable across time and space. For most countries in most time periods, the upper end of the income distribution roughly follows a Pareto distribution, with probability density function \begin{equation} f(x,a) = \frac{(a-1)}{x_{min}}\left(\frac{x}{x_{min}} \right)^{-a}. \label{eqn:pareto-density} \end{equation} for incomes $x>x_{min}.$ (Typically, $x_min$ is large enough that only the richest 3-4% of the population falls above it.) As a (called the Pareto exponent) gets smaller, the distribution of income becomes more unequal, that is, more of the population’s total income is concentrated among the very richest people. The proportion of people whose income is at least xmin whose income is also at or above any level $w \ge x_{min}$ is thus \begin{equation} Pr(X\ge w) = \int_{w}^{\infty}f(x,a)dx = \left(\frac{w}{x_{min}} \right)^{-a+1}. \end{equation} We will use this to estimate how income inequality changed in the US over the last hundred years or so. (Whether the trends are good or bad or a mix is beyond our scope here.) WTID exports its data sets as .xlsx spreadsheets. For this lab session, we have extracted the relevant data and saved it as wtid-report.csv (please see course website). Part I 1. Open the file and make a new variable containing only the year, “P99”, “P99.5” and “P99.9” variables; these are the income levels which put one at the 99th, 99.5th, and 99.9th, percentile of income. What was P99 in 1972? P99.5 in 1942? P99.9 in 1922? You must identify these using your code rather than looking up the values manually. (You may want to modify the column names to make some of them shorter.) 2. Plot the three percentile levels against time. Make sure the axes are labeled appropriately, and in particular that the horizontal axis is labeled with years between 1913 and 2012, not just numbers from 1 to 100. 3. One can show from the earlier equations that one can estimate the exponent by the formula \begin{equation} a = 1 - \frac{\log{(10)}}{\log(P99/P99.9)}. \end{equation} Write a function, exponent.est_ratio() which takes in values for P99 and P99.9, and returns the value of $a$ implied by (3). Check that if P99=1e6 and P99.9=1e7, your function returns an $a$ of 2. Part II 4. Estimate $a$ for each year in the data set, using your exponent.est_ratio() function. If the function was written properly, you should not need to use a loop. Plot your estimate of a over time. Do the results look reasonable? (Remember that smaller exponents mean more income inequality.) 5. There were 160,681 households in the top 0.1\% in the US in 2012. Using your estimated value of $a$ for 2012, calculate approximately how many households had an income of over \$50 million. 6. The logic leading to (3) also implies that \begin{equation} \left(\frac{P99.5}{P99.9} \right)^{-a+1}=5. \end{equation} Write a function which takes P99.5, P99.9 and $a$, and calculates equation (4). Plot the values for each year, using the data and your estimates of the exponent. Add a horizontal line with vertical coordinate 5. How good is the fit? (Note: the formula in (3) is not the best way to estimate $a,$ but it is one of the simplest.)