Introduction To Strings ======================================================== author: John S date: 21 September 2015 font-family: 'Garamond' transition: none Agenda === - Introduction to strings and string operations - Extracting and manipulating string objects - Introduction to general search Why Characters? ================ Most data we deal with is in character form! - web pages can be scraped - email can be analyzed for network properties - survey responses must be processed and compared Even if you only care about numbers, it helps to be able to extract them from text and manipulate them easily. The Simplest Distinction ======================== - ***Character***: a symbol in a written language, specifically what you can enter at a keyboard: letters, numerals, punctuation, space, newlines, etc. ``` 'L', 'i', 'n', 'c', 'o', 'l' ``` - ***String***: a sequence of characters bound together ``` Lincoln ``` Note: R does not have a separate type for characters and strings ```{r} mode("L") mode("Lincoln") class("Lincoln") ``` Making Strings ============== Use single or double quotes to construct a string; use `nchar()` to get the length of a single string. Why do we prefer double quotes? ```{r} "Lincoln" "Abraham Lincoln" "Abraham Lincoln's Hat" "As Lincoln never said, \"Four score and seven beers ago\"" ``` Whitespace ========== The space, `" "` is a character; so are multiple spaces `" "` and the empty string, `""`. Some characters are special, so we have "escape characters" to specify them in strings. - quotes within strings: `\"` - tab: `\t` - new line `\n` and carriage return `\r` -- use the former rather than the latter when possible The character data type ========================= One of the atomic data types, like `numeric` or `logical` Can go into scalars, vectors, arrays, lists, or be the type of a column in a data frame. ```{r} length("Abraham Lincoln's beard") length(c("Abraham", "Lincoln's", "beard")) nchar("Abraham Lincoln's beard") nchar(c("Abraham", "Lincoln's", "beard")) ``` Character-Valued Variables ========================== They work just like others, e.g., with vectors: ```{r} president <- "Lincoln" nchar(president) # NOT 9 presidents <- c("Fillmore","Pierce","Buchanan","Davis","Johnson") presidents[3] presidents[-(1:3)] ``` Displaying Characters ===================== We know `print()`, of course; `cat()` writes the string directly to the console. If you're debugging, `message()` is R's preferred syntax. ```{r} print("Abraham Lincoln") cat("Abraham Lincoln") cat(presidents) message(presidents) ``` Substring Operations ================ ***Substring***: a smaller string from the big string, but still a string in its own right. A string is not a vector or a list, so we ***cannot*** use subscripts like `[[ ]]` or `[ ]` to extract substrings; we use `substr()` instead. ```{r} phrase <- "Christmas Bonus" substr (phrase, start=8, stop=12) ``` We can also use `substr` to replace elements: ```{r} substr(phrase, 13, 13) <- "g" phrase ``` substr() for String Vectors ============================ `substr()` vectorizes over all its arguments: ```{r} presidents substr(presidents,1,2) # First two characters substr(presidents,nchar(presidents)-1,nchar(presidents)) # Last two substr(presidents,20,21) # No such substrings so return the null string substr(presidents,7,7) # Explain! ``` Dividing Strings into Vectors ============================= `strsplit()` divides a string according to key characters, by splitting each element of the character vector `x` at appearances of the pattern `split`. ```{r} scarborough.fair <- "parsley, sage, rosemary, thyme" strsplit (scarborough.fair, ",") strsplit (scarborough.fair, ", ") ``` Pattern is recycled over elements of the input vector: ```{r} strsplit (c(scarborough.fair, "Garfunkel, Oates", "Clement, McKenzie"), ", ") ``` Note that it outputs a `list` of character vectors -- why should this be the default? Combining Vectors into Strings ============================== Converting one variable type to another is called ***casting***: ```{r} as.character(7.2) # Obvious as.character(7.2e12) # Obvious as.character(c(7.2,7.2e12)) # Obvious as.character(7.2e5) # Not quite so obvious ``` Building strings from multiple parts ==================== The `paste()` function is very flexible! With one vector argument, works like `as.character()`: ```{r} paste(41:45) ``` Building strings from multiple parts ==================== With 2 or more vector arguments, combines them with recycling: ```{r} paste(presidents,41:45) paste(presidents,c("R","D")) # Not historically accurate! paste(presidents,"(",c("R","D"),41:45,")") ``` Building strings from multiple parts ==================== Changing the separator between pasted-together terms: ```{r} paste(presidents, " (", 41:45, ")", sep="_") paste(presidents, " (", 41:45, ")", sep="") ``` Exercise: what happens if you give `sep` a vector? A More Complicated Example of Recycling ======================================= Exercise: Convince yourself of why this works as it does ```{r} paste(c("HW","Lab"),rep(1:11,times=rep(2,11))) ``` Condensing Multiple Strings =========================== Producing one big string: ```{r} paste(presidents, " (", 41:45, ")", sep="", collapse="; ") ``` Default value of `collapse` is `NULL` -- that is, it won't use it A function for writing regression formulas =========================================== R has a standard syntax for models: outcome and predictors. ```{r} my.formula <- function(dep,indeps,df) { rhs <- paste(colnames(df)[indeps], collapse="+") return(paste(colnames(df)[dep], " ~ ", rhs, collapse="")) } my.formula(2,c(3,5,7),df=state.x77) ``` Text of Some Importance ========= If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman's two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said "the judgments of the Lord are true and righteous altogether." More text ========= ```{r} al2 <- readLines("al2.txt") length(al2) head(al2) ``` `al2` is a vector, one element per line of text A Hint Of The Future: Search ============================ Narrowing down entries: use `grep()` to find which strings have a matching search term ```{r} grep("God", al2) grepl("God", al2) al2[grep("God", al2)] ``` Reconstituting ============== Make one long string, then split the words ```{r} al2 <- paste(al2, collapse=" ") al2.words <- strsplit(al2, split=" ")[[1]] head(al2.words) ``` Counting Words with table() ============================= Tabulate how often each word appears, put in order: ```{r} wc <- table(al2.words) wc <- sort(wc,decreasing=TRUE) head(wc,20) ``` `names(wc)` gives all the distinct words in `al2.words` (***types***); `wc` counts how often they appear (***tokens***) Unexpected ========== The null string is the third-most-common word: ```{r} names(wc)[3] ``` ```{r} wc["years"] wc["years,"] ``` Unexpected ========== Capitalization: ```{r} wc["that"] wc["That"] ``` All of this can be fixed if we learn how to work with text patterns and not just constants. Summary ======= - Text is data, just like everything else - `substr()` extracts and substitutes - `strsplit()` turns strings into vectors - `paste()` turns vectors into strings - `table()` for counting how many tokens belong to each type Next time: searching for text patterns using regular expressions Summary =======