Big Data/Analytics Zone is brought to you in partnership with:

Jonathan Callahan received his Ph.D. in Physical Chemistry from the University of Washington in 1993. After two years as a post-doc in a magnetic resonance imaging laboratory, Jonathan joined NOAA's Pacific Marine Environmental Laboratory to work on analysis and visualization software for oceanographic climate and model data. Since 2007 Jonathan has worked as an independent consultant for NOAA, NASA and the US EPA. His areas of expertise include: data management; data visualization; statistical analysis using R; interface design; data mining; web services architecture. Jonathan writes occasional articles on data management at Working With Data. Jonathan is a DZone MVB and is not an employee of DZone and has posted 12 posts at DZone. You can read more from them at their website. View Full User Profile

Learning R — The R Linksheet

06.17.2012
| 7301 views |
  • submit to reddit
This entry is part 1 of 5 in the series Using R

We use R a lot.  R takes care of many our basic data management needs.  R is an awesome statistical analysis package.  R allows you to produce exceptional data graphics.  The only problem is … R has a wicked learning curve.  In this post we provide tips on learning R for the first time and pointers to some of the most useful books and documentation we’ve come across.

R’s wicked learning curve is probably not surprising given that R was written by and for academic statisticians in a loosely coupled, open source environment.  It is helpful to understand a little of the history of R and how it relates to both S and S-PLUS.

History

R, the statistical software environment, is an implementation of S, the statistical programming language.  The S language was developed at ATT Bell Labs by John Chambers and others in the late 1970′s and early 1980′s.  The goal of the language was “to turn ideas into software, quickly and faithfully.”  Back in the days when coding statistical analyses involved making calls to Fortran subroutines, the S language provided a way for statisticians to harness numerical analyses without becoming full time programmers.

The two main implementations of the S programming language are open source R and commercial S+.  The most obvious difference between them, besides price, is the more integrated nature of S+, complete with IDE (Integrated Development Environment).  In R’s favor is the large and growing community of R developers writing packages that continually enhance R’s functionality.  For scientific applications, where the data and analyses are less routine than in the business world, we favor R.

Learning R for the First Time

When approaching R for the first time it is important to let go of some of what you know about programming.  To many programmers, R has annoyingly unexpected behavior:  R has several different object types that behave differently in different situations; R remembers things you wouldn’t expect; R package methods, being developed by individuals, don’t always agree on argument names or behavior.  Frustrating if you are only concerned about writing code.

However, R has an incredible amount of statistical and data visualization smarts that make doing statistics easy.  So it is important to learn about R from the point of view of a statistician rather than the point of view of a programmer.  Our favorite introduction to statistics with R is John Verzani’s “Using R for Introductory Statistics” which is available in print or as a PDF.

We recommend going through the entire book a chapter at a time.  It is important to understand the statistical concepts built into R before attempting to harness them to do work.  For many tasks, there is an R function that already does what you want.  Those who refuse to read up and learn about this powerful tool will end up writing hundreds of lines of ‘programmer code’ where only a line or two of ‘R code’ is needed.

While you are going through Verzani’s examples you should take extra time to examine R’s built in documentation.  Like Unix man pages, help in R is easily accessible, ASCII formatted and informationally dense.  Reading the help for each function you use will soon get you familiar with the full power of R’s functions.  You can access R’s help facility with > ?help.

The last thing you’ll need while getting started is a list of the most important commands.  Our favorite list is the R Reference Card described below.  Print out all four pages and tape them to your desk while working through Verzani’s book.  If you’re really intent on learning R, the payoff will be commensurate with the time you invest in studying.

Quick Summary:

  1. Buy or download Verzani’s “Using R for Introductory Statistics
  2. Download and print out the R Reference Card
  3. Work through the examples in Verzani’s book, using ?help to learn more.
  4. Explore the functions listed in the reference card.

The rest of this post will be a compilation of R resources that we have used and heartily recommend.  If you have your own favorites, please add them as a comment.

R Books

Using R for Introductory Statistics (pdf)
John Verzani’s primer is an excellent place to begin learning about both statistics and R.

Applied Spatial Data Analysis with R
Authors Bivand, Pebesma and Gómez-Rubio are key developers of R’s spatial capabilities. This book, released in 2008, explains many of the newer developments that enable R to take on some of the spatial analyses that had previously been the domain of GIS systems. If you work with spatial data, this book is a must.


Official R Web Pages

R Home page
Ths starting point for documentation, downloads, packages, etc.

Unofficial Web Sites

Quick R
The documentation on this site, maintained by Rob Kabacoff, is the simplest to navigate, easiest to understand compilation of R documentation we have come across.  If you’re looking for web-based documentation, come here first.

R Graph Gallery
Romain Francois maintains this site of amazing data visualizations created with R.  Each visualization comes with the code that generated it so this is an excellent way to get inspired about data graphics.  Romiain also maintains a related blog highlighting his R development efforts.

Using Color in R
Earl Glynn’s slide presentation is an excellent review of the use of color in scientific graphics in general and R in particular. Color blindness and palettes for dichromats are covered on slides 33-37.

Cheat Sheets

R Reference Card
This reference card is one that you will eventually want to have memorized.  In the mean time, print it out and post it on your wall or have it open in another window while you’re learning R.

R Color Chart
If you are a data-vis perfectionist you will want to have this chart handy.

Related Headlines

Published at DZone with permission of Jonathan Callahan, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)