Exploratory data analysis in R

20 minute read

You have some data to play with! Now what? The first step in any data project is to explore the data: what is the structure of your data? What variables do you have? What patterns jump out at you? This mini-tutorial covers the basics of exploratory data analysis in R.

Intro - why EDA?

explore the data, could lead to new hypotheses
looking for patterns (variation in data, covariation in data – from https://r4ds.had.co.nz/exploratory-data-analysis.html)
distribution, skew, outliers

Parts of EDA

As Hadley Wickham wrote in his R for data science book, the main goal of EDA is looking at variation and covariation in data. In other words, what patterns do we see and how do the patterns of different variables relate?

The first step is to understand the structure of the data. Once we understand the structure of our data we can look at patterns in individual variables and finally look for covariation among variables.

Data structure:
- variable type
- summarize variables
Patterns:
- visualize summaries
- histograms
Covariance:
- correlations
- boxplots
- scatter plots
- dimension reduction (PCA, NLDR)

Get some data!

To see these ideas in practice let’s load a dataset to play with. Here we will use 2016 election poll data compiled by Rafael Irizarry in the dslabs package.

Start by installing the dslab package by running install.packages("dslabs"). In this tutorial we will use and explore a number of additional packages. To make sure that they are installed run: install.packages(c("dplyr","skimr","SmartEDA","GGally","Hmisc","psych","summarytools","corrplot")). If GGally package does not load try install_github("ggobi/ggally").

library(dslabs)
library(dplyr)
library(ggplot2)

polls <- polls_us_election_2016

1. Data Structure

dim(polls)

## [1] 4208   15

dim gives the dimensions of the data or the number of rows x number of columns. Here we have 4208 x 15 which means 4208 rows of entries by 15 columns of variables.

glimpse(polls)

## Rows: 4,208
## Columns: 15
## $ state            <fct> U.S., U.S., U.S., U.S., U.S., U.S., U.S., U.S., Ne...
## $ startdate        <date> 2016-11-03, 2016-11-01, 2016-11-02, 2016-11-04, 2...
## $ enddate          <date> 2016-11-06, 2016-11-07, 2016-11-06, 2016-11-07, 2...
## $ pollster         <fct> ABC News/Washington Post, Google Consumer Surveys,...
## $ grade            <fct> A+, B, A-, B, B-, A, A-, A-, NA, A-, A+, A-, A+, B...
## $ samplesize       <int> 2220, 26574, 2195, 3677, 16639, 1295, 1426, 1282, ...
## $ population       <chr> "lv", "lv", "lv", "lv", "rv", "lv", "lv", "lv", "l...
## $ rawpoll_clinton  <dbl> 47.00, 38.03, 42.00, 45.00, 47.00, 48.00, 45.00, 4...
## $ rawpoll_trump    <dbl> 43.00, 35.69, 39.00, 41.00, 43.00, 44.00, 41.00, 4...
## $ rawpoll_johnson  <dbl> 4.00, 5.46, 6.00, 5.00, 3.00, 3.00, 5.00, 6.00, 6....
## $ rawpoll_mcmullin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ adjpoll_clinton  <dbl> 45.20163, 43.34557, 42.02638, 45.65676, 46.84089, ...
## $ adjpoll_trump    <dbl> 41.72430, 41.21439, 38.81620, 40.92004, 42.33184, ...
## $ adjpoll_johnson  <dbl> 4.626221, 5.175792, 6.844734, 6.069454, 3.726098, ...
## $ adjpoll_mcmullin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse summarizes a number of characteristics of the data giving the number of rows and columns, names of the variables, variable types, and the first few rows of each column. The variable type is helpful to understand how the data was entered. For instance, here state is entered as a factor. But it could have been a character type. And knowing which of those we are working with helps us understand how the data might behave. Similarly, startdate is entered as a date type but it could have been a character or factor. Seeing the first few rows of each variable gives us an idea of the type of entries we expect to see. glimpse is from the dplyr package.

head(polls)

##   state  startdate    enddate
## 1  U.S. 2016-11-03 2016-11-06
## 2  U.S. 2016-11-01 2016-11-07
## 3  U.S. 2016-11-02 2016-11-06
## 4  U.S. 2016-11-04 2016-11-07
## 5  U.S. 2016-11-03 2016-11-06
## 6  U.S. 2016-11-03 2016-11-06
##                                                     pollster grade samplesize
## 1                                   ABC News/Washington Post    A+       2220
## 2                                    Google Consumer Surveys     B      26574
## 3                                                      Ipsos    A-       2195
## 4                                                     YouGov     B       3677
## 5                                           Gravis Marketing    B-      16639
## 6 Fox News/Anderson Robbins Research/Shaw & Company Research     A       1295
##   population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1         lv           47.00         43.00            4.00               NA
## 2         lv           38.03         35.69            5.46               NA
## 3         lv           42.00         39.00            6.00               NA
## 4         lv           45.00         41.00            5.00               NA
## 5         rv           47.00         43.00            3.00               NA
## 6         lv           48.00         44.00            3.00               NA
##   adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 1        45.20163      41.72430        4.626221               NA
## 2        43.34557      41.21439        5.175792               NA
## 3        42.02638      38.81620        6.844734               NA
## 4        45.65676      40.92004        6.069454               NA
## 5        46.84089      42.33184        3.726098               NA
## 6        49.02208      43.95631        3.057876               NA

head shows us the first 6 rows of each column. This is helpful for understanding the variable names, how the data is entered, and example entries.

Sometimes we want to know what the unique values are for a variable. For instance, let’s look at which pollsters were included in this dataset.

unique(polls$population)

## [1] "lv" "rv" "a"  "v"

There were 4 population types: lv, rv, a, and v.

For numeric variables we want summary statistics. These can be computed using the summary command which shows the minimum, 1st qartile, mean, median, 3rd quartile, and maximum for each numeric variable.

summary(polls)

##             state        startdate             enddate          
##  U.S.          :1106   Min.   :2015-11-06   Min.   :2015-11-08  
##  Florida       : 148   1st Qu.:2016-08-10   1st Qu.:2016-08-21  
##  North Carolina: 125   Median :2016-09-23   Median :2016-09-30  
##  Pennsylvania  : 125   Mean   :2016-08-31   Mean   :2016-09-06  
##  Ohio          : 115   3rd Qu.:2016-10-20   3rd Qu.:2016-10-28  
##  New Hampshire : 112   Max.   :2016-11-06   Max.   :2016-11-07  
##  (Other)       :2477                                            
##                                      pollster        grade     
##  Ipsos                                   : 919   A-     :1085  
##  Google Consumer Surveys                 : 743   B      :1011  
##  SurveyMonkey                            : 660   C-     : 693  
##  YouGov                                  : 130   C+     : 329  
##  Rasmussen Reports/Pulse Opinion Research: 125   B+     : 204  
##  USC Dornsife/LA Times                   : 121   (Other): 457  
##  (Other)                                 :1510   NA's   : 429  
##    samplesize       population        rawpoll_clinton rawpoll_trump  
##  Min.   :   35.0   Length:4208        Min.   :11.04   Min.   : 4.00  
##  1st Qu.:  447.5   Class :character   1st Qu.:38.00   1st Qu.:35.00  
##  Median :  772.0   Mode  :character   Median :43.00   Median :40.00  
##  Mean   : 1148.2                      Mean   :41.99   Mean   :39.83  
##  3rd Qu.: 1236.5                      3rd Qu.:46.20   3rd Qu.:45.00  
##  Max.   :84292.0                      Max.   :88.00   Max.   :68.00  
##  NA's   :1                                                           
##  rawpoll_johnson  rawpoll_mcmullin adjpoll_clinton adjpoll_trump   
##  Min.   : 0.000   Min.   : 9.0     Min.   :17.06   Min.   : 4.373  
##  1st Qu.: 5.400   1st Qu.:22.5     1st Qu.:40.21   1st Qu.:38.429  
##  Median : 7.000   Median :25.0     Median :44.15   Median :42.765  
##  Mean   : 7.382   Mean   :24.0     Mean   :43.32   Mean   :42.674  
##  3rd Qu.: 9.000   3rd Qu.:27.9     3rd Qu.:46.92   3rd Qu.:46.290  
##  Max.   :25.000   Max.   :31.0     Max.   :86.77   Max.   :72.433  
##  NA's   :1409     NA's   :4178                                     
##  adjpoll_johnson  adjpoll_mcmullin
##  Min.   :-3.668   Min.   :11.03   
##  1st Qu.: 3.145   1st Qu.:23.11   
##  Median : 4.384   Median :25.14   
##  Mean   : 4.660   Mean   :24.51   
##  3rd Qu.: 5.756   3rd Qu.:27.98   
##  Max.   :20.367   Max.   :31.57   
##  NA's   :1409     NA's   :4178

For categorical variables we can summarize the number of observations in each category using xtabs.

2. Patterns

To assess patterns in the data we can visualize the summaries using histograms and boxplots. Here we will make plots using the ggplot2 package.

ggplot(data = polls, aes(x = samplesize)) + 
  geom_histogram()

ggplot(data = polls, aes(x = rawpoll_clinton)) +
  geom_histogram()

3. Covariance

Looking at histograms by group can tell us whether certain groups tend to be higher or lower and the relative size of different groups. For instance, in this histogram, the group for population type lv has much momre data (higher counts) but the Clinton percentages for this group are not higher or lower than the others.

ggplot(polls, aes(x=adjpoll_clinton, fill=factor(population))) +
  geom_histogram(color="#e9ecef", alpha=0.6, position = 'identity') +
  labs(title = "Histogram of Clinton percentages by population group",
       x = "Adjusted Clinton percentage",
       fill = "Population type")

Boxplots can also show the differences …

ggplot(polls, aes(x=factor(population), y=adjpoll_clinton)) + 
  geom_boxplot() +
  labs(y = "Percent for Clinton",
       x = "Population type",
       title = "Boxplot of clinton polls by population type")

Correlations show which variables are related. Correlation is only calculated for numeric variables so we need to select the numeric variables first. Here we show two ways to calculate the correlation among variabels. First we use the cor() function from base R.

polls_numeric <- polls %>% dplyr::select(where(is.numeric))

# Base R
corvar = round(cor(polls_numeric, use="complete.obs"), 2)
print(corvar)

##                  samplesize rawpoll_clinton rawpoll_trump rawpoll_johnson
## samplesize             1.00            0.43          0.23           -0.22
## rawpoll_clinton        0.43            1.00          0.03           -0.19
## rawpoll_trump          0.23            0.03          1.00           -0.18
## rawpoll_johnson       -0.22           -0.19         -0.18            1.00
## rawpoll_mcmullin       0.18            0.25         -0.31           -0.71
## adjpoll_clinton        0.21            0.86         -0.13           -0.02
## adjpoll_trump         -0.17           -0.28          0.79            0.13
## adjpoll_johnson       -0.36           -0.33          0.05            0.85
## adjpoll_mcmullin       0.16            0.22         -0.33           -0.69
##                  rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## samplesize                   0.18            0.21         -0.17           -0.36
## rawpoll_clinton              0.25            0.86         -0.28           -0.33
## rawpoll_trump               -0.31           -0.13          0.79            0.05
## rawpoll_johnson             -0.71           -0.02          0.13            0.85
## rawpoll_mcmullin             1.00           -0.02         -0.65           -0.75
## adjpoll_clinton             -0.02            1.00         -0.17           -0.12
## adjpoll_trump               -0.65           -0.17          1.00            0.35
## adjpoll_johnson             -0.75           -0.12          0.35            1.00
## adjpoll_mcmullin             1.00           -0.05         -0.67           -0.75
##                  adjpoll_mcmullin
## samplesize                   0.16
## rawpoll_clinton              0.22
## rawpoll_trump               -0.33
## rawpoll_johnson             -0.69
## rawpoll_mcmullin             1.00
## adjpoll_clinton             -0.05
## adjpoll_trump               -0.67
## adjpoll_johnson             -0.75
## adjpoll_mcmullin             1.00

#Hmisc
library(Hmisc)
rcorr(as.matrix(polls_numeric))

##                  samplesize rawpoll_clinton rawpoll_trump rawpoll_johnson
## samplesize             1.00            0.06         -0.01           -0.01
## rawpoll_clinton        0.06            1.00         -0.44           -0.30
## rawpoll_trump         -0.01           -0.44          1.00           -0.14
## rawpoll_johnson       -0.01           -0.30         -0.14            1.00
## rawpoll_mcmullin       0.18            0.25         -0.31           -0.71
## adjpoll_clinton        0.06            0.92         -0.64           -0.26
## adjpoll_trump         -0.02           -0.66          0.90           -0.04
## adjpoll_johnson       -0.05           -0.26         -0.04            0.84
## adjpoll_mcmullin       0.16            0.22         -0.33           -0.69
##                  rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## samplesize                   0.18            0.06         -0.02           -0.05
## rawpoll_clinton              0.25            0.92         -0.66           -0.26
## rawpoll_trump               -0.31           -0.64          0.90           -0.04
## rawpoll_johnson             -0.71           -0.26         -0.04            0.84
## rawpoll_mcmullin             1.00           -0.02         -0.65           -0.75
## adjpoll_clinton             -0.02            1.00         -0.73           -0.29
## adjpoll_trump               -0.65           -0.73          1.00           -0.05
## adjpoll_johnson             -0.75           -0.29         -0.05            1.00
## adjpoll_mcmullin             1.00           -0.05         -0.67           -0.75
##                  adjpoll_mcmullin
## samplesize                   0.16
## rawpoll_clinton              0.22
## rawpoll_trump               -0.33
## rawpoll_johnson             -0.69
## rawpoll_mcmullin             1.00
## adjpoll_clinton             -0.05
## adjpoll_trump               -0.67
## adjpoll_johnson             -0.75
## adjpoll_mcmullin             1.00
## 
## n
##                  samplesize rawpoll_clinton rawpoll_trump rawpoll_johnson
## samplesize             4207            4207          4207            2798
## rawpoll_clinton        4207            4208          4208            2799
## rawpoll_trump          4207            4208          4208            2799
## rawpoll_johnson        2798            2799          2799            2799
## rawpoll_mcmullin         30              30            30              30
## adjpoll_clinton        4207            4208          4208            2799
## adjpoll_trump          4207            4208          4208            2799
## adjpoll_johnson        2798            2799          2799            2799
## adjpoll_mcmullin         30              30            30              30
##                  rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## samplesize                     30            4207          4207            2798
## rawpoll_clinton                30            4208          4208            2799
## rawpoll_trump                  30            4208          4208            2799
## rawpoll_johnson                30            2799          2799            2799
## rawpoll_mcmullin               30              30            30              30
## adjpoll_clinton                30            4208          4208            2799
## adjpoll_trump                  30            4208          4208            2799
## adjpoll_johnson                30            2799          2799            2799
## adjpoll_mcmullin               30              30            30              30
##                  adjpoll_mcmullin
## samplesize                     30
## rawpoll_clinton                30
## rawpoll_trump                  30
## rawpoll_johnson                30
## rawpoll_mcmullin               30
## adjpoll_clinton                30
## adjpoll_trump                  30
## adjpoll_johnson                30
## adjpoll_mcmullin               30
## 
## P
##                  samplesize rawpoll_clinton rawpoll_trump rawpoll_johnson
## samplesize                  0.0000          0.5954        0.4689         
## rawpoll_clinton  0.0000                     0.0000        0.0000         
## rawpoll_trump    0.5954     0.0000                        0.0000         
## rawpoll_johnson  0.4689     0.0000          0.0000                       
## rawpoll_mcmullin 0.3356     0.1888          0.0984        0.0000         
## adjpoll_clinton  0.0002     0.0000          0.0000        0.0000         
## adjpoll_trump    0.1590     0.0000          0.0000        0.0339         
## adjpoll_johnson  0.0053     0.0000          0.0275        0.0000         
## adjpoll_mcmullin 0.3984     0.2434          0.0706        0.0000         
##                  rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## samplesize       0.3356           0.0002          0.1590        0.0053         
## rawpoll_clinton  0.1888           0.0000          0.0000        0.0000         
## rawpoll_trump    0.0984           0.0000          0.0000        0.0275         
## rawpoll_johnson  0.0000           0.0000          0.0339        0.0000         
## rawpoll_mcmullin                  0.9226          0.0000        0.0000         
## adjpoll_clinton  0.9226                           0.0000        0.0000         
## adjpoll_trump    0.0000           0.0000                        0.0088         
## adjpoll_johnson  0.0000           0.0000          0.0088                       
## adjpoll_mcmullin 0.0000           0.8085          0.0000        0.0000         
##                  adjpoll_mcmullin
## samplesize       0.3984          
## rawpoll_clinton  0.2434          
## rawpoll_trump    0.0706          
## rawpoll_johnson  0.0000          
## rawpoll_mcmullin 0.0000          
## adjpoll_clinton  0.8085          
## adjpoll_trump    0.0000          
## adjpoll_johnson  0.0000          
## adjpoll_mcmullin

#Corrplot
library(corrplot)
corrplot(corvar, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Scatter plots are useful to look at bivariate relationships: how two variables are related to each other.

ggplot(polls, aes(x=rawpoll_clinton, y=rawpoll_trump)) + geom_point()

ggplot(polls, aes(x=rawpoll_clinton, y=adjpoll_clinton)) + geom_point()

ggplot(polls, aes(x=rawpoll_trump, y=adjpoll_trump)) + geom_point()

Helpful packages

For this section we will use a variety of packages. To make sure that these packages are installed run the install.packages() command introduced above (if you have not already done so).

skimr:
Sometimes you need a quick look at the data and want to view many of the EDA summaries and visualizations together. The skimr package automates summaries for variables by giving an overview of the data, then summarizing character, Date, factor, and numeric data.

library(skimr)
skim(polls)

| | | | :———————————————– | :—- | | Name | polls | | Number of rows | 4208 | | Number of columns | 15 | | _______________________ | | | Column type frequency: | | | character | 1 | | Date | 2 | | factor | 3 | | numeric | 9 | | ________________________ | | | Group variables | None |

Data summary

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
population	0	1	1	2	0	4	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
startdate	0	1	2015-11-06	2016-11-06	2016-09-23	352
enddate	0	1	2015-11-08	2016-11-07	2016-09-30	345

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1.0	FALSE	57	U.S: 1106, Flo: 148, Nor: 125, Pen: 125
pollster	0	1.0	FALSE	196	Ips: 919, Goo: 743, Sur: 660, You: 130
grade	429	0.9	FALSE	10	A-: 1085, B: 1011, C-: 693, C+: 329

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
samplesize	1	1.00	1148.22	2630.86	35.00	447.50	772.00	1236.50	84292.00	▇▁▁▁▁
rawpoll_clinton	0	1.00	41.99	7.73	11.04	38.00	43.00	46.20	88.00	▁▅▇▁▁
rawpoll_trump	0	1.00	39.83	7.88	4.00	35.00	40.00	45.00	68.00	▁▁▇▅▁
rawpoll_johnson	1409	0.67	7.38	2.96	0.00	5.40	7.00	9.00	25.00	▃▇▂▁▁
rawpoll_mcmullin	4178	0.01	24.00	5.70	9.00	22.50	25.00	27.90	31.00	▂▁▃▇▇
adjpoll_clinton	0	1.00	43.32	7.09	17.06	40.21	44.15	46.92	86.77	▁▇▆▁▁
adjpoll_trump	0	1.00	42.67	6.95	4.37	38.43	42.76	46.29	72.43	▁▁▇▃▁
adjpoll_johnson	1409	0.67	4.66	2.47	-3.67	3.15	4.38	5.76	20.37	▁▇▂▁▁
adjpoll_mcmullin	4178	0.01	24.51	5.24	11.03	23.11	25.14	27.98	31.57	▂▁▂▇▆

smartEDA:
The SmartEDA package has an impressive range of capabilities. It provides variable summaries, density plots, the distribution of categorical variables, scatter plots of each variable against a target variable, calculation of chi-squared and p-values for each variable as a predictor of a target variable, Q-Q plots for continuous variables, parallel coordinate plots, and more. This vignette gives a full demonstration of the power of SmartEDA. Here are a few small examples of what it can do.

library(SmartEDA)
ExpData(data = polls, type = 2)

##    Index    Variable_Name Variable_Type Per_of_Missing No_of_distinct_values
## 1      1            state        factor        0.00000                    57
## 2      2        startdate          Date        0.00000                   352
## 3      3          enddate          Date        0.00000                   345
## 4      4         pollster        factor        0.00000                   196
## 5      5            grade        factor        0.10195                    11
## 6      6       samplesize       integer        0.00024                  1767
## 7      7       population     character        0.00000                     4
## 8      8  rawpoll_clinton       numeric        0.00000                  1312
## 9      9    rawpoll_trump       numeric        0.00000                  1385
## 10    10  rawpoll_johnson       numeric        0.33484                   585
## 11    11 rawpoll_mcmullin       numeric        0.99287                    17
## 12    12  adjpoll_clinton       numeric        0.00000                  4200
## 13    13    adjpoll_trump       numeric        0.00000                  4204
## 14    14  adjpoll_johnson       numeric        0.33484                  2211
## 15    15 adjpoll_mcmullin       numeric        0.99287                    31

ExpNumViz(polls, 
          target = NULL, 
          nlim = 10, 
          Page = c(2,2))[[1]]

GGally:
The ggpairs() function from the GGally package creates plots of each variable against all others. For categorical variables ggpairs is limited to 15 categories. The variables state and pollster have momre than 15 unique values so we removed them from the polls data before plotting.

library('GGally')
ggpairs(polls[,-c(1,4)])

Hmisc:

library(Hmisc)
Hmisc::describe(polls)

## polls 
## 
##  15  Variables      4208  Observations
## --------------------------------------------------------------------------------
## state 
##        n  missing distinct 
##     4208        0       57 
## 
## lowest : Alabama       Alaska        Arizona       Arkansas      California   
## highest: Virginia      Washington    West Virginia Wisconsin     Wyoming      
## --------------------------------------------------------------------------------
## startdate 
##          n    missing   distinct       Info       Mean        Gmd        .05 
##       4208          0        352          1 2016-08-31      70.27 2016-03-14 
##        .10        .25        .50        .75        .90        .95 
## 2016-05-23 2016-08-10 2016-09-23 2016-10-20 2016-10-29 2016-11-01 
## 
## lowest : 2015-11-06 2015-11-07 2015-11-09 2015-11-10 2015-11-11
## highest: 2016-11-02 2016-11-03 2016-11-04 2016-11-05 2016-11-06
## --------------------------------------------------------------------------------
## enddate 
##          n    missing   distinct       Info       Mean        Gmd        .05 
##       4208          0        345          1 2016-09-06      70.47 2016-03-18 
##        .10        .25        .50        .75        .90        .95 
## 2016-05-26 2016-08-21 2016-09-30 2016-10-28 2016-11-03 2016-11-06 
## 
## lowest : 2015-11-08 2015-11-13 2015-11-15 2015-11-16 2015-11-17
## highest: 2016-11-03 2016-11-04 2016-11-05 2016-11-06 2016-11-07
## --------------------------------------------------------------------------------
## pollster 
##        n  missing distinct 
##     4208        0      196 
## 
## lowest : ABC News/Washington Post       American Research Group        American Strategies            Angus Reid Global              Anzalone Liszt Grove Research 
## highest: Winthrop University            Y2 Analytics                   YouGov                         Zia Poll                       Zogby Interactive/JZ Analytics
## --------------------------------------------------------------------------------
## grade 
##        n  missing distinct 
##     3779      429       10 
## 
## lowest : D  C- C  C+ B-, highest: B  B+ A- A  A+
##                                                                       
## Value          D    C-     C    C+    B-     B    B+    A-     A    A+
## Frequency     14   693    58   329   142  1011   204  1085   159    84
## Proportion 0.004 0.183 0.015 0.087 0.038 0.268 0.054 0.287 0.042 0.022
## --------------------------------------------------------------------------------
## samplesize 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4207        1     1766        1     1148     1142    137.0    244.0 
##      .25      .50      .75      .90      .95 
##    447.5    772.0   1236.5   1951.8   2562.4 
## 
## lowest :    35    37    39    42    43, highest: 32225 32226 40816 70194 84292
## --------------------------------------------------------------------------------
## population 
##        n  missing distinct 
##     4208        0        4 
##                                   
## Value          a    lv    rv     v
## Frequency     21  3727   418    42
## Proportion 0.005 0.886 0.099 0.010
## --------------------------------------------------------------------------------
## rawpoll_clinton 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4208        0     1312        1    41.99    8.237    28.08    31.50 
##      .25      .50      .75      .90      .95 
##    38.00    43.00    46.20    49.94    52.71 
## 
## lowest : 11.04 11.78 13.34 15.11 16.57, highest: 66.53 79.80 85.00 87.00 88.00
## --------------------------------------------------------------------------------
## rawpoll_trump 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4208        0     1385        1    39.83    8.722    27.00    30.00 
##      .25      .50      .75      .90      .95 
##    35.00    40.00    45.00    49.09    53.00 
## 
## lowest :  4.00  5.00  6.00  6.80  7.00, highest: 61.00 62.00 63.29 65.00 68.00
## --------------------------------------------------------------------------------
## rawpoll_johnson 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2799     1409      584    0.997    7.382     3.19     3.00     4.00 
##      .25      .50      .75      .90      .95 
##     5.40     7.00     9.00    11.00    12.95 
## 
## lowest :  0.00  0.98  1.00  1.32  1.49, highest: 21.00 22.00 23.00 24.00 25.00
## --------------------------------------------------------------------------------
## rawpoll_mcmullin 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       30     4178       16     0.99       24    5.943    10.35    17.40 
##      .25      .50      .75      .90      .95 
##    22.50    25.00    27.90    29.00    29.55 
## 
## lowest :  9.0 12.0 18.0 20.0 21.0, highest: 27.6 28.0 29.0 30.0 31.0
##                                                                             
## Value       9.00 12.00 18.00 20.00 21.00 22.00 24.00 24.52 25.00 26.00 27.00
## Frequency      2     1     1     2     1     1     3     1     4     4     1
## Proportion 0.067 0.033 0.033 0.067 0.033 0.033 0.100 0.033 0.133 0.133 0.033
##                                         
## Value      27.60 28.00 29.00 30.00 31.00
## Frequency      1     1     5     1     1
## Proportion 0.033 0.033 0.167 0.033 0.033
## --------------------------------------------------------------------------------
## adjpoll_clinton 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4208        0     4200        1    43.32    7.465    30.03    33.62 
##      .25      .50      .75      .90      .95 
##    40.21    44.15    46.92    50.43    53.38 
## 
## lowest : 17.06495 18.62685 19.41599 19.60153 19.61379
## highest: 85.77880 86.64585 86.70544 86.76118 86.77218
## --------------------------------------------------------------------------------
## adjpoll_trump 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4208        0     4204        1    42.67    7.524    32.12    34.68 
##      .25      .50      .75      .90      .95 
##    38.43    42.76    46.29    51.33    54.35 
## 
## lowest :  4.372936  4.622556  4.862200  4.879020  5.103538
## highest: 66.183890 67.167230 67.314490 67.607800 72.433030
## --------------------------------------------------------------------------------
## adjpoll_johnson 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2799     1409     2210        1     4.66    2.612    1.283    1.996 
##      .25      .50      .75      .90      .95 
##    3.145    4.384    5.756    7.648    8.988 
## 
## lowest : -3.667890 -3.011773 -2.928394 -1.361062 -1.168077
## highest: 16.342580 17.234500 17.988220 19.364130 20.366840
## --------------------------------------------------------------------------------
## adjpoll_mcmullin 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       30     4178       30        1    24.51    5.613    12.51    18.36 
##      .25      .50      .75      .90      .95 
##    23.11    25.14    27.98    29.65    30.06 
## 
## lowest : 11.02832 11.56920 13.65646 18.87894 20.74372
## highest: 29.47327 29.64230 29.67611 30.37186 31.57469
## --------------------------------------------------------------------------------

Psych:

library(psych)
psych::describe(polls)

##                  vars    n    mean      sd median trimmed    mad   min      max
## state*              1 4208   34.47   17.03  39.00   35.84  16.31  1.00    57.00
## startdate           2 4208     NaN      NA     NA     NaN     NA   Inf     -Inf
## enddate             3 4208     NaN      NA     NA     NaN     NA   Inf     -Inf
## pollster*           4 4208  104.10   48.61  81.00  103.27  41.51  1.00   196.00
## grade*              5 3779    5.83    2.34   6.00    5.94   2.97  1.00    10.00
## samplesize          6 4207 1148.22 2630.86 772.00  837.95 548.56 35.00 84292.00
## population*         7 4208    2.11    0.36   2.00    2.01   0.00  1.00     4.00
## rawpoll_clinton     8 4208   41.99    7.73  43.00   42.29   5.93 11.04    88.00
## rawpoll_trump       9 4208   39.83    7.88  40.00   39.92   7.41  4.00    68.00
## rawpoll_johnson    10 2799    7.38    2.96   7.00    7.16   2.97  0.00    25.00
## rawpoll_mcmullin   11   30   24.00    5.70  25.00   25.00   4.45  9.00    31.00
## adjpoll_clinton    12 4208   43.32    7.09  44.15   43.55   4.79 17.06    86.77
## adjpoll_trump      13 4208   42.67    6.95  42.76   42.60   5.73  4.37    72.43
## adjpoll_johnson    14 2799    4.66    2.47   4.38    4.47   1.91 -3.67    20.37
## adjpoll_mcmullin   15   30   24.51    5.24  25.14   25.31   4.08 11.03    31.57
##                     range  skew kurtosis    se
## state*              56.00 -0.54    -1.18  0.26
## startdate            -Inf    NA       NA    NA
## enddate              -Inf    NA       NA    NA
## pollster*          195.00  0.24    -1.12  0.75
## grade*               9.00 -0.45    -0.96  0.04
## samplesize       84257.00 16.34   391.23 40.56
## population*          3.00  2.69     8.34  0.01
## rawpoll_clinton     76.96  0.06     3.41  0.12
## rawpoll_trump       64.00 -0.28     1.10  0.12
## rawpoll_johnson     25.00  1.03     2.43  0.06
## rawpoll_mcmullin    22.00 -1.35     1.14  1.04
## adjpoll_clinton     69.71  0.20     3.98  0.11
## adjpoll_trump       68.06 -0.20     2.52  0.11
## adjpoll_johnson     24.03  1.19     3.59  0.05
## adjpoll_mcmullin    20.55 -1.14     0.71  0.96

pairs.panels(polls[,c(6,8,9,12,13)])

summarytools:
The summarytools package gives descriptive statistics for the continuous variables using descr(). You can also get the frequency of each category for factors using the dfSummary() function.

library(summarytools)
summarytools::descr(polls)

## Descriptive Statistics  
## polls  
## N: 4208  
## 
##                     adjpoll_clinton   adjpoll_johnson   adjpoll_mcmullin   adjpoll_trump
## ----------------- ----------------- ----------------- ------------------ ---------------
##              Mean             43.32              4.66              24.51           42.67
##           Std.Dev              7.09              2.47               5.24            6.95
##               Min             17.06             -3.67              11.03            4.37
##                Q1             40.21              3.15              22.81           38.43
##            Median             44.15              4.38              25.14           42.76
##                Q3             46.92              5.76              28.07           46.29
##               Max             86.77             20.37              31.57           72.43
##               MAD              4.79              1.91               4.08            5.73
##               IQR              6.71              2.61               4.87            7.86
##                CV              0.16              0.53               0.21            0.16
##          Skewness              0.20              1.19              -1.14           -0.20
##       SE.Skewness              0.04              0.05               0.43            0.04
##          Kurtosis              3.98              3.59               0.71            2.52
##           N.Valid           4208.00           2799.00              30.00         4208.00
##         Pct.Valid            100.00             66.52               0.71          100.00
## 
## Table: Table continues below
## 
##  
## 
##                     rawpoll_clinton   rawpoll_johnson   rawpoll_mcmullin   rawpoll_trump   samplesize
## ----------------- ----------------- ----------------- ------------------ --------------- ------------
##              Mean             41.99              7.38              24.00           39.83      1148.22
##           Std.Dev              7.73              2.96               5.70            7.88      2630.86
##               Min             11.04              0.00               9.00            4.00        35.00
##                Q1             38.00              5.40              22.00           35.00       447.00
##            Median             43.00              7.00              25.00           40.00       772.00
##                Q3             46.20              9.00              28.00           45.00      1237.00
##               Max             88.00             25.00              31.00           68.00     84292.00
##               MAD              5.93              2.97               4.45            7.41       548.56
##               IQR              8.20              3.60               5.40           10.00       789.00
##                CV              0.18              0.40               0.24            0.20         2.29
##          Skewness              0.06              1.03              -1.35           -0.28        16.34
##       SE.Skewness              0.04              0.05               0.43            0.04         0.04
##          Kurtosis              3.41              2.43               1.14            1.10       391.23
##           N.Valid           4208.00           2799.00              30.00         4208.00      4207.00
##         Pct.Valid            100.00             66.52               0.71          100.00        99.98

summarytools::dfSummary(polls$grade)

## Data Frame Summary  
## polls  
## Dimensions: 4208 x 1  
## Duplicates: 4197  
## 
## --------------------------------------------------------------------------------------
## No   Variable   Stats / Values   Freqs (% of Valid)   Graph      Valid      Missing   
## ---- ---------- ---------------- -------------------- ---------- ---------- ----------
## 1    grade      1. D               14 ( 0.4%)                    3779       429       
##      [factor]   2. C-             693 (18.3%)         III        (89.81%)   (10.19%)  
##                 3. C               58 ( 1.5%)                                         
##                 4. C+             329 ( 8.7%)         I                               
##                 5. B-             142 ( 3.8%)                                         
##                 6. B             1011 (26.8%)         IIIII                           
##                 7. B+             204 ( 5.4%)         I                               
##                 8. A-            1085 (28.7%)         IIIII                           
##                 9. A              159 ( 4.2%)                                         
##                 10. A+             84 ( 2.2%)                                         
## --------------------------------------------------------------------------------------

Two other packages to check out are RtutoR and DataExplorer see this article.

Conclusion

EDA can be insightful and is a necessary first step!

Twitter Facebook LinkedIn

Anika Staccone