# Descriptive Statistics

---

One of the first things a data analyst has to do after reading in a dataset, is to get a feel for the data.  A data analyst should check the size of the data, the variables, the variable names, and get summary statistics.  These help the analyst understand the data, and to ask the data generator any clarifying questions.  They may also want to investigate or take corrective measures if they find anomalies in the data.  Finally, the descriptive statistics can help set the analytical strategy.  For example, if a binary variable (eg. sex) has predominantly one category, it is of limited use in the data analysis.

## Checking data size

Checking data size can reveal if the data has been loaded correctly.  For example, if you were expecting 100 data points, and instead got 53, then you may want to find out if the data file was corrupted or there was a mistake in reading the data into R.  Let us see this via an example by reading in the Agren Arabidopsis datset.

In [1]:
# Filename in string
agrenURL <- "https://raw.githubusercontent.com/sens/smalldata/master/arabidopsis/agren2013.csv"
agren <- read.csv(agrenURL)
#Print top of data frame
round(head(agren),2)

Unnamed: 0_level_0,it09,it10,it11,sw09,sw10,sw11,id,flc
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,19.76,24.18,15.68,5.66,21.48,4.48,1,1
2,6.29,1.77,3.24,9.54,22.12,7.77,2,2
3,12.03,12.46,10.6,11.8,23.05,14.87,3,2
4,20.13,14.12,12.9,7.44,22.58,8.19,6,1
5,15.13,13.29,15.06,6.9,22.14,8.49,7,1
6,19.21,13.95,12.99,8.4,25.56,8.08,8,1


To see the dimensions of this data frame (spreadsheet type data) we can use `dim`.  It returns the number of rows and columns in a data frame.

In [2]:
dim(agren)

We can also get the number of rows and the number of columns separately using `nrow` and `ncol` separately.

In [3]:
nrow(agren)

In [4]:
ncol(agren)

To get the variable names of a data frame (or the "names" attribute of any object) use `names`.  The variabe names are explained in the README accompanying the data.

In [5]:
names(agren)

To get the size of a data vector, use `length`.  For example, each data column is a vector; so we can use the following.  The dollar sign is used to get a specific variable in a data frame.

In [6]:
length(agren$it09)

## Getting type

The next thing one will want to do is to get the types of the data fields.  You can get the type of a variables using `typeof`; you can apply the function to all data columns using `lapply` as follows.

In [7]:
lapply(agren,typeof)

This looks a bit ugly, so you may want to prettify it.

In [8]:
as.matrix(sapply(agren,typeof))

0,1
it09,double
it10,double
it11,double
sw09,double
sw10,double
sw11,double
id,integer
flc,integer


## Summary function

After this one might want to get a better look into the actual numbers.
The `summary()` function is a handy multipurpose function in R that
provides summary data related to a dataset. The output of the summary
function depends on what type of object is input as an argument. It
usually reports what is needed in terms of summary statistics. It is a
primary function to explore how to analyze a data set.  For a data
frame, the function `summary()` returns five values: the mean, median,
25th and 75th percentiles (first and third quartiles), minimum and
maximum for each variable.

For example, let us get the summarystatistics with the function `summary()`.

In the case of a data frame, the function `summary()` is automatically applied to each variable (*i.e,* column).

In [9]:
summary(agren)

      it09             it10             it11             sw09       
 Min.   : 6.288   Min.   : 1.774   Min.   : 3.239   Min.   : 5.664  
 1st Qu.:10.365   1st Qu.: 5.888   1st Qu.: 7.177   1st Qu.: 9.990  
 Median :12.058   Median : 7.540   Median : 8.524   Median :10.832  
 Mean   :12.211   Mean   : 8.083   Mean   : 8.921   Mean   :10.850  
 3rd Qu.:13.865   3rd Qu.: 9.462   3rd Qu.:10.370   3rd Qu.:11.702  
 Max.   :20.592   Max.   :24.176   Max.   :19.311   Max.   :15.001  
 NA's   :6        NA's   :2                         NA's   :6       
      sw10            sw11              id             flc       
 Min.   :19.23   Min.   : 4.478   Min.   :  1.0   Min.   :1.000  
 1st Qu.:22.27   1st Qu.:10.303   1st Qu.:162.2   1st Qu.:1.000  
 Median :23.09   Median :12.810   Median :313.5   Median :2.000  
 Mean   :23.37   Mean   :13.493   Mean   :325.6   Mean   :1.535  
 3rd Qu.:24.28   3rd Qu.:15.742   3rd Qu.:486.8   3rd Qu.:2.000  
 Max.   :34.13   Max.   :32.189   Max.   :700.0   Ma

We can calculate descriptive statistics for a specific variable (vector) as follows.

In [10]:
summary(agren$it09)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  6.288  10.365  12.058  12.211  13.865  20.592       6 

Notice that the summary function returns the number of `NA's` or missing values.  This is also very important because missing data can be informative.  To check if there is any missing values in a dataset, we use the `anyNA` function which will just tell us if there is _any_ missing data at all.

In [11]:
anyNA(agren)

## Basic descriptive statistics functions

R offers a number of functions to get descriptive statistics. The following table gives a list of several descriptive statistics commonly used.

|   Function name    |  Description                                                                                              |
|:-------------------|:----------------------------------------------------------------------------------------------------------|
|     `mean(x)`      |  Returns the average of the values in x                                                                    |
|     `sd(x)`        |  Returns the standard deviation of the values in x                                                        |
|     `var(x)`       |  Returns the variance of the values in x                                                                  |
|     `median(x)`    |  Returns the median of the values in x                                                                    |
|     `min(x)`       |  Returns the minimum value in x                                                                            |
|     `max(x)`       |  Returns the maximum value in x                                                                            |
|     `sum(x)`       |  Returns the total sum of all values in x                                                                  |
|     `quantile(x)`  |  Returns by default the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles) in x |
|     `IQR(x)`       |  Returns the interquantile range, the difference of its upper and lower quartiles in x                     |
|     `range(x)`     |  Returns the minimum and the maximum values in x                                                           |
|     `table(x)`     |  Returns a frequency table, representing the number of occurrences of every unique value in x               |


Many functions in R programming have a `na.rm` option. If we set this option to `TRUE`, observations with **NA** values will be excluded from calculations. However, if we set this option to `FALSE`, the function will return **NA** or an error if there are any missing values the dataset.

In [12]:
mean(agren$it09, na.rm = FALSE)

In [13]:
mean(agren$it09, na.rm = TRUE)

In [14]:
quantile(agren$it09, na.rm = T)

The `table()` function in R  programming is especially useful to perform cross-tabulation, which summarizes categorical data.

Let us load the English Premier League dataset (2015-2016 season) which contains categorical variables such as FTR (Full Time Result) and HTR (Half Time Result) whose values belongs to the set {H=Home Win, D=Draw, A=Away Win}.

In [15]:
## URL of the file
eplURL <- "https://raw.githubusercontent.com/sens/smalldata/master/soccer/E0.csv"
## read in data frame
epl <- read.csv(eplURL)
## examine head
head(epl)

Unnamed: 0_level_0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,⋯,BbAv.2.5.1,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<chr>,⋯,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,E0,08/08/15,Bournemouth,Aston Villa,0,1,A,0,0,D,⋯,1.79,26,-0.5,1.98,1.93,1.99,1.92,1.82,3.88,4.7
2,E0,08/08/15,Chelsea,Swansea,2,2,D,2,1,H,⋯,1.99,27,-1.5,2.24,2.16,1.8,1.73,1.37,5.04,10.88
3,E0,08/08/15,Everton,Watford,2,2,D,0,1,A,⋯,1.96,26,-1.0,2.28,2.18,1.76,1.71,1.75,3.76,5.44
4,E0,08/08/15,Leicester,Sunderland,4,2,H,3,0,H,⋯,1.67,26,-0.5,2.0,1.95,1.96,1.9,1.79,3.74,5.1
5,E0,08/08/15,Man United,Tottenham,1,0,H,1,0,H,⋯,2.01,26,-1.0,2.2,2.09,1.82,1.78,1.64,4.07,6.04
6,E0,08/08/15,Norwich,Crystal Palace,1,3,A,0,1,A,⋯,1.67,27,0.0,1.83,1.78,2.17,2.08,2.46,3.39,3.14


In [16]:
# Get the number of occurences for each values
table(epl$FTR)


  A   D   H 
116 107 157 

The `table()` function is also helpful in generating a 2-way cross table.

In [17]:
table(epl$FTR, epl$HTR, dnn = c("FTR", "HTR"))

   HTR
FTR  A  D  H
  A 64 43  9
  D 15 71 21
  H  8 54 95

## Manipulating datasets

After performing basic checks you may want to manipulate the data based on certain criteria.  For example, you may want to remove observations with impossible values.  Sometimes, it is convenient to restrict the data to individuals with complete data (no missing data).  These manipulations should be done with due care and diligence using domain-specific knowledge and input from experts.  To understand how this can be done, we need to look closer into two concepts -- logical operations and missing data representation.

## Logical (Boolean)

Logical operations are common in data analysis and programming.  For example, we may include or exclude data points based on certain criteria.  To best perform these tasks, it is helpful to consider the logical or Boolean data type.  The logical (boolean) data that has one of two possible values: `TRUE` and `FALSE`. In R programming, we can also use `T` for `TRUE` and `F` for `FALSE`.

In [18]:
typeof(TRUE)

In [19]:
typeof(T)

It is important to realize that `TRUE` and `FALSE` are not strings.

In [20]:
typeof(FALSE)
typeof("FALSE")

A logical value is generally created by a boolean expression such as a comparison between variables, which returns a boolean value. The next table describes typical comparison operators

|  Operator   | Operation                                                                                                                       |
|:------------|:--------------------------------------------------------------------------------------------------------------------------------|
|      ==     | returns `TRUE` if the value on the left is equal to the value on the right, otherwise it returns `FALSE`.                      |
|      !=     | returns `TRUE` if the value on the left is different from the value on the right, otherwise it returns `FALSE`.              |
|      >      | returns `TRUE` if the value on the left is greater than the value on the right, otherwise it returns `FALSE`.              |
|      >=     | returns `TRUE` if the value on the left is greater than or equal to the value on the right, otherwise it returns `FALSE`. |
|      <      | returns `TRUE` if the value on the left is less than the value on the right, otherwise it returns `FALSE`.                      |
|      <=     | returns `TRUE` if the value on the left is less than or equal to the value on the right, otherwise it returns `FALSE`.           |



In [21]:
22 < 24

Logical operators combine multiple boolean expressions into a single expression that returns a single logical value. The following table shows the standard logical operators.

|  Operator   | Operation                                                                                                  |
|:------------|:-----------------------------------------------------------------------------------------------------------|
|      !      | NOT returns `TRUE` if the value is `FALSE`, and returns `FALSE` if the value is `TRUE`                      |
|      &&     | AND returns `TRUE` if and only if the expressions on both sides are `TRUE`, otherwise it returns `FALSE`              |
|      &      | applies AND operation element-wise                                                                          |
|      \|\|     | OR returns `TRUE` if and only if the expression on either side is `TRUE`, otherwise it returns `FALSE`  |
|      \|      | applies OR operation element wise                                                                          |


In the case the operator `&&` compares two vectors, only the first elements of each vector are compared.

In [22]:
A <- c(T, F, T, T)
B <- c(T, T, T, T)

In [23]:
A&&B

In [24]:
# AND acts as "Multiplication" where TRUE is 1 and FALSE is 0
A
B
A&B

In [25]:
A==B

In [26]:
all(A==B)

In [27]:
# Compare A and !A
A
!A

In [28]:
# OR acts as "Addition" where TRUE is 1 and FALSE is 0
A
B
A|B

Logical expressions are often necessary for subsetting data. For example, in the Arabidopsis dataset, let subset the plants whose average number of seeds is greater than 19.

In [29]:
# Let subset the plants whose average number of seeds
# is greater than 19 in Italy in 2009.
agrenSub <- agren$it09[agren$it09>19]
round(head(agrenSub), 2)

In [30]:
# Let subset the plants whose average number of seeds
# is greater than 19 in Italy in 2009, for all variables.
agrenSub <- agren[agren$it09>19,]
round(head(agrenSub), 2)

Unnamed: 0_level_0,it09,it10,it11,sw09,sw10,sw11,id,flc
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1.0,19.76,24.18,15.68,5.66,21.48,4.48,1.0,1.0
4.0,20.13,14.12,12.9,7.44,22.58,8.19,6.0,1.0
6.0,19.21,13.95,12.99,8.4,25.56,8.08,8.0,1.0
9.0,20.59,14.87,11.18,6.52,21.51,8.04,14.0,1.0
,,,,,,,,
125.0,19.31,8.91,11.54,11.74,24.12,8.58,197.0,2.0


We will see in the missing values section how logical values can be used to handle missing data.

## Missing Data

One of the most common challenge in data analysis is how to handle missing values. In R programming, the symbol **NA** (Not Available) represents missing values. We will present various functions and options assiociated with **NA** to deal with missing data in the following sections:
* [How do we test for missing values?](#test-NA)
* [How to exclude missing values?](#exclude-NA)

Before, to go further, it is imporant to understand what is the **NA** symbol and what would be the output resulting from operations with missing data.
**NA** is  a reserved word and definined in R as a logical constant of lentgth 1. It can be generated by importing data with missing values, or it can be compelled to any data type.
Most expressions that include **NA** will generate also **NA**, with few exceptions.

In [31]:
1 + NA

In [32]:
NA^0

Here, the result is 1 since we could argue that any number to the power zero is equal to 1.

In [33]:
FALSE || NA

In [34]:
TRUE || NA

In this case, the **NA** a logical value that is missing. Therefore, it could be representing only `TRUE` or `FALSE`. Independently what **NA** represents, the answer will always be the same, `TRUE`.

### How do we test for misssing values? <a id='test-NA'></a>

We can use the two following functions `anyNA()` and `is.na()` for testing missing values.
The function `anyNA()` returns `TRUE` only if a data set contains at least one missing value, but it does not give any indication where the missing values are localized.

For example purposes, we will create a data frame with missing values.

In [35]:
# Create a data frame containing NA values
df <- data.frame("ID" = c("002", "004", "006", "007", "008"),
                 "Name" = c("Bill Fairbanks", NA, "Alex Trevelyan", "James Bond", NA),
                 stringsAsFactors = FALSE)
df

ID,Name
<chr>,<chr>
2,Bill Fairbanks
4,
6,Alex Trevelyan
7,James Bond
8,


In [36]:
anyNA(df)

In the case, we are interested in identifying positions of the missing values we can use `is.na()`. The function `is.na()` returns `TRUE` for the elements of vector (or dataframe) that are **NA**, and `FALSE` otherwise.

In [37]:
is.na(df)

ID,Name
False,False
False,True
False,False
False,False
False,True


To get the total number of **NA** we can use the function `sum()` which counts the number of `TRUE` in a data frame (or a vector).

In [38]:
# Get the total number or NA in a dataframe
sum(is.na(df))

To obtain the exact positions of the **NA** elements, we use the function `which()` that returns the locations of all `TRUE` in a data frame (or a vector).

In [39]:
# Get the positions where the elements are NA
which(is.na(df))

In [40]:
# Get the coordinates of NA elements in a dataframe
which(is.na(df), arr.ind = TRUE)

row,col
2,2
5,2


Another useful function for testing missing values is `complete.cases()`. This function returns `FALSE` if a row contains **NA**, and `TRUE` otherwise.

In [41]:
# Identify the rows which are complete cases (i.e. with no NA)
complete.cases(df)

### How to exclude misssing values? <a id='exclude-NA'></a>

There exist different ways to exclude missing values. First, we can use the `na.rm = TRUE` argument to dismiss missing values from an operation.

In [42]:
# Create a vector containing NA values
vec <- c(12, 7, NA, 17, 30, 23)
vec

In [43]:
# Get the sum of the vector vec without na.rm argument
sum(vec)

In [44]:
# Get the sum of the vector vec with na.rm argument
sum(vec, na.rm = TRUE)

We can also remove missing values by creating a subset of our data with only complete observations. For example, we can use the function `complete.cases()` to obtain only the full rows of a data frame.

In [45]:
# Original data frame
df
# Subset the rows
df[complete.cases(df),]

ID,Name
<chr>,<chr>
2,Bill Fairbanks
4,
6,Alex Trevelyan
7,James Bond
8,


Unnamed: 0_level_0,ID,Name
Unnamed: 0_level_1,<chr>,<chr>
1,2,Bill Fairbanks
3,6,Alex Trevelyan
4,7,James Bond


Finally, another simple way to deal with missing values is to eliminate all rows containing **NA** with the function `na.omit` or `na.exclude()`.

In [46]:
# Original data frame
df
# Subset the rows
na.omit(df)

ID,Name
<chr>,<chr>
2,Bill Fairbanks
4,
6,Alex Trevelyan
7,James Bond
8,


Unnamed: 0_level_0,ID,Name
Unnamed: 0_level_1,<chr>,<chr>
1,2,Bill Fairbanks
3,6,Alex Trevelyan
4,7,James Bond


In [47]:
na.exclude(df)

Unnamed: 0_level_0,ID,Name
Unnamed: 0_level_1,<chr>,<chr>
1,2,Bill Fairbanks
3,6,Alex Trevelyan
4,7,James Bond


It is always essential to deal with missing values since most functions will return **NA** if they are not excluded.  The existence and pattern of missing data can also reveal information about the data generating process, and possible sources of bias.