# Datasets

In this notebook we list and read in a variety of datasets.

In [1]:
library(tidyverse)
library(readxl)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.1
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## English Premier League results

In [2]:
## URL of the file
eplURL <- "https://raw.githubusercontent.com/sens/smalldata/master/soccer/E0.csv"
## read in data frame
epl <- read.csv(eplURL)
## examine head
head(epl)

Unnamed: 0_level_0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,⋯,BbAv.2.5.1,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<chr>,⋯,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,E0,08/08/15,Bournemouth,Aston Villa,0,1,A,0,0,D,⋯,1.79,26,-0.5,1.98,1.93,1.99,1.92,1.82,3.88,4.7
2,E0,08/08/15,Chelsea,Swansea,2,2,D,2,1,H,⋯,1.99,27,-1.5,2.24,2.16,1.8,1.73,1.37,5.04,10.88
3,E0,08/08/15,Everton,Watford,2,2,D,0,1,A,⋯,1.96,26,-1.0,2.28,2.18,1.76,1.71,1.75,3.76,5.44
4,E0,08/08/15,Leicester,Sunderland,4,2,H,3,0,H,⋯,1.67,26,-0.5,2.0,1.95,1.96,1.9,1.79,3.74,5.1
5,E0,08/08/15,Man United,Tottenham,1,0,H,1,0,H,⋯,2.01,26,-1.0,2.2,2.09,1.82,1.78,1.64,4.07,6.04
6,E0,08/08/15,Norwich,Crystal Palace,1,3,A,0,1,A,⋯,1.67,27,0.0,1.83,1.78,2.17,2.08,2.46,3.39,3.14


## Flowering time

We will use data from [Burghardt et. al. 2015](https://nph.onlinelibrary.wiley.com/doi/epdf/10.1111/nph.13799) who studied the effect of temperature fluctuations on flowering time in _Arabidopsis thaliana_.  We will use a small subset of the data for the purposes of this note.  They performed [three experiments](https://datadryad.org/resource/doi:10.5061/dryad.65d76) -- we will use data from the first experiment where a number of flowering time mutants were studied. 

In [3]:
vernURL <- "https://datadryad.org/stash/downloads/file_stream/23001"
tmpfile <- tempfile(fileext="txt")
download.file(vernURL,tmpfile)
vern <- read.table(tmpfile)
head(vern)

Unnamed: 0_level_0,Genotype,Background,Background.simple,Treatment,Treatment.V,Chamber.ID,Chamber.Irradiance,Vernalization,Daylength,Temperature,⋯,Survival.Bolt,Bolt,Days.to.Bolt,Days.to.Flower,Rosette.leaf.num,Cauline.leaf.num,Blade.length.mm,Total.leaf.length.mm,Blade.ratio,Notes
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,⋯,<chr>,<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<chr>
1,agl24-1,Col,Col,12ConLD,12ConLDNV,7B,Low,NV,16,12,⋯,Y,Y,35,48,18,7,15.6,29.7,0.5252525,
2,agl24-1,Col,Col,12ConLD,12ConLDNV,7B,Low,NV,16,12,⋯,Y,Y,35,48,17,5,16.4,32.9,0.4984802,
3,agl24-1,Col,Col,12ConLD,12ConLDNV,4T,High,NV,16,12,⋯,Y,Y,36,48,22,6,10.6,20.4,0.5196078,
4,agl24-1,Col,Col,12ConLD,12ConLDNV,4T,High,NV,16,12,⋯,Y,Y,36,47,18,5,15.9,25.6,0.6210938,
5,agl24-1,Col,Col,12ConLD,12ConLDNV,4T,High,NV,16,12,⋯,Y,Y,38,50,24,8,14.4,26.4,0.5454545,
6,agl24-1,Col,Col,12ConLD,12ConLDNV,7B,Low,NV,16,12,⋯,Y,Y,39,53,20,7,15.9,29.7,0.5353535,


## Tree ring dataset

We will use data from [Bigler et. al. 2016](https://doi.org/10.1890/15-1402.1) who collected [tree ring data](https://doi.org/10.5061/dryad.1bv6n) from three datasets.  The main data file has the width of tree rings for each tree in each year, as well as a metadata file with information about each tree (age, whether it is alive, species, and location).  Three long-lived tree species were studied: _Abies alba_ (silver fir), _Nothofagus dombeyi_ (coihui), and _Quercus petraea_ (sessile oak).  This is an example of a longitudinal observational study.

In [4]:
## two URLs
ringURL <- "https://datadryad.org/stash/downloads/file_stream/19701"
ringmetaURL <- "https://datadryad.org/stash/downloads/file_stream/19702"
## read in data
tmpfile <- tempfile(fileext="txt")
download.file(ringURL,tmpfile)
ring <- read.delim(tmpfile)
download.file(ringmetaURL,tmpfile)
ringmeta <- read.delim(tmpfile)
head(ring)
head(ringmeta)

Unnamed: 0_level_0,Year,BIS84003,BIS84004,BIS84006,BIS84007,BIS84009,BIS84010,BIS84016,BIS84019,BIS84021,⋯,QP_K6,QP_K7,QP_K8,QP_K9,QP_K10,QP_K11,QP_K12,QP_K13,QP_K14,QP_K15
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1793,,,,,,,,,,⋯,,,,,,,,,,
2,1794,,,,,,,,,,⋯,,,,,,,,,,
3,1795,,,,,,,,,,⋯,,,,,,,,,,
4,1796,,,,,,,,,,⋯,,,,,,,,,,
5,1797,,,,,,,,,,⋯,,,,,,,,,,
6,1798,,,,,,,,,,⋯,,,,,,,,,,


Unnamed: 0_level_0,tree,site,species,article,contact,status,age,DBH
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>
1,BIS84003,Bistra,Abies_alba,Bigler_et_al._2004,Cufar_K,LIVING,154,46.932
2,BIS84004,Bistra,Abies_alba,Bigler_et_al._2004,Cufar_K,LIVING,113,36.994
3,BIS84006,Bistra,Abies_alba,Bigler_et_al._2004,Cufar_K,LIVING,129,45.332
4,BIS84007,Bistra,Abies_alba,Bigler_et_al._2004,Cufar_K,LIVING,158,52.522
5,BIS84009,Bistra,Abies_alba,Bigler_et_al._2004,Cufar_K,LIVING,148,39.54
6,BIS84010,Bistra,Abies_alba,Bigler_et_al._2004,Cufar_K,LIVING,147,41.034


## Frog abnormalities data

This data is on [frog abnormalities](https://www.datadryad.org/resource/doi:10.5061/dryad.sq72d.2) from [Reeves et. al.](https://doi.org/10.1890/09-0879.1) (dichotomous) from 21 wetland sites in Alaska.  We read in the abnormalities file which has information about the individual frogs studied.

In [5]:
frogURL <- "https://datadryad.org/stash/downloads/file_stream/98621"
tmpfile <- tempfile(fileext="csv")
download.file(frogURL,tmpfile)
frog <- read.csv(tmpfile)
head(frog)

Unnamed: 0_level_0,COLLECTION_ID,FROG_ID,GOSNER_STAGE,SVL,TAIL_LENGTH,FROG_COMMENTS,ABNORMAL,BLEEDING_INJ,SKEL_AB,EYE_AB,SURF_AB,Perkensus,SITE,DATE,YEAR
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>
1,KNA1021-RASY-080712,15,45,17,2,,0,0,0,0,0,0,KNA1021,8/7/2012,2012
2,KNA1024-RASY-080812,13,44,22,20,,0,0,0,0,0,0,KNA1024,8/8/2012,2012
3,KNA1069-RASY-080612,24,45,18,1,,0,0,0,0,0,0,DNR1069,8/6/2012,2012
4,KNA1090-RASY-080612,5,45,17,3,,0,0,0,0,0,0,KEN1090,8/6/2012,2012
5,KNA11119-RASY-081612,26,44,20,17,,0,0,0,0,0,0,KNA111-19,8/16/2012,2012
6,KNA1024-RASY-080812,47,44,21,33,"~ 3mm of right thigh is comparable to left, remainder of thigh/calf are underdeveloped and foot is not fully developed, digits are not differentiated.",1,0,1,0,0,0,KNA1024,8/8/2012,2012


## Bird infection data

This data from [Clark et. al.](https://doi.org/10.1111/1365-2656.12578) recorded infections
in wild birds by capture session in New Caledonia, a subtropical Pacific archipelago.  We read in 
the data first.  Then we calculate the mean and variance of infection by capture session.

In [6]:
birdURL <- "https://datadryad.org/stash/downloads/file_stream/71521"
tmpfile <- tempfile(fileext="csv")
download.file(birdURL,tmpfile)
bird <- read.csv(tmpfile)
head(bird)

Unnamed: 0_level_0,Bird,Data.source,Genus,Species,Island,Habitat,Capture.session,Infected,Haem,H.zosteropis,H.killangoi,Plas,Microfilaria,Heterophil,Lymphocyte,Basophil,Monocyte,Eosinophil,H.L.Ratio,Time
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,GT9,This study,Zosterops,Green-backed white-eye,Grand Terre,Montane rainforest,Julien's place,0,0,0,0,0,0,6.0,89.0,0.0,5.0,0.0,0.06741573,0.3055556
2,GT218,This study,Zosterops,Green-backed white-eye,Grand Terre,Montane rainforest,Stefen's driveway,0,0,0,0,0,0,,,,,,,0.75
3,GT166,This study,Zosterops,Green-backed white-eye,Grand Terre,Open lowland,Water station,0,0,0,0,0,0,,,,,,,0.6666667
4,GT49,This study,Zosterops,Green-backed white-eye,Grand Terre,Open lowland,Flo's place,0,0,0,0,0,0,1.0,88.0,8.0,0.0,3.0,0.01136364,0.25
5,GT147,This study,Zosterops,Green-backed white-eye,Grand Terre,Open lowland,Poe,0,0,0,0,0,0,6.0,88.0,4.0,2.0,0.0,0.06818182,0.7291667
6,L903,This study,Zosterops,Large Lifou white-eye,Lifou,Lowland rainforest,Ngoni forest,0,0,0,0,0,0,,,,,,,


## Emergency department triage

This dataset has [short-term (30-day) mortality](https://datadryad.org/resource/doi:10.5061/dryad.m2bq5) from [Kristensen et. al.](https://doi.org/10.1186/s13049-017-0458-x) as a function of different blood tests, age and sex.  We will just look at the mortality as a function of age.

In [1]:
## make a temporaty filename with RDA extension
tmpfile <- tempfile(fileext="rda")
## download that file
## windows users may want to use the mode="wb" option
download.file("https://datadryad.org/stash/downloads/file_stream/27482",tmpfile,mode="wb")
## load it
load(tmpfile)
## assign it to a different name as it has an unhelpful name
emergency <- data
rm(data)
head(emergency)

Unnamed: 0_level_0,triage,age,sex,crp,k,na,hb,crea,leu,alb,⋯,mort30,icutime,icustatus,inddage,genindl.1,saturation,respirationsfrekvens,puls,systoliskblodtryk,gcs
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
2430,yellow,75,female,281.1013,3.266996,135.5883,8.4,34.50495,24.61,22.25447,⋯,0,999999,0,23,0,97.0,20.0,124.0,120.0,15
4250,red,21,female,3.470628,4.014938,137.3804,7.9,61.14286,18.69,,⋯,0,999999,0,2,0,99.0,18.0,102.0,164.0,15
5002,green,83,female,19.91354,4.254919,132.2926,7.5,61.46023,12.25,43.55455,⋯,0,999999,0,1,0,97.0,16.0,87.0,156.0,15
5375,orange,71,male,32.09275,4.9,135.0,9.9,89.0,16.39,39.1746,⋯,1,0,1,1,0,,,,,14
1424,yellow,86,female,76.25685,3.6,138.0,7.6,92.0,14.53,40.15953,⋯,0,999999,0,1,0,97.0,20.0,69.0,150.0,14
1057,yellow,84,female,2.324502,3.3,140.0,7.2,76.0,5.17,44.07696,⋯,0,999999,0,3,0,97.0,20.0,70.0,167.0,15


In [2]:
nrow(emergency)

In [3]:
dim(emergency)

## Tuberculosis transmission data

We will use the [tuberculosis transmission data](https://www.datadryad.org/resource/doi:10.5061/dryad.br760) from [Grandjean et. al.](https://doi.org/10.1371/journal.pmed.1001843)where individuals in contact with a TB patient in Peru were followed up for about three years.  The event of interest was development of TB, and the primary predictor of interest was whether the index individual had drug-resistant or drug-sensitive TB. 

In [8]:
## make temporary file
tmpfile <- tempfile(fileext="xlsx")
## download file
## windows users may want to use the mode="wb" option
download.file("https://datadryad.org/stash/downloads/file_stream/26803",tmpfile)
## read file
tb <- read_excel(tmpfile)
## make names "legal" for R
names(tb)<- make.names(names(tb))
head(tb)

Family.Code,Individual.Code,Incident.TB.Disease,MDR.or.Sensitive.Household,Follow.Up.Time,Index.Sex,Index.Education,Index.Sputum.Smear.Grade,Index.Diabetes,Index.Incarceration,⋯,Socio.Economic.Tertile,Contact.Age,Contact.Sex,Contact.Chemotherapy,Contact.Work,Contact.HIV,Index.Strain.Genotype,Contact.Diabetes,Contact.Index.Roomshare,Contact.Previous.TB.History
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,0,MDR,545,1,2,0,0,0,⋯,2,,,,,,4,,,
1,2,0,MDR,118,1,2,0,0,0,⋯,2,2.0,1.0,1.0,3.0,0.0,4,0.0,0.0,0.0
1,3,0,MDR,118,1,2,0,0,0,⋯,2,3.0,0.0,1.0,3.0,0.0,4,0.0,0.0,0.0
1,4,0,MDR,118,1,2,0,0,0,⋯,2,3.0,1.0,0.0,3.0,0.0,4,0.0,0.0,0.0
1,5,0,MDR,545,1,2,0,0,0,⋯,2,8.0,0.0,0.0,1.0,0.0,4,0.0,0.0,0.0
1,6,0,MDR,545,1,2,0,0,0,⋯,2,5.0,0.0,0.0,2.0,0.0,4,0.0,0.0,0.0
