Final

Due: 2022-12-16, 11:59pm

This work must be submitted via Blackboard. The answers must be in a MS Word (DOCX) or PDF format. Your submitted document should have sections corresponding to those in this assignment.

Include graphs as images in your document. Support all answers to questions using the Stata output, figures and tables. Include all code used to complete the homework so that your results can be reproduced (you may edit out irrelevant portions). Use the lecture notes as a guide.

In this exam you will analyze a dataset you have not seen before and submit the report, the data, and code in a single ZIP archive. The workflow is similar to what you might do for an actual data analysis project.

1. Gupte pulmonary function data

Gupte et al (2017) studied lung function in 730 HIV-infected Black South African adults. Pre-bronchodilator spirometry was performed at enrollment and repeated annually for three years. They wanted to see if high viral load, low CD4 counts and ART were associated with development of OLD (obstructive lung disease).

Gupte AN, Wong ML, Msandiwa R, Barnes GL, Golub J, Chaisson RE, Hoffmann CJ, Martinson NA. Factors associated with pulmonary impairment in HIV-infected South African adults. PLoS One. 2017 Sep 13;12(9):e0184530. doi: 10.1371/journal.pone.0184530.

The data may be accessed using the following link: https://doi.org/10.5061/dryad.st5rk

We will use this data to develop models for predicting lung function (FEV1, forced expiratory volume in one second) and FVC (forced vital capacity, the effective capacity of the lung).

  • (10%) Organize your work as follows:

     final/
     ├── code.txt
     ├── data
     │   ├── datafile.xlsx
     │   └── README
     └── report.pdf
    

The names of the files can be different from what they are above, but the structure should be maintained.

Zip the final folder; (re)name it yourlastname-final.zip. (Replace yourlastname` with your last name.

  • (10%) Document the data. Download the data and put the data file in a separate directory. Make a README file to the best of your ability describing the variables. Note any ambiguities or gaps in information. You may consult the Gupte et. al. (2017) paper to fill in gaps. Note the target population, study population and sampling design.

  • (5%) Data overview. How many unique subjects are in this dataset? What fraction returned for their 12 month, 24 month and 36 month visits?

  • (5%) Consider data from the first visit only. This means you only consider data from visit==0. How many data rows do you have?

  • (5%) Data types. Consider the following variables only, and list their data type.

    ca_sex ca_evrsmk ca_hadtb
    ce_prefev1 ce_prefvc crp cd4 vl age bmi
    
  • (5%) Missing data. Which variables have how much missing data? What fraction of the subjects have no missing data in any of the variables we are considering?

  • (5%) Data summaries. Make numerical and graphical summaries for all variables. Does it match what is reported in the paper?

  • (5%) Data transformations. Transform all appreciably skewed variables using logarithms (base 10) and make histograms of the transformed data. Comment on the transformed data distributions.

  • (10%) Pairwise associations and plots. Calculate appropriate association statistics of FEV1 with sex, smoking, TB, CRP, CD4, viral load, age and BMI. Use transformed versions of variables if they were transformed. Make appropriate plots for each association.

  • (5%) Linear regression. Build a linear regression model for FEV1 using predictor variables associated with it with a p-value of at most 0.2. Keep sex and age in the model regardless of association. How much variation in FEV1 is explained by the predictors.

  • (5%) Regression diagnostics. Make a residual vs fitted value plot for the model above. Do you suspect any deviations from the assumptions of linear regression?

  • (10%) Model interpretation. Write down the prediction equation for FEV1 based on the regression model fitted above. Comment on the nature of the association of each variable with FEV1 based on the model. What is the proportion of variance explained by the model?

  • (5%) Bootstrap. Use the bootstrap to estimate the confidence intervals for the coefficients in the model. Do your conclusions change?

  • (5%) Using FVC. Make a scatterplot of FEV1 and FVC. Build a regression model for predicting FEV1 from FVC.

  • (5%) Comparing models. What are the pros and cons of the two models (one considering all variables except FVC as predictor, and one considering only FVC)?

  • (5%) Conclusion. In non-technical language summarize your findings from the analyses above.

2. Acknowledgements

Please acknowledge individuals who helped you or resources thay were helpful in completing the homework.