Homework 2

Due: 2025-02-04, 11:59pm

Submission instructions

For this homework you should submit a ZIP archive containing:

  • A single document with the answers to all the following items in HTML format only. Make sure you include plain English blocks in between the code, and its output to interpret what R is giving you.
  • Code file used to generate the answers (RMD format). There should be comments in the code blocks.

  • Jupyter notebook (IPYNB) is okay.
  • Please remember to mix your comments with code and output.
  • Do not forget acknowledgements.

In this homework we will practice working with factors.

1. Final project (50%)

Outline your idea for the final project. State the following.

  • Goals: Outline what you want to investigate and questions you want answered.
  • Data: Describe the dataset, if you have it in hand, and if the instructor can have access to it. Mention size of the data, and evaluate if the data allows you to answer the question.
  • Analysis: Sketch the analysis steps you expect to take.

The answers are not intended to be definitive, and are the first step in initiating a discussion about the project idea.

2. Creating a factor from a numeric variable (50%)

We will revisit the Southwestern Chinese cohort study.

Creating factors from numeric variables is helpful in some situations when you expect non-linear effects. There is some evidence that the relationship between and systolic and diastolic blood pressure is non-linear, and we will look at it by dividing up the range of diastolic blood pressure into bins.

  • Calculate the quintiles of DBP using the quantile command. Quintiles are the 20th, 40th, 60th and 80th percentiles.
  • Use the cut command to make a factor variable from DBP. Label them q1 through q5. There should be five categories. Use the table command to tabulate this variable.
  • Perform linear regression of SBP on the quantile categories of DBP. Is the association positive or negative?
  • Using the relevel command, make the middle quintile (q3) the reference category. In many situations, it is more natural to have the middle category as the reference. Repeat the linear regression, and explain the new regression coefficients using the previous regression output.
  • (Optional) Repeat the above steps by dividing the range of DBP into equal sized bins (instead of quintiles). The width of the bins will be the same, but the number of individuals in the bins will vary.

3. Acknowledgements

Cite resources or individuals helping you.