ABSTRACT

What might affect the chance of getting heart disease? One of the earliest studies addressing this issue started in 1960 and used 3154 healthy men, aged from 39 to 59, from the San Francisco area. At the start of the study, all were free of heart disease. Eight and a half years later, the study recorded whether these men now suffered from heart disease along with many other variables that might be related to the chance of developing this disease. We load a subset of this data from the Western Collaborative Group Study described in Rosenman et al. (1975): data(wcgs, package="faraway")

We see that only 257 men developed heart disease as given by the factor variable chd. The men vary in height (in inches) and the number of cigarettes (cigs) smoked per day. We can plot these data using R base graphics: plot(height ~ chd, wcgs) wcgs$y <- ifelse(wcgs$chd == "no",0,1) plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height", ylab="Heart

↪→ Disease", pch=".") The first panel in Figure 2.1 shows a boxplot. This shows the similarity in the distribution of heights of the two groups of men with and without heart disease. But the heart disease is the response variable so we might prefer a plot which treats it as such. We convert the absence/presence of disease into a numerical 0/1 variable and plot this in the second panel of Figure 2.1. Because heights are reported as round numbers of inches and the response can only take two values, it is sensible to add a small amount of noise to each point, called jittering, so that we can distinguish them. Again we can see the similarity in the distributions. We might think about fitting a line to this plot.