Casestudy: Predicting Doctor Visits

A quick tutorial on analyzing and presenting count data
statistical analysis
presentation
Author

Paw Hansen

Published

March 25, 2024

Today I’m sitting at a cafe and feel like doing some quick data analysis. We’ll be working with the DoctorVisits data from the AER package. Our goal is first to learn what predicts number of doctor visits. Second, to practice the workflow advocated in Alexander (2023). That means, we’ll:

  1. Plan the kind of data and results we would like to end up with
  2. Use simulation to produce a data set with similar features as the one we would like to acquire
  3. Actually, acquire that data
  4. Explore the data set using graphs, models, and anything else that might make sense for our specific analysis
  5. Share what we found with others in a way that is compelling not only to stats aficionados but to stakeholders who care about the substance of what we’re studying.

Let’s go!

Packages used in this post
library(tidyverse)
library(rstanarm)
library(broom)
library(modelsummary)
library(marginaleffects)
library(AER)
library(ggsci)

theme_set(theme_minimal(base_size = 12))

Plan

For this example, we want to try to estimate what makes people go see their doctor. To plan this, we need to consider two things. First, what will a useful data set look like? And second, what might a final graph look like?

Concerning our first criterion, a useful data set should include a variable indicating how many times the respondent went to see their doctor for a given period of time. This will be our outcome variable when we get to modeling.

In addition, our data set should include one or more predictors. These could be a the individual (i.e. respondent) level or perhaps we might want to consider some form of multilevel modeling - perhaps seeing your doctor or not also depends on things like doctor supply at the state level, your family’s total financial resources, and several other factors on primarily found at the individual level.

In our example here, though, we’ll stick with what can be measured at the respondent level. Thus, a useful dataset might look a in [crossref].

Second, a useful graph could simply include a plot of the relevant predictors as well as their corresponding uncertainty intervals. [crossref] is one example.

[Make sketches]

Simulate

Next, we should try to simulate what our dataset might look like as this will prepare us for any challenges we might end up facing once we have acquired the real dataset.

Code
n_obs <- 1000

dat_sim <- 
  tibble(
    num_visits = rnorm(n = n_obs, 2) |> round(), 
    gender = sample(c("female", "male"), n_obs, replace = T),
    age = 20
  )

Aquire

Now we’re ready to acquire a real dataset. Today, that will be the easy part since a useful one is contained in the AER package. Let’s download and have a peak:

Code
data("DoctorVisits")

head(DoctorVisits)
  visits gender  age income illness reduced health private freepoor freerepat
1      1 female 0.19   0.55       1       4      1     yes       no        no
2      1 female 0.19   0.45       1       2      1     yes       no        no
3      1   male 0.19   0.90       3       0      0      no       no        no
4      1   male 0.19   0.15       1       0      0      no       no        no
5      1   male 0.19   0.45       2       5      1      no       no        no
6      1 female 0.19   0.35       5       1      9      no       no        no
  nchronic lchronic
1       no       no
2       no       no
3       no       no
4       no       no
5      yes       no
6      yes       no

Explore

Next, let’s start exploring our dataset. We should probably start by having a look at some basic descriptives as seen in Table 1.

Code
datasummary_balance(~1, DoctorVisits)
Table 1: Sample descriptives for the DoctorVisits dataset
Mean Std. Dev.
visits 0.3 0.8
age 0.4 0.2
income 0.6 0.4
illness 1.4 1.4
reduced 0.9 2.9
health 1.2 2.1
N Pct.
gender male 2488 47.9
female 2702 52.1
private no 2892 55.7
yes 2298 44.3
freepoor no 4968 95.7
yes 222 4.3
freerepat no 4099 79.0
yes 1091 21.0
nchronic no 3098 59.7
yes 2092 40.3
lchronic no 4585 88.3
yes 605 11.7

And of course, we should plot our dependent variable to see how it looks (Figure 1):

Code
DoctorVisits |> 
  ggplot(aes(visits)) + 
  geom_histogram(color = "white", fill = "lightblue") +
  scale_x_continuous(labels = 0:9, breaks = 0:9) + 
  labs(x = "Number of visits",
       y = "Count")

Figure 1: Histogram of the number of doctor visits found in the AER dataset

We could also try to explore patterns in the data using a correlation matrix. To do so, we first need to recode any factor variables into dummies:

Code
doctor_visits <- 
  DoctorVisits |> 
  mutate(across(where(is.factor), ~case_when(
    . == "no" ~ 0,
    . == "yes" ~ 1)))

datasummary_correlation(doctor_visits)
visits gender age income illness reduced health private freepoor freerepat nchronic lchronic
visits 1 . . . . . . . . . . .
gender 1 . . . . . . . . . .
age .12 1 . . . . . . . . .
income −.08 −.27 1 . . . . . . . .
illness .22 .20 −.15 1 . . . . . . .
reduced .42 .09 −.05 .22 1 . . . . . .
health .19 .02 −.09 .36 .28 1 . . . . .
private −.01 −.06 .28 −.05 −.03 −.05 1 . . . .
freepoor −.04 −.16 −.16 −.03 −.01 .06 −.19 1 . . .
freerepat .11 .60 −.40 .20 .08 .07 −.46 −.11 1 . .
nchronic .05 .30 −.08 .25 .00 .02 .03 −.07 .16 1 .
lchronic .14 .12 −.07 .22 .22 .19 −.04 .00 .16 −.30 1

Perhaps more useful is to model the thing:

Code
mod_pois <- 
  glm(visits ~ ., 
    data = DoctorVisits,
    family = "poisson")

mod_lm <- 
  lm(visits ~ ., 
    data = DoctorVisits)
Code
modelsummary(models = 
               list("Poisson Model" = mod_pois,
                    "Normal Model" = mod_lm))
Table 2: Predicting the number of doctor visits found in the AER dataset
Poisson Model Normal Model
(Intercept) −2.098 0.036
(0.102) (0.036)
genderfemale 0.156 0.034
(0.056) (0.022)
age 0.279 0.148
(0.166) (0.067)
income −0.187 −0.056
(0.085) (0.031)
illness 0.186 0.060
(0.018) (0.008)
reduced 0.127 0.103
(0.005) (0.004)
health 0.031 0.017
(0.010) (0.005)
privateyes 0.126 0.035
(0.072) (0.025)
freepooryes −0.438 −0.103
(0.180) (0.052)
freerepatyes 0.084 0.033
(0.092) (0.038)
nchronicyes 0.117 0.005
(0.067) (0.024)
lchronicyes 0.151 0.042
(0.082) (0.036)
Num.Obs. 5190 5190
R2 0.202
R2 Adj. 0.200
AIC 6735.7 11243.1
BIC 6814.4 11328.3
Log.Lik. −3355.850 −5608.565
F 152.175 119.029
RMSE 0.73 0.71

Same with rstanarm

Code
# library(rstanarm)
# 
# mod_pois_stan <- 
#   stan_glm(visits ~ ., 
#     data = DoctorVisits,
#     family = "poisson",
#     refresh=0)
# 
# mod_lm_stan <- 
#   stan_glm(visits ~ ., 
#     data = DoctorVisits,
#     family = "gaussian",
#     refresh=0)
# 
# mod_negbinomial_stan <- 
#   stan_glm(visits ~ ., 
#     data = DoctorVisits,
#     family = "neg_binomial_2",
#     refresh=0)
# 
# pp_check(mod_pois_stan)
# 
# pp_check(mod_lm_stan)
# 
# pp_check(mod_negbinomial_stan)

Share

Alexander, Rohan. 2023. “Telling Stories with Data,” June. https://doi.org/10.1201/9781003229407.