Paw Hansen - Casestudy: Predicting Doctor Visits

Today I’m sitting at a cafe and feel like doing some quick data analysis. We’ll be working with the DoctorVisits data from the AER package. Our goal is first to learn what predicts number of doctor visits. Second, to practice the workflow advocated in Alexander (2023). That means, we’ll:

Plan the kind of data and results we would like to end up with
Use simulation to produce a data set with similar features as the one we would like to acquire
Actually, acquire that data
Explore the data set using graphs, models, and anything else that might make sense for our specific analysis
Share what we found with others in a way that is compelling not only to stats aficionados but to stakeholders who care about the substance of what we’re studying.

Let’s go!

Packages used in this post

library(tidyverse)
library(rstanarm)
library(broom)
library(modelsummary)
library(marginaleffects)
library(AER)
library(ggsci)

theme_set(theme_minimal(base_size = 12))

Plan

For this example, we want to try to estimate what makes people go see their doctor. To plan this, we need to consider two things. First, what will a useful data set look like? And second, what might a final graph look like?

Concerning our first criterion, a useful data set should include a variable indicating how many times the respondent went to see their doctor for a given period of time. This will be our outcome variable when we get to modeling.

In addition, our data set should include one or more predictors. These could be a the individual (i.e. respondent) level or perhaps we might want to consider some form of multilevel modeling - perhaps seeing your doctor or not also depends on things like doctor supply at the state level, your family’s total financial resources, and several other factors on primarily found at the individual level.

In our example here, though, we’ll stick with what can be measured at the respondent level. Thus, a useful dataset might look a in [crossref].

Second, a useful graph could simply include a plot of the relevant predictors as well as their corresponding uncertainty intervals. [crossref] is one example.

[Make sketches]

Simulate

Next, we should try to simulate what our dataset might look like as this will prepare us for any challenges we might end up facing once we have acquired the real dataset.

Code

n_obs <- 1000

dat_sim <- 
  tibble(
    num_visits = rnorm(n = n_obs, 2) |> round(), 
    gender = sample(c("female", "male"), n_obs, replace = T),
    age = 20
  )

Aquire

Now we’re ready to acquire a real dataset. Today, that will be the easy part since a useful one is contained in the AER package. Let’s download and have a peak:

Code

data("DoctorVisits")

head(DoctorVisits)

  visits gender  age income illness reduced health private freepoor freerepat
1      1 female 0.19   0.55       1       4      1     yes       no        no
2      1 female 0.19   0.45       1       2      1     yes       no        no
3      1   male 0.19   0.90       3       0      0      no       no        no
4      1   male 0.19   0.15       1       0      0      no       no        no
5      1   male 0.19   0.45       2       5      1      no       no        no
6      1 female 0.19   0.35       5       1      9      no       no        no
  nchronic lchronic
1       no       no
2       no       no
3       no       no
4       no       no
5      yes       no
6      yes       no

Explore

Next, let’s start exploring our dataset. We should probably start by having a look at some basic descriptives as seen in Table 1.

Code

datasummary_balance(~1, DoctorVisits)

Table 1: Sample descriptives for the DoctorVisits dataset
		Mean	Std. Dev.
visits		0.3	0.8
age		0.4	0.2
income		0.6	0.4
illness		1.4	1.4
reduced		0.9	2.9
health		1.2	2.1
		N	Pct.
gender	male	2488	47.9
	female	2702	52.1
private	no	2892	55.7
	yes	2298	44.3
freepoor	no	4968	95.7
	yes	222	4.3
freerepat	no	4099	79.0
	yes	1091	21.0
nchronic	no	3098	59.7
	yes	2092	40.3
lchronic	no	4585	88.3
	yes	605	11.7

And of course, we should plot our dependent variable to see how it looks (Figure 1):

Code

DoctorVisits |> 
  ggplot(aes(visits)) + 
  geom_histogram(color = "white", fill = "lightblue") +
  scale_x_continuous(labels = 0:9, breaks = 0:9) + 
  labs(x = "Number of visits",
       y = "Count")

Figure 1: Histogram of the number of doctor visits found in the AER dataset

We could also try to explore patterns in the data using a correlation matrix. To do so, we first need to recode any factor variables into dummies:

Code

doctor_visits <- 
  DoctorVisits |> 
  mutate(across(where(is.factor), ~case_when(
    . == "no" ~ 0,
    . == "yes" ~ 1)))

datasummary_correlation(doctor_visits)

	visits	gender	age	income	illness	reduced	health	private	freepoor	freerepat	nchronic	lchronic
visits	1	.	.	.	.	.	.	.	.	.	.	.
gender		1	.	.	.	.	.	.	.	.	.	.
age	.12		1	.	.	.	.	.	.	.	.	.
income	−.08		−.27	1	.	.	.	.	.	.	.	.
illness	.22		.20	−.15	1	.	.	.	.	.	.	.
reduced	.42		.09	−.05	.22	1	.	.	.	.	.	.
health	.19		.02	−.09	.36	.28	1	.	.	.	.	.
private	−.01		−.06	.28	−.05	−.03	−.05	1	.	.	.	.
freepoor	−.04		−.16	−.16	−.03	−.01	.06	−.19	1	.	.	.
freerepat	.11		.60	−.40	.20	.08	.07	−.46	−.11	1	.	.
nchronic	.05		.30	−.08	.25	.00	.02	.03	−.07	.16	1	.
lchronic	.14		.12	−.07	.22	.22	.19	−.04	.00	.16	−.30	1

Perhaps more useful is to model the thing:

Code

mod_pois <- 
  glm(visits ~ ., 
    data = DoctorVisits,
    family = "poisson")

mod_lm <- 
  lm(visits ~ ., 
    data = DoctorVisits)

Code

modelsummary(models = 
               list("Poisson Model" = mod_pois,
                    "Normal Model" = mod_lm))

Table 2: Predicting the number of doctor visits found in the AER dataset
	Poisson Model	Normal Model
(Intercept)	−2.098	0.036
	(0.102)	(0.036)
genderfemale	0.156	0.034
	(0.056)	(0.022)
age	0.279	0.148
	(0.166)	(0.067)
income	−0.187	−0.056
	(0.085)	(0.031)
illness	0.186	0.060
	(0.018)	(0.008)
reduced	0.127	0.103
	(0.005)	(0.004)
health	0.031	0.017
	(0.010)	(0.005)
privateyes	0.126	0.035
	(0.072)	(0.025)
freepooryes	−0.438	−0.103
	(0.180)	(0.052)
freerepatyes	0.084	0.033
	(0.092)	(0.038)
nchronicyes	0.117	0.005
	(0.067)	(0.024)
lchronicyes	0.151	0.042
	(0.082)	(0.036)
Num.Obs.	5190	5190
R2		0.202
R2 Adj.		0.200
AIC	6735.7	11243.1
BIC	6814.4	11328.3
Log.Lik.	−3355.850	−5608.565
F	152.175	119.029
RMSE	0.73	0.71

Same with rstanarm

Code

# library(rstanarm)
# 
# mod_pois_stan <- 
#   stan_glm(visits ~ ., 
#     data = DoctorVisits,
#     family = "poisson",
#     refresh=0)
# 
# mod_lm_stan <- 
#   stan_glm(visits ~ ., 
#     data = DoctorVisits,
#     family = "gaussian",
#     refresh=0)
# 
# mod_negbinomial_stan <- 
#   stan_glm(visits ~ ., 
#     data = DoctorVisits,
#     family = "neg_binomial_2",
#     refresh=0)
# 
# pp_check(mod_pois_stan)
# 
# pp_check(mod_lm_stan)
# 
# pp_check(mod_negbinomial_stan)

Alexander, Rohan. 2023. “Telling Stories with Data,” June. https://doi.org/10.1201/9781003229407.

Plan

Simulate

Aquire

Explore

Share