-
Notifications
You must be signed in to change notification settings - Fork 1
/
06-specification.Rmd
1062 lines (905 loc) · 62.8 KB
/
06-specification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
editor_options:
chunk_output_type: console
---
# Model Specification Refinement: San Francisco Bay Area Work Mode Choice {#specification-chapter}
```{r setup, include = FALSE, cache = FALSE}
library(mlogit)
library(tidyverse)
library(modelsummary)
library(haven)
library(knitr)
library(kableExtra)
knitr::opts_chunk$set(cache = TRUE)
theme_set(theme_bw())
```
```{r loaddata}
# work trips data frame constructed in chapter 5
sf_mlogit <- read_rds("data/worktrips.rds")
```
## Introduction
This chapter describes and demonstrates the refinement of the utility function specification for
the multinomial logit (MNL) model for work mode choice in the San Francisco Bay Area. The
process combines the use of intuition, statistical analysis and testing, and judgment. The
intuition and judgment components of the model refinement process are based on theory,
anecdotal evidence, logical analysis, and the accumulated empirical experience of the model
developer. This empirical experience can be and often is enhanced through the advice of others
or through review of reports and published papers documenting previous modeling studies for
similar choice problems and contexts.
We explore a variety of different specifications of the utility functions to demonstrate
some of the most common specifications and testing methods. These tests include both formal
statistical tests and informal judgments about the signs, magnitudes, or relative magnitudes of
parameters based on our knowledge about the underlying behavioral relationships that influence
mode choice. The use of judgment and experience is an essential element of successful model
development since it is almost impossible to determine the “best” model specification solely on
the basis of statistical tests. A model that fits the data well may not necessarily describe the
causal relationships and may not produce the most reasonable predictions. Also, it is not
uncommon to find several model specifications that, for all practical purposes, fit the data
equally well, but which have very different specifications and forecast implications. Therefore,
practical model building involves considerable use of subjective judgment and is as much an art
as it is a science.
Different modelers have different styles and approaches to the model development
process. One of the most common approaches is to start with a minimal specification which
includes those variables that are considered essential to any reasonable model. In the case of
mode choice, such a specification might include travel time, travel cost and departure frequency
where appropriate for each alternative. Working from this minimal specification, incremental
changes are proposed and tested in an effort to improve the model in terms of its behavioral
realism and/or its empirical fit to the data while avoiding excessive complexity of the model.
Another common approach is to start with a richer specification which represents the model
developer’s judgment about the set of variables that is likely to be included in the final model
specification. For example, such a model might include travel time (separated into in-vehicle
and out-of-vehicle time), out of vehicle travel time might be adjusted to take account of the total
distance traveled, out of pocket travel cost (possibly adjusted by household income), frequency
of departure for carrier modes, household automobile ownership or availability, household
income, and size of the travel party.
We adopt the first of these methods in the following section for refinement of the
specification of a model of work mode choice as it is the most appropriate approach for those
who are new to discrete choice modeling. At each stage in the model development process, we
introduce incremental changes to the modal utility functions and re-estimate the model with the
objective of finding a more refined model specification that performs better statistically and is
consistent with theory and our *a priori* expectations about mode choice behavior. We introduce
small changes at each step as the estimation results for each stage provide useful insights which
may be helpful in further refining the model. The appropriateness of each specification change
is evaluated at each step using both judgmental and statistical tests.
In the rest of this chapter, we describe and demonstrate this process for work mode
choice in the San Francisco Bay Area.
## Alternative Specifications
The basic multinomial logit mode choice model for work commute in the San Francisco Bay
Area was reported in Table \@ref(tab:basic-estimation-table) in [CHAPTER 5](#chapter5) . The refinements we consider include:
- Different specifications of the income effects,
- Different specifications of travel time,
- Additional decision maker related variables such as gender and automobiles owned,
- Additional variables that represent the interaction of decision maker related variables
with mode related variables (*e.g.*, interaction of income with cost), and
- Additional trip context variables (*e.g.*, dummy variable indicating if the trip
origin/destination is in a Central Business District).
### Refinement of Specification for Alternative Specific Income Effects
The estimation results for the base model in [CHAPTER 5](#chapter5) yielded time and cost parameter
estimates that had the expected (negative) sign and were statistically significant. The parameters
for the alternative specific income variables were significant and had the expected sign (negative
relative to drive alone) except for the shared ride specific income variables (shared ride 2 and
shared ride 3+) which were not significant and the sign on the shared ride 3+ income variable
was counter-intuitive. All else being equal, we expect the preference for shared ride 2 to be
negative relative to drive alone and for shared ride 3+ to be more negative than shared ride 2
because of the increasing inconvenience of coordinating with other travelers as the number of
persons in the ride sharing group increases. However, the empirical results provide only limited
support for the first expectation and are inconsistent with the second expectation. This suggests
that the effect of income on choice is not necessarily different among the automobile modes.
We approach this inconsistency between expectation and empirical results by thinking of
other plausible relationships for the effect of income on shared ride choice and developing
alternative specifications which represent these relationships. Options for consideration include:
- The effect of income relative to drive alone is the same for the two shared ride modes (shared
ride 2 and shared ride 3+) but is different from drive alone and different from the other
modes. This relationship is represented by constraining the income coefficients in the two
shared alternatives to be equal as follows:
\begin{equation}
H_0 : \beta_{IncomeSR2} = \beta_{IncomeSR3+}
(\#eq:incomeandsharedrides)
\end{equation}
- The effect of income relative to drive alone is the same for both shared ride modes and
transit but is different for the other modes. This is represented in the model by constraining
the income coefficients in both shared ride modes and the transit mode to be equal as:
\begin{equation}
H_0 : \beta_{Income-SR2} = \beta_{Income-SR3+} = \beta_{Income-Transit}
(\#eq:incomeonsharedrideandtransit)
\end{equation}
- The effect of income on all the automobile modes (drive alone, shared ride 2, and shared
ride 3+) is the same, but the effect is different for the other modes. We include this
constraint by setting the income coefficients in the utilities of the automobile modes to be
equal. In this case, we set them to zero since drive alone is the reference mode.
\begin{equation}
H_0 : \beta_{IncomeSR2} = \beta_{IncomeSR3+} = 0
(\#eq:automotivemodessame)
\end{equation}
The estimation results for the base model (from [CHAPTER 5](#chapter5)) and for these three alternative
models are reported in Table \@ref(tab:incpsec-models). The parameter estimates for all three models are consistent
with expectations. That is, the effect of increasing income is neutral or negative for the shared
ride modes relative to drive alone and equal to or more negative for transit, bike and walk than
for shared ride. Further, all the parameters are significant except for the shared ride income
parameters in Model 1W.
Selection of one of these four models to represent the effect of income should consider
the statistical relationships among them and the reasonableness of the resultant models. Since
Models 1W, 2W and 3W are constrained versions of the Base Model and Models 2W and 3W
are constrained versions of Model 1W, we can use the likelihood ratio test to evaluate the
hypotheses implied by each of these models (see [section 5.7.3.2](#section5-7-3-2)). We use this test to determine
if the hypothesis that each of these models is the true model is or is not rejected by the less
restricted model. The likelihood ratio statistics (Equation 5.16), the degrees of freedom or
number of restrictions and the level of significance for each test are reported relative to the Base
Model and to Model 1W in Table \@ref(tab:incspec-goftest), respectively. The Base
Model cannot reject any of the subsequent models at a reasonable level of significance. Further,
the Base Model has a counter-intuitive relationship between the parameters for shared ride 2 and
shared ride 3+. Thus, Model 1W or Model 3W can represent the effect of income on mode
choice in this case. We choose Model 1W because it is most consistent with our prior
hypotheses about the effect of income on preference between drive alone and shared ride and
other modes. However, the differences among these models are small both statistically and
behaviorally so the decision should be subject to a review before adoption of the final
specification [^statbasis].
```{r incspec-models}
model_base <- mlogit(chosen ~ tvtt + cost | hhinc, data = sf_mlogit, )
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_1w <- mlogit(chosen ~ tvtt + cost | hhinc, data = sf_mlogit)
#Having issues setting Shared Ride 2, 3+, and Transit to be equal to each other
model_2w <- mlogit(chosen ~ tvtt + cost | hhinc, data = sf_mlogit)
model_3w <- mlogit(chosen ~ tvtt + cost | hhinc, data = sf_mlogit, constPar = c('hhinc:Share ride 2' = 0, 'hhinc:Share ride 3++' = 0))
altspecinc_estimation <- list(
"Base Model " = model_base,
"Model 1W" = model_1w,
"Model 2W" = model_2w,
"Model 3W" = model_3w
)
modelsummary(
altspecinc_estimation, fmt = "%.4f",
title = "Alternative Specifications of Income Variable"
)
```
```{r incspec-goftest, echo = F}
#This table is incomplete because Models 1-3 are not being restricted in the same way as in the original table.
lrtest_compare <- function(m){
lrtest(m, model_base)[2, 4]
}
lrtest_p <- function(m){
lrtest(m, model_base)[2, 5]
}
tibble(
model = list(model_1w, model_2w, model_3w)
) %>%
mutate(
Model = c("Model 1W","Model 2W","Model 3W"),
loglik = map_dbl(model, logLik),
lrtest = map_dbl(model, lrtest_compare),
p_val = map_dbl(model, lrtest_p)
) %>%
select(-model)%>%
kbl(align = 'c', caption = "Likelihood Ratio Tests between Models 1W, 2W, 3W and Base Model") %>%
kable_styling()
```
### Different Specifications of Travel Time
The specification for travel time in the above models implies that the utility value of time is
equal for all the alternatives and between in-vehicle and out-of-vehicle time. However, we
expect travelers in non-motorized modes to be more sensitive to travel time than travelers in
motorized modes (since walking or biking is physically more demanding than traveling in a car)
and we expect that travelers are more sensitive to out-of-vehicle travel time (OVT) than to in
vehicle travel time (IVT).
The estimation results for two specifications of travel time that relax these constraints are
reported with those for Model 1W in Table \@ref(tab:timespec-models). Model 5W relaxes the time constraints in Model
1W by specifying distinct time variables for the motorized and non-motorized modes based on
our expectation that travelers are likely to be more sensitive to travel time by non-motorized
modes. Model 6W relaxes the constraint further by disaggregating the travel time for motorized
modes into distinct components for IVT and OVT. This specification allows the two
components of travel time for motorized travel to have different effects on utility with the
expectation that travelers are more sensitive to out-of-vehicle time than in-vehicle time.
The estimation results for Model 5W rejects the hypothesis of equal value of travel time
across modes implied in Model 1W and Model 6W rejects the hypothesis of equal value of in
and out of travel time for the motorized modes at a very high level of significance $(0.001)$. The
estimated parameters associated with travel time in Model 6W have the correct signs and the
magnitude of the parameters for OVT for motorized modes and for time for non-motorized
modes are larger in magnitude than the parameter for IVT, as expected; however, the parameter
for IVT is very small and not statistically significant. Further, the ratio of OVT to IVT for
motorized modes, 30 times, is far greater than expected. Nonetheless, since Model 6W rejects
the constraints imposed by both Models 1W and 5W at a very high level of significance, we
cannot discard this model without further exploration.
Another perspective on the suitability of these models can be obtained by calculating the
relative importance of each component of travel time and cost which gives us the implied value
of each component of time. The implied value of in-vehicle-time for motorized modes is computed
for each model using the estimated motorized in-vehicle-time and cost parameters and
similarly for the other time components:
\begin{equation}
$\displaystyle = \text{Value of motorized IVTT (\$/hour)} = \frac{\beta_\text{motorized ivtt (1/min.)}}{\beta_{cost (1/cents)}} \times \frac{60 min./hour}{100 cents/\$} $
(\#eq:valueofivtt)
\end{equation}
The implied values of in- and out-of-vehicle times for motorized modes in Models 1W, 5W, and
6W are reported in Table \@ref(tab:timespec-vot). The values of motorized in-vehicle time and non-motorized time
are somewhat low but not unreasonable compared to the average wage rate of $21.20 per hour in
the region (1990 dollars); however, the value of in-vehicle time is unreasonably low.
Nevertheless, the likelihood ratio tests reject both Model 5W and Model 1W at very high levels
of significance. This raises doubt about the suitability of those models and suggests the need to
consider other specifications to evaluate the influence of travel time components on the utilities
of the different alternatives.
Two approaches are commonly taken to identify a specification which is not statistically
rejected by other models and has good behavioral relationships among variables. The first is to
examine a range of different specifications in an attempt to find one which is both behaviorally
sound and statistically supported. The other is to constrain the relationships between or among
parameter values to ratios which we are considered reasonable. The formulation of these
constraints is based on the judgment and prior empirical experience of the analyst. Therefore,
the use of such constraints imposes a responsibility on the analyst to provide a sound basis for
his/her decision. The advice of other more experienced analysts is often enlisted to expand
and/or support these judgments.
```{r timespec-models}
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_5w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + cost | hhinc, data = sf_mlogit)
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_6w <- mlogit(chosen ~ nm_tvtt + mot_ovtt + mot_ivtt + cost | hhinc, data = sf_mlogit)
altspectvtt_estimation <- list(
"Model 1W" = model_1w,
"Model 5W" = model_5w,
"Model 6W" = model_6w
)
modelsummary(
altspectvtt_estimation, fmt = "%.4f",
title = "Estimation Results for Alternative Specifications of Travel Time[^trumodel], [^valuesoftime]"
)
```
```{r VOTfunctions, echo = FALSE}
#These functions are based off the equations given in the text above Table 6-6
VOTsimple <- function(model, timevar, costvar) {
coef(model)[timevar]*0.6/coef(model)[costvar]
}
VOTdistance <- function(model, timevar, timedistvar, dist_value, costvar) {
(coef(model)[timevar]+(coef(model)[timedistvar]/dist_value))*0.6/coef(model)[costvar]
}
```
```{r timespec-vot, echo = FALSE}
#Table 6-4 based off coefficients from models 1w, 5w, 6w
#These values are different from those in the text because the model coefficients are as well
tibble(
"Value of Time ($/hr)" = c("Value of Non-Motorized Time", "Value of Out-of-vehicle Time", "Value of In-vehicle Time"),
"Model 1W" = round(c(VOTsimple(model_1w, "tvtt", "cost"), VOTsimple(model_1w, "tvtt", "cost"), VOTsimple(model_1w, "tvtt", "cost")),2),
"Model 5W" = round(c(VOTsimple(model_5w, "nm_tvtt", "cost"), VOTsimple(model_5w, "mot_tvtt", "cost"), VOTsimple(model_5w, "mot_tvtt", "cost")),2),
"Model 6W" = round(c(VOTsimple(model_6w, "nm_tvtt", "cost"), VOTsimple(model_6w, "mot_ovtt", "cost"), VOTsimple(model_6w, "mot_ivtt", "cost")),2)
) %>%
kbl(align = 'c', caption = "Implied Values of Time in Models 13W, 14W, and 15W") %>%
kable_styling()
```
The primary shortcoming of the specification in Model 6W is that the estimated value of
IVT is unrealistically small. At least two alternatives can be considered for getting an improved
estimate of the value of out-of-vehicle time. One is to use an approach that has been effective in
other contexts; that is, to assume that the sensitivity of travelers to OVT diminishes with the trip
distance. The idea behind this is that travelers are more willing to tolerate higher out-of-vehicle
time for a long trip rather than for a short trip. We still expect that travelers will be more
sensitive to OVT than IVT for any travel distance. A formulation which ensures this result is to
include total travel time (the sum of in-vehicle and out-of-vehicle time) and out-of-vehicle time
divided by distance in place of in- and out-of-vehicle travel time. This specification, as shown
below, is consistent with our expectations provided that $\beta_1$ and $\beta_2$ are negative:
\begin{equation}
\begin{split}
V_{m} &= \gamma_{0,m} + \beta_{1} \times TTT_{m} + \beta_{2} \times \Big(\frac{OVT_{m}}{Dist}\Big) + \ldots\\
&= \gamma_{0,m} + \beta_{1} \times (IVT_{m} + OVT_{m}) + \frac{\beta_{2}}{Dist} \times OVT_{m} + \ldots\\
&= \gamma_{0,m} + \beta_{1} \times IVT_{m} + \Big(\beta_{1} + \frac{\beta_{2}}{Dist}\Big) \times OVT_{m} + \ldots
\end{split}
(\#eq:IVT-OVT)
\end{equation}
An alternative approach is to impose a constraint on the relative importance of OVT and IVT.
This is achieved by replacing the travel time variables in the modal utility equations with a
weighted travel time (WTT) variable defined as in-vehicle time plus the appropriate travel time
importance ratio (TIR) times out-of-vehicle time (IVT + TIR×OVT). The mechanics of how this
constraint works is illustrated as follows:
\begin{equation}
\begin{split}
V_{m} &= \gamma_{0,m} + \beta_{1} \times IVT + (\beta_{1} \times TIR) \times OVT + \ldots \\
&= \gamma_{0,m} + \beta_{1} \times (IVT + TIR \times OVT) + \ldots\\
&= \gamma_{0,m} + \beta_{1} \times WTT + \ldots
\end{split}
(\#eq:IVT-TIRxOVT)
\end{equation}
so that the parameter for out-of-vehicle time is equal to the parameter for in-vehicle time
multiplied by the selected travel time ratio (TTR). In Models 8W and 9W, we use travel
importance ratios of 2.5 and 4.0, respectively. The estimation results for these models
compared to Model 6W are reported in Table 6-5. The parameter estimates obtained for the travel
time, cost, and income variables in all four models have the correct signs and are statistically
significant. Model 7W has substantially better goodness-of-fit than Models 6W, 8W and 9W. Since
none of the other models are constrained versions of Model 7W, we use the non-nested hypothesis
test (see [Section 5.7.3.2](#section5-7-3-2), Equation 5.21) to compare it with Models 6W, 8W, and 9W.
We illustrate the non-nested hypothesis test by applying it to the hypothesis
that Model 6W is the true model given that Model 7W has a higher
$\bar{\rho}^{2}$. Since both models have the same number of parameters, the
term (K7-K6) drops out, and the equation becomes
\begin{equation}
\begin{split}
\mathrm{Level of Rejection} &= \Phi[-(-2(\bar{\rho_{7}}^{2}-\bar{\rho_{6}}^{2})\ l(0))^{1/2}]\\
&= \Phi[-(-2(0.5129-0.5074)(-7309.6))^{1/2}]\\
&= \Phi(-8.97)<< 0.001
\end{split}
(\#eq:non-nestedhypothesistest)
\end{equation}
That is, the null hypothesis that Model 6W is the true model is rejected with significance much
greater than $0.001$. Models 8W and 9W are also rejected as the true model at an even higher
level of significance.
```{r tirspec-models}
sf_mlogit_trip_estimates <- sf_mlogit %>%
mutate(
TIR8 = (mot_ivtt + 2.4 * mot_ovtt),
TIR9 = (mot_ivtt + 4 * mot_ovtt),
ovtd = mot_ovtt/dist,
scalemot = 2.4 * mot_ovtt,
scalemot2 = 4 * mot_ovtt
)
model_7w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + ovtd + cost | hhinc,
data = sf_mlogit_trip_estimates)
model_8w <- mlogit(chosen ~ nm_tvtt + TIR8 + cost | hhinc,
data = sf_mlogit_trip_estimates)
model_9w <- mlogit(chosen ~ nm_tvtt + TIR9 + cost | hhinc,
data = sf_mlogit_trip_estimates)
model_8a <- mlogit(chosen ~ nm_tvtt + (mot_ivtt + scalemot) + cost | hhinc,
data = sf_mlogit_trip_estimates)
model_9a <- mlogit(chosen ~ nm_tvtt + (mot_ivtt + scalemot2) + cost | hhinc,
data = sf_mlogit_trip_estimates)
```
```{r tirspec-models-tab, echo = FALSE}
trip_1 <- list(
"Model 6W" = model_6w,
"Model 7W" = model_7w,
"Model 8W" = model_8w,
"Model 9W" = model_9w
)
modelsummary(
trip_1, fmt = "%.4f",
title = "Estimation Results for Additional Travel Time Specification Testing"
)
```
Before adopting Model 7W, it is a good idea to evaluate and interpret the relative
importance of in-vehicle and out-of-vehicle time and between each component of time and cost.
Despite the difference in the specification, this analysis is undertaken the same way as earlier;
that is, the parameters for time is divided by the parameter for cost to obtain the values of time.
The values of IVT and OVT in cents-per-minute (and dollars-per-hour) are shown in Table \@ref(tab:model7w-vot)
as a function of distance. The time values are obtained as described earlier by dividing each of
the time parameters (in utils-per-minute) by the cost parameter in utils per cent. For example,
the values for Model 7W are:
Value of IVTT $= \frac{\beta_{mot\ tvtt}}{\beta_{cost}} = \frac{-0.0415}{-0.0041}$
= 10.1 cents/min = $6.07/hr
Value of OVT (5 Mile Trip) $= \frac{\beta_{mot\ tvtt}+ \frac{\beta_{OVT/Dist}}{Dist}}{\beta_{cost}}$
$= \frac{-0.0415+ \frac{-0.1812}{5}}{-0.0041}$
= 19.0 cents/min = $11.38/hr
These values of time are fixed for IVT but vary with distance for OVT[^costbyinc] as reported in Table \@ref(tab:model7w-vot)
for Model 7W. The corresponding values of time for Models 6W, 8W and 9W are shown in
Table \@ref(tab:timespec-vot2)
```{r model7w-vot, echo = F}
model7_vot <- function(model, timepar, costpar, distance){
if(timepar == "mot_ovttbydist"){
( coef(model)["mot_tvtt"] + coef(model)[timepar]/ distance) / coef(model)[costpar]
} else {
coef(model)[timepar] / coef(model)[costpar]
}
}
tibble(distance = c(5, 10, 20)) %>%
rowwise() %>%
mutate(
"Value of Motorized Out-of-Vehicle Time" = model7_vot(model_7w, "mot_ovttbydist", "cost", distance),
"Value of Motorized Total Time" = model7_vot(model_7w, "mot_tvtt", "cost", distance),
"Value of Non-Motorized Time" = model7_vot(model_7w, "nm_tvtt", "cost", distance),
) %>%
kbl(caption ="Model 7W Implied Values of Time as a Function of Trip Distance" ) %>%
kable_styling()
```
```{r timspec-vot2, echo = F}
tibble(
"Value of Time ($/hr)" = c("Value of Out-of-vehicle Time", "Value of In-vehicle Time"),
"Model 6W" = round(c(VOTsimple(model_6w, "mot_ovtt", "cost"), VOTsimple(model_6w, "mot_ivtt", "cost")),2),
"Model 8W" = round(c(VOTsimple(model_8w, "mot_ovtt", "cost"), VOTsimple(model_8w, "mot_ivtt", "cost")),2),
"Model 9W" = round(c(VOTsimple(model_9w, "mot_ovtt", "cost"), VOTsimple(model_9w, "mot_ivtt", "cost")),2)
) %>%
kbl(align = 'c', caption = "Implied Values of Time in Models 6W, 8W, and 9W") %>%
kable_styling()
```
The prevailing wage rate in the San Francisco Bay Area is $21.20 per hour[^refsfmodel]. In
comparison, the values of in-vehicle time implied by Models 6W, 8W, and 9W are very low and
the values of out of vehicle time are somewhat low. Model 7W produces higher, but still low,
values of time. Finally, we can examine the ratio of time values of OVT relative to IVT for all
four models as shown in Figure \@ref(fig:vottable). The ratio for Model 6W is
unacceptably high. Those for Models 7W, 8W and 9W are more reasonable.
```{r vottable, fig.cap = "Ratio of Out-of-Vehicle and In-Vehicle Time Coefficients for Work Models 6, 7, 8, and 9"}
vottable <- tibble(
model = c("Model 6W", "Model 7W", "Model 8W", "Model 9W"),
ovtt = c( VOTsimple(model_6w, "mot_ovtt", "cost"), VOTsimple(model_7w, "mot_tvtt", "cost"),
VOTsimple(model_8a, "scalemot", "cost"), VOTsimple(model_9a, "scalemot2", "cost")),
ivtt = c(VOTsimple(model_6w, "mot_ivtt", "cost"), VOTsimple(model_7w, "mot_tvtt", "cost"),
VOTsimple(model_8a, "mot_ivtt", "cost"), VOTsimple(model_9a, "mot_ivtt", "cost")),
ratio = ovtt / ivtt
)
tibble(
distance = .8:10,
`Model 6W` = vottable$ratio[1],
`Model 7W` = vottable$ratio[2] / distance,
`Model 8W` = vottable$ratio[3],
`Model 9W` = vottable$ratio[4],
) %>%
gather(model, vot, -distance) %>%
ggplot(aes(x = distance, y = vot, color = model)) +
scale_color_discrete("Model") +
geom_line() + xlab("Distance") + ylab("Value of Time Ratio") +
scale_y_log10()
```
The selection of a preferred travel time specification among the four alternative specifications
tested is relatively straightforward in this case. Model 7W outperforms the other models in all
the evaluations undertaken; it has the best goodness-of-fit, the most intuitive relationship
between the IVT and OVT variables and the most acceptable values of time[^imposedconstraints]. Consequently,
Model 7W is our preferred travel time specification. We can still consider imposing a constraint
between the time and cost variables to force the value of time to more reasonable levels.
However, we defer this until we explore other specification improvements.
### Including Additinal Decision Maker Related Variables
There are strong theoretical and empirical reasons to expect that a variety of decision maker
related variables such as income, car availability, residential location, number of workers in the
household and others, influence workers’ choice of travel mode. The models reported to this
point include income as the only decision maker related explanatory variable. To the extent that
these variables influence the mode choice decision of travelers, their inclusion in the model will
increase the explanatory power and predictive accuracy of the model.
There are two general approaches to including decision maker related variables in
models. One is to include such variables as specific to each alternative (except for one base or
reference alternative) to indicate the extent to which changes in the variable value will increase
or decrease the utility of the mode to that traveler (relative to the reference alternative). The
other is to include such variables as interactions with mode related characteristics. For example,
dividing cost by income to reflect the decreasing importance of cost with increasing annual
income. The inclusion of decision maker related variables as alternative specific variables is
demonstrated in this section. Similar treatment of trip context variables is considered in [Section 6.2.4](#section6-2-4). Interactions with mode characteristics are demonstrated in [Section 6.2.5](#section6-2-5).
We consider number of automobiles in the household, the number of autos divided by the
number of household workers and the number of autos divided by the number of persons of
driving age in the household. Since these variables are constant across all alternatives, they must
be included as distinct variables for each alternative (except for the reference alternative). This
is considered a full set of alternative specific variables. The estimation results for these
specifications and Model 7W are reported in Table 6-8.
These three new models have much better goodness-of-fit than Model 7W. Each model
rejects Model 7W as the true model at a very high level of significance. The parameters for
alternative specific automobile availability variables in all the three models have the expected
signs, negative relative to drive alone, with the exception of the shared ride 3+ variable in Model
10W which is not significant. Further, the signs and magnitude of the parameters for time, cost,
and income are stable across the models. Finally, Models 11W and 12W which include cars-perworker and
cars-per-number-of-adults, respectively, reject Model 10W as the true model.
Overall, Models 11W and 12W are superior to the other two models in terms of
behavioral appeal, they provide an indication of automobile availability, and goodness of fit,
they statistically reject Models 7W and 10W statistical fit. Model 11W has slightly better
goodness-of-fit than Model 12W but the difference is so small that the non-nested hypothesis test
is not able to distinguish between the two models. Therefore, selection of a preferred model is
primarily a matter of judgment. We select Model 11W but selection of Model 12W would be
equally appropriate.
```{r Table 6-8 Estimation for Auto Availability, echo = F}
# all models working, but values are different to Table
sf_mlogit_autoavailability <- sf_mlogit %>%
mutate(autoperad = numveh / numadlt,
mot_ovttbydist = mot_ovtt / dist)
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_7w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc, data = sf_mlogit_autoavailability)
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_10w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc + numveh, data = sf_mlogit_autoavailability)
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_11w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc + vehbywrk, data = sf_mlogit_autoavailability)
#Having issues setting Shared Ride 2 and 3+ to be equal to each other
model_12w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc + autoperad, data = sf_mlogit_autoavailability)
Autoavailability_estimation <- list(
"Model 7W" = model_7w,
"Model 10W" = model_10w,
"Model 11W" = model_11w,
"Model 12W" = model_12w
)
modelsummary(
Autoavailability_estimation, fmt = "%.4f",
title = "Estimation Results for Auto Availiability Specification Testing"
)
```
### Including Trip Context variables {#section6-2-4}
The models considered to this point include variables that describe the attributes of alternatives,
modes, and the characteristics of decision-makers (the work commuters). The mode choice decision also
is influenced by variables that describe the context in which the trip is made. For
example, a work trip to the regional central business district (CBD) is more likely to be made by
transit than an otherwise similar trip to a suburban work place because the CBD is generally
well-served by transit, has more opportunities to make additional stops by walking and is less
auto friendly due to congestion and limited and expensive parking. This suggests that the model
specification can be enhanced by including variables related to the context of the trip, such as
destination zone location.
We consider two distinct variables to describe the trip destination context. One is a
dummy variable which indicates whether the destination zone (workplace) is located in the
CBD; the other is the employment density of different workplace destinations. The CBD
variable implies an abrupt increase in the likelihood of using public transit at the CBD boundary.
The density variable implies a continuous increase in the likelihood of using public transit with
increasing workplace density. A third option is to include both variables in the model. There is
disagreement about whether to include such combinations of variables since they both represent
the same underlying phenomenon: increasing transit use with increasing density of
development. There is no firm rule about this point; each case must be evaluated on its merits
based on statistical tests and reasonableness of the estimation results. As with the addition of
characteristics of the traveler, we introduce each variable as a full set of alternative specific
variables, each of which represents the effect of a change in that variable on the utility of the
alternative relative to the reference alternative (drive alone). Model 13W adds the alternative
specific CBD dummy variables to the variables in Model 11W. Model 14W adds the alternative
specific employment density variables and Model 15W adds both. Estimation results for these
specifications and Model 11W are reported in Table 6-9.
```{r trip-context-models, echo = F}
#Estimates are close but slightly off, even after changing the travel time variables
#However, the coefficients given here provide the correct value of time values in Table 6-10
sf_mlogit_tripcontext <- sf_mlogit %>%
mutate(cbddumall = wkccbd + wknccbd,
mot_ovttbydist = mot_ovtt / dist)
model_13w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc + vehbywrk + cbddumall, data = sf_mlogit_tripcontext)
model_14w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc + vehbywrk + wkempden, data = sf_mlogit_tripcontext)
model_15w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext)
```
```{r trip-context-table, echo = FALSE}
tripcontext_estimation <- list(
"Model 11W" = model_11w,
"Model 13W" = model_13w,
"Model 14W" = model_14w,
"Model 15W" = model_15w
)
modelsummary(
tripcontext_estimation, fmt = "%.4f", statistic_vertical = FALSE,
title = "Estimation Results for Models with Trip Context Variables"
)
```
```{r Vot-13-14-15, echo = F}
#These functions are based off the equations given in the text above Table 6-6
VOTsimple <- function(model, timevar, costvar) {
coef(model)[timevar]*0.6/coef(model)[costvar]
}
VOTdistance <- function(model, timevar, timedistvar, dist_value, costvar) {
(coef(model)[timevar]+(coef(model)[timedistvar]/dist_value))*0.6/coef(model)[costvar]
}
#Table 6-10 based off coefficients from models 13w, 14w, 15w
#These values are different from those in the text because the model coefficients are as well
tibble(
"Value of Time ($/hr)" = c("Value of Motorized IVT", "Value of Motorized OVT (10 mile trip)", "Value of Motorized OVT (20 mile trip)", "Value of Non-Motorized Time"),
"Model 13W" = round(c(VOTsimple(model_13w, "mot_tvtt", "cost"), VOTdistance(model_13w, "mot_tvtt", "mot_ovttbydist", 10, "cost"), VOTdistance(model_13w, "mot_tvtt", "mot_ovttbydist", 20, "cost"), VOTsimple(model_13w, "nm_tvtt", "cost")),2),
"Model 14W" = round(c(VOTsimple(model_14w, "mot_tvtt", "cost"), VOTdistance(model_14w, "mot_tvtt", "mot_ovttbydist", 10, "cost"), VOTdistance(model_14w, "mot_tvtt", "mot_ovttbydist", 20, "cost"), VOTsimple(model_14w, "nm_tvtt", "cost")),2),
"Model 15W" = round(c(VOTsimple(model_15w, "mot_tvtt", "cost"), VOTdistance(model_15w, "mot_tvtt", "mot_ovttbydist", 10, "cost"), VOTdistance(model_15w, "mot_tvtt", "mot_ovttbydist", 20, "cost"), VOTsimple(model_15w, "nm_tvtt", "cost")),2)
) %>%
kbl(align = 'c', caption = "Implied Values of Time in Models 13W, 14W, and 15W") %>%
kable_styling()
```
Each of the new Models (13W, 14W and 15W) significantly reject Model 11W as the
true model at a very high level of significance. Further, the parameters for all of the alternative
specific CBD dummy and employment density variables have a positive sign, implying that all
else being equal, an individual is less likely to choose drive alone mode for trips destined to a
CBD and/or high employment density zones, as expected.
Since Models 13W and 14W are restricted versions of Model 15W, we can use the loglikelihood test
which rejects the hypothesis that each of these models is the true model.
Therefore, purely on statistical grounds, Model 15W is preferred over Models 13W and 14W.
However, this improvement in statistical fit comes at the cost of increased model complexity,
and it may be appropriate to adopt Model 13W or 14W, sacrificing statistical fit in favor of
parsimony[^parsimony]. For now, we choose Model 15W as the preferred model for its stronger statistical
results, but we will return to the issue of model complexity.
### Interactions between Trip maker and/or Context Characteristics and Mode Attributes {#section6-2-5}
Another approach to the inclusion of trip maker or context characteristics is through interactions
with mode attributes. The most common example of this approach is to take account of the
expectation that low-income travelers will be more sensitive to travel cost than high-income
travelers by using cost divided by income in place of cost as an explanatory variable. Such a
specification implies that the importance of cost in mode choice diminishes with increasing
household income. Table 6-11 portrays the estimation results for two models that differ only in
how they represent cost; Model 15W includes travel cost while Model 16W includes travel cost
divided by income.
```{r income_interaction, echo = F}
#I feel I've created the interaction term correctly and included the right variables, but values are still slightly off
model_15w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + cost |
hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext)
model_16w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + mot_ovttbydist + I(cost/hhinc) | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext)
incomeinteraction_estimation <- list(
"Model 15W" = model_15w,
"Model 16W" = model_16w
)
modelsummary(
incomeinteraction_estimation, fmt = "%.4f", statistic_vertical = FALSE,
title = "Estimation Results for Models with Trip Context Variables"
)
```
The cost by income variable has the expected sign and is statistically significant, but the
overall goodness-of-fit for the cost divided by income model is lower than that for model 15 that
uses cost without interaction with income. However, because theory and common sense suggest
that the importance of cost should decrease with income, we may choose Model 16W despite the
differences in the goodness-of-fit statistics. Since the estimation results contradict our
understanding of the decision making behavior, it is useful to consider other aspects of model
results. In the case of mode choice, we are particularly interested in the relative value of the time
and cost parameters because it measures the implied value of time used by travelers in choosing
their travel mode. Values of time evaluated with earlier models were somewhat lower than
expected when compared to the average wage rate. Using the cost by income formulation in
Model 16W, we can calculate the implied value of time using the relationship developed in
[Section 5.8.2](#section5-8-2).
The implied values of IVT and OVT from Model 16W are substantially higher than those
from Model 15W (Table 6-12) and more in line with our *a priori* expectations. This
improvement in the estimate of values of time more than offsets the difference in goodness-of-fit
so we adopt Model 16W as our preferred specification. Thus, our strong belief in both valuing
time relative to wage rate and higher estimates of the value of time provide evidence which is
strong enough to override the statistical test results. Nonetheless, we may still decide to impose
parameter constraints to obtain higher values of time.
``` {r impliedVOT15-16W}
VOTsimple <- function(model, timevar, costvar) {
coef(model)[timevar]*0.6/coef(model)[costvar]
}
VOTdistance <- function(model, timevar, timedistvar, dist_value, costvar) {
(coef(model)[timevar]+(coef(model)[timedistvar]/dist_value))*0.6/coef(model)[costvar]
}
model_15w <- mlogit(chosen ~ mot_tvtt + nm_tvtt + I(mot_ovtt/dist) + cost |
hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext)
model_16w <- mlogit(chosen ~ I(cost/hhinc) + mot_tvtt + nm_tvtt + I(mot_ovtt/dist) |
hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext)
tibble(
"Measure" = c("Value of In-Vehicle Time","Value of Out-of-Vehilce Time (10 mile trip)",
"Value of Out-of-Vehilce Time (20 mile trip)"),
"Model 15W" = c(
paste("$", round(VOTsimple(model_15w,"mot_tvtt","cost"),2), "/hr"),
paste("$", round(VOTdistance(model_15w, "mot_tvtt", "I(mot_ovtt/dist)", 10,"cost"),2), "/hr"),
paste("$", round(VOTdistance(model_15w, "mot_tvtt", "I(mot_ovtt/dist)", 20,"cost"),2), "/hr")
),
"Model 16W (Wage Rate = $21.20)" = c(
paste("$", round(coef(model_16w)["mot_tvtt"] / coef(model_16w)["I(cost/hhinc)"] * 21.20, 2), "/hr"),
paste("$", round((coef(model_16w)["mot_tvtt"] + coef(model_16w)["I(mot_ovtt/dist)"] / 10) / coef(model_16w)["I(cost/hhinc)"] * 21.20, 2), "/hr"),
paste("$", round((coef(model_16w)["mot_tvtt"] + coef(model_16w)["I(mot_ovtt/dist)"] / 20) / coef(model_16w)["I(cost/hhinc)"] * 21.20, 2), "/hr")
)) %>%
kbl(caption = "Implied Value of Time in Models 15W and 16W") %>%
kable_styling()
```
```{r compare-1516, echo=FALSE}
modelsummary(
list("15W" = model_15w, "16W" = model_16w),
coef_map = c("mot_tvtt" = "Motorized Travel Time",
"nm_tvtt" = "Non-motorized Travel Time",
"I(mot_ovtt/dist)" = "Motorized time per distance",
"cost" = "Trip Cost",
"I(cost/hhinc)" = "Trip cost divided by income"
)
)
```
### Additional Model Refinement
Generally, it is appropriate to test the preferred model specification against a variety of other
specifications; particularly reviewing decisions made earlier in the model development process.
Such testing would include reducing model complexity by the elimination of selected variables
(e.g., dropping either the CBD Dummy or Employment Density variables or combining some of
the alternative specific parameters), changing the form used for inclusion of different variables
(e.g., replacing income by log of income) or adding new variables which substantially improve
the explanatory power and behavioral realism of the model.
In this section, we consider simplifying the model specification by dropping variables
that are not statistically significant or by collapsing alternative specific variables that do not
differ across alternatives. The cost and time parameters are all significant and should be
included because they represent the impact of policy changes in mode service attributes. Among
the traveler and context variables, those for income have the lowest t-statistics so might be
considered for elimination; however, we prefer to keep these in the model since income
differences are important in mode selection, particularly for transit. However, the extremely low
values and lack of significance for the shared ride alternatives suggest that income has no
differential impact on the choice of drive alone versus any of the shared ride alternatives and
these variables should be dropped from the model (or constrained to zero). In addition, the
parameter for the number of automobiles by number of workers variable for shared ride 3+
alternative is smaller in magnitude than the parameter for the shared ride 2 alternative. This is
counter-intuitive as we expect shared ride 3+ travelers to be more sensitive to automobile
availability. This can be addressed by constraining the alternative specific variables for the
shared ride modes to be equal (we accomplish by summing the two variables). The estimation
results for the simplified specification (constraining income for the shared ride alternatives to
zero, and constraining the automobile ownership by number of workers variable for the two
shared ride alternatives to be equal) and Model 16W are reported in Table 6-13.
The goodness-of-fit for the two models are very close, suggesting that the constraints
imposed to simplify the model do not significantly impact the explanatory power of the model.
The results of the likelihood ratio test confirm that the restrictions imposed in Model 17W
cannot be statistically rejected. The parameter estimates for all the variables have the right sign
and are all statistically significant (except CBD dummy for bike and walk). We therefore select
Model 17W as our preferred model.
As discussed in the next section, the other major approach to searching for improved
models is market segmentation and segmenting the population into groups which are expected to
use different criteria in making their mode choice decisions.
``` {r model16v17}
# Model 17 W
model_17w <- mlogit(chosen ~ I(cost/hhinc) + mot_tvtt + nm_tvtt + I(mot_ovtt/dist) | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext, constPar = c("hhinc:Share ride 2" = 0, "hhinc:Share ride 3++" = 0, "vehbywrk:Share ride 2" = -0.3166, "vehbywrk:Share ride 3++" = -0.3166))
#table 6-13
list_1617 <- list(
"Model 16W" = model_16w,
"Model 17W" = model_17w
)
modelsummary(list_1617, title = "Estimation Results for Model 16W and Its Constrained Version")
```
## Market Segmentation
The models considered to this point implicitly assume that the entire population, represented by
the sample, uses the same model decision structure, variable and importance weights
(parameters) to select their commute to work mode. That is, we assume that the population is
homogeneous with respect to the importance it places on different aspects of service except as
differentiated by decision-maker characteristics included in the model specification. If this
assumption is incorrect, the estimated model will not adequately represent the underlying
decision processes of the entire population or of distinct behavioral groups within the population.
For example, mode preference may differ between low and high-income travelers as low-income
travelers are expected to be more sensitive to cost and less sensitive to time than high-income
travelers. This phenomenon is incorporated in the preceding models to a limited extent through
the use of alternative specific income variables and cost divided by income in the utility
specification. Market segmentation can be used to determine whether the impact of other
variables is different among population groups. The most common approach to market
segmentation is for the analyst to consider sample segments which are mutually exclusive and
collectively exhaustive (that is, each case is included in one and only one segment). Models are
estimated for the sample associated with each segment and compared to the pooled model (all
segments represented by a single model) to determine if there are statistically significant and
important differences among the market segments.
Market segmentation is usually based on socio-economic and trip related variables such
as income, auto ownership and trip purpose which may be used separately or jointly. Trip
purpose has already been used in our analysis by considering work commute trips exclusively.
Once segmentation variables are selected (income, auto ownership, etc.), different numbers of
segments may be considered for each dimension (e.g., we could use high, medium and low
income segments or only high and low income segments). All members of each segment are
assumed to have identical preferences and identical sensitivities to all the variables in the utility equation.
Analysts will often have some *a priori* ideas about the best segmentation variables and
the appropriate groupings of the population with respect to these variables. In the case of
continuous variables, such as income, the analyst may consider different boundaries between
segments. In cases where the analyst does not have a strong basis for selecting model segments,
he/she can test different combinations of socio-economic and trip-related variables in the data for
segmentation. This approach is limited by the fact that the number of segments grows very fast
with the number of segmentation variables (e.g., three income segments, two gender segments
and three home location segments results in 18 distinct groups). The multiplicity of segments
creates interpretational problems due to the complexity of comparing results among segments
and estimation problems due to the small number of observations in some of the segments (with
as many as 2,000 cases, eighteen segments would be likely to produce many segments with
fewer than 100 cases and some with fewer than 50 cases, well below the threshold for reliable
estimation results). The alternative of pre-defining market segments along one dimension at a
time is practical and easy to implement but it has the disadvantage that this approach does not
account for interactions among the segmentation variables.
### Market Segmentation Tests
The determination of whether to segment the data is based on a comparison of the pooled model
for the entire sample/population and a set of segment specific models for each segment of the
sample/population. This comparison includes: (1) a statistical test, referred to as the market
segmentation or taste variation test, to determine if the segments are statistically different from
one another, (2) statistical significance and reasonableness of the parameters in each of the
segments, and (3) reasonableness of the relationships among parameters in each segment and
between parameters in the different market segments.
The statistical test for market segmentation consists of three steps. First, the sample is
divided into a number of market segments which are mutually exclusive and collectively
exhaustive. A preferred model specification is used to estimate a pooled model for the entire
data set and to estimate models for each market segment. Finally, the goodness-of-fit differences
between the segmented models (taken as a group) and the pooled model are evaluated to
determine if they are statistically different. This test is an extension of the likelihood ratio test
described earlier to test the difference between two models. In this case, the unrestricted model
is the set of all the segmented models and the restricted model is the pooled model which
imposes the restriction that the parameters for each segment are identical.
```{r segmentation}
sf_work <- read_rds("data/worktrips.rds")
base_model <- mlogit(chosen ~ tvtt + cost | wkempden, data = sf_work)
withincome <- mlogit(chosen ~ tvtt + cost | hhinc + wkempden, data = sf_work)
highincome <- mlogit(chosen ~ tvtt + cost | hhinc + wkempden,
data = sf_work %>% filter(hhinc > 50))
low_income <- mlogit(chosen ~ tvtt + cost | hhinc + wkempden,
data = sf_work %>% filter(hhinc <= 50))
list(
"Base" = base_model,
"Income" = withincome,
"High Income" = highincome,
"Low Income" = low_income
) %>%
modelsummary(fmt = "%.5f", stars = TRUE, statistic_vertical = TRUE)
```
Thus, the null hypothesis is that $\underline\beta_1 = \underline\beta_2 = ... = \underlineβ_s = ... = \underlineβ_S$ , where βs , is the
vector of coefficients for the $S^{th}$ market segment. Following the approach described in
[CHAPTER 5](#chapter5), we reject the null hypothesis that the restricted model is the correct model at
significance level p if the calculated value of the statistic is greater than the test or critical value.
That is, if:
\begin{equation}
$\displaystyle -2 \times [l_{R} - l_{U}]\ge \chi^{2}_{n,(p)}$
(\#eq:log-likelihoodtestatlevelp)
\end{equation}
Substituting the log-likelihood for the pooled model for $l_R$ and the sum of market segment
model log-likelihoods for lU in equation 5.16, the null hypothesis, that all segments have the
same choice function, is rejected at level p if:
\begin{equation}
$\displaystyle -2 \times \Bigg[l(\beta) - \sum^{S}_{s=1} l(\beta_{s})\Bigg]\ge \chi^{2}_{n,(p)}$
(\#eq:rejectedlog-likelihoodtestatlevelp)
\end{equation}
Where $l(\beta)$ is the log-likelihood for the pooled model,
$l(\beta_{s})$ is the log-likelihood of the model estimated with $s^{th}$ market segment,
$\chi^{2}_{n}$ is the chi-square distribution with n degrees of freedom,
$n$ is equal to the number of restrictions, $\sum^{S}_{s=1} K_{s} - K$
$K$ is the number of coefficients in the pooled model, and
$K_{s}$ is the number of coefficients in the $s^{th}$ market segment model.
$K_{s}$ is generally equal to $K$ in which case $n$ is given by $K x (S-1)$ [^fixedseg]
### Market Segmentation Example
We illustrate the market segmentation test for two cases, automobile ownership (zero/one car
households and households with more than one car), and gender (male and female). In the case
of segmentation by automobile ownership, it is appealing to include a distinct segment for
households with no cars since the mode choice behavior of this segment is very different from
the rest of the population due to their dependence on non-automobile modes. However, the
small size of this segment in the data set, only 160 of the 5029 work trip reports from households
with no cars, precludes use of a no car segment; this group is combined with the one car
ownership households for estimation. Using the same utility specification as in Model 17W, the
estimation results for the pooled and segmented models for auto ownership and for gender are
reported in Table 6-14 and Table 6-15.
We can make the following observations from the estimation results of the automobile
ownership segmentation models (Table 6-14):
- The segmented model rejects the pooled model at a very high level of statistical significance
$-2\times\Bigg[l(\beta) - \sum^{S}_{s=1} l(\beta_{s})\Bigg] = -2\times[-3444.2-(-1049.3-2296.7)] = 196.4$
- The alternative specific constants for all other modes relative to drive alone are much more
negative for the higher auto ownership group than for the lower auto ownership group.
These differences indicate the increased preference for drive alone among persons from
multi-car households. This makes intuitive sense, as travelers in households with fewer
automobiles are more likely to choose non-automobile modes, all else being equal.
- The alternative specific income coefficients are insignificant or marginally significant for
both segments suggesting that the effect of income differences is adequately explained by the
segment difference.
``` {r segbycars}
# uses Model 17W as the basis
model_17w_lowcars <- mlogit(chosen ~ I(cost/hhinc) + mot_tvtt + nm_tvtt + I(mot_ovtt/dist) | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext %>% filter(numveh <= 1), constPar = c("hhinc:Share ride 2" = 0, "hhinc:Share ride 3++" = 0, "vehbywrk:Share ride 2" = -3.015, "vehbywrk:Share ride 3++" = -3.015))
model_17w_highcars <- mlogit(chosen ~ I(cost/hhinc) + mot_tvtt + nm_tvtt + I(mot_ovtt/dist) | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext %>% filter(numveh >= 2), constPar = c("hhinc:Share ride 2" = 0, "hhinc:Share ride 3++" = 0, "vehbywrk:Share ride 2" = -0.241, "vehbywrk:Share ride 3++" = -0.241))
# table 6-14
list(
"Pooled Model" = model_17w,
"0-1 Car HH's" = model_17w_lowcars,
"2+ Car HH's" = model_17w_highcars
) %>%
modelsummary(fmt = "%.5f", stars = TRUE, statistic_vertical = TRUE, title = "Estimation Results for Market Segmentation by Automobile Ownership")
```
- The sensitivity to automobile availability is much higher among low auto ownership
households where an increase in availability (from 0) will be relatively important, than
among higher auto ownership households where the number of cars is likely to closely approximate
the number of drivers and an increase in availability will be relatively unimportant.
- The differences in the alternative specific CBD dummy variables and the Employment
Density variables are very small and not significant suggesting that these variables could be
constrained to be equal across auto ownership segments.
- The differences in the time parameters also are very small and not significant suggesting that
these variables could be constrained to be equal across auto ownership segments.
- The magnitude of the cost by income parameter is much smaller in the lower automobile
ownership segment than in the higher automobile ownership segment indicating that cost
may be of little importance in households with low car availability.
We can make the following observations from the estimation results of the gender segmentation
models (Table 6-15):
- The segmented model rejects the pooled model at a very high level of statistical significance.
- The alternative specific constants relative to the drive alone mode are less negative (more
positive) in the female segment suggesting the preference for drive alone mode is less
pronounced among females. This is especially true for the non-motorized modes (bike and
walk) where the difference in the modal constants between the two groups is large and highly
significant.
``` {r segbygender}
# uses Model 17W as the basis
model_17w_males <- mlogit(chosen ~ I(cost/hhinc) + mot_tvtt + nm_tvtt + I(mot_ovtt/dist) | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext %>% filter(femdum == 0), constPar = c("hhinc:Share ride 2" = 0, "hhinc:Share ride 3++" = 0, "vehbywrk:Share ride 2" = 0.21, "vehbywrk:Share ride 3++" = 0.21 ))
model_17w_females <- mlogit(chosen ~ I(cost/hhinc) + mot_tvtt + nm_tvtt + I(mot_ovtt/dist) | hhinc + vehbywrk + cbddumall + wkempden, data = sf_mlogit_tripcontext %>% filter(femdum == 1), constPar = c("hhinc:Share ride 2" = 0, "hhinc:Share ride 3++" = 0, "vehbywrk:Share ride 2" = 0.607, "vehbywrk:Share ride 3++" = 0.607 ))
# table 6-15
list(
"Pooled Model" = model_17w,
"Males Only" = model_17w_males,
"Females Only" = model_17w_females
) %>%
modelsummary(fmt = "%.5f", stars = TRUE, statistic_vertical = TRUE, title = "Estimation Results for Market Segmentation by Gender")
```
- The female segment parameters for alternative specific variables; Income, Autos per Worker,