In market sizing and forecasting, we’re often tasked with estimating large data tables with limited data or only tables of subtotals. For example, one might know the number of people employed in the a country by age range, as well as the gender breakdown of total employment without age. Using iterative proportion fitting with these two data series, one can estimate the number of people employed by age range AND gender.
Iterative proportion fitting (IPF) has been used in market analysis for over 30 years. This package integrates the methodology, along with some more expansions of this technique, into the tidyverse in R.
The ipfitr package for R allows the user to input multiple high-level summary/aggregate data tables and create a single, full-dimension data table estimating individual cell values using iterative proportion fitting.
Full demo can be found here.
Data: Three one-dimensional targets
To explore the function in the ipfitr package, we’ll use the base R dataset
HairEyeColor which contains 4 hair colors, 4 eye colors, and 2 genders, along with the number of people who have each of those 32 (4x4x2) combinations.
But for now, let’s assume we don’t have that much detail. Instead we have 3 tables containing the total number of people with each hair color, total number with each eye color, and total number of each gender.
dat_hec <- as.data.frame(HairEyeColor, stringsAsFactors = T) tar1 <- dat_hec %>% group_by(Hair) %>% summarize(value = sum(Freq)) tar1 #> # A tibble: 4 x 2 #> Hair value #> <fct> <dbl> #> 1 Black 108 #> 2 Brown 286 #> 3 Red 71 #> 4 Blond 127 tar2 <- dat_hec %>% group_by(Eye) %>% summarize(value = sum(Freq)) tar2 #> # A tibble: 4 x 2 #> Eye value #> <fct> <dbl> #> 1 Brown 220 #> 2 Blue 215 #> 3 Hazel 93 #> 4 Green 64 tar3 <- dat_hec %>% group_by(Sex) %>% summarize(value = sum(Freq)) tar3 #> # A tibble: 2 x 2 #> Sex value #> <fct> <dbl> #> 1 Male 279 #> 2 Female 313
Note that the total number of people in each of the tables is equal. This is an important (but not required) feature of iterative proportion fitting. More on this later.
Expanding data with ip_expand()
The ipfitr package can be used to estimate how many people have each combination of hair color, eye color, and gender. (Remember, we’re ignoring the original data frame.) This takes our 10 data points (4+4+2) and creates 32 data points (4x4x2), while ensuring the 10 target data points remain true.
tar_list <- list(tar1, tar2, tar3) dat_1 <- ip_expand(tar_list) head(dat_1 %>% mutate_at(vars(value), round)) #> # A tibble: 6 x 4 #> Hair Eye Sex value #> <chr> <chr> <chr> <dbl> #> 1 Black Brown Male 19 #> 2 Brown Brown Male 50 #> 3 Red Brown Male 12 #> 4 Blond Brown Male 22 #> 5 Black Blue Male 18 #> 6 Brown Blue Male 49
ip_expand() is a wrapper function, performing two key IPF functions in a single step. First, the function takes all the targets (3 here) and creates a “seed” data frame that contains all combinations of the unique series in the targets along with a seed value of 1. From there, a fitting function takes over, scaling the seed value iteratively to each of the targets until all targets are met.
This is equivalent to running the two functions independently:
dat_1 <- ip_create_seed(tar_list) %>% ip_fit(tar_list)
Initializing the seed values at 1 implies that one has no additional information about the relative distribution of the elements in the final output. In this example, the user has no additional information about the distribution of hair color, eye color, and sex, so the IPF distributes the count of people evenly across the seed until the three targets are met.
This doesn’t give us the “correct” answer, which is unknown, but does tell us the most likely answer.
Seeding the IPF with ip_create_seed()
Alternatively, suppose we learn that people with blond hair are much more likely to have blue eyes. You can implement this vague piece of information using the seed. This information should impact both females and males, so we increase all seed values for blond hair and blue eyes by a factor of 4.
dat_2 <- ip_create_seed(tar_list) %>% mutate(value = ifelse(Hair == "Blond" & Eye == "Blue", 4*value, value)) %>% ip_fit(tar_list) dat_1 %>% filter(Hair == "Blond", Eye == "Blue") #> # A tibble: 2 x 4 #> Hair Eye Sex value #> <chr> <chr> <chr> <dbl> #> 1 Blond Blue Male 21.7 #> 2 Blond Blue Female 24.4 dat_2 %>% filter(Hair == "Blond", Eye == "Blue") #> # A tibble: 2 x 4 #> Hair Eye Sex value #> <chr> <chr> <chr> <dbl> #> 1 Blond Blue Male 37.3 #> 2 Blond Blue Female 41.8
Applying this updated seed to the targets changes each of the 32 estimates while still ensuring the three target values are hit.
You’ll notice that increasing the seed for blond hair and blue eyes by a factor of 4 increases the output for those elements, but by less than 4x. This change to the seed increases the intial starting point for blond hair for relative to eye colors and then blue eyes compared to all hair colors. From there, the scaling take over, redistributing people assigned to each row until all targets are met.
The seed values do not need to be based around 1 - any non-negative values will work. For example, you could use a completely different data frame with counts of a completely different sample of people by hair color, eye color, and gender if you believe the distributions are similar.
Any seed values of 0 will remain 0 in the final IPF output. For example, if you know for sure that there are no green-eyed, red-haired males in the sample, setting the seed here to 0 will retain that 0 in the final results.
Masking the seed with ip_mask_seed()
In some cases, you will want to make many edits to the seed and using
ifelse() can become burdensome. As an alternative, you can use
ip_mask_seed(). See this vignette for more detail.
Implementing specific information with ip_fit()
The seed informs the starting “best guess” of the distribution of people by the three series. Alternatively, you may start out knowing the “true” value for one or more of the full combinations. For example, suppose you had external information telling you that there are 66 brown-haired, brown-eyed females in your sample, as well as 5 blond hazel-eyed females.
The final output of
ip_fit() will need to reflect this information, while also considering the seed and the target subtotals. In this case, we will “freeze” the output for those two new datapoints.
frz.cells #> # A tibble: 2 x 4 #> Hair Eye Sex value #> <chr> <chr> <chr> <dbl> #> 1 Brown Brown Female 66 #> 2 Blond Hazel Female 5 dat_3 <- ip_create_seed(tar_list) %>% mutate(value = ifelse(Hair == "Blond" & Eye == "Blue", 4*value, value)) %>% ip_fit(tar_list, freeze_cells = frz.cells) #> Previous output dat_2 %>% filter((Hair == "Brown" & Eye == "Brown" & Sex == "Female") | (Hair == "Blond" & Eye == "Hazel" & Sex == "Female")) #> # A tibble: 2 x 4 #> Hair Eye Sex value #> <chr> <chr> <chr> <dbl> #> 1 Brown Brown Female 62.5 #> 2 Blond Hazel Female 6.25 #> New Output with Frozen cells dat_3 %>% filter((Hair == "Brown" & Eye == "Brown" & Sex == "Female") | (Hair == "Blond" & Eye == "Hazel" & Sex == "Female")) #> # A tibble: 2 x 4 #> Hair Eye Sex value #> <chr> <chr> <chr> <dbl> #> 1 Brown Brown Female 66 #> 2 Blond Hazel Female 5
Similar options are available for cases when you have incomplete targets or only know the approximate ranges of subsets of the data.