Intro to ipfitr

In market sizing and forecasting, we’re often tasked with estimating large data tables with limited data or only tables of subtotals. For example, one might know the number of people employed in the a country by age range, as well as the gender breakdown of total employment without age. Using iterative proportion fitting with these two data series, one can estimate the number of people employed by age range AND gender.

Iterative proportion fitting (IPF) has been used in market analysis for over 30 years. This package integrates the methodology, along with some more expansions of this technique, into the tidyverse in R.

The ipfitr package for R allows the user to input multiple high-level summary/aggregate data tables and create a single, full-dimension data table estimating individual cell values using iterative proportion fitting.

Full demo can be found here.

Data: Three one-dimensional targets

To explore the function in the ipfitr package, we’ll use the base R dataset HairEyeColor which contains 4 hair colors, 4 eye colors, and 2 genders, along with the number of people who have each of those 32 (4x4x2) combinations.

But for now, let’s assume we don’t have that much detail. Instead we have 3 tables containing the total number of people with each hair color, total number with each eye color, and total number of each gender.

dat_hec <- as.data.frame(HairEyeColor, stringsAsFactors = T)

tar1 <- dat_hec %>% 
  group_by(Hair) %>% 
  summarize(value = sum(Freq))
tar1
#> # A tibble: 4 x 2
#>   Hair  value
#>   <fct> <dbl>
#> 1 Black   108
#> 2 Brown   286
#> 3 Red      71
#> 4 Blond   127

tar2 <- dat_hec %>% 
  group_by(Eye) %>% 
  summarize(value = sum(Freq))
tar2
#> # A tibble: 4 x 2
#>   Eye   value
#>   <fct> <dbl>
#> 1 Brown   220
#> 2 Blue    215
#> 3 Hazel    93
#> 4 Green    64

tar3 <- dat_hec %>% 
  group_by(Sex) %>% 
  summarize(value = sum(Freq))
tar3
#> # A tibble: 2 x 2
#>   Sex    value
#>   <fct>  <dbl>
#> 1 Male     279
#> 2 Female   313

Note that the total number of people in each of the tables is equal. This is an important (but not required) feature of iterative proportion fitting. More on this later.

Expanding data with ip_expand()

The ipfitr package can be used to estimate how many people have each combination of hair color, eye color, and gender. (Remember, we’re ignoring the original data frame.) This takes our 10 data points (4+4+2) and creates 32 data points (4x4x2), while ensuring the 10 target data points remain true.

tar_list <- list(tar1, tar2, tar3)

dat_1 <- ip_expand(tar_list)

head(dat_1 %>% mutate_at(vars(value), round))
#> # A tibble: 6 x 4
#>   Hair  Eye   Sex   value
#>   <chr> <chr> <chr> <dbl>
#> 1 Black Brown Male     19
#> 2 Brown Brown Male     50
#> 3 Red   Brown Male     12
#> 4 Blond Brown Male     22
#> 5 Black Blue  Male     18
#> 6 Brown Blue  Male     49

ip_expand() is a wrapper function, performing two key IPF functions in a single step. First, the function takes all the targets (3 here) and creates a “seed” data frame that contains all combinations of the unique series in the targets along with a seed value of 1. From there, a fitting function takes over, scaling the seed value iteratively to each of the targets until all targets are met.

This is equivalent to running the two functions independently:

dat_1 <- ip_create_seed(tar_list) %>% ip_fit(tar_list)

Initializing the seed values at 1 implies that one has no additional information about the relative distribution of the elements in the final output. In this example, the user has no additional information about the distribution of hair color, eye color, and sex, so the IPF distributes the count of people evenly across the seed until the three targets are met.

This doesn’t give us the “correct” answer, which is unknown, but does tell us the most likely answer.

Seeding the IPF with ip_create_seed()

Alternatively, suppose we learn that people with blond hair are much more likely to have blue eyes. You can implement this vague piece of information using the seed. This information should impact both females and males, so we increase all seed values for blond hair and blue eyes by a factor of 4.

dat_2 <- ip_create_seed(tar_list) %>% 
  mutate(value = ifelse(Hair == "Blond" & Eye == "Blue", 4*value, value)) %>% 
  ip_fit(tar_list)

dat_1 %>% filter(Hair == "Blond",  Eye == "Blue")
#> # A tibble: 2 x 4
#>   Hair  Eye   Sex    value
#>   <chr> <chr> <chr>  <dbl>
#> 1 Blond Blue  Male    21.7
#> 2 Blond Blue  Female  24.4

dat_2 %>% filter(Hair == "Blond",  Eye == "Blue")
#> # A tibble: 2 x 4
#>   Hair  Eye   Sex    value
#>   <chr> <chr> <chr>  <dbl>
#> 1 Blond Blue  Male    37.3
#> 2 Blond Blue  Female  41.8

Applying this updated seed to the targets changes each of the 32 estimates while still ensuring the three target values are hit.

You’ll notice that increasing the seed for blond hair and blue eyes by a factor of 4 increases the output for those elements, but by less than 4x. This change to the seed increases the intial starting point for blond hair for relative to eye colors and then blue eyes compared to all hair colors. From there, the scaling take over, redistributing people assigned to each row until all targets are met.

The seed values do not need to be based around 1 - any non-negative values will work. For example, you could use a completely different data frame with counts of a completely different sample of people by hair color, eye color, and gender if you believe the distributions are similar.

Any seed values of 0 will remain 0 in the final IPF output. For example, if you know for sure that there are no green-eyed, red-haired males in the sample, setting the seed here to 0 will retain that 0 in the final results.

Masking the seed with ip_mask_seed()

In some cases, you will want to make many edits to the seed and using mutate() + ifelse() can become burdensome. As an alternative, you can use ip_mask_seed(). See this vignette for more detail.

Implementing specific information with ip_fit()

The seed informs the starting “best guess” of the distribution of people by the three series. Alternatively, you may start out knowing the “true” value for one or more of the full combinations. For example, suppose you had external information telling you that there are 66 brown-haired, brown-eyed females in your sample, as well as 5 blond hazel-eyed females.

The final output of ip_fit() will need to reflect this information, while also considering the seed and the target subtotals. In this case, we will “freeze” the output for those two new datapoints.


frz.cells
#> # A tibble: 2 x 4
#>   Hair  Eye   Sex    value
#>   <chr> <chr> <chr>  <dbl>
#> 1 Brown Brown Female    66
#> 2 Blond Hazel Female     5

dat_3 <- ip_create_seed(tar_list) %>% 
  mutate(value = ifelse(Hair == "Blond" & Eye == "Blue", 4*value, value)) %>% 
  ip_fit(tar_list, freeze_cells = frz.cells)

#> Previous output
dat_2 %>% filter((Hair == "Brown" &  Eye == "Brown" & Sex == "Female") |
                 (Hair == "Blond" &  Eye == "Hazel" & Sex == "Female"))
#> # A tibble: 2 x 4
#>   Hair  Eye   Sex    value
#>   <chr> <chr> <chr>  <dbl>
#> 1 Brown Brown Female 62.5 
#> 2 Blond Hazel Female  6.25

#> New Output with Frozen cells
dat_3 %>% filter((Hair == "Brown" &  Eye == "Brown" & Sex == "Female") |
                 (Hair == "Blond" &  Eye == "Hazel" & Sex == "Female"))
#> # A tibble: 2 x 4
#>   Hair  Eye   Sex    value
#>   <chr> <chr> <chr>  <dbl>
#> 1 Brown Brown Female    66
#> 2 Blond Hazel Female     5

Similar options are available for cases when you have incomplete targets or only know the approximate ranges of subsets of the data.