brms_varying_effects.Rmd

---
title: "Model"
author: "Anders Sundelin"
date: "2022-12-14"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(tidyr)
library(dplyr)
library(brms)
library(tidybayes)
library(bayesplot)
library(patchwork)
```

## Data Ingestion and Model Building

Models are cached in subdirectory, as they take considerable time to run. Expect multiple hours for the complete, multi-level model with all the data set.
You might need to remove the cached models when changing some parameters (though formula or data changes are usually detected by brms).

We move away from zero-based scaling, instead opting for scaling the logarithm of the severely right-skewed values (added, removed, complexity, duplicated lines).
We are also heeding the advice of Gelman: https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/
This also means that the linear model will be based on the _magnitude_ of the change in parameter (e.g. 0, 1, 2.7, 7.3, 19.8 added lines/existing complexity). So each individual line will have marginally lower impact, but the scale of the parameter will still matter.

A simple scientific model, where rates of change is taken from the population only, and the intercept varies per group becomes:

$log(\lambda) = \beta_{0,i} + \beta_A A_i + \beta_C C_i + \beta_D D_i$

where 

$A_i = \frac{ln(added_i+1) - \hat{\mu_{\alpha}}}{\hat{\sigma_{\alpha}}}$

and 
$\hat{\mu_{\alpha}} = mean(ln(added+1))$
$\hat{\sigma_{\alpha}} = stddev(ln(added+1))$

$C_i = \frac{ln(complexity_i+1) - \hat{\mu_{\gamma}}}{\hat{\sigma_{\gamma}}}$

and 
$\hat{\mu_{\gamma}} = mean(ln(complexity+1))$
$\hat{\sigma_{\gamma}} = stddev(ln(complexity+1))$

and

$D_i = \frac{ln(dupblocks_i+1) - \hat{\mu_{\delta}}}{\hat{\sigma_{\delta}}}$

and 
$\hat{\mu_{\delta}} = mean(ln(dupblocks+1))$
$\hat{\sigma_{\delta}} = stddev(ln(dupblocks+1))$

This corresponds to the following multiplicative model for $\lambda$ (also called $\mu$ in the negative binomial context):

$\lambda_i = e^{\beta_{0,i}} (\frac{added_i+1}{e^\hat{\mu_\alpha}})^{\frac{\beta_A}{\hat{\sigma_\alpha}}} (\frac{complexity_i+1}{e^\hat{\mu_\gamma}})^{\frac{\beta_C}{\hat{\sigma_\gamma}}} (\frac{duplicates_i+1}{e^\hat{\mu_\delta}})^{\frac{\beta_D}{\hat{\sigma_\delta}}}$

In this equation, $added$, $complexity$ and $duplicates$ are on their natural scale, starting at 0 and counting upwards.

Depending on which parameters we include in the group-level effect, the exponent will change accordingly (hence the $i$ in $\beta_{0,i}$ for the model where only the intercept changes per team and repo).
The other parameters, $\hat{\mu_\alpha}$ and $\hat{\sigma_\alpha}$, et al. are just normalizing constants, defined as the mean and standard deviation of the logarithm of the corresponding metric count.

```{r ingest}
df <- read.csv("samples/authors-team-impact.csv")

data_logscaled <- df %>% mutate(repo=as.factor(repo),
                             file=as.factor(file),
                             ISTEST=istestfile,
                             ISNEW=isnewfile,
                             author=as.factor(author),
                             authorteam=as.factor(authorteam),
                             committer=as.factor(committer),
                             committerteam=as.factor(committerteam),
                             ADD=added,
                             DEL=removed,
                             REASON=as.factor(changereason),
                             CLOC=currCloc,
                             COMPLEX=currComplex,
                             DUP=prevDupBlocks, # duplicates prior to the change
                             INTROD=if_else(delta >= 0, delta, as.integer(0)),
                             REMOVED=if_else(delta <= 0, abs(delta), as.integer(0)),
                             logADD=log(added+1),
                             logDEL=log(removed+1),
                             logCOMPLEX=log(currComplex+1),
                             logDUP=log(prevDupBlocks+1)) %>%
  select(repo, file, ISTEST, ISNEW, author, authorteam, committer, committerteam, ADD, DEL, REASON, CLOC, COMPLEX, DUP, logADD, logDEL, logCOMPLEX, logDUP, INTROD, REMOVED)

data <- data_logscaled |> mutate(A = scale(logADD)[,1],
                             R = scale(logDEL)[,1],
                             C = scale(logCOMPLEX)[,1],
                             D = scale(logDUP)[,1])

# adapt these to fit your machine and willingness to wait for results. Models are cached, you might need to remove the cache in order to rebuild if these values are changed.
CHAINS <- 4
CORES <- 2
ITERATIONS <- 4000
THREADS <- 2
ADAPT_DELTA <- 0.95
SAVE_PARS <- save_pars(all = FALSE)
```

```{r}
source("conditional_effects.R")
```

# Exploratory Data Analysis

Introduced duplicates in files, per repo.
```{r}
data |> filter(INTROD > 0) |> group_by(repo, INTROD) |> tally()
```

Most of the introduced duplicates are small (single-digits), but some are large, ranging into the hundreds for the IntTest.

```{r}
changesPerRepoAndTeam <- data |> group_by(repo, authorteam) |> tally()
zerosPerRepoAndTeam <- data |> filter(INTROD == 0) |> group_by(repo, authorteam) |> summarise(zeros=n())
zeros_ratio <- merge(changesPerRepoAndTeam, zerosPerRepoAndTeam) |> mutate(zeroIntrodRatio = zeros/n) |> arrange(zeroIntrodRatio) 

zeros_ratio |> ggplot(aes(x=authorteam, y=1-zeroIntrodRatio, color=repo, size=n)) + geom_point() + ggtitle("Proportion of introduced duplicates in one file, per repo and team")
```
```{r}
data |> group_by(repo) |> ggplot(aes(x=logADD, y=logDEL, color=authorteam)) + geom_point() + facet_wrap( ~ repo)
```

```{r}
additions_per_team <- function(t) {
  d |> filter(team == t) |> ggplot(aes(x=A, y=R, color=repo)) + geom_point() + ggtitle(paste0("Scaled additions and removals by team ", t))
}
lapply(teams, function(t) additions_per_team(t))
```

```{r}
introds_per_team <- function(t) {
  d |> filter(team == t, y > 0) |> ggplot(aes(x=A, y=R, color=repo, size=y)) + geom_point() + ggtitle(paste0("Introductions by additions and removals by team ", t))
}
lapply(teams, function(t) introds_per_team(t))

```


```{r}
unknownFileChanges <- changesPerRepoAndTeam |> filter(authorteam == "Unknown")
allChanges <- data |> group_by(repo) |> summarize(filechanges = n())
merge(unknownFileChanges |> select(repo, n), allChanges) |> mutate(ratio = n/filechanges)
```

Unknown team have made about 6-12% changes in various repos.

```{r}
head(zeros_ratio |> filter(repo == "Neptune"), 10)
```

Teams have a different proportion of zero introduced duplicates. There seems to be some pattern here for us to investigate.


```{r}
sum_per_team <- data |> group_by(authorteam) |> summarize(filechanges = n())
data_with_sum_per_team <- merge(data, sum_per_team)
data_with_sum_per_team |> filter(repo == "Jupiter", INTROD > 0) |> group_by(authorteam) |> ggplot(aes(x=authorteam, y=INTROD, group=authorteam, color=filechanges)) + geom_boxplot() + ggtitle("Introduced issues per authorteam in repo Jupiter")
```
```{r}
data |> filter(repo == "Jupiter", INTROD > 0, authorteam %in% c("Red","Blue")) |> group_by(authorteam) |> ggplot(aes(x=INTROD, color=authorteam, fill=authorteam)) + geom_histogram(position = "dodge", binwidth = 1) + ggtitle("Number of introduced duplicates per file in team Jupiter, by team Red and Blue")
```

In the Jupiter repository, there are differences in the number of introduced issues between the Red and Blue teams.

```{r}
data_with_sum_per_team |> filter(repo == "Neptune", INTROD > 0) |> group_by(authorteam) |> ggplot(aes(x=authorteam, y=INTROD, group=authorteam, color=filechanges)) + geom_boxplot() + ggtitle("Introduced issues per authorteam in repo Neptune")
```

Likewise in the Neptune repository.

```{r}
data |> filter(repo == "Neptune", INTROD > 0, authorteam %in% c("Green","Blue")) |> group_by(authorteam) |> ggplot(aes(x=INTROD, color=authorteam, fill=authorteam)) + geom_histogram(position = "dodge", binwidth = 1)
```

Green and Blue teams seem to behave similarly in the Neptune repository. The other teams are not as active there.

```{r}
data_with_sum_per_team |> filter(repo == "IntTest", INTROD > 0) |> group_by(authorteam) |> ggplot(aes(x=authorteam, y=INTROD, group=authorteam, color=filechanges)) + geom_boxplot() + ggtitle("Introduced issues per authorteam in repo IntTest")
```
```{r}
data |> filter(repo == "IntTest", INTROD > 0, authorteam == "Green") |> group_by(authorteam) |> ggplot(aes(x=INTROD, color=authorteam, fill=authorteam)) + geom_histogram(position = "dodge")

data |> filter(repo == "IntTest", INTROD > 0, authorteam %in% c("Arch", "Orange", "Violet", "Green")) |> group_by(authorteam) |> ggplot(aes(x=INTROD, color=authorteam, fill=authorteam)) + geom_histogram(position = "dodge", binwidth = 1)
```

The Arch team introduced very few duplicates in the IntTest repository. The other teams introduced more, especially the Green team.

```{r}
data |> filter(repo == "IntTest", INTROD > 0, authorteam %in% c("Pink", "Arch", "UI")) |> group_by(authorteam) |> ggplot(aes(x=INTROD, color=authorteam, fill=authorteam)) + geom_histogram(position = "dodge", binwidth = 1) + ggtitle("Small contributors to IntTest repo, vs. Arch team")
```

Some teams introduved very few duplicates --- but the UI team still introduced three files with between 15 and 31 duplicates. The Arch team introduced three files with 1, 2 and 4 duplicates, respectively.

# Working Model

* Terminology: $Weibull(\lambda, k)$, where $\lambda > 0$ is the *scale* parameter, and $k$ is the *shape* parameter
* A Weibull distribution reduces to an exponential distribution when $k = 1$
* A Weibull distribution is the same as a Generalized Gamma distribution with both its shape parameters $d$ and $p$ equal to $k$.

# Intercept-only model (baseline)

The simplest possible model, only intercepts (on population, committerteam and repo level):

```{r}
d <- data |> select(y=INTROD,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + (1 | team/repo),
              zi ~ 1 )

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)
```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(weibull(2, 1), class = sd),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
```


```{r}
M_intercepts_only <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_intercepts_only",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
```
Expected runtime: ~3600 seconds

If we were to use the `0 + Intercept` notation, we would not center the population-level parameters around 0. This represents the expected response value when all predictors are at their means. 

Otherwise brms sets its default prior there.

The model has a reasonable rhat value and neff ratios:

```{r, eval=FALSE}
rhat(M_intercepts_only)
min(neff_ratio(M_intercepts_only))
```
```{r}
M_intercepts_only <- add_criterion(M_intercepts_only, criterion = "loo")
```

```{r}
m <- M_intercepts_only
#stopifnot(rhat(m) < 1.01)
#stopifnot(neff_ratio(m) > 0.2)
```

```{r}
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) { 
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })

```

```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```

Both Rhat and Neff ratio looks good.

```{r}
loo <- loo(m)
loo
plot(loo)
```

We have 7 observations with high Pareto k values. Most likely due to sparse data for some observations.

We can do a more exact Pareto calculation by refitting the model 7 times, each time excluding one of the problematic observations.
If these calculations produce reasonable Pareto values (< 0.7), then we can still trust the results.

```{r, eval=FALSE}
reloo <- reloo(m, loo, chains=CHAINS)
```
Doing the reloo takes about 7 hours, so it is best done manually (e.g over night).

If reloo sorted out the problematic observations, we can still trust the results (the model has converged).

```{r}
summary(m)
```

```{r}
ranef(m)
```

```{r}
mcmc_areas(m, regex_pars = c("^b_", "^b_zi", "^sd_"))
```

```{r}
mcmc_areas(m, regex_pars = c("^r_team[[]"))

```

```{r}
teams <- c("Arch", "Blue", "Brown", "Green", "Orange", "Pink", "Red", "UI", "Unknown", "Violet", "Yellow")
lapply(teams, function(t) mcmc_areas(m, regex_pars = c(paste0("^r_team:repo[[]", t))))
```

```{r}
repos <- c("IntTest", "Jupiter", "Saturn", "Uranus", "Neptune", "Venus", "Mars", "Mercury")
lapply(repos, function(t) mcmc_areas(m, regex_pars = c(paste0("^r_team:repo[[].*_", t,",Intercept[]]"))))

```

Clearly there are some differences between teams.

## Posterior Predictive Checks

```{r}
yrep <- posterior_predict(m)
```

Proportion of zeros

```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```

The zero-inflation seems to be working very well.

The observed max value ranges falls well within the predicted max value for the individual observations. There are no extreme max values either, indicating that the model converged appropriately (the data/likelihood tamed the priors).

```{r}
ppc_stat(y = d$y, yrep, stat = "max")
```

Rootogram, full scale

```{r}
rootogram <- pp_check(m, type = "rootogram", style="suspended")
```

Rootogram, sized according to reasonable (observed) values.

```{r}
rootogram + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```

The suspended rootogram shows that our model seems to somewhat overestimate the introduces issues (all light blue buckets protude upwards from the x-axis).

## Conditional effects


```{r}
```

```{r}
ftotArch <- condeffect_logCOMPLEX_by_logADD(m, d, "Arch", "IntTest")
ftotGreen <- condeffect_logCOMPLEX_by_logADD(m, d, "Green", "IntTest")
```

In the conditional effects graphs (measuring the expected value, for a given set of values), the x-axis show one variable (A, C or D), and the coloured lines show the expected outcome (y), for various values of one other value. The plots below show how complexity (C) interacts with number of added lines (A) for two teams with different behaviour in the IntTest repository.

```{r}
plot_logCOMPLEX_by_logADD(m, d, ftotArch, "Arch", "IntTest")
```

```{r}
plot_logCOMPLEX_by_logADD(m, d, ftotGreen, "Green", "IntTest")
```

Because of our simple model, all teams and repos share the same slope, only the intercepts differ.

We can do better, given more compute power.

# Full model (Incorporating A, C and D into our model

Following our DAG, we could include A, R, C, D and possibly Reason (depending on whether we want to test if it impacts the number of added or removed lines) in our model. 
However, we have to have a model that converges reasonably well as well.

We start with A, C and D.
```{r}
d <- data |> select(y=INTROD,
                    A=A,
                    C=C,
                    D=D,
                    R=R,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + A + C + D + (1 + A + C + D | team/repo),
              zi ~ 1 + A )

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)
```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(normal(0, 0.25), class = b),
            prior(weibull(2, 1), class = sd),
            prior(lkj(2), class = cor),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(normal(0.5, 0.5), class = b, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)

```

Full model takes about 15000 seconds to complete on my i9 laptop.

```{r}
M_full_model <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_full_model",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = save_pars(all = TRUE),
      adapt_delta = ADAPT_DELTA)
```

```{r}
M_full_model <- add_criterion(M_full_model, "loo")
```


```{r}
m <- M_full_model
#stopifnot(rhat(m) < 1.01)
#stopifnot(neff_ratio(m) > 0.2)
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) { 
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })
```
```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```

Both Rhat and Neff ratio looks good.

```{r}
loo <- loo(m)
loo
plot(loo)
```

We have 18 high Pareto k values, so we will have to refit the model 18 times

Reloo will take about 3 hours per resampling, so safest to run it in the background while doing other productive tasks.

```{r, eval=FALSE}
reloo <- reloo(m, loo, chains=CHAINS)
```

### Model parameters

```{r}
summary(m)
```

```{r}
ranef(m)
```


```{r}
mcmc_areas(m, regex_pars = c("^b_", "^b_zi"))
```

```{r}
mcmc_areas(m, regex_pars = c("^sd_"))
```

```{r}
params <- c("Intercept", "A", "C", "D")
lapply(params, function(p) mcmc_areas(m, regex_pars = paste0("^r_team[[].*,", p, "[]]")) + ggtitle(paste("Team-level difference for parameter", p)))
```

```{r}
teams <- c("Arch", "Blue", "Brown", "Green", "Orange", "Pink", "Red", "UI", "Unknown", "Violet", "Yellow")
lapply(params, function(p) lapply(teams, function(t) mcmc_areas(m, regex_pars = c(paste0("^r_team:repo[[]", t, ".*,", p, "[]]"))) + ggtitle(paste("Repo-level difference for team", t, "parameter", p))))
```

```{r}
repos <- c("IntTest", "Jupiter", "Saturn", "Uranus", "Neptune", "Venus", "Mars", "Mercury")
lapply(params, function(p) lapply(repos, function(r) mcmc_areas(m, regex_pars = c(paste0("^r_team:repo[[].*_", r,",", p, "[]]"))) + ggtitle(paste("Team-level differences for repo", r, "parameter", p))))
```

### Posterior Predictive Checks


```{r}
yrep <- posterior_predict(m)
```

Proportion of zeros

```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```

The zero-inflation seems to be working very well.

The observed max value ranges falls well within the predicted max value for the individual observations. 
Compared to the intercept-only model, the predicted max value are about a magnitude higher. This indicates that we might have some priors that are a bit too wild, or some imbalance in our data (which we know we have - some teams make few or no commits in some repos).
But there are still no extreme max values (e.g. 1e6 or more - we know that the total number of observed lines of code is around 1e5, so it is impossible to have more that those amounts of duplicates).

```{r}
ppc_stat(y = d$y, yrep, stat = "max")
```
```{r}
rootogram <- pp_check(m, type = "rootogram", style="suspended")
```
```{r}
rootogram
```

Rootogram, sized according to reasonable (observed) values.

```{r}
rootogram + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```

Our model slightly overestimates the number of introduced issues.

### Conditional Effects

```{r}
ftotArch <- condeffect_logCOMPLEX_by_logADD(m, d, "Arch", "IntTest")
ftotGreen <- condeffect_logCOMPLEX_by_logADD(m, d, "Green", "IntTest")
ftotBlue <- condeffect_logCOMPLEX_by_logADD(m, d, "Blue", "IntTest")
```

How to read these conditional plots:

* X axis is the continuously varying predictor (in a scaled log scale)
* Y axis is the predicted number of introduced issues (in the natural count scale)
* Existing observations for the observed team and repo are plotted as points.
  * The color of the point is the value of the second predictor (rounded to the nearest integer value)
  * The size of the point is the Pareto k value of the observations (larger size mean more influential points)
* The lines are the predicted estimate of the y value
  * The color of the line is the value of the second predictor (same as the observations)
  * The intervals refer to the Q5.5-Q94.5 credible intervals. That is, we expect 89% of the predictions to be located inside these intervals.

```{r}
plot_logCOMPLEX_by_logADD(m, d, ftotArch, "Arch", "IntTest")
```

```{r}
plot_logCOMPLEX_by_logADD(m, d, ftotGreen, "Green", "IntTest")
```

```{r}
plot_logCOMPLEX_by_logADD(m, d, ftotBlue, "Blue", "IntTest")
```


In the IntTest repo, there is not much influence of the complexity on the number of introduced duplicates. But some teams (e.g. Green) have more impact than others (e.g. Arch)

If we reverse the plots, and check the impact of added lines, relative to C and D

```{r}
ftotArch <- condeffect_logADD_by_logCOMPLEX(m, d, "Arch", "IntTest")
ftotGreen <- condeffect_logADD_by_logCOMPLEX(m, d, "Green", "IntTest")
ftotBlue <- condeffect_logADD_by_logCOMPLEX(m, d, "Blue", "IntTest")
```

```{r}
plot_logADD_by_logCOMPLEX(m, d, ftotArch, "Arch", "IntTest") #+ scale_y_continuous(limits=c(0, 50))
```

```{r}
plot_logADD_by_logCOMPLEX(m, d, ftotGreen, "Green", "IntTest") #+ scale_y_continuous(limits=c(0, 50))
```
```{r}
plot_logADD_by_logCOMPLEX(m, d, ftotBlue, "Blue", "IntTest") #+ scale_y_continuous(limits=c(0, 50))
```

Again, we see that the complexity of the file has no significant impact on the number of introduced duplicates. The lines largely overlap.

```{r}
ftotArch <- condeffect_logADD_by_logDUP(m, d, "Arch", "IntTest")
ftotGreen <- condeffect_logADD_by_logDUP(m, d, "Green", "IntTest")
ftotBlue <- condeffect_logADD_by_logDUP(m, d, "Blue", "IntTest")
```

```{r}
plot_logADD_by_logDUP(m, d, ftotArch, "Arch", "IntTest") #+ scale_y_continuous(limits=c(0, 75))
plot_logADD_by_logDUP(m, d, ftotGreen, "Green", "IntTest") #+ scale_y_continuous(limits=c(0, 75))
plot_logADD_by_logDUP(m, d, ftotBlue, "Blue", "IntTest") #+ scale_y_continuous(limits=c(0, 75))
```

For team Green, in particular, there seems to be an impact of the existing duplicates in the file. But for other teams, there is much less variation between the different D lines, so existing duplicates seem to play a smaller part.
What is clear is that the number of added lines play a part in the number of introduced duplicates, as can be expected.

### Looking at another repo (Jupiter)

```{r}
ftotArch <- condeffect_logADD_by_logCOMPLEX(m, d, "Arch", "Jupiter")
ftotGreen <- condeffect_logADD_by_logCOMPLEX(m, d, "Green", "Jupiter")
ftotBlue <- condeffect_logADD_by_logCOMPLEX(m, d, "Blue", "Jupiter")
```

```{r}
plot_logADD_by_logCOMPLEX(m, d, ftotArch, "Arch", "Jupiter")  + scale_y_continuous(limits=c(0, 50))
plot_logADD_by_logCOMPLEX(m, d, ftotGreen, "Green", "Jupiter")+ scale_y_continuous(limits=c(0, 50))
plot_logADD_by_logCOMPLEX(m, d, ftotBlue, "Blue", "Jupiter")  + scale_y_continuous(limits=c(0, 50))
```
```{r}
ftotArch <- condeffect_logADD_by_logDUP(m, d, "Arch", "Jupiter")
ftotGreen <- condeffect_logADD_by_logDUP(m, d, "Green", "Jupiter")
ftotBlue <- condeffect_logADD_by_logDUP(m, d, "Blue", "Jupiter")
```

```{r}
plot_logADD_by_logDUP(m, d, ftotArch, "Arch", "Jupiter") #+ scale_y_continuous(limits=c(0, 75)) #+ scale_x_continuous(limits=c(-2, 3.2))
plot_logADD_by_logDUP(m, d, ftotGreen, "Green", "Jupiter") #+ scale_y_continuous(limits=c(0, 75)) #+ scale_x_continuous(limits=c(-2, 3.2))
plot_logADD_by_logDUP(m, d, ftotBlue, "Blue", "Jupiter") #+ scale_y_continuous(limits=c(0, 75)) #+ scale_x_continuous(limits=c(-2, 3.2))
```
In the Jupiter repo, we see clear team-level impacts of the existing number of duplicates The Arch team are unlikely to introduce more than a few duplicates, while the Green and Blue team are correspondingly more likely to introduce duplicates, provided that the magnitude of the number of added lines are 2 or more.


```{r}
ftotArch <- condeffect_logADD_by_logREMOVED(m, d, "Arch", "Jupiter")
ftotGreen <- condeffect_logADD_by_logREMOVED(m, d, "Green", "Jupiter")
ftotBlue <- condeffect_logADD_by_logREMOVED(m, d, "Blue", "Jupiter")
```

```{r}
plot_logADD_by_logREMOVED(m, d, ftotArch, "Arch", "Jupiter") + scale_y_continuous(limits=c(0, 75)) #+ scale_x_continuous(limits=c(-2, 3.2))
plot_logADD_by_logREMOVED(m, d, ftotGreen, "Green", "Jupiter") + scale_y_continuous(limits=c(0, 75)) #+ scale_x_continuous(limits=c(-2, 3.2))
plot_logADD_by_logREMOVED(m, d, ftotBlue, "Blue", "Jupiter") + scale_y_continuous(limits=c(0, 75)) #+ scale_x_continuous(limits=c(-2, 3.2))

```

```{r}
library(marginaleffects)
```
```{r}
marginaleffects(m)
```


```{r}
bayes_R2(m)
```

```{r}
loo_compare(M_intercepts_only, M_full_model)
```

```{r}
bayes_R2(M_intercepts_only)
```

We see both from the `bayes_R2` and the `loo_compare` functions that the full model is to be preferred over the intercept-only model.
But it still only explains about 50% of the variance in the data.

```{r}
summary(M_full_model)
```

```{r}
ranef(M_full_model)
```


# Adding removed lines

```{r}
d <- data |> select(y=INTROD,
                    A=A,
                    C=C,
                    D=D,
                    R=R,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + A + C + D + R + (1 + A + C + D + R | team/repo),
              zi ~ 1 + A )

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)
```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(normal(0, 0.25), class = b),
            prior(weibull(2, 1), class = sd),
            prior(lkj(2), class = cor),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(normal(0.5, 0.5), class = b, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)

```

Full model takes about 15000 seconds to complete on my i9 laptop.

```{r}
M_fuller_model <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_fuller_model",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
```

```{r}
M_fuller_model <- add_criterion(M_fuller_model, "loo")
```
```{r}
m <- M_fuller_model
#stopifnot(rhat(m) < 1.01)
#stopifnot(neff_ratio(m) > 0.2)
```
```{r}
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) { 
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })
```
```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```

Both Rhat and Neff ratio looks good.

```{r}
loo <- loo(m)
loo
plot(loo)
```

```{r}
summary(m)
```

```{r}
mcmc_areas(m, regex_pars = c("^b_"))
```

```{r}
mcmc_areas(m, regex_pars = c("^sd_"))
```

```{r}
pars <- c("Intercept", "A", "D", "C", "R")
lapply(pars, function(p) mcmc_areas(m, regex_pars = c(paste0("^r_team[[].*,", p, "[]]"))))
```

```{r}
lapply(pars, function(p) lapply(teams, function(t) mcmc_areas(m, regex_pars = c(paste0("^r_team:repo[[]", t, ".*,", p, "[]]")))))
```

```{r}
ftotArch <- condeffect_logADD_by_logDUP(m, d, "Arch", "IntTest")
ftotGreen <- condeffect_logADD_by_logDUP(m, d, "Green", "IntTest")
ftotBlue <- condeffect_logADD_by_logDUP(m, d, "Blue", "IntTest")
```

```{r}
plot_logADD_by_logDUP(m, d, ftotArch, "Arch", "IntTest") + scale_y_continuous(limits=c(0, 75))
plot_logADD_by_logDUP(m, d, ftotGreen, "Green", "IntTest") + scale_y_continuous(limits=c(0, 75))
plot_logADD_by_logDUP(m, d, ftotBlue, "Blue", "IntTest") + scale_y_continuous(limits=c(0, 75))
```

```{r}
lapply(pars, function(p) lapply(repos, function(r) mcmc_areas(m, regex_pars = c(paste0("^r_team:repo[[].*_", r,",", p, "[]]")))))
```


```{r}
plot(hypothesis(m, "R = 0"))
```

```{r}
hypA <- hypothesis(m, "C = 0")
hypA |> glimpse()
plot(hypA)
```

```{r}
loo_compare(M_intercepts_only, M_full_model, M_fuller_model)
```


```{r}
library(marginaleffects)
```

```{r}
mareff <- marginaleffects(M_fuller_model, by="team", newdata="mean")
```

```{r}
print(mareff, nrows = 200)
```


```{r}
bayes_R2(M_intercepts_only)
```

```{r}
bayes_R2(M_full_model)
```

```{r}
bayes_R2(M_fuller_model)
```

```{r}
rootogram_intercepts <- pp_check(M_intercepts_only, type = "rootogram", style="suspended")
```
```{r}
rootogram_intercepts + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```

```{r}
rootogram_full <- pp_check(M_full_model, type = "rootogram", style="suspended")
```
```{r}
rootogram_full + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```

```{r}
rootogram_fuller <- pp_check(M_fuller_model, type = "rootogram", style="suspended")
```
```{r}
rootogram_fuller + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```

# Causal model

```{r}
d <- data |> select(y=INTROD,
                    A=A,
                    C=C,
                    D=D,
                    R=R,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + A + R + (1 + A + R | team) + (1 + A + R | repo),
              zi ~ 1 + A + R)

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)

```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(normal(0, 0.25), class = b),
            prior(weibull(2, 1), class = sd),
            prior(lkj(2), class = cor),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(normal(0, 0.5), class = b, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
```

```{r}
M_crossed_model <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_crossed_model",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)

```

```{r}
M_crossed_model <- add_criterion(M_crossed_model, "loo")
```

```{r}
m <- M_crossed_model
```

```{r}
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) { 
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })
```
```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```

Both Rhat and Neff ratio looks good.

```{r}
loo <- loo(m)
loo
plot(loo)
```
```{r}
reloo <- reloo(m, loo, chains=CHAINS)
```
```{r}
plot(reloo)
```

```{r}
reloo
```

```{r}
summary(m)
```

```{r}
yrep <- posterior_predict(m)
```

Proportion of zeros

```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```

The zero-inflation seems to be working very well.

The observed max value ranges falls well within the predicted max value for the individual observations. 
Compared to the intercept-only model, the predicted max value are about a magnitude higher. This indicates that we might have some priors that are a bit too wild, or some imbalance in our data (which we know we have - some teams make few or no commits in some repos).
But there are still no extreme max values (e.g. 1e6 or more - we know that the total number of observed lines of code is around 1e5, so it is impossible to have more that those amounts of duplicates).

```{r}
ppc_stat(y = d$y, yrep, stat = "max")
```
```{r}
rootogram <- pp_check(m, type = "rootogram", style="suspended")
```
```{r}
rootogram
```

Rootogram, sized according to reasonable (observed) values.

```{r}
rootogram + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```

# teams per repo

```{r}
d <- data |> select(y=INTROD,
                    A=A,
                    C=C,
                    D=D,
                    R=R,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + A + R + (1 + A + R + repo | team) + (1 + A + R | repo),
              zi ~ 1 + A + R)

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)

```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(normal(0, 0.25), class = b),
            prior(weibull(2, 1), class = sd),
            prior(lkj(2), class = cor),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(normal(0, 0.5), class = b, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
```

```{r}
M_crossed_team_repo_model <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_crossed_team_repo_model",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)

```

```{r}
M_crossed_team_repo_model <- add_criterion(M_crossed_team_repo_model, "loo")
```

```{r}
m <- M_crossed_team_repo_model
```

```{r}
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) { 
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })
```

```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```

Both Rhat and Neff ratio looks good.

```{r}
loo <- loo(m)
loo
plot(loo)
```

```{r}
summary(m)
```

```{r}
yrep <- posterior_predict(m)
```

Proportion of zeros

```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```

```{r}
ppc_stat(y = d$y, yrep, stat = "max")
```
```{r}
rootogram <- pp_check(m, type = "rootogram", style="suspended")
```
```{r}
rootogram
```

Rootogram, sized according to reasonable (observed) values.

```{r}
rootogram + scale_y_continuous(limits=c(0, 35))  + scale_x_continuous(limits=c(0,150))
```


```{r}
team_predict <- function(t, rep, a=1, r=1, c=0, d=0) {
  nd <- expand_grid(A=a, R=r, C=c, D=d, repo=rep, team=t)
  pred <- posterior_predict(m, newdata=nd)
  colnames(pred) <- levels(d$repo)
  return(data.frame(pred) |> pivot_longer(everything(), names_to="repo", values_to = "pred"))
}
```

```{r}
newdata <- expand_grid(A=1,
                       R=1,
                       repo=levels(d$repo),
                       team="Arch")
teamArch <- posterior_predict(m, newdata=newdata, seed=12354)

colnames(teamArch) <- levels(d$repo)

#head(teamArch)

summary(teamArch)
```

```{r}
bluePred <- team_predict("Blue", a=1, r=1)

```
Figure out how to work with the stat-halfeye...

So, the raincloud plot plots all predictions as dots under the distribution. But the distribution already contains the predictions... But it might make sense to instead of the predictions plot the (relative) number of observations...
This would make the distribution mimic the posterior, and the raindrops mimic the raw data...
But note that this would be only be useful as a posterior predictive check, where the predictions arise from the exact save data as the observations. For predictions (fixing the value of the parameters), it would be pointless... Better to use the standard tools there...


```{r}
ypred <- posterior_predict(m)
#Sobs <- d |> filter(repo == "IntTest") |> select(y, repo)
```


```{r}
bluePred |> filter(repo == "IntTest") |> ggplot(aes(x=pred, group=repo, color=repo, fill=repo)) + stat_halfeye() + scale_x_continuous(limits=c(0,15)) #+ stat_dots(side="left", binwidth=0.25) #+ facet_wrap(~ repo) 
```

```{r}
team_predict("Blue", a=2, r=2) |> filter(repo == "IntTest") |> ggplot(aes(x=pred, group=repo, color=repo, fill=repo)) + stat_halfeye() + scale_x_continuous(limits=c(0,15)) 

```

```{r}
predPerRepo <- team_predict("Blue", a=2, r=2) |> filter(repo == "IntTest") |> tally()

team_predict("Blue", a=2, r=2) |> filter(repo == "IntTest") |> group_by(pred) |> summarize(proportion=round(n()/predPerRepo$n, 3))
```

```{r}
team_predict("Red", a=2, r=2) |> filter(repo == "IntTest") |> group_by(pred) |> summarize(proportion=round(n()/predPerRepo$n, 3))
```

A 3 and R 0 gives relatively unlikely chance of 0 added duplicates! How come? Because A and R tends to cancel each other out?

```{r}
team_predict("Arch", a=3, r=0) |> filter(repo == "IntTest") |> group_by(pred) |> summarize(proportion=round(n()/predPerRepo$n, 3))

```

```{r}
range(d$A)
range(d$R)
mean(data$logADD)/sd(data$logADD)

exp((range(d$A)*sd(data$logADD)+mean(data$logADD)))-1
exp((range(d$R)*sd(data$logDEL)+mean(data$logDEL)))-1
```

```{r}
percentage_zeros <- function(model, aTeam) {
  items <- 5000
  repos <- data.frame(team=aTeam, repo=levels(d$repo))
  as <- data.frame(A=seq(from=min(d$A), to=4, length.out=20))
  rs <- data.frame(R=seq(from=min(d$R), to=4, length.out=20))
  grid <- expand_grid(repos, expand_grid(as, rs))
  perczeros <- posterior_predict(model, newdata=grid, ndraws=items) |> data.frame() |> sapply(function(x) { length(which(x==0))/length(x) } ) 
  grid$pct <- perczeros
  grid$added <- exp(grid$A*sd(data$logADD)+mean(data$logADD))-1
  grid$removed <- exp(grid$R*sd(data$logDEL)+mean(data$logDEL))-1
  return(grid)
}
```

```{r}
p <- percentage_zeros(m, "Arch")
head(p, 20)
```

```{r}
zero_introd_per_team_repo <- function(postpercentage, aRepo) {
  team <- postpercentage |> select(team) |> distinct()
  ybreaks <- c(-1, 0, 1, 2, 3, 4)
  ylabels <- round(exp(ybreaks*sd(data$logDEL)+mean(data$logDEL))-1, 0)
  xbreaks <- c(-1, 0, 1, 2, 3, 4)
  xlabels <- round(exp(xbreaks*sd(data$logADD)+mean(data$logADD))-1, 0)
  postpercentage |> mutate(probDup = 1-pct) |> filter(repo == aRepo) |> ggplot(aes(x=A, y=R, fill=probDup)) + geom_tile() + scale_fill_gradient(breaks=c(0, 0.2, 0.4, 0.6, 0.8, 1), low="white", high="black", limits=c(0,1)) + xlab("added") + scale_x_continuous(breaks=xbreaks, labels=xlabels) + ylab("deleted") +  scale_y_continuous(breaks=ybreaks, labels=ylabels) + ggtitle(paste0("Probability of team ", team, " introducing duplicates in ", aRepo))
}
```

```{r}
library(ggExtra)
```


```{r}
zero_introd_per_team_repo(p, "IntTest")
```

```{r}
zero_introd_per_team_repo(p, "Jupiter")
```
```{r}
lapply(repos, function(r) zero_introd_per_team_repo(p, r))
```

```{r}
pRed <- percentage_zeros(m, "Red")
```


```{r}
pRed |> filter( repo == "Venus", added > 5000)
```

```{r}
p |> filter(repo == "Venus", added > 5000)
```


```{r}
lapply(repos, function(r) zero_introd_per_team_repo(pRed, r))
```

```{r}
pOrange <- percentage_zeros(m, "Orange")
```

```{r}
lapply(repos, function(r) zero_introd_per_team_repo(pOrange, r))
```

```{r}
pBlue <- percentage_zeros(m, "Blue")
```

```{r}
lapply(repos, function(r) zero_introd_per_team_repo(pBlue, r))
```


```{r}
pYellow <- percentage_zeros(m, "Yellow")
```

```{r}
lapply(repos, function(r) zero_introd_per_team_repo(pYellow, r))

```

```{r}
zero_introd_per_team_repo(p, "IntTest") | zero_introd_per_team_repo(pBlue, "IntTest") 
```

```{r}
zero_introd_per_team_repo(p, "Neptune") | zero_introd_per_team_repo(pBlue, "Neptune") 
```
```{r}
pGreen <- percentage_zeros(m, "Green")
```

```{r}
lapply(repos, function(r) zero_introd_per_team_repo(pGreen, r))

```

```{r}
zero_introd_per_team_repo(p, "Neptune") | zero_introd_per_team_repo(pGreen, "Neptune") #| zero_introd_per_team_repo(pYellow, "Neptune")

```

```{r}
zero_introd_per_team_repo(pGreen, "Neptune") | zero_introd_per_team_repo(pBlue, "Neptune") #| zero_introd_per_team_repo(pYellow, "Neptune")

```

```{r}
tail(p)
```


# Team, Repo, Added, Removed, Duplicates, Complexity

```{r}
d <- data |> select(y=INTROD,
                    A=A,
                    C=C,
                    D=D,
                    R=R,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + A + R + C + D + (1 + A + R + C + D + repo | team) + (1 + A + R | repo),
              zi ~ 1 + A + R + C + D)

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)

```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(normal(0, 0.25), class = b),
            prior(weibull(2, 1), class = sd),
            prior(lkj(2), class = cor),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(normal(0, 0.5), class = b, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
```

```{r}
M_crossed_team_repo_complex_dup_model <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_crossed_team_repo_complex_dup_model",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
```

```{r}
M_crossed_team_repo_complex_dup_model <- add_criterion(M_crossed_team_repo_complex_dup_model, "loo")
```


```{r}
d <- data |> select(y=INTROD,
                    A=A,
                    C=C,
                    D=D,
                    R=R,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + A + R + C + D + repo + (1 + A + R + C + D + repo | team),
              zi ~ 1 + A + R + C + D + repo)

get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)

```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(normal(0, 0.25), class = b),
            prior(weibull(2, 1), class = sd),
            prior(lkj(2), class = cor),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(normal(0, 0.5), class = b, dpar=zi),
            prior(gamma(1, 0.1), class = shape))
validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
```

```{r}
M_crossed_team_repo_complex_dup_repo_pop <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_crossed_team_repo_complex_dup_repo_pop",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
```

```{r}
M_crossed_team_repo_complex_dup_repo_pop <- add_criterion(M_crossed_team_repo_complex_dup_repo_pop, "loo")
```

```{r}
m <- M_crossed_team_repo_complex_dup_repo_pop
```


```{r}
nd <- data.frame(A=0, R=0, C=0, D=0, team="Arch", repo="IntTest")
pp <- posterior_predict(M_crossed_team_repo_complex_dup_repo_pop, newdata = nd)

```


```{r}
data.frame(teamArch) |> pivot_longer(everything(), names_to="repo", values_to = "pred") |> ggplot(aes(x=pred, group=repo, color=repo)) + stat_halfeye() + scale_x_continuous(limits=c(0,10))
```
```{r}
nd <- tibble(A=1, R=1, C=1, D=1, repo="IntTest", team="Arch")
newdata[4:5,]
pred <- posterior_predict(M_crossed_team_repo_model, newdata=nd, seed=12354)
summary(pred)
```


```{r}
posterior_predict(m, newdata = expand_grid(A=1,
                       R=1,
                       repo=levels(d$repo),
                       team="Arch"), resp=levels(d$repo), ndraws=10000) |> data.frame() |> 
  pivot_longer(everything(), names_to="repoid", values_to="count") |> mutate(repoid = as.factor(repoid)) |> group_by(repoid) |> summarise(sum(count)/1e4, max(count))
```

```{r}
posterior_predict(m, newdata = expand_grid(A=1,
                       R=1,
                       repo=levels(d$repo),
                       team="Arch"), resp=levels(d$repo), ndraws=10000) |> data.frame() |> 
  pivot_longer(everything(), names_to="repoid", values_to="count") |> mutate(repoid = as.factor(repoid)) |> group_by(repoid) |>
  filter(count > 0) |> ggplot(aes(x=count, group=repoid, color=repoid, fill=repoid)) + geom_histogram(position = "dodge", binwidth = 1)
#  summarise(sum(count)/1e4, max(count))

```
```{r}
library(tidybayes)

#summarise_draws(tidy_draws(m))

add_predicted_draws(m, newdata=expand_grid(A=1,
                       R=1,
                       repo=levels(d$repo),
                       team=levels(d$team))) |> select(repo, team, .prediction) |> 
  mutate(repo = as.factor(repo), team=as.factor(team), ypred=.prediction) |> select(-.prediction) |> 
  group_by(repo, team) |> ggplot(aes(x=ypred, color=team, fill=team)) + geom_histogram(position = "dodge") + facet_wrap(~ repo)
```
```{r}
loo_compare(M_intercepts_only, M_full_model, M_fuller_model, M_crossed_model, M_crossed_team_repo_model)
```

```{r}
loo_compare(M_intercepts_only, M_crossed_model, M_crossed_team_repo_model)
```

```{r}
m <- M_crossed_team_repo_complex_dup_repo_pop
```


```{r}
posterior_predict(m, newdata = expand_grid(A=1,
                       R=1,
                       repo=levels(d$repo),
                       team=levels(d$team)), ndraws=10000) |> data.frame() |> head()
#  pivot_longer(everything(), names_to="repoid", values_to="count") |> mutate(repoid = as.factor(repoid)) |> head() #group_by(team, repoid) |>
 # filter(count > 0) |> ggplot(aes(x=count, group=repoid, color=repoid, fill=repoid)) + geom_histogram(position = "dodge", binwidth = 1)
#  summarise(sum(count)/1e4, max(count))

```