NIH peer review: Criterion scores completely account for racial disparities in overall impact scores

Black-white disparities in preliminary peer review scores for NIH R01 grant applications are examined.


Study Data
This section provides a description of the study data, including key information from the main text for completeness. The data come from the NIH IMPAC II (Information for Management, Planning, Analysis, and Coordination) grant data system from the council years 2014-2016. This study focused on black-white disparities; we did not include 1,771 applications submitted by PIs whose race was American Indian or Alaskan, Asian, Native Hawaiian or Pacific Islander, or who indicated more than one race, as well as 8,648 applications for which PI race was withheld or unknown. Table 3 in the main text summarizes all variables used in our analyses and their definitions. At the time of application, PI demographics are voluntarily reported by applicants. CSR is not aware of patterns among PIs not reporting their demographic characteristics, though this issue deserves further attention if respondents who did not report these demographic characteristics are systematically di↵erent from others, which could lead to biased conclusions (50 ).
In the full set of 54,740 applications, approximately 15% of the applications from black and white PIs were missing information on PI gender, ethnicity (Hispanic/Latino or not), and degree, and were excluded from the study. Specifically, 232 were missing gender information, 7,409 were missing ethnicity information, and 1,639 were missing degree information. The remaining 46,226 applications-1,015 (or 2.2%) from black PIs and 45,211 (or 97.8%) from white PIs-were evaluated by 19,197 unique reviewers who wrote 139,216 reviews. 73.7% of the reviewers reviewed in just one of the three council years, 2014-2016, for which we have data, while 22.9% reviewed in two and 3.4% reviewed in three of those years. Because PIs can amend each application that is not funded initially and submit multiple applications for di↵erent projects, there are fewer unique PIs-500 (2.5%) black and 19,653 (97.5%) white. Among these applications with no missing data, there were 1,015 applications from 500 unique black PIs, which received 3,064 reviews from 2,322 unique reviewers (Table S1). There were 45,211 applications from 19,653 unique white PIs, which received 136,152 reviews from 19,100 unique reviewers. 73.7% of the reviewers reviewed for just one of the three council years during the period of 2014-2016, for which we have data, while 22.9% reviewed in two and 3.4% reviewed in three of those years.
Study codes-Human Subjects, Animal Subjects, Child, Gender, and Minority-are categorical variables that take on a number of values. For our analyses, all study codes were re-coded/coarsened to "Acceptable", "Unacceptable", or "Inapplicable", in order to avoid numerical estimation problems with rare categories and for ease of interpretability. Below, we describe only codes that occurred in our study data for example, code 20-no exemption designated, so award cannot be processed-never occurs in our data and is not discussed).
Links to the current NIH study codes are provided as hyperlinked URLs in the text for ease of reference. For Human Subject codes (https://www.niaid.nih.gov/grants-contracts/human-subjectsinvolvement-codes), code 10 (no human subjects involved) was re-coded to "Inapplicable," code 44 (human subjects involved, SRG concerns) to "Unacceptable," and other codes (30-certified with no SRG concerns; 54-previous concerns resolved; and exemptions) to "Acceptable." For Animal Subjects (https://www.niaid.nih.gov/grants-contracts/researchanimals-involvement-codes), code 10 (no animal subjects involved) was re-coded to "Inapplicable," code 44 (animal subjects involved, SRG concerns) was re-coded to "Unacceptable," and others (30-animals involved with no SRG concerns; 32-animals involved with SRG comments; 48-conditional award with terms and conditions; 54-previous concerns resolved) to "Acceptable." For Gender codes (https://www.niaid.nih.gov/grants-contracts/human-subjects-inclusioncodes), the categories of interest indicated whether or not women were knowingly included in the proposed study. Codes "1A" and "2A" were re-coded as "Acceptable," because they represent studies in which the researchers knowingly included women in the study design that were deemed acceptable. Codes "1U" and "2U" were re-coded as "Unacceptable," as they represent studies in which the researchers knowingly included women in the study design that were deemed unacceptable. The remaining applications-those whose proposed studies did not include human subjects, or did not knowingly include women-were re-coded as "Inapplicable." The Minority codes (https://www.niaid.nih.gov/grants-contracts/human-subjects-inclusioncodes) and Child codes (the NIH Child Subjects codes for the data used in this study have since been updated to "Age codes" and can be found at https://grants.nih.gov/grants/funding/lifespan/review codes.doc) are structured similarly to the Gender code and were re-coded analogously, with those applications in which minority subjects or child subjects were knowingly included being separated from the others in the re-coding. Note that, because the Human Subjects code indicates whether or not human subjects were included in the proposed study, the human subjects studies at the "Inapplicable" level of the Gender, Minority, and Child Subject codes are still recognized by our models as distinct from those without human subjects.
Finally, the NIH Funding Bin variable is determined by the amount of NIH R01 funding given to all investigators at an institution in 2014, and then split into 5 bins with roughly equal numbers of black PIs in each bin. These bins are delineated in Table S2.

Matching and Study Subsets Selection
While the observational units in our study are reviews, matching occurred at the application level. We used exact matching on eight key variables thought to be related to scores and CEM has several desirable properties including congruence (i.e., matching is performed on the data space rather than in a space of some metric such as the propensity score), relatively easy and flexible implementation, and Monotonic Imbalance Bounding (specifying the coarsening level for each variable automatically bounds imbalance allowed for each covariate) (30 ). It is recommended that coarsening levels be chosen based on subject matter knowledge about the measurement and the likely importance of di↵erent covariates (30 ). Due to the high number of categorical covariates, our choice was to carry out exact matching on eight key variables and implement complete coarsening for the rest; this choice achieved a good trade-o↵ between improved balance and sample size. We tested coarsened exact matching on various additional covariates, but the ensuing reduction in sample size was prohibitive. Our matching procedure improved balance on all the matching variables and on most other applicant-and application-specific covariates (see Table S3). The improved balance makes estimates from the matched subset analysis more robust, or less susceptible to model misspecification, as compared to those from a random sample (31)(32).

Matching Algorithm
This section relates the details of the matching algorithm, which was constructed to: 1. Maximize the number of applications in the matched data set; 2. Maximize the number of reviews of applications in the matched data set; 3. Enforce exact matching on the 8 matching variables (the remainder are "fully coarsened" and thus trivially matched in a CEM); and, 4. Respect the constraint that no more than four applications in the entire matched data set may come from the same reviewer. This constraint was implemented due to the sensitive nature of the data, in order to ensure the privacy and confidentiality of reviewers.
A near 1:2 matching was performed: each application from a black PI was matched with up to two applications from white PIs on the eight matching variables thought to be related to scores and/or award rates (see Table 1).
The selection of the matched data set was performed in two stages. First, black applications were matched to white applications to form a "matched tuple" of 1 black and either 1 or 2 white applications. Then, review records were selected for each set of matched applications using the following algorithm: 1. Create a dictionary in which the keys are reviewers and the values are the number of reviews from the given reviewer that have been selected. Initialize all values to zero. If a reviewer's value is less than five, that reviewer's reviews are "eligible"; five or more, and they are ineligible.
2. For each matched tuple: (a) For each application in the matched tuple: i. If any reviews of the application are eligible, select one at random, add it to the matched data set, and add one to the value of the appropriate reviewer key. ii. If no reviews of the applications are eligible: A. If the application is from a black applicant, discard the application and the associated matched white application(s). B. If the application is from a white applicant, discard the application and replace it with another exact match from among the white applications, if available.
3. Repeat step 2 until there are no remaining eligible reviews to be selected.
This algorithm attempts to maximize the number of records in the data set by minimizing the number of matched tuples/black applications that must be discarded. It does this by selecting only one review for as many applications as possible before selecting the remaining eligible reviews. No black applications and nine white applications were discarded by this algorithm when the final study data were selected.

Random Subset Selection
The random subset selection algorithm was designed to generate as representative a set of reviews of white applicants as possible while respecting the constraint that no more than four reviews from a given reviewer may be in the data set. Its steps are as follows: 1. Randomly select twice the number of available black applications from the full set of white applications (for this data set, there were 1,015 black applications and so 2,030 randomly selected white applications were included). Call this number n (here n = 2, 030).
2. For each application, select one eligible review, or discard the application if no reviews are eligible. Repeat until there are no remaining eligible reviews.
3. While the number of applications in the random sample is less than n, randomly select an application from the set of white applications in which at least one reviewer of each application reviewed 5 or more applications, and in which each application has at least one eligible review. Add one to the value associated with the appropriate reviewer key of the dictionary described in step 1 of the matching algorithm. Select reviews from this application as in step 2a of the matching algorithm.
A naive approach to random subset selection would replace each application with no eligible reviewers with another randomly selected application. However, this would systematically bias the sample to include applications that were reviewed by less experienced reviewers at a higher rate than in the population. By replacing applications with no eligible reviews with applications that were also reviewed by at least one experienced reviewer (a reviewer with 5 or more reviews in the data set), we mitigate this bias.

Coarsened Exact Matching with Exact Matching on a Subset of Covariates
In this section, we prove that exact matching on selected variables is a version of Coarsened Exact Matching (CEM). In this section only, we use the language of potential outcomes and treatment e↵ects for the ease of exposition and to reflect the language used in (30 ). We emphasize that although our analysis relies on matching, it reports conditional associations and it does not present estimates of causal e↵ects.
Exact matching on a strict subset of the covariates is CEM with "full coarsening" on the unmatched covariates. One may verify this by checking that the proofs in (30 ) do not assume that each coarsened variable has at least 2 coarsened levels. We now provide an in-depth example proof for the boundedness of SATT (Sample Average Treatment E↵ect on the Treated) estimation error. Equation (7) of (30 ) states that as long as the true treatment e↵ect is a Lipschitz function of the observed covariates and the maximum width of a coarsening interval (or set, for categorical variables) is ✏ j < 1 for the j-th covariate, then the SATT absolute estimation error is bounded. For this to hold with exact matching on a covariate subset, we simply require that the range of each non-matching variable be bounded by some ✏ j for an appropriate metric in addition to the Lipschitz requirement. Bounded range is equivalent to having a bounded coarsening interval width, which is nearly always the case in practice; furthermore, if no such ✏ j exists then no coarsening will yield finite ✏ j and thus the requirement is not restrictive. Explicitly, for continuous non-matching variables we require the range to be finite, for ordinal non-matching variables we require a finite number of levels, and for categorical non-matching variables we impose no restriction as the distance between any two levels of a categorical variable can be said to be one. For the exact matching variables, we have ✏ j = 0 by definition. These assumptions do not impose much in practice. Let X 1 , . . . , X k be the observed variables with X i1 , . . . , X ik the values for the i-th observation. Then let the potential outcome at treatment level 0 be Y i (0) (in this paper, treatment level zero corresponds to being white) for the i-th individual. Under the ignorability assumption, we can write where we have omitted possible mean-zero noise for ease of exposition (this is justifiable because ignorability guarantees that this noise is independent, so it does not contribute any bias to our estimator; it only adds to its variability). Since we always observe Y i (1) for treated individuals, our estimate of the treatment e↵ect for the treated is and the true treatment e↵ect for the treated is For the di↵erence-in-means estimator, d Y i (0) is also the observed outcome for a matched untreated unit. We then have whereX i represents the covariate vector for the observation matched to the i-th treated unit, with observed outcome d Y i (0). Taking an average over the treated units, we get Now, assume that g 0 is Lipschitz in the sense that replacing X j with anyX j within the coarsened matching caliper of width ✏ j and holding all other variables fixed changes the value of g 0 by at most L j for any j. Then for any i, and, as desired, it immediately follows that For another example, consider Section 4.1 of (30 ), regarding the maximum imbalance bound. For the non-matching variables, the imbalance bound is simply 1 (although in practice the imbalance is typically much less than 1, as can be seen from Table S3), and it remains true that specifying a coarsening for one variable does not a↵ect the imbalance bound for other variables because the maximum possible imbalance under the L 1 distance is 1. As noted by (30 ), this property stands in contrast to certain Mahalanobis distance matching methods where the user demands a certain sample size from the matching, in which additionally imposing an upper bound on balance or coarsening for one variable can increase the imbalance bound for other variables.

Balance Analysis
The goal of matching is to increase balance between treated and untreated units. After matching, we check balance on the application-and applicant-specific covariates between black and white applicants. Our measure of balance is L1 overlap, or one minus the total variation distance. Rather than simply assessing di↵erences in means and standard deviations by covariate, the L1 overlap measures how di↵erent the entire empirical distributions of the variables are between the black and white subsets. The L1 approach to measuring overlap-recommended by (30 )-is superior because the entire distribution of a covariate is of concern when a model is misspecified, and model misspecification is one of the main concerns matching is designed to address. Table S3 displays the L1 overlap (on a zero-to-one scale) for the random and matched subsets, as well as the percentage increase in overlap for the matched subset. Exact matching variables are in bold. Note that overlap for exact matched variables may not be exactly 1 because the matching is not strictly one-to-one, so the distributions in the white and black matched subsets may di↵er slightly. Overall, the overlap noticeably improved for the 8 exact matching covariates, and improved moderately for most other covariates. Note that the overlap for Institution Sector and Institution Lookup variables declined slightly after matching. While this is not a major concern because the overlap is still quite high, we note that CEM does not guarantee that imbalance will improve on every covariate, merely that there is an upper bound on imbalance for each covariate.

Multilevel Modeling
This section provides a description of the multilevel (hierarchical) linear models, starting with the model equations given in the Materials and Methods section of the main paper. We specified linear models for preliminary overall impact scores at the review level, relying on the NIH review structure and distinguishing between "structural variables" and other covariates that could potentially be associated with preliminary overall impact scores. IRG, SRG, and administering institute as well as reviewer and PI indicators are structural variables as they represent various levels of clustering in the data. The structure, as illustrated in Figure 1 of the main paper, is as follows. Reviews are clustered in a mixed hierarchy: reviews are nested within applications, which are nested within PIs. But reviewers can review multiple PIs just as PIs are reviewed by multiple reviewers: reviewer and PI are "crossed." Applications are nested within SRG, IRG, and administering institute, but PIs are not: over 200 PIs had applications reviewed in more than one SRG, IRG, or administering institute. All SRGs are nested within IRG, and IRG and SRG are both crossed with administering institute. All special emphasis panels within an IRG were modeled as a single study section. All of our models account for structural dependencies in the data via random and fixed e↵ects for the structural variables. Random intercepts are appropriate when the observed values of a clustering variable can be regarded as a sample from some larger population, whereas fixed e↵ects are appropriate when the levels of a clustering variable are considered fixed and known. To this end, we model fixed e↵ects for IRG and administering institute, because these are large, well-established units that are unlikely to change over time. We fit random intercepts for SRGs, because SRGs are created and disbanded routinely (and as a result of a Hausman test, discussed subsequently). PIs, applications, and reviewers all merit random e↵ects because the observed values of these variables are samples from larger populations of PIs, applications, and reviewers. Application random intercepts were excluded from the models as they were redundant (i.e., estimated to have zero variability) after estimating PI random intercepts.

Model Specifications
Let Y ijklm be the preliminary overall impact score for the ith review of the jth application from the kth PI (reviewed by the lth reviewer in the mth SRG), R k a race indicator (1 indicates a black PI), and X jk a vector of application-and applicant-specific control variables. To estimate racial disparities, we consider the following mixed e↵ects model formulation: where ↵ is the model intercept; R is the race coe cient; is the vector of coe cients for control variables; k , ⇠ l , and ⌘ m are random intercepts for PI, reviewer, and SRG; and the ✏ ij are within-application independent Gaussian error terms. Here, only SRG could have potentially been specified as a fixed e↵ect. Conducting the Hausman test (51 ) for specification of the SRG e↵ects, we conclude that the random e↵ects specification is plausible; additionally, it aligns well with our substantive knowledge that SRGs are occasionally disbanded or created over time and can thus be thought of as coming from a hypothetical "population of SRGs." We examine estimates of the race coe cient R from a series of models: first only adjusting for the structural covariates and then including applicant-and application-level characteristics and preliminary criterion scores among the control covariates X.
To study commensuration practices, we focus on interaction e↵ects between race and the preliminary criterion scores. Let Z ij be the vector of criterion scores associated with the ith review of the jth application. The linear commensuration model for the preliminary overall impact score Y ijklm of the ith review of the jth application from the kth PI (reviewed by the lth reviewer in the mth SRG) is specified by where ↵ is the model intercept; R is the race coe cient; C is a vector of preliminary criterion score coe cients; I is the vector of coe cients for the interactions between race and the preliminary criterion scores ("commensuration coe cients"); is the vector of coecients for control variables X jk ; k , ⇠ l , and ⌘ m are random intercepts for PIs, reviewers, and SRGs; and the ✏ ij are within-application independent Gaussian error terms. For commensuration models, the control variables X include structural and applicant-and application-level characteristics. Note that because the focus of our analyses is on the relationship between preliminary criterion and preliminary overall impact scores, and because we do not have a reason to think that the reviewers' consideration of preliminary criterion scores in assigning the preliminary overall impact score would be di↵erent depending on application type (Type 1 or Type 2, new submission or resubmission), we use data for all application types in our models and control for application types with fixed e↵ects.

Hausman Test for SRG E↵ects
Our models may account for clustering by SRG using either fixed or random e↵ects. Random e↵ects have the benefit of aligning with our substantive knowledge that SRGs at NIH are not always fixed but may appear or disappear over time; random e↵ects models use fewer degrees of freedom; and, if the assumption of no endogeneity holds in that there is no correlation between the random intercepts and model residuals, fixed e↵ects estimates from random e↵ects models are asymptotically e cient. However, if endogeneity is present, the random e↵ect model estimates for the fixed e↵ects (the coe cients of interest in this paper) are inconsistent while those of the fixed e↵ect model are still consistent.
The null hypothesis for the Hausman test is that the fixed-e↵ects coe cients are consistent in both the SRG random e↵ects and SRG fixed e↵ects models, and consequently that the SRG random e↵ect model estimates are e cient. It is shown in (51 ) that the covariance between an asymptotically e cient estimator and its di↵erence from a di↵erent consistent but ine cient estimator is asymptotically zero. A simple chi-squared test can be constructed based on this result with the following statistic and null distribution where RE stands for random e↵ects and FE for fixed e↵ects, p is the number of fixed-e↵ect coe cients estimated in the model (and the number of degrees of freedom of the chi-square distribution), and the inverse is a pseudo-inverse. Because under local alternatives to the null (i.e. slight model misspecification) the test statistic has a noncentral chi-square distribution (i.e. larger expected value), we reject the null if the test statistic is too large. For the matched subset analysis of the full data set (the analysis presented in the main text), this statistic was 28.76 on 115 degrees of freedom, with p-value approximately 1. The test fails to reject the null hypothesis of no endogeneity, and since random e↵ects align well with our substantive knowledge of the SRG, we elect to fit SRG random e↵ects. In practice, both random e↵ects and fixed e↵ects models lead to the same substantive conclusions and very similar coe cient estimates for our data.

Model Diagnostics
We assess the fit of our commensuration model (from the matched subset analysis) to the data to ensure that hierarchical linear mixed-e↵ects model assumptions are satisfied. For mixed-e↵ects models, residual analysis constitutes the bulk of the diagnostics. There are three main types of residuals in mixed-e↵ects models: conditional residuals, marginal residuals, and BLUPs (best linear unbiased predictors). If y is the outcome, x the observed covariates,ˆ the estimated fixed-e↵ect coe cients, andˆ the best linear unbiased predictor of the random e↵ects, then: • the conditional residuals are e c = y ˆ x ˆ , • the marginal residuals are e m = y ˆ x, and • the BLUPs areˆ = e m e c .
Any one of these residuals can be computed from the other two; hence examination of only two types of residuals is needed. For our analyses, we examined normal quantile-quantile plots for the conditional residuals and BLUPs for the three random intercepts included in the commensuration model (matched subset analysis) (the plots are not shown). The conditional residuals and BLUPs displayed wider tails than a normal distribution, indicating some excess kurtosis that was not enough to raise any concerns given the large sample size and robustness of linear regression to deviations from normality of residuals. Furthermore, no conditional residuals nor BLUPs displayed evidence of heteroscedasticity or dependence on the main covariates of interest (i.e., race, the preliminary criterion scores, requested costs-the only purely continuous covariate--and terminal degree year, the only ordinal covariate aside from the preliminary criterion scores). For both residual types, residual analysis plots indicated that assumptions of homoscedasticity, independence between residuals and covariates, and approximate Gaussianity are reasonable and that our model estimates are valid under the proposed model class of hierarchical linear mixed-e↵ects models.

Commensuration Practices
To interpret the magnitude of the estimated commensuration coe cients, we examine estimated expected di↵erences in scores for black versus white applications ( Figure S1) and a hypothetical example of di↵erences in predicted preliminary overall impact scores for given preliminary criterion scores between black and white applications (Table S4). Figure S1 shows the expected change in preliminary overall impact score for all black applications if their criterion scores were commensurated into overall impact scores as if these were applications submitted by white PIs with otherwise identical values for the observed application-and applicant-specific covariates. For just 15% of black applications would we expect an otherwise identical (on the observed covariates) white application to score di↵erently by at least 0.1 points (better, 11%; worse, 4%) in the preliminary overall impact score due to commensuration di↵erences. The di↵erence of 0.1 points in preliminary overall impact score is not large relative to the variability due to other sources. As explained in the main paper, a change in an application's overall impact score of 0.3 points is substantial because it could tangibly a↵ect funding decisions. At the same time, we point Figure S1: Distribution of estimated expected preliminary overall impact score di↵erences due to commensuration (histogram) and distributions of reviewer intercepts and model residuals (colored lines), under the matched subset commensuration model (Table 6). Histogram and densities have been scaled to have a common maximum for ease of visualizing di↵erences in variability. out that 0.3 points is similar in magnitude to the (estimated standard deviation due to) reviewer variability, and is about twice as small as the estimated residual standard deviation after controlling for preliminary criterion scores (Models 3 and 4, Table 5). Based on the commensuration model for the matched subset, 15% of all applications can expect the preliminary overall impact score to di↵er by at least 0.1 points from the average score due to random reviewer variability, and 86% due to residual variability that is not explained by the model. No black application would expect to see a score di↵erence of greater than 0.3 because of commensuration practices as estimated on the matched subset ( Figure S1). The population-weighted expected average di↵erence in preliminary overall impact score between black and white applications, conditional on the control covariates, is 0.004, which is practically negligible. This quantity, 0.004, is small because the interaction coe cient estimates are balanced between positive and negative quantities and all the preliminary criterion scores are positively correlated. Combined, these two facts lead to a "cancelling" phenomenon in which the estimated expected di↵erence in preliminary overall impact score is small for the vast majority of applicants.
In addition, to illustrate the potential impact of commensuration di↵erences on the preliminary overall impact score, we consider two situational pairs, each of a hypothetical black applicant and a hypothetical white applicant (Table S4). Here, we pick scenarios where the discrepancies between preliminary Approach and Significance/Innovation scores are extreme but still plausible: each combination of criterion scores does occur in our dataset both for white and black applicants. The "Innovative" preliminary criterion score combination occurs twice in the set of reviews of black applications and once in the set of reviews of white applications; the "Thorough" preliminary criterion score combination occurs twice in the set of white applicants and twice in the set of black applicants. In each hypothetical scenario, we assume the two applicants and applications are identical on all the observed covariates except for race.
Under the "Innovative" scenario, the application review scores indicate that the proposed Hypothetical preliminary criterion score scenarios; "innovative" scenario has relatively high Innovation and Significance scores and a low Approach score, and vice-versa for the "thorough" scenario.
research is innovative and significant, but that the approach is sub-par. Based on our matched subset commensuration analysis, in such a scenario, the matched white researcher's score would be 0.12 points better than the black researcher's score (p < 0.005). This di↵erence occurs because, on average, reviewers weigh the preliminary Approach score more heavily for black applicants than for matched whites and the preliminary Innovation and Significance scores less heavily. Conversely, under the "Thorough" scenario, in which the research proposals are scored as rigorous but not significant or innovative, our model predicts that the black applicant will, on average, receive an impact score 0.20 points better than the white researcher (p < 0.005). As noted earlier, these di↵erences are small in magnitude as compared to reviewer random e↵ect variability or residual variability.

Random Subset Analyses
While we emphasize the results of our matched subset analyses as less susceptible to model specification, we also performed random subset analyses in the interests of comparison and to allay concerns that our matching design was suboptimal. The main results of these analyses are presented alongside the matched subset analyses below in Tables S5 and S6. The conclusions one draws from the random subset analysis are largely the same as those of the matched subset analysis, given the very similar coe cient estimates and significance levels. One point of di↵erence is that the race coe cient in the random subset commensuration analysis is statistically significant, while it is not for the matched subset analysis. However, because the commensuration model includes interaction coe cients between PI race and the criterion scores, this coe cient cannot be interpreted on its own. Figure S2 shows the expected change between black and white applications due to commensuration practices as estimated with the random subset analysis. As with our main analysis (Figure S1), estimated expected di↵erences in preliminary overall impact score of 0.1 points or more as a result of commensuration di↵erences are rare. The population-weighted expected average di↵erence in preliminary overall impact score between black and white applications, conditional on the control covariates, is 0.038 (in favor of white applications) under the random subset analysis which is similar to that of 0.004 under the matched subset analysis. These expected changes in preliminary overall impact scores due to commensuration practices are practically negligible. Figure S2: Distribution of estimated expected preliminary overall impact score di↵erences due to commensuration (histogram) and distributions of reviewer intercepts and model residuals (colored lines), under the random subset commensuration model (Table S6). Histogram and densities have been scaled to have a common maximum for ease of visualizing di↵erences in variability. Race coe cient estimates and their e↵ect sizes; preliminary criterion score fixed e↵ects and their standard errors; and variance components estimates from four hierarchical linear models for preliminary overall impact scores fit on n = 7471 reviews of 2566 applications (matched subset) and n = 8595 reviews of 3045 applications (random subset). Model 1 controls for structural covariates; Model 2 controls for structural and applicant-/application-specific covariates; Model 3 controls for structural covariates and preliminary criterion scores; Model 4 controls for structural, applicant/application-specific covariates, and preliminary criterion scores. Control variables are listed in Table 3 of the main paper. Coe cient estimates for control variables are not shown. Significance * is reported for the race fixed e↵ect estimate for p < .005. In mixed e↵ects models, multiple e↵ect sizes exist for a given coe cient; we report the coe cient divided by the residual standard deviation. For more information, see (49 ). Preliminary criterion, race, commensuration (race-criterion interaction) coe cients and variance components estimates for preliminary overall impact scores on n = 7471 reviews of 2566 applications (matched subset) and n = 8595 reviews of 3045 applications (random subset). Coe cient estimates for control variables that include structural and applicant-/application-specific covariates as listed in Table 3 of the main paper are not shown. Significance * is reported for p < .005.

Post-Discussion Scores
This analysis is for applications that have reached the SRG discussion stage. Not all reviewers change their criterion and overall impact scores after discussion (Table S7). Among post-discussion reviews, 20% saw a change in both the overall impact score and at least one criterion score post-discussion, 27% saw a change in the overall impact score but not in the criterion scores, 4% saw a change in the criterion scores but not in the overall impact score, and 49% saw no change in any scores. Thus only 20% of reviewers had revised both the criterion scores and the overall impact score based on the discussion (sometimes, a reviewer will decide that their assessment remains correct after the discussion, or will decide that changes to criterion scores do not merit a change in the overall impact score). Score change behavior after discussion; discussed reviews only.
With the caveats as stated above, we replicate the racial disparity analysis for final overall impact scores (Table S8). The results are largely the same: controlling for final criterion scores accounts for essentially all racial disparities in final overall impact scores. Note, though, that the racial disparities observed without controlling for final criterion scores are not as large as those for preliminary scores.

Reproducibility
Because of the sensitive nature of individual-level data, a reduced data set that contains the same reviews but fewer covariates is available for public use. This public-use data set includes all of the covariates of interest (applicant race, preliminary criterion and overall impact scores), the structural covariates (PI ID, application ID, reviewer ID, administering institute ID, IRG ID, and SRG ID), the matching variables (contact PI's gender, ethnicity, career stage, degree type, institution's NIH funding bin, application type, application's amended status, and the area of science represented by the Integrated Review Group), as well as the final overall impact score. We provide the url of the public-use data depository in the Acknowledgements section of the main text.
Here, we reproduce results of the multilevel analysis of racial disparities in preliminary overall impact scores from Table 5 and of commensuration practices in Table 6, for the matched and random subsets, using the public use data set. We also reproduce Figures S1 and S2 of the expected change in preliminary overall impact score for all black applicants if their preliminary criterion scores were commensurated into preliminary overall impact scores as if they were white, for the matched subset of the public-use reduced-covariates data set.
Racial disparities: Table S9 presents multilevel modeling results from the public-use data that are analogous to those reported in Tables 5 and S5. We find that the race coecient estimates from Models 1 and 2 (which do not control for preliminary criterion scores) obtained from public-use data are positive, statistically significant, and very similar in magnitude to those reported in Tables 5 and S5. Once preliminary criterion scores are included (Models 3 and 4), the race coe cient estimates obtained from the matched and random subsets of the public-use data set (Table S9) become practically and statistically insignificant, similarly to our results from the matched (Table 5) and random (Table S5) subsets of the confidential data set. These results are consistent with our respective interpretation of racial disparities in the main paper. While the main results concerning the race coe cient estimates are strikingly similar between the confidential and the public use data set, we note that the random intercept variability for PIs and SRGs is somewhat larger for Model 2 fit to the matched subset of the public-use data set (Table S9) than for Model 2 fit to matched subset of the confidential study data set (Table 5). This is because fewer PI-specific covariates are included in the model as there are fewer covariates available in the public data set to explain PI variability.
Commensuration practices: Table S10 contains relevant parameter estimates from the linear commensuration models that were fit using the public data for both the matched and random white subsets. For the matched subset, which is less susceptible to model misspecification (31)(32), the signs and magnitude of coe cient estimates for the interaction coe cients are strikingly similar to our main results on commensuration practices from Table 6. For the random subset, we note that the pattern of significant commensuration practices coe cients changes slightly (compare with Table S6). However, as Figures S3  and S4 demonstrate, the combined extent and magnitude of commensuration di↵erences across all preliminary criterion scores remains small: expected di↵erences for black applications in the preliminary overall impact score of 0.1 or more as result of commensuration practices are rare. This finding is consistent across all our analyses-random/matched subsets of confidential and reduced-covariate public-use data sets. Public-use data set: Race coe cient estimates, their e↵ect sizes, and variance components estimates from four hierarchical linear models for preliminary overall impact scores fit on n = 7471 reviews of 2566 applications (matched subset) and n = 8595 reviews of 3045 applications (random subset). Model 1 controls for structural covariates; Model 2 controls for structural and matching covariates; Model 3 controls for structural covariates and criterion scores; Model 4 controls for structural, matching covariates, and criterion scores. Coe cient estimates for control variables are not shown. Significance * is reported for p < .005. Public-use data set: Preliminary criterion, race, commensuration (race-criterion interaction) coe cients, and variance components estimates for preliminary overall impact scores on n = 7471 reviews of 2566 applications (matched subset) and n = 8595 reviews of 3045 applications (random subset). Control variables (coe cient estimates are not shown) are the matching variables. Significance * is reported for p < .005. Figure S3: Public-use data set: Distribution of estimated expected preliminary overall impact score di↵erences due to commensuration (histogram) and distributions of reviewer intercepts (red line) and model residuals (blue line), under the matched subset commensuration model (Table S10). Figure S4: Public-use data set: Distribution of estimated expected preliminary overall impact score di↵erences due to commensuration (histogram) and distributions of reviewer intercepts (red line) and model residuals (blue line), under the random subset commensuration model (Table S10).