Research ArticleSOCIAL SCIENCES

Cross-disciplinary evolution of the genomics revolution

See allHide authors and affiliations

Science Advances  15 Aug 2018:
Vol. 4, no. 8, eaat4211
DOI: 10.1126/sciadv.aat4211

Abstract

Born out of the Human Genome Project (HGP), the field of genomics evolved with phenomenal speed into a dominant scientific and business force. While other efforts were intent on estimating the economic impact of the genomics revolution, we shift focus to the social and cultural capital generated by bridging together biology and computing—two of the constitutive disciplines of “genomics”. We quantify this capital by measuring the pervasiveness of bio-computing cross-disciplinarity (XD) in genomics research during and after the HGP. To provide interlocking perspectives at the career and epistemic levels, we assembled three data sets to measure XD via (i) the collaboration network between 4190 biology and computing faculty from 155 departments in the United States, (ii) cross-departmental affiliations within a comprehensive set of human genomics publications, and (iii) the application of computational concepts and methods in research published in a preeminent genomics journal. Our results show the following: First, research featuring XD collaborations has higher citation impact than other disciplinary research—an effect observed at both the career and individual article levels. Second, genomics articles featuring XD methods tend to have higher citation impact than epistemically pure articles. Third, XD researchers of computing pedigree are drawn to the biology culture. This statistical evidence acquires deeper meaning when viewed against the organizational and knowledge transfer mechanisms revealed by the data models. With cross-disciplinary initiatives set to dominate the agenda of funding agencies, our case study provides a framework for appreciating the long-term effects of these initiatives on science and its standard-bearers.

INTRODUCTION

With the coming of the 21st century, biology has emerged as the vanguard of the scientific enterprise and computing as the epicenter of engineering and technology (1). These two fields were primed as complements in the Human Genome Project (HGP). The successful completion of the HGP in the early 2000s ushered the genomics revolution that continues to transform our capacity to understand, predict, and edit life (25).

A significant amount of work has focused on estimating the impact of the HGP on human health and the return on investment in the U.S. economy (68). These efforts proved surprisingly difficult, high-lighting the broader challenge of evaluating the socioeconomic impact of science policy (9, 10).

Instead of focusing on economic and health outcomes we focus here on the evolution of social and cultural capital within the genomics revolution. To this end, we build on scholarship of epistemic (11) and network analysis (1216) to quantify the factors and career incentives that contribute to the formation of new fields (1719) in a team science context (2025).

The HGP (1990–2003) was a singular opportunity for scientists from several disciplines, with biology and computing being prominent among them. For this reason, this project serves as a rich case study for science of science (26) to investigate the social and behavioral elements underlying cross-disciplinary research. Consequently, we adopt a mixed methods analytic approach that focuses not only at the epistemic level but also at the scholar level.

Specifically, we begin with a network analysis, focused on U.S. academia, of the administratively invisible or informal cross-disciplinary biology-computing collaborations that we dub a “college,” to illustrate the explosion of the crossdisciplinary population parallel to, and also in the wake of, the HGP. By cross-disciplinary population, we refer to faculty from biology and computing who achieve their research objectives via collaboration across this disciplinary boundary. Upon further inspection, we find that the overwhelming majority (90%) of the faculty forming this biology-computing bridge have been active in genomics research. Building on this insight from the descriptive analysis, we then apply cross-sectional regression to the 4190 faculty in our data set, showing a positive correlation between one’s inclination toward cross-disciplinary collaboration and total career citation impact. To pinpoint the source of this lifetime advantage, we implement a longitudinal panel regression, demonstrating that, within each scholarly career, the cross-disciplinary publications have significantly higher citation impact than the disciplinary ones.

As this scholar-centered result is based on a U.S. academic data set, to test for its broader relevance, we analyze a comprehensive set of human genomics publications from the international literature. We find that, in this set, publications with joint authorship from biology and computing scholars also have significantly higher citation impact than publications with biology authorship only.

Finally, we analyze cross-disciplinarity from an epistemological perspective, operationalizing the mixing of domain knowledge rather than the mixing of scholars. To be specific, we use publication-level keywords to identify the methods applied within research articles from a leading genomics journal. Our results demonstrate that articles with an explicit computational component have significantly higher citation impact than articles without an explicit computational component.

Together, these outcomes show that cross-disciplinarity in genomics is pervasive and impactful, creating upward mobility in morphed careers and generating dominant hybrid intellectual and social capital that have persisted long after the end of the HGP. The legacy of the HGP also survives in organizational capital that is fundamental to “consortium science,” whereby teams of teams organize around central challenges, with a common goal to share benefits equitably within and beyond institutional boundaries.

RESULTS

Cross-disciplinarity in genomics careers

Career data set. We anchor our analysis on individual scholars so that we can control for scholar-specific attributes and account for career-level decisions regarding one’s orientation toward cross-disciplinary activity. A principal challenge in this endeavor was that there were no complete, validated, and publicly available data sets for the biology-computing college. Hence, we synthesized a large longitudinal data set using criteria that could, in principle, be generalized to other case studies.

We focused on the biology-computing college in the United States, because this is the source of the HGP, and it was a task that we could practically complete. Our entry point for classifying scholars was through their primary academic affiliation. Specifically, we accessed the websites of 155 biology and computing departments in the United States (table S1), matching the faculty listings to individual Google Scholar (GS) profiles. We consider the primary departmental affiliation of each faculty scholar to be a lifelong disciplinary trait, because while faculty may change institutions, the likelihood that they change from a computing department to a biology department, or vice versa, is very low. We verified this premise by examining publicly available curricula vitae (CVs) for many of the faculty.

As such, from this point forward, we will refer to these biology (n = 2077) and computing (n = 2113) faculty as F, indexed by i (n = 4190 in total). We carefully examined the publication profile of each Embedded Image, removing spurious content (for example, articles that did not include the respective scholar’s name). This process yielded a total of 413,565 publications that were incorporated into the faculty career data set.

Collaboration network and analytical framework for the career data set. To build the collaboration network of the biology-computing college, we inspected each disambiguated Embedded Image profile for instances of direct collaboration with another Embedded Image; as a result, we identified 3900 Embedded Image who collaborated with at least one other Embedded Image, forming 16,799 links. Within the subset of connected Embedded Image, the size of the largest (“giant”) connected component was 3869. Therefore, just 7.6% of the Embedded Image were not part of the largest connected component, and only 6.9% of the Embedded Image were completely disconnected. Description of the name disambiguation method and a thorough investigation of the network’s structural properties can be found in sections S1 and S2, respectively.

We operationalized cross-disciplinarity by investigating both the direct collaborations and indirect associations within the F network, both of which are important to knowledge transfer and science development. In more detail, the network’s nodes can be connected via two types of links, as illustrated in Fig. 1: “Direct collaboration” refers to a link between two faculty Embedded Image and Embedded Image who appear together in at least one publication, and “mediated association” refers to a link between two faculty Embedded Image and Embedded Image who are indirectly associated via a common non-F coauthor. This non-F coauthor creates the link via “triadic closure” between the two F. Because many published researchers are not faculty in one of the 155 listed departments, the typical Embedded Image has many more mediated associations than direct collaborations with other faculty in our data set (fig. S2).

Fig. 1 Construction of the F network.

Schematic network, serving as an instructive example of our method for classifying the faculty Embedded Image, their pollinator coauthors Embedded Image, and the links between them. The network corresponds to the table on the right. Two types of links connect the faculty nodes: a direct link (Embedded Image) if Embedded Image and Embedded Image are coauthors of at least one publication together, and a mediated link (Embedded Image) if there is at least one Embedded Image that has coauthored separately with Embedded Image and Embedded Image, thereby mediating a triadic closure between the two F. We classified each Embedded Image according to her/his main discipline: Embedded Image = biology and Embedded Image = computing unless they have collaborated with at least one Embedded Image from the other discipline, in which case the classification Embedded Image supersedes their original disciplinary classification. We classified the non-F coauthors Embedded Image as bridge pollinators if they coauthored with two or more faculty; otherwise, these Embedded Image are classified as leaf pollinators. Among the bridge pollinators, we classified those Embedded Image who coauthor with faculty from both biology and computing as cross-pollinators. Thus, the solid link connecting A-B represents a direct cross-disciplinary link, the dashed link connecting C-A represents a mediated cross-disciplinary link, and pollinators 7 and 8 are cross-pollinators because they have collaborated with faculty from each discipline. N/A, not applicable.

We use the primary departmental affiliations, which we treat as time-invariant traits, to define three disciplinary orientations O for F. If Embedded Image collaborated with at least one Embedded Image from the opposite department, then we classify him/her as cross-disciplinary Embedded Image; otherwise, Embedded Image is classified as Embedded Image, or Embedded Image, depending on her/his primary departmental affiliation. The group sizes are nearly equal: Embedded Image, n = 1353; Embedded Image, n = 1590; and Embedded Image, n = 1247. We further examined each member of the Embedded Image group by finding their corresponding Scopus author profile, which contains career-level keywords derived from their publications. We found that 90% of the Embedded Image faculty feature the Scopus keyword “genetics” in their curated profiles, indicating that the overwhelming majority of the Embedded Image group have been involved in genomics research. This consistency check confirms the soundness of our Embedded Image classification scheme.

As mentioned earlier, there are many collaborators of F who are not explicitly included in our starting sample, possibly because they are not faculty in one of the listed biology or computing departments (for example, PhD students, postdocs, and other international researchers). These collaborators are still crucial for understanding the role of cross-disciplinarity in the genomics revolution, as they constitute the academic ecosystem or “invisible college” surrounding tenure-track faculty (27). We identify these non-F collaborators as pollinators P, indexed by j.

In contradistinction to faculty F, we do not have knowledge of the departmental affiliations of pollinators P. Hence, we infer their disciplinary orientation by observing their coauthorship patterns with faculty F. Specifically, (i) biology pollinators Embedded Image , if they collaborated with F from biology departments only; (ii) computing pollinators Embedded Image, if they collaborated with F from computing departments only; and (iii) cross-pollinators Embedded Image, if they collaborated with F from both biology and computing departments. Those pollinators who appeared in just a single scholar profile are named “leaf pollinators;” they are not central to our analytic framework.

HGP and evolution of cross-disciplinarity in the biology-computing college. Figure 2A shows the evolution of the largest connected component of the biology-computing collaboration network from the pre-HGP era (around 1990) to the post-HGP era (beyond 2003), where nodes correspond to Embedded Image and links represent only the direct collaborations. We sized the nodes according to their relative importance within the network, given by the centrality Embedded Image calculated up to time t. We calculated three different centrality measures: degree, PageRank, and betweenness. The degree centrality Embedded Image counts the number of faculty F connected to a given faculty Embedded Image. The PageRank centrality Embedded Image self-consistently incorporates the centrality of the neighboring F into the centrality of Embedded Image (28, 29). The betweenness centrality Embedded Image counts the number of shortest paths between other nodes that intersect Embedded Image and is an indicator of between-group brokerage (30). Although these three centrality variables quantify different properties of the nodes within the network, we found them to be correlated with each other: Embedded Image, Embedded Image, and Embedded Image. For visual comparison, we illustrate these three measures simultaneously in fig. S3, which identifies Eric Lander, one of the leaders of the HGP, as the most prominent faculty according to all three measures.

Fig. 2 Growth of cross-disciplinary social capital.

(A) Evolution of the giant component in the U.S. biology-computing network. Green and magenta nodes represent faculty Embedded Image with Embedded Image and Embedded Image affiliation, respectively; black nodes represent faculty Embedded Image that, by time t, published at least one cross-disciplinary publication and joined the Embedded Image group; node size is proportional to the logarithm of the degree centrality, Embedded Image, of Embedded Image at time t. (B) Evolution of the fraction of collaboration links in the F network that are cross-disciplinary. We calculated f⋅,XD(t) using direct links Embedded Image between faculty (blue line) [that is, Embedded Image] or association links Embedded Image mediated by pollinators (red line) [that is, Embedded Image]. For comparison, the black line shows the evolution of cross-disciplinary links in the human genomics literature per Web of Science (WoS); these values are divided by two to facilitate trend comparison. The orange area marks the HGP project period.

In Fig. 2A, we chose to size the nodes according to the degree measure Embedded Image, which is an intuitive count variable that facilitates comparisons across the different networks. We colored the nodes green if Embedded Image belonged to the Embedded Image group, magenta if they belonged to the Embedded Image group, and black if they belonged to the cross-disciplinary Embedded Image group. To illustrate the evolution of cross-disciplinarity within the F network, we initially classify (color) each faculty node according to her/his primary departmental affiliation and only change this classification (color) to Embedded Image once the year of her/his first direct cross-disciplinary collaboration is reached. As time passes by, the giant component of the F network experiences impressive growth in size and complexity; within it, the cross-disciplinary nodes grow in numbers and prominence. Part of the giant component’s growth appears to be fueled by the increasing assimilation of formerly less connected scholars, as the diminishing set of nongiant components shown in fig. S4 suggests.

While Fig. 2A depicts the emergence and centrality of cross-disciplinary scholars in the network during and after the HGP, Fig. 2B quantifies this evolution. We determined the overall fraction of collaborations that are within- or cross-disciplinary, from the perspective of both the direct (Embedded Image) and the mediated (Embedded Image) links. More specifically, we first disaggregated the publication data by nonoverlapping 2-year periods. Then, for each period, we tallied the number of direct Embedded Image links in a given period that were within-discipline Embedded Image or cross-discipline Embedded Image, with Embedded Image denoting the total number of direct links. Similarly, we constructed the total number of mediated links realized via pollinator connections: Embedded Image.

Next, we estimated the fraction of collaborations that are cross-disciplinary f∙,XD(t), with respect to two perspectives:Embedded Image using the direct collaboration links and fEmbedded ImageXD(t) = Embedded Image using the mediated links (Fig. 2B). We complement these two estimates with a third estimation using a separate international data set of “Human Genome” publications, reporting the fraction of publications that include both CS and BIO author affiliations (see the “Assembly of the WoS data set” section).

The relative frequency of mediated cross-disciplinary associations shows marked growth during and in the wake of HGP, reaching ~30% of the total mediated associations by 2015. The relative frequency of direct cross-disciplinary collaboration shows slower growth. This feature may arise from the different competitive and leadership perspectives between the faculty F and the pollinators P, leading to different capacities to explore cross-disciplinary projects. The difference between the mediated and direct f∙,XD(t) supports the importance of mobility in the academic ecosystem as an underlying conduit for knowledge transfer, in addition to direct collaboration.

The impetus for this increasing rate of cross-disciplinarity is intriguing. Recent work identifies groundbreaking discoveries as one type of impetus leading to the densification and emergence of scholarly communities (15). However, in the case of the HGP (1990–2003), the evolution of the collaboration network was likely, to some degree, pulled ahead by the specification of a grand challenge that led to the organization of agents around a common agenda, not unlike the case of sustainability science (18). This alternative type of impetus is evident in the early growth of cross-disciplinarity XD (from the mid 1990s to early 2000s), when the HGP was in full swing, but the breakthrough of sequencing the human genome was not yet fully realized. As such, the co-occurrence of the start of the HGP and the increasing rate of cross-disciplinary activity provide preliminary evidence that incentivizing this activity around a unifying grand challenge was effective in bridging university disciplines. However, additional data and specifically tailored research design would be necessary to more conclusively estimate the magnitude of the HGP’s impact on cross-disciplinary orientation in genomics research, which we leave for future work.

Career benefits of cross-disciplinarity. Figure 3 presents the descriptive statistics of the career data set. Figure 3A shows that the typical Embedded Image career in all three faculty groups began in the early 1990s. This is ideal for studying the evolution of genomics, as HGP—arguably the field’s constitutional project—was formally started in 1990. Figure 3B shows the significantly higher degree of collaboration in the Embedded Image group (370 ± 440) with respect to the Embedded Image(122 ± 98) and Embedded Image(165 ± 175) groups.

Fig. 3 Descriptive statistics for the career data set.

Vertical lines indicate distribution means for the corresponding subsets. (A) Probability distribution of the year of first publication Embedded Image by Embedded Image. (B) Probability distribution of Ki, the total number of collaborators for a given Embedded Image. (C) Probability distribution of χi, the fraction of the collaborators of Embedded Image who are cross-disciplinary. (D) Probability distribution of Embedded Image, the PageRank centrality of Embedded Image; it is scaled by Embedded Image, the number of Embedded Image so that the mean value of this scaled quantity across all Embedded Image, independent of the discipline subset, is 1. (E) Probability distribution of the mean impact factor (Embedded Image) of the publication record of Embedded Image. (F) Probability distribution of the total citations log10 Ci of Embedded Image.

Figure 3C shows that the Embedded Image group exhibits a significantly higher degree of cross-disciplinarity (0.3 ± 0.19) than the other two groups (0.1 ± 0.07 in both cases). The degree of cross-disciplinarity χi of Embedded Image is defined as the fraction of her/his collaborators who are cross-disciplinary. Specifically, χi = ki,XD/Ki ∈ [0, 1], where Ki is the total number of collaborators of Embedded Image, while ki,XD is the number of her/his cross-disciplinary collaborators; the collaborators include both other faculty F and pollinators P alike. We focus on one additional network characteristic, the scholar’s PageRank centrality, which is measured relative to other members of the network. We use rescaled units, Embedded Image, so that the mean centrality value across all Embedded Image is 1, which facilitates comparison. Figure 3D shows that the mean centrality of the Embedded Image group (1.4 ± 1.1) is significantly higher than the mean centrality of the Embedded Image (0.7 ± 0.5) and Embedded Image(0.9 ± 0.6) groups.

Figure 3E indicates that Embedded Image faculty have similar publishing patterns with Embedded Image faculty, that is, they tend to publish in high–impact factor (IF) journals. To be specific, we calculated the mean Journal Citations Report (JCR) impact factor among the publication set of each Embedded Image, denoted as Embedded Image: The distribution of the Embedded Image among the Embedded Image faculty is 7.1 ± 3.7; for the Embedded Image faculty, it is 2.0 ± 1.4; and for the Embedded Image faculty, it is 6.5 ± 4.5.

Given the relatively balanced composition of the cross-disciplinary group (n = 724 with biology pedigree versus n = 523 with computing pedigree), one would expect the Embedded Image mean to be more balanced in its distance from the Embedded Image and Embedded Image mean values. Looking inside the Embedded Image group, we find that the cross-disciplinary subgroup with biology pedigree has Embedded Image, manifesting a small mean shift with respect to the core Embedded Image faculty (+ 20.8%). The cross-disciplinary subgroup with computing pedigree has Embedded Image, manifesting a massive mean shift with respect to the core Embedded Image faculty (+ 77.5%). Hence, on the one hand, biology cross-disciplinary faculty maintain a publication culture that is on par with their disciplinary norms. On the other hand, computing cross-disciplinary faculty feature a publication culture that breaks away from their disciplinary norms and trends in the direction of biology. As a result, the overall mean of the Embedded Image group remains very close to the mean of the Embedded Image group and far away from the mean of the Embedded Image group, revealing a degree of cultural assimilation.

Figure 3F shows the higher mean citation impact (in log10) in the Embedded Image group (3.8 ± 0.5) with respect to the Embedded Image (3.4 ± 0.5) and Embedded Image (3.6 ± 0.6) groups. Because of the importance of total citation impact as a quantitative measure of career achievement (31), we begin by modeling Ci using cross-sectional analysis. Recent studies have demonstrated how collaboration factors can explain long-term success at the publication and career level (25, 32, 33). Consequently, here, we also account for the role of network attributes, reflecting the position of Embedded Image in the collaboration network, in addition to controlling for standard CV attributes, such as her/his h-index, funding, and institutional prestige.

Our principal interest is to test whether Embedded Image with stronger cross-disciplinary orientation (that is, higher χi) correlate with higher Ci. To this end, we used time-aggregated measures calculated through 2017 to estimate the parameters of the following cross-sectional ordinary least squares (OLS) regression modelEmbedded Image(1)where Ci is the total number of citations for Embedded Image, ri is the ranking of her/his department, hi is her/his h-index serving here as a productivity measure, and Embedded Imageand Embedded Imageare the total counts of her/his National Science Foundation (NSF) and National Institutes of Health (NIH) grants, while Embedded Imageand Embedded Imageare the total monies from the NSF and NIH grants deflated to constant 2010 USD, Embedded Image is her/his PageRank centrality within the F network, and χi is the fraction of her/his total Ki coauthors who are cross-disciplinary. We include two dummy variables, the first capturing the three possible disciplinary orientations Embedded Image, and the second capturing age cohort variation, where Embedded Image is the year of the faculty’s first publication grouped into nonoverlapping 5-year intervals. Last, ϵ is white noise.

Table S2 shows the full-parameter estimates for the model expressed in Eq. 1, while Fig. 4 summarizes the relevant coefficient estimates for the funding and collaboration variables. The main result of this model shows that higher degrees of cross-disciplinary activity (βχ > 0, P < 0.001) correlate with higher career citations. To be specific, our estimates indicate that an increase in χ by 0.1 correlates to a 10 × βχ = 5.7% increase in Ci.

Fig. 4 Career cross-sectional regression model.

OLS parameter estimates for the linear regression model in Eq. 1. The coefficients for the relevant covariates split into two categories are shown, depending on whether you might find the information in the researcher’s CV or by analyzing her/his collaboration network. To facilitate comparison of the relative strength of the parameter estimates, the standardized beta coefficients are shown, representing the change in the dependent variable ln Ci that corresponds to a 1-SD shift in a given covariate. See table S2 for the complete list of parameter estimates. The levels of statistical significance are as follows: ***P ≤ 0.001.

We tested the robustness of this cross-sectional model by exploring several variations (table S3). In the first two variants, we replaced the PageRank Embedded Image centrality measure with one of two alternative centrality measures, that is, the betweenness centrality Embedded Image and the degree centrality Embedded Image. In the third variant, we removed the variables Embedded Image and Embedded Image related to the number of grants, leaving only the variables Embedded Imageand Embedded Image related to total funding, suspecting correlation effects. In the fourth variant, we removed the department rank variable ri, because it is based only on the most recent university affiliation of a given Embedded Image and thus could inaccurately represent her/his career. In all cases, the results of the modified regression estimates are not significantly different, indicating the robustness of our specification with respect to these adjustments.

The results of the cross-sectional model in Eq. 1, featuring the full set of funding variables (Fig. 4), point to a key career dilemma with respect to the pursuance of extramural grants. While our estimates confirm the benefit of total NIH funding (Embedded Image), the correlation with the number of NIH grants is negative (Embedded Image), pointing to the sunk costs associated with the management of several smaller grants (for example, R21) versus fewer bigger grants (for example, R01). Neither of the estimates for the NSF variables is significant, suggesting different levels of reliance on NIH/NSF between the biology and computing faculty.

Cross-disciplinary versus disciplinary production within careers. Motivated by the results of our pooled cross-sectional analysis, we implemented a panel regression model that leverages the longitudinal dimension of the career data disaggregated at the publication level. This enabled us to test whether the cross-disciplinary citation premium, indicated by βχ > 0 in the cross-sectional career model, stems from the scholar’s cross-disciplinary publications rather than other factors. In particular, we use a specification with individual Embedded Image fixed-effects so that parameter estimates leverage the within-career comparison of publications that are cross-disciplinary with respect to those that are not. Hence, by identifying a clear counterfactual, this first panel model provides an estimate of the causal link between cross-disciplinary orientation and scientific impact.

To reduce false-positive (type I) classification errors, we do not use the disciplinary orientation of pollinators to classify individual publications. This is because the discipline of pollinators is not directly known and is based on inferences that may lead to overestimation. Consequently, for the classification of publications (hereafter denoted by p), we exclusively use the departmental affiliation of faculty, which are the only authors for whom we have ground-truth information. Within the profile of a facultyEmbedded Image, for each p published in year tp, we assign an indicator value Embedded Image= 0 if all of the faculty authors are from the same discipline or, conversely, Embedded Image= 1 if there is at least one faculty author from Embedded Image and at least one faculty author from Embedded Image. For example, a publication with three faculty authors classified as Embedded Image with Embedded Image will have Embedded Image= 1; however, if instead Embedded Image, then Embedded Image= 0. Using this strict rule, out of the 413,565 publications (observations) in our faculty network sample, we classify with high confidence 4207 publications, or 1% of the entire sample, as cross-disciplinary.

Critical to the panel framework is the definition of a dependent variable measuring an article’s long-term citation impact, one that is comparable across both different years t and disciplines s. This is a common difficulty in citation analysis and arises from a combination of three statistical biases: (i) varying citation rates across disciplines of different size, (ii) right censoring bias in the tallying of raw citation counts from a single census year (that is, the year in which citation data are downloaded from GS or another repository), and (iii) “citation inflation.” The first bias reflects the fact that larger, more prolific disciplines produce more citations than smaller disciplines. The second bias refers to the fact that older publications have had more time to accrue citations than newer ones. The final bias arises from the change in the relative significance of a single citation over time, due to increasing publication rates and reference list lengths (34). By way of example, consider two publications, each cited 10 times in their first 10 years: If the first was published in 1980 and the second in 2007, in relative terms, then the first article has higher citation impact than the second.

The citation tallies reported by GS suffer from each of these problems. To neutralize these statistical biases, we applied a normalization formula that maps the GS citation count ci,p,s,t—for an article p that was published in year t by a faculty Embedded Image from discipline s—to a citation score zi,p that is comparable across s and t. A detailed description of the citation normalization formula is given in Materials and Methods.

Consequently, we formulate the following hierarchical panel regression modelEmbedded Image(2)

The panel data encompass publications in the period 1970–2017 for the 3900 Embedded Image that are connected within the F network, among which 1247 Embedded Image are classified as Embedded Image. This subset of cross-disciplinary publications is represented by the coefficient βI, which provides an estimate for the impact of cross-disciplinarity at the publication level. By using author-specific fixed effects (βi), which capture unobserved time-invariant researcher-specific characteristics, our model effectively compares the publications from the same Embedded Image with Embedded Image= 1 using the counterfactual scenario Embedded Image = 0 as a baseline, after all other factors are held approximately constant. The other control variables in Eq. 2 include ai,p measuring the total number of coauthors listed on each publication p; the career age variable Embedded Imagereferring to the number of years since the researcher’s first publication, which controls for the career life cycle; the dummy year variable D(t) controlling for year-specific shocks; and the residual white noise ϵi,p.

The parameters in Eq. 2 are estimated using Huber-White robust SEs, which account for heteroscedasticity and serial correlation within the publication set of each Embedded Image. Table S4 shows the OLS estimates for models with and without Embedded Image fixed effects. The sign and significance of the model variables are robust to the hierarchical specification, that is, with and without Embedded Image fixed effects.

Figure 5A shows the model estimates for the three variables of principal interest. First, and most importantly, we estimate a statistically significant positive relationship between cross-disciplinarity and citation impact (βI = 0.145, P < 0.001), meaning the average cross-disciplinary publication is more highly cited than the average disciplinary publication authored by the same Embedded Image. To translate the impact expressed by zi,p to the citation premium ci,p, we calculate the percent change 100 Δcp/cp when Embedded Image goes from 0 to 1, which, due to the property of logarithms, is given by 100 Δcp/cp = 100 × σt × (∂z/∂IXD) ≈ 100 × 1.4 × βI = 20% increase, which follows because the SD σt is approximately constant over time.

Fig. 5 Career panel regression model.

(A and B) Parameter estimates for the three principal explanatory variables included in the fixed effects F career model defined in Eq. 2; see table S4 for the complete list of parameter estimates. (C and D) Robustness check of panel regression model. To test the possibility of spurious correlations leading to the significant estimates for the cross-disciplinary variables in the panel model (table S4), we ran this model using a randomized cross-disciplinary indicator variable Embedded Image, implemented by shuffling just that variable across the observations without replacement. (C) For n = 1000 shuffled data sets, we do not observe any (0%) coefficient estimates as large as the empirical value βI = 0.145 corresponding to the dashed vertical blue line [solid vertical blue lines indicate 95% confidence interval (CI); see table S4, third column cluster]. (D) We repeated the same shuffling method for the panel model applied to only the 1247 Embedded Image classified with orientation Embedded Image, and again, we do not observe any (0%) coefficient estimates as large as the empirical value βI reported in table S5 (third column cluster). The levels of statistical significance are as follows: **P ≤ 0.01, ***P ≤ 0.001.

While our model specification does not focus on the effect of team size or author age on citation impact, the associated explanatory variables are also significant and worth discussing. Consistent with previous research, we observe a positive relationship between team size and citation impact (βa = 0.31, P < 0.001) (21, 25), which translates to a σt × βa ≈ 0.43% increase in citations associated with a 1% increase in team size (as ai,p enters in ln in our specification), and finally, we observe a negative relationship with increasing career age (βτ = −0.01, P < 0.001), consistent with previous studies using different career data (25, 35), which here translates to a 100 × σt × βτ ≈ −1.3% decrease in ci,p associated with every additional career year.

We tested the robustness of these results by introducing progressively stricter data selection criteria in two steps. First, we refined the faculty data set to include only the Embedded Image with Embedded Image, that is, we excluded from consideration the core Embedded Image and Embedded Image faculty of the college (table S5). Second, within this Embedded Image subset, we became stricter as to what we considered fair comparison between their cross-disciplinary versus their other publications. Specifically, we used a matching scheme to pair each cross-disciplinary publication p (Embedded Image = 1) with a single disciplinary publication p′ (Embedded Image = 0) from the same faculty profile; the matched pair of publications (p, p′) must also be within 2 years of one another and feature nearly identical number of coauthors (table S6). This matching procedure allows us to more accurately identify a counterfactual for each cross-disciplinary publication and thus to test the causal link between cross-disciplinarity and scientific impact (see Materials and Methods for further details on our matching procedure).

According to the Rubin causal model and potential outcomes framework (36) (see Materials and Methods), this matching procedure facilitates computing the average treatment effect in terms of cross-disciplinarity. Using the entire set of matched pairs (p, p′), we calculate the mean difference in normalized impact z corresponding to the mean treatment effect Embedded Image. An additional sign of robustness is that this estimate is consistent with the βI estimates for the three panel scenarios we developed: (i) all faculty F (βI = 0.145), (ii) cross-disciplinary faculty Embedded ImageI = 0.112), and (iii) cross-disciplinary faculty Embedded Image considering only their matched pairs of publications (βI = 0.135). We can also use the matched pairs to estimate the average treatment effect in terms of percent change in citations, which we calculate to be, on average, a 10.6% increase in cpover the counterfactual cp. Furthermore, by tallying the citation difference across all matched cross-disciplinary publications for each Embedded Image, we calculate the average treatment effect in terms of total net citations to be 630 citations over her/his career.

Figure 5B shows the fixed-effects model estimates for all three approaches: (i) using all faculty Embedded Image, (ii) using only cross-disciplinary faculty Embedded Image, and (iii) using matched publication subsets for the cross-disciplinary faculty Embedded Image. All parameter estimates are consistent across the three panel variants, with the exception of βτ, which is positive for the specification (iii) and negative for specifications (i) and (ii); this inconsistency is due to the fact that the matched data in (iii) are a subset of the faculty’s longitudinal profile, thus introducing bias in the subset selection with respect to career age.

We also explored the possibility that spurious correlations could give rise to the significance of Embedded Image by using a placebo randomization scheme in which we shuffled the Embedded Image values across the data set, without replacement, that is, conserving the total number of observations with Embedded Image= 1. We ran this placebo regression 1000 times, each time recording the value of βI. Figure 5C shows the distribution of the placebo estimates, PI), when that data for all the faculty Embedded Image are used; none (0%) of the placebo estimates were larger than the real estimate βI = 0.145, thereby showing that it is unlikely that we obtained the magnitude and significance of βI by chance alone. Figure 5D shows similar results for the distribution of the placebo estimates, PI), when the data for just the Embedded Image faculty are used.

By analyzing publications clustered within careers using fixed effects, we approach the problem differently than the bulk of recent relevant work on quantifying the correlation between interdisciplinarity and impact (3740). There are relatively few studies we are aware of that use fixed-effects to net out unit-level variation (41, 42), and none that uses the Rubin potential outcomes framework. Moreover, most relevant studies use as a proxy for interdisciplinarity the diversity of distinct journals or the diversity of distinct research areas cited within an article’s reference list. While this is a reasonable approach, all by itself, it would not serve us well in the case of a team science field such as genomics, where we are preoccupied not only with mixed knowledge but also with behaviors in mixed teams.

The genomics story behind the numbers in the biology-computing college

Our study of the biology-computing college in the United States revealed significant cross-disciplinary activity in genomics during and after the HGP, with net career benefits for those involved, and a cultural shift for the cross-disciplinary faculty with computing pedigree. Motivated by these quantitative results, we further explored the collaboration-mediated pathways that trace knowledge transfer from the HGP to the present day. In addition to explicit knowledge pertaining to computing algorithms and biotechnology methods, this knowledge transfer also includes tacit organizational know-how that is fundamental to the management of consortium science—a paradigm in which teams of teams coordinate a common agenda around a single “grand challenge.”

This emergent pattern began with the 2001 publication of the two seminal human genome papers in Nature (43) and Science (44). These parallel efforts, one public and one private, offer valuable insights into the economics of science (8). Yet, more germane to our focus on social and organizational capital formation in science is the intriguing hereditary pattern of the consortium science model, as illustrated in Fig. 6. Namely, several of the authors in these two papers played a quintessential role in knowledge transfer and the evolution of genomics in the 2000s. Every year, starting from the culmination of the HGP circa 2000 and all the way to 2010 and beyond, subsets of the original HGP authorship seeded efforts to decode the genome of important animals, plants, and microorganisms. The faculty network F and its pollinating ecosystem P capture key aspects of this blooming period in genomics. In the left panel of Fig. 6, the faculty nodes and their “hidden” pollinator nodes that were members of the original HGP teams appear. In the middle panel of Fig. 6, some landmark genomics publications that were authored by these scholars appear, including the mouse genome (45), the chicken genome (46), and the dog genome (47). In the right panel of Fig. 6, the faculty nodes that were not in the original HGP teams but contributed in these subsequent genomics efforts appear, thus establishing coauthorship links with the original “HGP cohort” present in the network. Given that the authorship in all these papers was mixed, including authors from both biology and computing, all the coauthorship links presented in the figure are cross-disciplinary. The citations of the original HGP papers as well as the genomics papers that followed in their steps are impressive and testify to their impact. For HGP cohort members of the F network, the centrality and h-index attest to their apostolic role and status, respectively. The centrality and h-index of the non-HGP faculty who interacted with the HGP cohort suggest that these “HGP offspring” followed on the steps of their scholarly fathers/mothers, developing their own notable standing.

Fig. 6 The knowledge transfer story behind the numbers.

Interactions of the HGP scholars with other faculty in the F network during the 2000s, and some of the landmark publications they produced, powering the genomics revolution. The scholar nodes bear the name initials. On the left panels, one can recognize some well-known HGP scholars, such as Eric Lander (EL) and Bruce Birren (BB). “d” stands for the network degree of a scholar and controls with the size of her/his node. “h” stands for the h-index of a scholar. Magenta nodes denote faculty affiliated with computing departments, while green nodes denote faculty affiliated with biology departments.

The sequencing of the human genome is an exemplary case of science for the public good, wherein the culminating achievement extended far beyond the organizational boundaries of the individuals and institutions centrally involved. Yet, in addition to far-reaching public health impacts, the development of the consortium approach cannot be understated, as it has subsequently served as the organizational model for the sequencing of other important genomes. Numerous prominent genomics papers in the 2000s are the products of consortia (for example, the “Mouse Genome Consortium” and the “Chicken Genome Consortium”), that is, analogs of the “Human Genome Consortium” first established in the HGP.

Finally, the evolutionary pathway depicted in Fig. 6 is not the only way genomics knowledge evolved through scholarly interactions in the Embedded Image network. For example, HGP scholars interacted with other faculty on genomic applications to immunology, cancer, and the decoding of ancient DNA, transforming medicine and evolutionary biology. These interactions, indicative of the broad impact of the HGP and the ever-expanding reach of genomics, are also captured in our data set and are lumped in the “Other works” box in Fig. 6.

Cross-disciplinarity as mixed authorship in the genomics literature

Analysis of the career data set above indicates that researchers in the U.S. biology-computing college achieve higher citation impact—both across and within faculty profiles—when they adopt a cross-disciplinary approach. As a consistency and robustness check, we further tested whether cross-disciplinarity, defined in this section as mixed bio-computing authorship in genomic publications, has value that transcends the U.S. biology-computing college.

We proceeded by collecting a comprehensive international data set consisting of 25,466 articles from the WoS using the topic query “Human Genome.” We classified each article as cross-disciplinary (XDg) if its affiliation list included both biology and computing departments, or biology (BIOg) if its affiliation list included only biological sciences departments. The subscript g indicates that the XD and BIO attributes are linked to departmental affiliations of authors globally (not just U.S.-based), who have published in human genomics. As in our panel model, this operationalization establishes a clear counterfactual, that is, an article is either XDg or BIOg, reflecting researcher-level orientations. Consequently, the citation difference between the two publication subsets is associated with cross-disciplinary factors, net of other likely factors, such as funding levels or field size.

We calculated the mean citation impact Embedded Image and Embedded Image for the nonoverlapping subsets of cross-disciplinary and biology publications, respectively (see Materials and Methods). The ratioEmbedded Image(3)measures the cross-disciplinary citation premium relative to the baseline established by the intradisciplinary biology publications. The value Embedded Image = 1 corresponds to the case in which there is no difference in citation impact between the two publication subsets.

Figure 7A shows the evolution of the citation premium rc(t) associated with cross-departmental collaboration in the international human genomics literature. We estimated the degree to which rc(t) could arise by chance using a random bootstrap sampling method to calculate the distribution of the randomized (null model) test statistic rc,RND(t) and thus to assess the likelihood of type I (false-positive) misestimation. To be specific, for a given year t, we randomly selected Embedded Image publications, independent of their departmental affiliations, and then calculated Embedded Image and Embedded Image for this subset. We combined these two values to obtain a null model estimate Embedded ImageEmbedded Image. We repeated this randomization 106 times for each year and calculated the two-tailed 90, 95, and 99% thresholds for each distribution of rc,RND(t). It is worth noting that this null model, which is based on random sampling without replacement from the underlying citation distribution, conserves the overall proportion of publications with Embedded Image and also the total citations received by these publications. Thus, by sampling the empirical citation distribution, this randomization scheme demonstrates the range of rc values one could obtain by chance.

Fig. 7 Cross-disciplinarity beyond the faculty network F.

(A) Cross-disciplinarity XDg as mixed authorship in the human genomics literature: Cross-disciplinarity is measured using the combinations of departmental affiliations on the set of Human Genome publications reported in the WoS. The mean value, weighted according to the publication volume each year, is Embedded Image. (B) Cross-disciplinarity XDe as mixed methods in NB: Cross-disciplinarity is measured by analyzing the combinations of computational and biological methods used within articles from the journal NB. The mean value, weighted according to the publication volume each year, is Embedded Image. In both panels, blue dots represent the respective rc(t), calculated using real data to measure the additional citation impact of XD publications. The curves correspond to the respective null model test statistic distribution P(rc,RND(t)), estimated from 1 million bootstrap randomizations, in which the expected value rc,RND(t) ≡ 1 (that is, no difference between the mean citation impact of the subsets). The red curve and shaded region correspond to the 90% confidence interval for the respective randomized rc,RND(t ) ≡ 1, and the outer black curves correspond to the 95% (solid) and 99% (dashed) confidence intervals. Thus, empirical data above (or below) the null model confidence intervals are significantly different than the expected value rc = 1 at the given significance level and demonstrate that it is highly unlikely to obtain these large values by chance alone.

Our results indicate significant citation premium rc(t) stemming from mixed authorship. Specifically, since 1999, the annual rc(t) values are significantly in excess of unity at the P = 0.01 level (false-positive rate; see Fig. 7A), with the mean rc(t) value standing at Embedded Image = 1.1. Because rc(t) is calculated using the logarithm of citation values, to temper the impact of outliers, we must convert this ratio to fully appreciate the magnitude of the effect. We can estimate the percent difference in raw citations drawing on the properties of the log-normal citation distribution. By assuming that cross-disciplinarity only affects the logarithmic mean of the citation distribution, one can estimate the percent difference in c for the XDg group as compared to the BIOg group as Embedded Image. In these terms, the XDg publications gained, on average, 37% more citations than those in the BIOg group.

Cross-disciplinarity as mixed methods in a genomics journal

To test the value of cross-disciplinarity at the epistemic level of explicit and tacit knowledge, we constructed a third data set by collecting the 3516 research articles published over the period 1996–2014 in Nature Biotechnology (NB), a prestigious genomics-oriented journal. As in the previous analysis, it is important to verify that the cross-disciplinary citation premium persists among research articles of similar perceived novelty, that is, colocated in the same high–impact factor journal but differing to the extent to which they incorporate computational methods.

In this case, we assigned articles that featured computational methods to the cross-disciplinary group XDe, as specified by the paper’s Medical Subject Headings (MeSH), which are a controlled thesaurus of keywords implemented by PubMed (48). The remaining articles were assigned to the BIOe group. Thus, in this NB analysis, the classification of articles as cross-disciplinarity is based on only epistemic and not authorship criteria (thus, the subscript e for XD and BIO). Nevertheless, statistical comparison of the two sets of research articles, corresponding to cross-disciplinary (XDe) and biology (BIOe), followed exactly the same method as in the case of the WoS human genomics data set.

Our results indicate significant citation premium rc(t) stemming from mixed research methods. Specifically, the annual rc(t) values are significant at the P = 0.05 level since 2004 (see Fig. 7B), with the mean rc(t) value standing at Embedded Image = 1.22. Translating this ratio, we find that XDe publications gained, on average, roughly 126% more citations than those in the BIOe group. Because we only compare publications within NB, the difference in the citation impact is net of journal-specific factors and represents the added value of computational knowledge and methods in research with genomic applications.

DISCUSSION

The merging trend among techno-scientific disciplines is bound to continue because of the nature of grand challenges faced by society. We know, however, little about what works and why in a cross-disciplinary fusion process. To start unlocking this problem in the context of team science (21), it is imperative to analyze not only scholarly knowledge production but also scholarly interactions that support scientific progress. To this end, the science of team science has contributed greatly to understanding how to accelerate scientific advancement via multiscale collaboration (49).

Less is known, however, about the factors that promote cross-disciplinary collaboration around a central challenge. Even in the case where the goals are well posed, agreeing on the best path forward can become contentious, especially when groups have different social and epistemological backgrounds. Consequently, harnessing the benefits of team science is often not just a matter of bringing primed stakeholders together. Recent work highlights a case of high-risk “gain-of-function” pathogen research, where the differing expertise of biomedical researchers affected their position around this politically charged dilemma (50). The issue of consensus formation acquires new urgency with the proliferation of social media, which sometimes undermine constructive dialogue between groups.

In general, whether between experts or nonexperts, there is a need to understand how to foster cultural bridging around scientific topics and narratives (51). It is within this overarching framework that we pursued mixed analysis for the field of genomics and two of its key constituent disciplines—biology and computing. First, we investigated cross-disciplinary versus disciplinary careers within the biology-computing college in U.S. universities. Strikingly, nearly all cross-disciplinary faculty in this college (~90%) have published research on genomics, and we show that the precipitation of this activity correlates with the onset of the HGP in 1990. Furthermore, we find that scholars with greater orientation toward cross-disciplinary collaboration tend to have higher career citation impact. We use several identification strategies, using publication-level data, to attribute this citation premium to the scholars’ cross-disciplinary activity—net of other factors.

Germane to this discussion is the fact that cross-disciplinary computing scholars exhibit publication patterns that trend in the direction of biologists, with profiles that include papers in high-impact science journals. This is a sign of cultural assimilation, which gives cross-disciplinarity at the career level a fuller meaning.

Cross-disciplinarity, defined as joint authorship by biology and computing scholars, enjoys a premium that transcends the U.S. biology-computing college, being a feature of the international intellectual production in human genomics, as tracked by the WoS. Looking also at cross-disciplinarity as an epistemic fusion in the articles of a well-known genomics-oriented journal, we found that papers with explicit computational content enjoy a significant impact premium over papers without such content. These results are in agreement with recent work documenting the citation advantage that occurs when researchers innovate to form new within-discipline knowledge bridges (52) via measured combinations of novel and traditional concepts (38). The latter represents a strategy that is generalizable to bridge-building in other domains, such as stimulating proactive public discourse (51).

One wonders about the reason behind the higher impact of cross-disciplinary publications in genomics. After all, this is what feeds the career advantage of the Embedded Image cohort and likely acts as a talent attractor, although other factors reportedly play a role in the decision to pursue cross-disciplinary collaboration (53).

Fast-paced and application-oriented techno-scientific disciplines, such as genomics, tend to be highly utilitarian. On the basis of this assumption, we can speculate for the moment that cross-disciplinary genomics publications are popular primarily because they are useful. One, however, should not underestimate the significant coordination cost associated with bridging disciplinary gaps within mixed teams (54). The fact that this coordination is done successfully in the biology-computing college suggests some compatibility between the two disciplines, with the assimilation of Embedded Image scholars of computing pedigree into biology’s culture being a sign of it.

It is also generally true that mixed teams overcoming disciplinary communication barriers produce publications that are well posed, well framed, and well written, accessible to the union of the corresponding communities; all these likely contribute to higher citation rates (55).

The statistical results reported here can serve as an excellent springboard for science studies into the particular processes, artifacts, and personalities that powered genomics as well as the consortium science organizational model. In this direction, looking behind the numbers, we traced the collaborative pathways captured in our data model, bringing to the fore a key mechanism of the genomics revolution. Following the HGP, the data model points to several research efforts staggered over a decade, which led to the sequencing of important animals, plants, and organisms. The outcomes of these efforts were impactful publications in iconic journals, such as Nature and Science. The investigative teams included new coming faculty from both biology and computing, mixed with key members of the original HGP team in various configurations. Therefore, these projects shaped a new generation of cross-disciplinary researchers and helped them build their networks, their careers, and, along the way, genomics as we know it today. The work and authorship in the relevant genomic papers were structured around consortia, in the image of the HGP—a practice that ushered team science into the teams-of-teams science era.

In conclusion, as funding agencies are increasingly supporting cross-disciplinary investigations [for example, BRAIN initiative (56)] and associated scientific activity is on the rise (42, 57), there is a growing need for insightful quantitative evidence from past cases to aid policy making (8, 10). To this end, our findings show how a timely research initiative helped create cross-disciplinary human capital between two culturally complementary disciplines, and how inherent career incentives perpetuated this capital and contributed to its epistemic dominance. In modeling terms, science policy makers could view this as no different from the elements needed for a flame: a spark, a combustible medium, and a feeding system.

MATERIALS AND METHODS

The assembly of the career data set

We selected 155 biology and computing departments in the United States following the 2014 U.S. News & World Report (table S1). We confirmed that all the departments in the set had active PhD programs since the conception of HGP in the 1980s. Moreover, the ranking of academic departments is relatively rank-stable, as supported by theoretical and empirical evidence drawn from various other social systems characterized by positive feedback reinforcement mechanisms that temper large rank fluctuations (58). With respect to the latter, we found no significant differences in the ranks of these 155 departments between the 2014 and 2018 U.S. News & World Report ranking (P > 0.05, Wilcoxon test).

We accessed the home pages of these departments and recorded the listed faculty as of spring 2017. In this master list, we identified the faculty Embedded Image that had GS pages, forming a database with their GS IDs, h-indices, departments, department rankings, and bibliometric data. We also indexed their NSF and NIH grant data from the corresponding repositories (59). We then applied a name disambiguation algorithm to Embedded Image and their coauthors to reconcile their identities within and across Embedded Image profiles (appendix S1). Figure 1 provides a visual example of how we constructed the biology-computing college network from the disambiguated Embedded Image data.

The key motivator behind our data collection methodology for the career data set is the tendency of typical computing researchers to publish the bulk of their work in refereed conferences from where they receive most of their citations. Traditional bibliometric databases, such as Scopus and WoS, do not cover citations from many refereed conference publications, but GS does, thus emerging as the only viable alternative for fair career assessment.

Although the career data set covers a substantial portion of the biology-computing college in United States, it does not cover it all, and it does not explicitly cover the international biology-computing college. This limitation is tempered by two factors. First, it is important to clarify that, in our analysis, we are not seeking to measure the impact of the HGP on research outcomes, but rather the impact of cross-disciplinarity on research outcomes. Because the HGP had explicit cross-disciplinary alignment, we expect it to have had its strongest and most direct impact on the adoption of cross-disciplinary research orientation in the United States. Second, the construction of the mediated association network considerably expands the reach of the career data set, as it includes not only the faculty members in these 155 departments but all their collaborators, forming an impressive ecosystem. The representational power and validity of this ecosystem finds confirmatory evidence in two cases during the course of our analysis: (i) The evolution of the rate of cross-disciplinary collaborations in the U.S. biology-computing college mirrors the rate of cross-disciplinary collaborations gleaned via author affiliations in the human genomics literature at large. (ii) Entrance of faculty in the U.S. biology college crests in early 2000s, which is consistent with the doubling of NIH research funding in the period 1998–2003 (8, 60).

Citation normalization

The citation normalization of publication p from faculty Embedded Image leverages the universal log-normal properties of citation statistics (61), yielding a stationary, normally distributed citation measure zi,pN(0, 1) (fig. S5) that is well suited for identifying longitudinal patterns of citation impact in research careers (25, 35).

To be specific, we disaggregated the articles by publication year and removed the time-dependent trend in the location and scale of the underlying log-normal citation distribution by definingEmbedded Image(4)where Embedded Imageis the mean and σt≡σ[ln(1 + cs,t)] is the SD of the citation distribution, after adding 1 to each citation tally (to avoid the divergence of ln 0 associated with uncited publications) and applying the natural logarithm. We calculated μt and σt within the subset of publications for a given year t and discipline s (BIO or CS). The SD σt ≈ 1.4 is approximately constant across time and the two disciplines we analyzed.

Publication matching

We used the Rubin causal model framework (36) to provide additional evidence for a causal link between cross-disciplinary collaboration and increasing citation impact. According to the potential outcome notation, let YXD=1 = zi,p,1 represent the outcome, that is, scientific impact proxied by citations, of a publication drawing on cross-disciplinary collaboration, denoted in our data set by the indicator Embedded Image = 1; conversely, the counterfactual Embedded Image represents the potential outcome of the same publication but without cross-disciplinary collaboration (Embedded Image = 0). To obtain counterfactual pairs from our data set (Embedded Image for each XD publication p of each faculty Embedded Image with Embedded Image, we searched through just their profile for the most similar p′ to pair with p. More specifically, for each p with Embedded Image = 1, we collected all publications from the same profile within ± 2 years (|tptp| ≤ 2). From this potential match set, we then selected the p′ with the closest number of coauthors to ap, and if ap was larger or smaller than ap by more than 20% (|apap| ≥ 0.2), then we rejected this match and did not include p in the set of matched pairs. We produced matches without replacement so that each p′ was included only once.

We then combined these matched pairs (p, p′) into an observation subset and ran the same regression model as in Eq. 2 on this set of faculty with Ni,XD ≥ 10 matched data pairs. Table S6 shows the model estimates for the resulting 53 Embedded Image. Using these matched publication pairs, we also estimated the mean cross-disciplinary “treatment effect,” Embedded Image. As such, the average value, Embedded Image, is an estimate of the average treatment effect on the treated (ATET). In addition to comparing the outcome according to normalized citation impact, YXD=1 = zi,p,1, we also report the ATET calculated using the total citation difference, Embedded Image, and the percent citation difference, Y1Yo ≡ 100(ci,pci,p)/ci,p.

Assembly of the WoS data set

We used the topic keyword Human Genome to query the WoS database. After excluding books and editorials, we arrived at a set of 25,466 publications, recording the total number of citations cp,t each publication p received through November 2016. We then defined cross-disciplinarity according to the diversity of departmental affiliations associated with each publication. Publications featuring researchers from both computing and biology departments were classified as XDg, whereas publications featuring researchers from biology departments only were tagged as BIOg.

Assembly of the NB data set

We downloaded from the WoS all publication records for articles published in the journal NB as of December 2015, resulting in a data set of 3516 items. We then used the MeSH of MEDLINE/PubMed, a unified and controlled vocabulary system of article keywords, to separate publications into the complementary XDe and BIOe subsets. The typical biomedical publication has roughly 10 to 20 MeSH descriptors assigned by professional MEDLINE experts and algorithms, which can then be used to position publications in a complex conceptual space composed of 16 top-level categories and more than 27,800 MeSH descriptors (48).

We leveraged this detailed ontology by tagging publications with at least one MeSH keyword from the “Information Science” category—the L branch—as XDe articles. Three examples of MeSH keywords from the L branch are “Human Genome Project” (tree number L01.453.450), “Molecular Sequence Data” (tree number L01.453.245.667), and “Algorithms” (tree number L01.224.050). Seventy-one percent of the NB publications do not contain a single MeSH descriptor keyword belonging to the L branch; we tagged these as BIOe articles. Thus, there are a significant number of publications with and without explicit computational methods, and we used the latter set as our baseline for comparison.

Calculating rc

In the analysis of both the human genomics and NB articles, we calculated the mean citations per year for the XD (∙ ≡ g or e) subset, Embedded Image where Embedded Image is the total number of articles published in year t within the XD group. Similarly, we calculated the mean citations per year for the BIO (∙ ≡ g or e) subset, Embedded ImageWe applied the logarithmic transformation to normalize the citation distribution within each year t. Assuming that citations follow a log-normal distribution and that the only difference between the XD and BIO groups is a multiplicative factor rc affecting their logarithmic mean μLN, then the mean citations for the XD group is Embedded Image, and for the BIO group is Embedded Image, where μLN and σLN are the location and scale parameters of the underlying log-normal distribution. Thus, the percent difference between the mean citations is Embedded Image.

SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/4/8/eaat4211/DC1

Appendix S1. Author name disambiguation.

Appendix S2. Connectivity of the F network.

Fig. S1. Robustness of the F network with respect to link removal.

Fig. S2. F network distributions for direct and mediated associations.

Fig. S3. Three perspectives on the centrality of Embedded Image in the direct collaboration network.

Fig. S4. Evolution of the nongiant components in the F network.

Fig. S5. Distribution of normalized citation impact by departmental affiliation and time period.

Table S1. Set of 155 biology and computing departments in the United States.

Table S2. Career data set: Pooled cross-sectional model.

Table S3. Career data set: Pooled cross-sectional model—robustness check.

Table S4. Career data set: Panel model on all faculty F.

Table S5. Career data set: Panel model on the Embedded Image faculty.

Table S6. Career data set: Panel model on the Embedded Image faculty with matched pairs.

References (6264)

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.

REFERENCES AND NOTES

Acknowledgments: Funding: The authors acknowledge funding from the Eckhard-Pfeiffer Distinguished Professorship Fund and from NSF grant 1738163 entitled “From Genomics to Brain Science.” Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. Author contributions: A.M.P. developed methods and performed quantitative data analysis. D.M., K.K., and M.E.A. collected and curated data and also developed the software tools. I.P. designed research and linked epistemic analysis to quantitative results. A.M.P. and I.P. wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. GS and NSF/NIH RePORTER data are openly available online; Impact Factor data were obtained from the Clarivate Analytics Journal Citations Report. WoS publication and citation data were also obtained from Clarivate Analytics. Scopus data were obtained via calls to the relevant Application Programming Interface. Supporting data is provided through the Open Science Framework repository (https://osf.io/7nb6d/). Additional data related to this paper may be requested from the authors.
View Abstract

Navigate This Article