Module 2: Analysis of Topic Networks

Faculty Contact: Joao Hespanha, Stacy Hespanha

Concepts: Semantic networks, social network analysis, academic publications, scientometrics, LDA topic modeling, Rao-Stirling diversity

Research Areas:

Datasets: ~400,000 Web of Science records for all articles published between 1996 and late 2014 in 108 top journals in the fields of Ecology, Biodiversity Conservation, Evolutionary Biology, Fisheries, and Forestry; a subset of this corpus (N=~2,500) represents all publications produced by two NSF-funded synthesis centers – NCEAS (nceas.ucsb.edu) and NESCent (nescent.org); LDA topic model data at various resolutions for this corpus; 3-component Rao-Stirling diversity metric (variety, evenness/balance, and disparity) for each document; additional journal article metadata such as publication date, journal of publication, number of citations, etc.

Abstract: Most advances in understanding interdisciplinarity, novelty, and innovation of academic research that have been achieved through computational approaches are based upon analysis of citation rates, co-authorship or co-citation networks, or imprecise estimates of semantic content based on disciplinary categorization of the academic journals in which papers are published (see Wagner et al., 2011 for a review). While important insights have been gained through these approaches, there remain exciting opportunities to apply approaches that involve modeling of thematic contents of these texts. In a manner analogous to how scientometricians have typically used citation or authorship data to construct networks, academic publications can be modeled as a dynamic network of objects whose relationships are based on the degree to which they share the same statistically-modeled themes, or ‘topics’. Better understanding of how this type of network analysis could be used to identify research innovation, emerging subdisciplines, or ‘hot’ topics based on academic publications is the next frontier in scientometrics research and highly relevant to increased interest in using metrics to inform science policy.

The proposed project involves use of probabilistically modeled latent topic data from a corpus of approximately 400,000 titles, abstracts, and keywords from 100+ top journals in the fields of Ecology, Biodiversity Conservation, Evolutionary Biology, Fisheries, and Forestry. Data represent each document as a proportional mixture of the discovered latent topics. This topic model data would be the basis for network analyses, and information such as publication date, journal of publication, number of authors, and previously-calculated Rao-Stirling diversity metrics (variety, balance, disparity) could serve as additional node attributes. Problems to be addressed include:

  • Identification of best methods for constructing topic-based network for subsequent analyses. Topic model-based estimates of relationships between documents will be multidimensional and continuous rather than discrete. Identification of best methods could include the following approaches:
    • Comparison of networks created using a variety of approaches to evaluate degree of agreement between these different approaches
    • Evaluate convergence of network-based document grouping (e.g., graph segmentation, community detection) with other methods for grouping documents (e.g., kmeans)
    • Evaluate relationships between node properties (e.g., various centrality metrics) and Rao-Stirling diversity metrics
    • Evaluation of key papers from history of science literature in node, or of interesting-looking nodes by examining papers themselves
  • Integration of temporal component of data (i.e., publication date) in analytic approach
  • Creation of data visualizations that convey interesting properties of the network and results of analyses
  • Description of the degree to which ‘interesting’ papers from the relevant history and sociology of science literature have distinctive properties in the network representation
  • Articulation of how topic-based network analysis could be useful as a metric for science policy, including:
    • Report on degree to which network-based node properties (e.g., centrality) relate to Rao-Stirling diversity measures based on the same data
    • Ideas for how topic-based analyses could be combined with standard scientometric approaches to refine current approaches to detection of innovation and interdisciplinary

Active Quarters:

  • Spring 2015, Elizabeth Forbes, Alex Kulick and Rafael Melendez-Rios
  • Fall 2015, Elizabeth Forbes

 

 

Wagner, C.S., J.D. Roessner, K. Bobb, J. Thompson Klein, K.W. Boyak, J. Keyton, I. Rafols, & K. Börner (2011) Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature, Journal of Infometrics, 165, 14-26.