Thesis ID: CBB117865546

Topic Modeling the Reading and Writing Behavior of Information Foragers (2019)

unapi

How do individuals create a knowledge base over a lifetime? Charles Darwin left detailed records of every book he read from The Voyage of the Beagle to just after publication of The Origin of Species. Additionally, he left copies of his drafts before publication. I use these records to build a case study of how reading and writing interact to create conceptual novelties, such as the theory of natural selection and modification by descent. The model is extended to cover entire disciplines by bootstrapping reading and writing histories from bibliographies in scientific publications, scaling the model to address the question of how we move from an individual psychology to society? There are two central components from cognitive science that impact the proposed models. The first is bounded cognition. People have limited attention, and that attention is further limited by an individual’s information processing ability. Information foraging is a framework for managing the trade-off between exploration of new information and exploitation of existing knowledge when searching for information. Most existing work on information foraging and bounded cognition examine short-term information foraging problems, such as formulating web search queries in a laboratory setting with a known information goal. Through the case study of Charles Darwin, we use real-world datasets to explore this problem at a timescale of decades with unknown information goals. The base of the reading model is topic modeling with Latent Dirichlet Allocation (LDA). This method reduces the dimensionality of text by reducing each document to a topic distribution, where each topic is defined as a probability distribution over the words in the collection. With these probability distributions, we are able to apply information theoretic measures to calculate the divergence between texts. These divergences characterize a particular reading decision as exploiting the topics exposed by previously read texts or exploring new topics. I train these topic models not on the records, but identify each volume in the Hathi Trust Digital Library and train the topic model on the full text of the books. While Darwin’s reading notebooks and manuscript drafts provide relatively precise information on reading and writing behaviors at a day-level granularity, that type of data is rare. I explore three extensions of the models, dealing with progressively more “fuzzy” data. First, I look at the contents of Darwin’s Library at the time of his death to infer readings 1860-1882. These readings are used to provide a preliminary analysis of his work on The Descent of Man and the latter editions of the Origin of Species. Then, I look at another historical figure: Thomas Jefferson, whose working library formed the basis of the Library of Congress. We examine the bibliography of his retirement library and tie it into his correspondence to find possible evidence for when certain volumes were read. Finally, I scale the model up to the discipline of neuroscience. I extract citation graphs from the Web of Science to infer reading histories for neuroscientists based on the articles they cited. I use the text of the abstracts of these articles to perform a similar analysis to the Darwin case study on readings and writings. These extensions of the model highlight the potential to work with less precise data and illuminate future problems. Throughout the work, I emphasize the notion of multiple realizability and interpretive pluralism. Each model is itself a population of models, and while simpler term-frequency-based models may show many of the same effects as the topic models, an argument is made for the explanatory power of the topic model with respect to causality.

...More
Citation URI
https://data.isiscb.org/isis/citation/CBB117865546/

Similar Citations

Article Grant Ramsey; Charles H. Pence; (2016)
evoText: A new tool for analyzing the biological sciences (/isis/citation/CBB014385965/)

Article Melinda Baldwin; (2018)
A Perspective from the History of Scientific Journals (/isis/citation/CBB030888393/)

Article Theodore M. Porter; (2018)
Digital Humanism (/isis/citation/CBB751955814/)

Article Anu Masso; Maris Männiste; Andra Siibak; (2020)
‘End of Theory’ in the Era of Big Data: Methodological Practices and Challenges in Social Media Studies (/isis/citation/CBB632756299/)

Article Abraham Gibson; Manfred D. Laubichler; Jane Maienschein; (2019)
Introduction to Focus: Computational History and Philosophy of Science (/isis/citation/CBB323182392/)

Thesis Currier, James David; (2007)
“Greedy for Facts”: Charles Darwin's Information Needs and Behaviors (/isis/citation/CBB001560886/)

Thesis Damerow, Julia; (2014)
A Quadruple-Based Text Analysis System for History and Philosophy of Science (/isis/citation/CBB001567603/)

Article Deryc T. Painter; Bryan C. Daniels; Jürgen Jost; (2019)
Network Analysis for the Digital Humanities: Principles, Problems, Extensions (/isis/citation/CBB443684783/)

Article Kenneth D. Aiello; Michael Simeone; (2019)
Triangulation of History Using Textual Data (/isis/citation/CBB253321424/)

Book Peter Janich; (2018)
What Is Information? (/isis/citation/CBB403064080/)

Article McCarthy, Gavan; (2011)
Mapping the Past: Building Public Knowledge Places to Meet Community Needs (/isis/citation/CBB001251178/)

Chapter Downey, Greg; (2007)
The librarian and the Univac: automation and labor at the 1962 Seattle World's Fair (/isis/citation/CBB001180032/)

Book Antonio Badia; (2019)
The Information Manifold: Why Computers Can't Solve Algorithmic Bias and Fake News (/isis/citation/CBB524320511/)

Article Fosse, Sébastien de la; (2013)
Media and Cognition: The Relationship between Thought Structures and Media Structures (/isis/citation/CBB001201747/)

Thesis Kouper, Inna; (2011)
The Meanings of (Synthetic) Life: A Study of Science Information as Discourse (/isis/citation/CBB001567283/)

Authors & Contributors
Siibak, Andra
Daniels, Bryan C.
Jost, Jürgen
Flis, Ivan
Aiello, Kenneth D.
Pao, Lea
Concepts
Digital humanities
Information theory
Information science
Data analysis
Text mining
Data collection; methods
Time Periods
21st century
20th century, late
20th century
19th century
18th century
Places
United States
Soviet Union
Comments

Be the first to comment!

{{ comment.created_by.username }} on {{ comment.created_on | date:'medium' }}

Log in or register to comment