- Laura Noren (New York University) email
- Stuart Geiger (UC-Berkeley) email
- Gretchen Gano (University of California Berkeley) email
- Massimo Mazzotti (University of California, Berkeley) email
- Charlotte Mazel-Cabasse (University of California, Berkeley) email
- Brittany Fiore-Gartland (University of Washington) email
We invite papers investigating datadriven techniques in academic research and analytic industries and the consequences of implementing datadriven products and processes. Papers utilizing computational methods or ethnography with theorization of technology, social power, or politics are encouraged.
Computational methods with large datasets are becoming more common across disciplines in academia (including social sciences) and analytic industries, but the sprawling and ambiguous boundaries of "big data" makes it difficult to research. In this track we investigate the relationship between theories, instruments, methods and practices in data science research and implementation. How are such practices transforming the processes of knowledge creation and validation, as well as our understanding of empiricism and the scientific method?
Beyond case studies, we invite connective explorations of emerging theory, machinery, methods, and practices. Papers may examine data collection instruments, software, inscription devices, packages, algorithms and their interaction in sociotechnical systems used to produce, analyse, share, and validate knowledge. Looking at the way these knowledges are objectified, classified, imagined and contested, the aim is to reflect critically on the maturing practices of quantification and their historical, social, cultural, political, ideological, economical, scientific and ecological impacts.
We welcome papers tackling a variety of questions and cases studies such as:
- What does it mean to study quantification (including big data) as myth, narrative, ideology, discourse, and power?
- How is instrumentation is being used to connect data and theory?
- How well do we understand which domains are being reshaped by these techniques, and what are the consequences of their adoption in those domains and beyond? Is data science linking up to domains that have previously been distinct or dividing fields that had been unified?
This track is closed to new paper proposals.
Scientific Open Data: Questions of Labor and Public Benefit
While “Open data” has become a norm in science and public policy, the costs and benefits of making data open rarely are made explicit. These social costs translate in changes in the workforce dynamics, and vary by domains.
Openness of publicly funded scientific data is policy enforced, and its benefits are normally taken for granted: increasing scientific trustworthiness, enabling replication and reproducibility, and preventing duplication of efforts.
However, when public data are made open, a series of social costs arise. In some fields, such as biomedicine, scientific data have great economic value, and new business models based on the reuse of public data are emerging. In this session we critically analyze the relationship between the potential benefits and social costs of opening scientific data, which translate in changes in the workforce and challenges for current science funding models. We conducted two case studies, one medium-scale collaboration in biomedicine (FaceBase II Consortium) and one large-scale collaboration in astronomy (Sloan Digital Sky Server). We have conducted ethnographic participant observations and semi-structured interviews of SDSS since 2010 and FaceBase since 2015. Analyzing two domains sharpened our focus on each by enabling comparisons and contrasts. The discussion is also based on extensive document analysis.
Our goal is to unpack open data rhetoric by highlighting its relation to the emergence of new mixed private and public funding models for science and changes in workforce dynamics. We show (1) how open data are made open "in practice" and by whom; (2) how public data are reused in private industry; (3) who benefits from their reuse and how. This paper contributes to the Critical Data Studies field for its analysis of the connections between big data approaches to science, social power structures, and the policy rhetoric of open data.
It's the context, stupid: Reproducibility as a scientific communication problem
Context in data-intensive research is often seen as something that can be captured with metadata to extend reproducibility. Based on varied ways “context” is marshalled in reproducibility practice, we argue for a nuanced view of context and reframing of reproducibility as a communication problem.
Reproducibility has long been considered integral to scientific research and increasingly must be adapted to highly computational, data-intensive practices. Central to reproducibility is the sharing of data across varied settings. Many scholars note that reproducible research necessitates thorough documentation and communication of the context in which scientific data and code are generated and transformed. Yet there has been some pushback against the generic use of the term context (Nicolini, 2012); for, as Seaver puts it, "the nice thing about context is everyone has it" (2015). Dourish (2004) articulates two approaches to context: representational and interactional. The representational perspective sees context as stable, delineable information; in terms of reproducibility, this is the sort of context that can be captured and communicated with metadata, such as location, time, and size. An interactional perspective, on the other hand, views context not as static information but as a relational and dynamic property arising from activity; something that is much harder to capture and convey using metadata or any other technological fix. In two years of ethnographic research with scientists negotiating reproducibility in their own data-intensive work, we found "context" being marshalled in multiple ways to mean different things within scientific practice and discourses of reproducibility advocates. Finding gaps in perspectives on context across stakeholders, we reframe reproducibility as a scientific communication problem, a move that recognizes the limits of representational context for the purpose of reproducible research and underscores the importance of developing cultures and practices for conveying interactional context.
Condensing Data into Images, Uncovering "the Higgs"
In data-intensive sciences such as particle physics images become essential sites for evidential exploration and debate through procedures of black-boxing, synthesis, and contrasting. This paper addresses the challenges of data analysis using as an example the Higgs search at the LHC (CERN).
Contemporary experimental particle physics is amongst the most data-intensive sciences and thus provides an interesting test case for critical data studies. Approximately 30 petabytes of data produced at CERN's Large Hadron Collider (LHC) annually need to be controlled and processed in multiple ways before physicists are ready to claim novel results: data are filtered, stored, distributed, analyzed, reconstructed, synthesized, etc. involving collaborations of 3000 scientists and heavily distributed work. Adopting a science-as-practice approach, this paper focuses on the associated challenges of data analysis using as an example the recent Higgs search at the LHC, based on a long-term qualitative study. In particle physics, data analysis relies on statistical reasoning. Physicists thus use a variety of standard and advanced statistical tools and procedures. I will emphasize that, and show how, the computational practice of data analysis is inextricably tied to the production and use of specific visual representations. These "statistical images" constitute "the Higgs" (or its absence) in the sense of making it "observable" and intelligible. The paper puts forward two main theses: (1) that images are constitutive of the prime analysis results due to the direct visual grasp of the data that they afford within large-scale collaborations and (2) that data analysis decisively relies on the computational and pictorial juxtaposition of "real" and "simulated data", based on multiple models of different kind. In data-intensive sciences such as particle physics images thus become essential sites for evidential exploration and debate through procedures of black-boxing, synthesis, and contrasting.
Data Pedagogy: Learning to Make Sense of Algorithmic Numbers
Focusing on data analytic pedagogy, this paper shows how students learn to make sense of algorithmic output in relation to data, code, and prior knowledge. I showcase this by drawing out the relation and contrast between human and machine understanding of algorithmically outputted numbers.
This paper conceptualizes data analytics as a situated process: one that necessitates iterative decisions to adapt prior knowledge, code, contingent data, and algorithmic output to each other. Learning to master such forms of iteration, adaption, and discretion then is an integral part of being a data analyst. In this paper, I focus on the pedagogy of data analytics to demonstrate how students learn to make sense of algorithmic output in relation to underlying data and algorithmic code. While data analysis is often understood as the work of mechanized tools, I focus instead on the discretionary human work required to organize and interpret the world algorithmically, explicitly drawing out the relation between human and machine understanding of numbers especially in the ways in which this relationship is enacted through class exercises, examples, and demonstrations. In a learning environment, there is an explicit focus on demonstrating established methods, tools, and theories to students. Focusing on data analytic pedagogy, then, helps us to not only better understand foundational data analytic practices, but also explore how and why certain forms of standardized data sensemaking processes come to be. To make my argument, I draw on two sets of empirics: participant-observation of (a) two semester long senior/graduate-level data analytic courses, and (b) a series of three data analytic training workshops taught/organized at a major U.S. East Coast university. Conceptually, this paper draws on research in STS on social studies of algorithms,sociology of scientific knowledge, sociology of numbers, and professional vision.
Big Data or Big Codata? Flows in Historical and Contemporary Data Practices
This paper develops a empirical distinction between the aspects of “volume” and “velocity” currently conflated in theorizations of “big data”. The contrasting concept of “big codata” emphasizes streaming flows of events, contrasting data science practice with traditional social-scientific methodology.
Presently existing theorizations of "big data" practices conflate observed aspects of both "volume" and "velocity" (Kitchin 2014). The practical management of these two qualities, however, have a comparably disjunct, if interwoven, computational history: on one side, the use of large (relational and non-relational) database systems, and on the other, the handling of real-time flows (the world of dataflow languages, stream and event processing, and message queues). While the commercial data practices of the late 20th century were predicated on an assumption of comparably static archival (the site-specific "mining" of data "warehouses"), much of the novelty and value of contemporary "big data" sociotechnics is in fact predicated on the harnessing/processing vast flows of events generated by the conceptually-centralized/ physically-distributed datastores of Google, Facebook, LinkedIn, etc. These latter processes—which I refer to as "big codata"—have their origins in IBM's mainframe updating of teletype message switching, were adapted for Wall Street trading firms in the 1980s, and have a contemporary manifestation in distributed "streaming" databases and message queues like Kafka and StormMQ, in which one differentially "subscribes" to brokered event streams for real-time visualization and analysis. Through ethnographic interviews with data science practitioners in various commercial startup and academic environments, I will contrast these technologies and techniques with those of traditional social-scientific methods—which may begin with empirically observed and transcribed "codata", but typically subject the resultant inert "dataset" to a far less real-time sequence of material and textual transformations (Latour 1987).
Talking to Non-Experts about Data: Translating and Synthesizing Modeling Data in Design Teams
How do engineers translate data to teams? How do these teams synthesize these data into design & construction decisions? We studied 14 hospital projects to find the mechanisms and strategies of communicating data. We learn more about the "last mile" of data science: integration into team decisions.
Teams often struggle to incorporate data into their decisions. We studied energy modeling in the design and construction of health care projects. How do engineers translate complex data to teams of architect, engineering, & construction professionals? How do these teams synthesize these data into design & construction decisions? We studied 14 hospital projects to find the mechanisms and strategies of communicating data for informing better collaboration in the design process. Successful energy engineers made data meaningful to people who may not have expertise with data. We find
1) Communication Shapes Data : The goals of the project and the concerns of the design shape what kinds of energy analyses engineers do how they share the data. The team's understanding of the modeling software's technical capacities and constraints influences the analysis and how they use it for design decisions.
2) Data Are Facts, Probabilities & Negotiations: Engineers must convince other people who have different data skills. Non-engineers can see a model's data as a hard fact or truth or as a range of probabilities. The assumptions and construction of the energy model is negotiated and debated during team meetings and altered to represent the needs of design team members.
3) Engineers Bundle Data to Navigate Data's Competing Needs: Successful engineers use data communication strategies that anticipate the needs of their audience and reflect their own goals. For presentation to non-experts, they bundle data with analyses and stories based on what the team wants and what they judge experience as optimal.
Emerging Practices of Data-Driven Accountability in Healthcare: Individual Attribution of C-Sections
Through ethnographic research on obstetrical care, I describe a change in scale from performance measurement of hospitals to individual clinicians, and attendant dilemmas related to data quality management and tradeoffs between professional discretion and accountability.
This paper examines the implementation and consequences of data science in a specific domain: evaluation and regulation of healthcare delivery. Recent iterations of data-driven management expand the dimensions along which organizations are evaluated and utilize a growing array of non-financial measures to audit performance (i.e. adherence to best practices). Abstract values such as "quality" and "effectiveness" are operationalized through design and implementation of certain performance measurements—it is not just what outcomes that demonstrate the quality of service provision, but the particular practices engaged during service delivery.
Recent years have seen the growth of a controversial new form of data-driven accountability in healthcare: application of performance measurements to the work of individual clinicians. Fine-grained performance measurements of individual providers were once far too resource intensive to undertake, but expanded digital capacities have made provider-level analyses feasible. Such measurements are being deployed as part of larger efforts to move from "volume-based" to "value- based" or "pay for performance" payment models.
Evaluating individual providers, and deploying pay for performance at the individual (rather than the organizational) level is a controversial idea. Critics argue that the measurements reflect a tiny sliver of any clinician's "quality," and that such algorithmic management schemes will lead professionals to focus on only a small number of measured activities. Despite these and other concerns, such measurements are on the horizon. I will discuss early ethnographic findings on implementation of provider-level cesarean section measurements, describing tensions between professional discretion and accountability and rising stakes of data quality in healthcare.
The (in)credibility of data science methods to non-experts
This paper explores the quantification practices through which models and algorithms are created, maintained and contested. It draws on data collected in the analytical industry and government in the UK and the Netherlands to illustrate how non-experts evaluate the credibility of highly technical objects.
The rapid development and dissemination of data science methods, tools and libraries, allows for the development of ever more intricate models and algorithms. Such digital objects are simultaneously the vehicle and outcome of quantification practices and may embody a particular world-view with associated norms and values. More often than not, a set of specific technical skills is required to create, use or interpret these digital objects. As a result, the mechanics of the model or algorithm may be virtually incomprehensible to non-experts.
This is of consequence for the process of knowledge creation because it may introduce power asymmetries and because successful implementation of models and algorithms in an organizational context requires that all those involved have faith in the model or algorithm. This paper contributes to the sociology of quantification by exploring the practices through which non-experts ascertain the quality and credibility of digital objects as myths or fictions. By considering digital objects as myths or fictions, the codified nature of these objects comes into focus.
This permits the illustration of the practices through which experts and non-experts develop, maintain, question or contest such myths. The paper draws on fieldwork conducted in government and analytic industry in the form of interviews, observations and documents to illustrate and contrast the practices which are available to non-experts and experts in bringing about the credibility or incredibility of such myths or fictions. It presents a detailed account of how digital objects become embedded in the organisations that use them.
Big data and the mythology of algorithms
Big data relies on algorithms, which are typically presented as objective and unbiased. They are not. As they become more deeply entangled in our lives, it is important to understand the implications of the roles they are playing. This paper critically analyzes this mythology of algorithms.
There are no big data without algorithms. Algorithms are sociotechnical constructions and reflect the social, cultural, technical and other values embedded in their contexts of design, development, and use. The utopian "mythology" (boyd and Crawford 2011) about big data rests, in part, on the depiction of algorithms as objective and unbiased tools operating quietly in the background. As reliable technical participants in the routines of life, their impartiality provides legitimacy for the results of their work. This becomes more significant as algorithms become more deeply entangled in our online and offline lives. where we generate the data they analyze. They create "algorithmic identities," profiles of us based on our digital traces that are "shadow bodes," emphasizing some aspects and ignoring others (Gillespie 2012). They are powerful tools that use these identities to dynamically shape the information flows on which we depend in response to our actions and decisions made by their owners
Because this perspective tends to dominate the discourse about big data, thereby shaping public and scientific understandings of the phenomenon, it is necessary to subject it to critical review as an instance if critical data studies. This paper interrogates algorithms as human constructions and products of choices that have a range of consequences for their users and owners; issues explored include:
The epistemological implications of big data algorithms
The impacts of these algorithms in our social and organizational lives
The extent to which they encode power ways in which this power is exercised
The possibility of algorithmic accountability
Infrastructuring data analysis in Digital methods with digital data and tools
The presentation draws on ethnographic research to describe data practices of appropriating the web for social research in Digital methods as layers of infrastructuring. The web is mediated by community infrastructures to support iterative assembling of the local infrastructure of a knowledge space.
The Digital methods approach seeks the strategic appropriation of digital resources on the web for social research. I apply the grounded theory to theorize how data practices in Digital methods are entangled with the web as a socio-technical phenomenon. My account draws on public sources of Digital methods and ethnographic research of semester-long student projects based on observations, interviews and project reports. It is inspired by Hutchin's call for understanding how people "create their cognitive powers by creating the environments in which they exercise those powers". The analysis draws on the lens of infrastructuring to show that making environments for creativity in Digital methods is a distributed process, which takes place on local and community levels with distinct temporalities. Digital methods is predicated on creating its local knowledge space for social analysis by pulling together digital data and tools from the web, and this quick local infrastructuring is supported by layers of slower community infrastructures which mediate the digital resources of the web for a Digital methods style analysis by means of translation and curation. Overall, the socially distributed, infrastructural style of data practice is made possible by the web as a socio-technical phenomenon predicated on openness, sharing and reuse. On the web, new digital resources are readily available to be incorporated into the local knowledge space, making way for an iterative, exploratory style of analysis, which oscillates between infrastructuring and inhabiting a local knowledge space. The web also serves as a socio-technical platform for community practices of infrastructuring.
"An afternoon hack" Enabling data driven scientific computing in the open
The scientific computing, or e-science, has enabled the development of large data driven scientific initiatives. The research focuses on the socio-technical conditions of the development of free and reproducible computational scientific tools and the system of values that supports them.
The scientific computing, or e-science, has enabled the development of large data driven scientific initiatives. A significant part of these projects relies on the software infrastructures and tool stacks that make possible to collect, clean and compute very large data sets.
Based on an anthropological research among a community of open developers and/or scientists contributing to SciPy, the open source Python library used by scientists to enable the development of technologies for big data, the research focuses on the socio-technical conditions of the development of free and reproducible computational scientific tools and the system of values that supports it.
Entering the SciPy community for the first time is entering a community of learners. People who are convinced that for each problem there is a function (and if there is not, one should actually create one), who think that everybody can (and probably should) code, who have been living between at least two worlds (sometime more) for a long time: academia and the open software community, and for some, different versions of the corporate world.
Looking at the personal trajectories of these scientists that turned open software developers, this paper will investigate the way in which a relatively small group of dedicated people has been advancing a new agenda for science, defined as open and reproducible, through carefully designed data infrastructures, workflows and pipelines.
Playing with educational data: the Learning Analytics Report Card (LARC)
The field of ‘learning analytics’ is gaining significant traction in education, often driven by uncritical government and corporate research agendas. This paper describes the ‘LARC’: an interdisciplinary project investigating critical and student-focused educational data analysis.
Education has become an important site for computational data analysis, and the burgeoning field of 'learning analytics' is gaining significant traction, motivated by the proliferation of online courses and large enrolment numbers. However, while this 'big data' and its analysis continue to be hyped across academic, government and corporate research agendas, critical and interdisciplinary approaches to educational data analysis are in short supply. Driven by narrow disciplinary areas in computer science, learning analytics is not only 'blackboxed', - in other words a propensity to 'focus only on its inputs and outputs and not on its internal complexity' (Latour 1999, p304), but also abstracted and distanced from the activities of education itself. This methodological estrangement may be particularly problematic in an educational context where the fostering of critical awareness is valued. The first half of this paper will describe three ways in which we can understand this 'distancing', and how it is implicated in enactments of power within the material conditions of education: the institutional surveilling of student activity; the mythologizing of empirical objectivity; and the privileging of prediction. The second half of the paper will describe the development of a small scale and experimental learning analytics project undertaken at the University of Edinburgh that sought to explore some of these issues. Entitled the Learning Analytics Report Card (LARC), the project investigated playful ways of offering student choice in the analytics process, and the fostering of critical awareness of issues related to data analysis in education.
Data science / science studies
Inside universities, data science is practically co-located with science studies. How can we use that proximity to shape how data science gets done? This paper reports on experiments in data science research and organizational/strategic design.
Inside universities, data science is practically co-located with science studies. How can we use that proximity to shape how data science gets done? Drawing on theorizations of collaboration as a research strategy, embedded ethnography, critical technical practice, and design intervention, this paper reports on experiments in data science research and organizational/strategic design. It presents intellectual tools for working on data science (conceptual distinctions such as data science as specialty, platform, and surround; temporal narratives that capture practitioners' conjoint sense of prospect and dread) and explores modes of using these tools in ways that get uptake and do work. Finally, it draws out possible consequences of the by now sometimes well-anchored situation of science studies/STS inside universities, including having science studies scholars in positions of institutional leverage.
Critical Information Practice
A pedagogical model grounded in interpretive learning experiences: collecting data from messy sources, processing data with an eye towards what algorithms occlude, and presenting data through creative forms like narrative and sculpture.
Big Data has been described as a death knell for the scientific method (Anderson, 2008), a catalyst for new epistemologies (Floridi, 2012), a harbinger for the death of politics (Morozov, 2014), and "a disruptor that waits for no one" (Maycotte, 2014). Contending with Big Data, as well as the platitudes that surround it, necessitates new kind of data literacy. Current pedagogical models, exemplified by data science and data visualization, too often introduce students to data through sanitized examples, black-boxed algorithms, and standardized templates for graphical display (Tufte, 2001; Fry, 2008; Heer, 2011). Meanwhile, these models overlook the social and political implications of data in areas like healthcare, journalism and city governance. Scholarship in critical data studies (boyd and Crawford, 2012; Dalton and Thatcher, 2014) and critical visualization (Hall, 2008; Drucker 2011) has established the necessary foundations for an alternative to purely technical approaches to data literacy. In this paper, we explain a pedagogical model grounded in interpretive learning experiences: collecting data from messy sources, processing data with an eye towards what algorithms occlude, and presenting data through creative forms like narrative and sculpture. Building on earlier work by the authors in the area of 'critical making' (Ratto), this approach—which we call critical information practice—offers a counterpoint for students seeking reflexive and materially-engaged modes of learning about the phenomenon of Big Data.
Actor-Network VS Network Analysis VS Digital Networks Are We Talking About the Same Networks?
Our contribution discusses the differences and affinities among three types of networks (namely Actor-Network, Network Analysis and Digital Networks) that are playing an increasingly important role in digital STS.
In the last few decades, the idea of 'network' has slowly but steadily colonized broad strands of STS research. This colonization started with the advent of actor-network theory, which provided a convenient set of notions to describe the construction of socio-technical phenomena. Then came network analysis, and scholars who imported in the STS the techniques of investigation and visualization developed in the tradition of social network analysis and scientometrics. Finally, with the increasing 'computerization' of STS, scholars turned their attention to digital networks a way of tracing collective life.
Many researchers have more or less explicitly tried to link these three movements in one coherent set of digital methods for STS, betting on the idea that actor-network theory can be operationalized through network analysis thanks to the data provided by digital networks. Yet, to be honest, little proves the continuity among these three objects besides the homonymy of the word 'network'. Are we sure that we are talking about the same networks?
Data scientists construct and navigate data spaces. Where critical data studies has focused on flaws in these spaces' construction, this paper examines their navigation. Studies of navigation illuminate key features of data science, particularly the interrelation of maps, spaces, plans, and action.
Data scientists summon space into existence. Through gestures in the air, visualizations on screen, and loops in code, they locate data in spaces amenable to navigation. Typically, these spaces embody a Euro-American common sense: things near each other are similar to each other. This principle is evident in the work of algorithmic recommendation, for instance, where users are imagined to navigate a landscape composed of items arranged by similarity. If you like this hill, you might like the adjacent valley. Yet the topographies conceived by data scientists also pose challenges to this spatial common sense. They are constantly reconfigured by new data and the whims of their minders, subject to dramatic tectonic shifts, and they can be more than 3-dimensional. In highly dimensional spaces, data scientists encounter the "curse of dimensionality," by which human intuitions about distance fail as dimensions accumulate. Work in critical data studies has conventionally focused on the biases that shape these spaces. In this paper, I propose that critical data studies should not only attend to how representative data spaces are, but also to the techniques data scientists use to navigate them. Drawing on fieldwork with the developers of algorithmic music recommender systems, I describe a set of navigational practices that negotiate with the shifting, biased topographies of data space. Recalling a classic archetype from STS and anthropology, these practices complicate the image of the data scientist as rationalizing, European map-maker, resembling more closely the situated interactions of the ideal-typical Micronesian navigator.
This track is closed to new paper proposals.