
Crowdsourcing Performance Evaluations of User Interfaces用户界面的众包绩效评估.pdf
10页Crowdsourcing Performance Evaluations of User InterfacesSteven Komarov, Katharina Reinecke, Krzysztof Z. Gajos Intelligent Interactive Systems Group Harvard School of Engineering and Applied Sciences 33 Oxford St., Cambridge, MA 02138, USA {komarov, reinecke, kgajos}@seas.harvard.eduABSTRACT Online labor markets, such as Amazon’s Mechanical Turk (MTurk), provide an attractive platform for conducting hu- man subjects experiments because the relative ease of re- cruitment, low cost, and a diverse pool of potential partici- pants enable larger-scale experimentation and faster experi- mental revision cycle compared to lab-based settings. How- ever, because the experimenter gives up the direct control over the participants’ environments and behavior, concerns about the quality of the data collected in online settings are pervasive.In this paper, we investigate the feasibility of conducting online performance evaluations of user interfaces with anonymous, unsupervised, paid participants recruited via MTurk. We implemented three performance experiments to re-evaluate three previously well-studied user interface de- signs. We conducted each experiment both in lab and online with participants recruited via MTurk. The analysis of ourresults did not yield any evidence of significant or substan- tial differences in the data collected in the two settings: Allstatistically significant differences detected in lab were also present on MTurk and the effect sizes were similar. In ad-dition, there were no significant differences between the two settings in the raw task completion times, error rates, consis- tency, or the rates of utilization of the novel interaction mech- anisms introduced in the experiments. These results suggest that MTurk may be a productive setting for conducting per- formance evaluations of user interfaces providing a comple- mentary approach to existing methodologies.Author Keywords Crowdsourcing; Mechanical Turk; User Interface EvaluationACM Classification Keywords H.5.2. Information Interfaces and Presentation (e.g. HCI): User Interfaces-Evaluation/MethodologyINTRODUCTION Online labor markets, such as Amazon’s Mechanical Turk (MTurk), have emerged as an attractive platform for humanPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI’13, April 27–May 2, 2013, Paris, France. Copyright 2013 ACM 978-1-4503-1899-0$10.00.subjects research. Researchers are drawn to MTurk because the relative ease of recruitment affords larger-scale experi- mentation (in terms of the number of conditions tested and the number of participants per condition), a faster experimen- tal revision cycle, and potentially greater diversity of partici- pants compared to what is typical for lab-based experiments in an academic setting [15, 24].The downside of such remote experimentation is that the re- searchers give up the direct supervision of the participants’ behavior and the control over the participants’ environments. In lab-based settings, the direct contact with the experimenter motivates participants to perform as instructed and allows the experimenter to detect and correct any behaviors that might compromise the validity of the data. Because remote partici- pants may lack the motivation to focus on the task, or may be more exposed to distraction than lab-based participants, con- cerns about the quality of the data collected in such settings are pervasive [11, 19, 23, 26].A variety of interventions and filtering methods have been explored to either motivate participants to perform as in- structed or to assess the reliability of the data once it has been collected. Such methods have been developed for ex- periments that measure visual perception [8, 12, 3], decision making [20, 27, 14], and subjective judgement [11, 19, 21].Missing from the literature and practice are methods for remotely conducting performance-based evaluations of user interface innovations—such as evaluations of novel input methods, interaction techniques, or adaptive interfaces— where accurate measurements of task completion times are the primary measure of interest. Such experiments may bedifficult to conduct in unsupervised settings for several rea- sons. First, poor data quality may be hard to detect: For ex- ample, while major outliers caused by a participant taking a phone call in the middle of the experiment are easy to spot, problems such as a systematically slow performance due to a participant watching TV while completing the experimentaltasks may not be as easily identifiable. Second, the most pop- ular quality control mechanisms used in paid crowdsourcing,such as gold standard tasks [1, 18], veri。
