Measures of the current usability of a system to provide a baseline against which future systems can be compared.
Although measures are most often of user performance, to cover all aspects of usability, they should also include measures of satisfaction. The method used for benchmarking is usually summative usability testing.
Berkun, S (2003) The art of usability benchmarking
Keijzers, J., Ouden, E.d., Lu, Y. (2008) Usability benchmark study of commercially available smart phones: cell phone type platform, PDA type platform and PC type platform. Mobile HCI 2008: 265-272.
NIST (2007) Usability Performance Benchmarks For the Voluntary Voting System Guidelines Prepared at the direction of the HFP Subcommittee of the TGDC
Benefits, Advantages and Disadvantages
Benchmarking a product or website against competitors is useful for identifying features that enhance usability and, when used to compare one version of a product/website with a later one, can provide proof of usability improvement. If the benchmark study shows that your product/site is deficient, management can be motivated to fund design improvements and additional usability research. If the benchmark study shows that your product is superior, your organization's product marketing can use that information in marketing campaigns.
Benchmark studies are similar to summative usability studies in that they test usability of a functioning product, application, or website. The goal is to collect data along specific measures such as error rate, number of clicks, success/failure, and satisfaction ratings. If you conduct summative studies, you already know how to design a benchmark study.
The challenge with benchmark studies is in their size. The number of sessions you need to run must be sufficient for performing a meaningful statistical analysis. This number can vary from 24 to 60 sessions (or more) per product/site, depending on the difference in performance you expect (and ultimately find), the level of confidence desired in the results, and the margin of error that is acceptable. See [the statistics section] for more information on power analysis, confidence intervals, and margin of error.
In addition, the preparation required to design a benchmark study is greater than that for a "one-off" summative test. The study must be designed for repeatability without change. The user profiles, tasks performed, and measures used (and the definition of those measures) must be consistent for each round of research.
Benchmark studies are more expensive to conduct than formative or summative usability studies because of the additional preparation and the greater number of sessions required. The return is worth the investment when it justifies usability improvements or provides favorably comparative statements that product marketing can use.
To minimize the expense of a benchmark study and increase its cost effectiveness, one can consider conducting it as an automated study [see the automated study section]. Assuming one is not observing recordings after the fact, this approach supports the greater number of sessions required by the method at a lower cost per participant (incentive). The downside is the lack of first-hand observation of the participant's behavior, which is critical for descriptive error coding.
Otherwise, the best way to minimize the expensive of a benchmark study, as with any study, is to identify the focus of the study clearly and conduct only the tasks for that focus, minimizing session time and collecting only the data that answers the key issues of the study.
Benchmark testing that includes error rates, click counts, and success/failure data is most appropriately used with a product/website that supports full user interactivity and natural user behavior. A partially functional prototype is not appropriate, as it introduces variables whose effect cannot be measured. For example, a user who clicks on user interface elements and receives no response because the product/site is a mockup may start hesitating to clicking in general.
For a benchmark study of user attitudes and opinions, a less than fully functional prototype can be used, with a moderator who drives the user through the experience and asks questions.
The moderator chosen to run a benchmark study must be well-trained in facilitation techniques to be 100% consistent from one session to the next, through all sessions conducted.
See the procedure for conducting a summative usability study.
Decide how many sessions you need to run per product/website variation.
Decide what constitutes an error, and develop error codes and definitions.
If you are not a statistician, bring in a statistician early in the process to review the study definition, ask questions, and advise on how to make sure the data collected will be suitable for comparison.
Recruit participants who match the desired profile. Ensure balance among characteristics for each product/website tested to avoid introducing variation that could affect the results.
Before finalizing the protocol, create the file in which you will ultimately record the data (usually a spreadsheet). Define your column headings (variables) for the statistician. Work with the statistician to ensure the format will enable easy to import into the stats program for analysis. This process often introduces changes to the protocol.
Conduct multiple pilot tests to make sure the facilitator is comfortable using the session protocol and that the equipment is working properly.
Run all sessions with the same facilitator, if possible. Some schedules do not allow a single facilitator. In this case, conduct cross-facilitator review of the pilot tests to ensure consistency.
With proper preparation, common problems can be avoided. Multiple pilot tests ensure that your stakeholders agree with the study design as implemented. Be prepared to make changes between pilot tests, and continue conducting pilot tests until everyone is satisfied. You cannot change anything about the study once the actual sessions begin.
The stakeholders may say they only want quantitative results, and you may design the study either as moderated with no think-aloud, or as unmoderated, to meet this requirement cost-effectively. However, it is best to make sure the stakeholders will ultimately be satisfied with only quantitative data. After the fact, they may ask you for the reasons behind the results, which will not be available with a quantitative-only study.
Data Analysis and Reporting
Unless you are trained in statistical analysis, you will be entering your data for a statistician to analyze. You have already made sure your spreadsheet is structured appropriately. Once you show the actual data to the statistician, you may still need to make some adjustments, such as data formats or additional coding. At the same time the statistician works on the analysis of success rates, error rates, and other measures, if your study includes qualitative data, you will write that supporting analysis. If you decide to work in parallel with the statistical analysis, you may be making assumptions about the significance of results that may be incorrect once you receive the statistician's report.
Costs and Scalability
The first study of a benchmark series is the most expensive, because it includes the whole study design and the pilot tests to proof the design. Once the design is set, future studies can begin with already prepared session materials. Usually some adjustments for the specific user interface need to be incorporated into the script for each version tested.
Sources and contributors:Nigel Bevan, Chauncey Wilson, Laurie Kantner