Using Inter-Reliability Rates in UX Research

We’ve all conducted excellent research only to have it thrown to the side when other people in our organization choose to just do whatever they were already planning to do. Executives, product managers, project managers, developers, and even UX Researchers (in other words all the humans) simply get dug into a position and unwilling to accept a point of view contrary to that position.

One common objection we hear when presenting research are that a researcher introduced bias in the way a question was asked, or the tone of the researcher impacted the response, or even that researchers use different techniques. These same objections were once used to keep from having to make policy decisions in the face of academic research and so we looked for ways to make it more challenging to refute research. One way was to increase the reliability of that research.

Any quantitative research is better with interpolation from a qualitative researcher. The same is true coming from the other direction. And this goes beyond surveys or analyzing patterns of behavior right to the people actually doing that work. This means both the methods and the rigor with which those methods are applied.

We want our research to be reliable. Research is considered reliable when it is consistent. One form of consistency is to maintain a high degree of consistency across the people leading the research. Tests are inconsistent when they vary across researchers and so can’t be reliably trusted to make inferences. The term “inter-rater reliability” refers to the rate that consistency is maintained when data is collected across those doing the collection. This includes the evaluation of the mechanisms used to gather information as well the procedures used in order to confirm their stability and level of reliance.

Inter-rater reliability is a form of reliability assessed when scores from two or more independent judges are assessed across a number of subjects, thus determining the consistency in their estimations or scoring. Putting this in the context of User Experience (UX) Research, this means that we would look at findings from multiple people conducting UX Research and attempt to disprove that there is a bias in their findings across researchers.

Some forms of research require the inter-rater reliability rate assessment as a confidence interval. Many a software developer would have been exposed to this type of research academically, if not professionally. To provide logical proof that inter-rater reliability is not a matter of chance, statistical measures can then be applied, either for each research effort or randomly assigned across projects in a larger department, where budget allows. This installs confidence by reducing the level of subjectivity in research efforts.

The acceptable inter-reliability rate of a given research initiative then proves objectivity, or at least a measure of consensus across researchers. One way to derive this reliability would be to have a number of researchers assign a score to a given piece of research. For example, let’s say we’re scoring responses on a scale of 1 to 10. We can then have each researcher run a study and correlate scores between them. Provided each is able to rate a statistically significant set of responses, the ratings are then calculated to determine the inter-rater reliability levels. The inter-reliability rate is then the percentage of time the researchers agree on findings. If 4 out of 5 researchers agree, that would be an 80% inter-reliability rating.

Calculating the inter-reliability rate isn’t that hard for a given initiative. And providing that rate should help dispel objections executives or developers might have as to the validity of our research. But we don’t need to calculate that rate for every initiative. In fact, we can’t if the sample size or research subjects isn’t large enough or if we have to turn some research around by tomorrow (let’s face it – we’ve all been there). And none of us have unlimited budgets; we are constantly stretched to do more with less. As our research becomes used by more teams in an organization and our value helps the organization hone in on what customers want, our backlog of work increases.

Running additional projects to study our own activities, like inter-reliability rates may seem like extra busy work. But consider this, those couple of people in an organization that continue to do whatever they want without taking your research into account – you have two options of how to deal with them. You can complain about them until one of you leaves the organization; or you can take their objections head-on and supplement your efforts to disprove their objections. The objection that research is biased can then be dispelled by running inter-reliability studies routinely.

Every team is different and has a different composition, different stakeholders, and varied expectations. A few other options for dealing with this objection from those still not willing to accept research include include:

Add an inter-reliability component to a random sampling of 1 out of 10 studies and average findings across all research initiatives.
Run an inter-reliability study quarterly and include the findings on the front page of the UX Research internal site, linking to ways you plan to keep getting a better score.
Provide annual clinics that provide training on bias in research, possibly bringing in an academic researcher from a local university.
Bring in an external researcher from an academic institution to run an inter-reliability study, which reduces the impact on a likely overburdened UX Research while providing an external level of validation.
Have a canned response or a page on your internal website that addresses bias in research. This might lay out future plans for studies specifically geared towards resolving issues with bias “when we have the budget to do so.”

All of the above options are made easier if you have your research in a repository that’s available to everyone in your organization. Using a tool like https://www.handrailux.com provides transparency across the organization and having all of the research in a single, indexed location, gives other researchers access to the information, allows you to access historical research quickly, and allows for overlaying research by analyzing the information gathered in support of these types of initiatives.

Ultimately, our goal is to provide a better experience and to build the right features in a way customers want, and to do so the first time. The inter-reliability score isolates bias in a scientific manner and allows you to project a rate to your organization that gives confidence that your ResearchOps practice is maturing in many of the same ways your DevOps and Product Management practices are maturing. Adding this type of data will resonate with developers who are often judged on burn down charts, executives who are immersed in spreadsheets, and product managers who are being asked for quantitative measures to prove the return on investments of their epics.

Adding research techniques from other disciplines isn’t just about objection handling for other teams though. They help us validate our findings and give us more confidence in the integrity of our own research. This is similar to how past initiatives caused disciplines like psychology and sociology to be more widely accepted as they were maturing into their respective branches of science. Implementing academic inter-reliability rates to your UX Research practice then gives you yet another tool in your quiver to combat perspectives from those that choose not to accept research. There are many others, and the more you can implement, the more irrefutable the research becomes. We’ve provided a few options here in this article, but would love to hear anything other organizations have done, so please feel free to leave us a comment with your innovative ideas!