Different types of testing and research

A distinction can be made between testing that is carried out as part of the design process and testing on finished products.

Formative evaluations, i.e. before finalising the design, can inform design decisions by either detecting problems with some aspects of a single design (e.g. type is too small) or indicating which of two or more versions is easier to read.
This form of testing is described as diagnostic testing when pinpointing specific problems, and is ideally used as part of an iterative design process. Having detected a problem, this is resolved and then re-tested.
User testing or user research compares different versions and this may be carried out as a formative evaluation, to determine which version to develop further.
If user testing is carried out as a summative evaluation, i.e. testing the final product, the results may provide recommendations for the design of future similar products. However, this practical guidance will be limited if it is not possible to determine why one version was better than another.
Research studies make comparisons between different versions whilst controlling how they vary. From these results, it should be possible to say, for example, which typographic variable affects speed of reading. The research is therefore generalizable to other design situations and can be considered robust research, if carried out appropriately.

Question: Consider whether you have used a formative evaluation as part of your design process. For example, have you asked colleagues or friends for feedback about aspects of your design?

Challenges

Key criteria

The methods used for the first three types of testing above can be less formal than those used for research studies. In some circumstances, it may be unnecessary to meet all of the criteria listed below, or they may be less relevant. Nevertheless, it is helpful to know what are the main challenges to carrying out robust research that will be of value and relevant to both researchers and designers.

Although the three criteria are listed separately, they do interrelate. Finding a solution to one challenge may conflict with another so a judgement must be made as to what to prioritise.

The key criteria in designing a study are:

Sensitivity: finding a method to measure performance of some aspect of reading that is sensitive enough to pick up differences when typography is varied.
Reliability: ensuring that the results you get are repeatable. If you were to do the same study again, would you get the same outcome? One solution is to increase the amount of data collected. You can do this by using a sufficiently large number of participants in the study and, where practical, giving participants multiple examples of each condition of the experiment. These requirements present their own challenges which are to find enough participants and to fit the experiment into a reasonable length of time.
Validity: determining that the study measures what it is intended to measure. Of most relevance to legibility research, and the designer’s perspective, is ecological validity, a form of ‘external validity’. This describes the extent to which a study approximates typical conditions and is also referred to as ‘face validity’. In our context, this can mean a natural reading situation and suitable reading material. Another form of validity is ‘internal validity’ which describes the relationship between the outcomes of the study and the object of study. This is explained further below.

Reading conditions

Ecological validity is not only a concern of design practitioners but also of psychologists doing applied research. However, reading situations in experiments are frequently artificial and do not resemble everyday reading practice. As mentioned in Chapter 2, research has often looked at individual letters or words, rather than reading of continuous text. The letter or word is often displayed for only a short time and the participants in the studies may be required to respond quickly. Context is also removed which means:

If testing individual letters, there are no cues from other letters which might help identification. Panel 4.1 provides an example of how the stylistic characteristics of a particular font, or style of handwriting, may help us identify letters.
If testing words, there is no sentence context.

Clearly these are not everyday reading conditions, but there are compelling reasons for carrying out a study in this way. These techniques can be necessary to detect quite small differences in how we read because skilled readers can recognise words very quickly (within a fraction of a second). Any differences in legibility need to be teased out by focusing on a part of the reading process and making that process sufficiently difficult to detect change. This is a way of making the measure sensitive (one of the three criteria described above), but at the expense of ecological validity. Although some research does use full sentences and paragraphs, these may not always reveal differences or may be testing different aspects of the reading process.

Designers, in particular, can also be critical of studies measuring speed of reading claiming that how fast we read is not an important issue for them. Speed of reading, or speed of responding to a single letter or word, are also techniques used to detect small differences, and may be used because they are reasonably sensitive measures. It is not the speed itself which is important but what this reveals, e.g. ease of reading or recognition.

Material used in studies

Another criticism relating to artificial conditions in experiments is the poor choice of typographic material, e.g. the typeface or way in which the text is set (spacing, length of line etc.). The objection to such material is that designers would never create material in this form and therefore it is pointless to test; the results will not inform design practice. In some cases, there is no reason for the poor typography of material used in a study, except the researcher’s lack of design knowledge. The researcher may not be aware that it is not typical practice. In other cases, the researcher may need to control the design of the typographic material to ensure that the results are internally valid. If I am interested in the effect of line length I could:

Compare two line lengths and also vary the line spacing (see Figure 4.2). Experienced typographic designers increase the space between lines when lines are longer. But if I set the text in this way I cannot be sure if the line length or the spacing, or both, have influenced my results. The line spacing is a confounding variable.
Compare two line lengths and not vary the line spacing (see Figure 4.3). But designers will say that they would not create material which looks like this.

Comparison of line lengths of around 50 and 100 characters
per line — **Figure 4.2:** Comparison of line lengths of around 50 and 100 characters per line (cpl) with adjustments to line spacing. The shorter line length is 10 point type with 12 point line spacing; the longer line is 10 point type with 14 point line spacing.

In these two examples, there is a conflict between the internal validity, ensuring that the study is planned correctly, and ecological validity. See Panel 4.2 for further detail of experiment design.

Question: Are you convinced by the reasons I have given for using unnatural conditions and test material? If not, what are your concerns?

Panel 4.2: Explanation of interacting typographic variables in psychology experiments

Typographic and graphic designers learn to make decisions about type size, line length, and line spacing in relation to each other. These typographic variables are considered to be inter-related. In psychology experiments, this inter-relationship can be demonstrated by finding interactions between the variables. In the example of line lengths and line space (Figures 4.2 and 4.3), if the type size remains constant, we might expect to find that optimal legibility for a longer line length has larger line space and optimal legibility for a shorter line length has a smaller line space.

In Figure 4.4 I have plotted some data from Paterson and Tinker, reproduced in Tinker (1963, p95). The study used 10 point type and I have selected three line lengths (around 40, 54 and 90 characters per line) with line spacing starting from 10 point and increasing to 11, 12 and 14 point. At all three line lengths, 10 point line spacing slows down reading and the line length has very little effect. However, the results regarding optimum combinations of line length and line spacing are not as I predicted above: the optimum line spacing for the longer line length (90 cpl) is 12 point; this is also the optimum for the two shorter line lengths (40 and 54 cpl).

Nevertheless, this is an example of an interaction between line length and line spacing. The effect on reading speed of the amount of spacing depends on the line length. We can see this from the graph as the three lines representing the line lengths are different shapes, indicating a different pattern of data. The consequence of this difference is that if I had chosen not to adjust line spacing as line length varied (as in Figure 4.3), but instead tested all line lengths with 11pt line spacing, I would have concluded that:

a line length of 40 cpl is read fastest
90 cpl is quite a bit slower
but 90 cpl is read faster than 54 cpl

If I had chosen 12pt line spacing, I would have reached a different conclusion:

lines of 40 and 54 cpl are read at the same (fast) speed
lines of 90 cpl are read slower

Figure 4.4: Graph showing the relationship between two typographic variables (line spacing and line length) and how this affects legibility measured by reading speed. The graph is based on a subset of data reported in Tinker (1963).

This selective use of data is employed only to illustrate how to translate designers’ respect for the relationship between typographic variables into experiment design. It is unwise to regard these specific results as a guide to design practice. Chapter 5 and Chapter 6 review a wider range of research which is more representative of the findings and therefore a better guide.

The data in Figure 4.4 was taken from a huge series of studies in which the experimenters included all combinations of line lengths, line spacing and different type sizes. This scale of testing would not be carried out today as it would not be considered a feasible or efficient approach. Instead, the options would be limited to those shown in Figures 4.2 and 4.3:

adjusting the spacing to suit each line length
keeping the line spacing constant across all line lengths

Question: If you were asked to advise a researcher who was interested in finding the optimum line length for reading from screen, which of the two options above would you recommend? Why?

Comparing typefaces

An even greater problem arises when more than one type of variation is built into the test material. The classic example is the comparison of a serif and a sans serif typeface. If a difference in reading speed is found this could be due to the presence or absence of serifs but also could be due to other ways in which the two typefaces differ (e.g. contrast between thick and thin strokes). Researchers may be insensitive to the confounding variables (that also change along with the variable of interest) but their existence may invalidate the inferences that can be drawn. If we are less concerned about which stylistic feature of the typeface contributes to legibility and more interested in the overall effect, the results may be valid.

Numerous studies have compared the legibility of different typefaces despite potential difficulties in deciding how to make valid comparisons. As a typeface has various stylistic characteristics, which have been shown to affect legibility, comparisons need to consider:

How to equate for size. Although this may seem straightforward to many people, those with typographical knowledge are aware that typefaces appear to be different sizes depending on the height of the ascenders and capitals, the x-height, and the size of the counters (space within letters). Making sure that the typefaces are matched for their x-height, not point size, helps to make them appear similar in size (see Figure 4.5).
How to control for differences in weight and width, stroke contrast, and serifs.

The word 'hand' set in different typefaces — **Figure 4.5:** The pair on the left compare 24 point Georgia with 24 point Garamond; Georgia appears to be quite a lot larger. To make both appear a similar size, Garamond needs to be increased to about 29 point (pair on the right).

Collaborations across disciplines have resulted in experimental modifications of typefaces by type designers (Box 4.1). This approach would appear to provide the ideal solution, but requires a significant contribution from type designers.

Box 4.1: Experimental modifications of typefaces

Morris, Aquilante, Yager, and Bigelow (2002) compared a serif and sans serif version of Lucida (Figure 4.6), designed by Bigelow and Holmes

…the designers produced a seriffed and sans-serif pair whose underlying forms are identical in stem weights, character widths, character spacing and fitting, and modulation of thick to thin. The only difference is the presence or absence of serifs, and the slight increase of black area in the seriffed variant. (p245)

Figure 4.6: Lucida Bright and Lucida Sans.

Beier has designed various typefaces specifically for testing (Beier and Larson, 2010, 2013; Beier and Dyson, 2014; Dyson and Beier, 2016). Figure 4.7 shows the fonts used in Dyson and Beier (2016).

Figure 4.7: The fonts designed by Beier which control the variation by adding stylistic features to the first font (top): italic, weight, contrast, and width.

Illustrating test material

Graphic designers work with visual material and can be frustrated to find that many studies reported in journals do not illustrate the material used in the studies. Consequently, we are left to figure out what was presented to the participants. This may reflect the researchers giving priority to the results of the study (illustrating data in graphs). However, some printed journals have imposed constraints, due to economic considerations. Many journals now publish online and include interactive versions of articles, which allow for additional supporting material. This has resulted in the inclusion of more illustrations and greater transparency in reporting the methods, materials and procedures used in the study.

Familiarity

Chapter 1 introduced the view, held by some, that legibility results reflect our familiarity with the test material. According to this view, we will find it easier to read something which we have been accustomed to reading. This seems to make a lot of sense as we do improve with practice. However, this also creates a significant challenge for experimenters. How can we test a newly designed typeface against existing typefaces, or propose an unusual layout, without disadvantaging the novel material? More fundamentally, when legibility research confirms existing practices, based on traditional craft knowledge, can we be sure that these practices are optimal? Might they instead be the forms which we are most used to reading? This conundrum was raised by Dirk Wendt in writing about the criteria for judging legibility (Wendt, 1970, p43).

Some research by Beier and Larson (2013), described more fully in Chapter 7, examines familiarity directly, rather than as a confounding variable which causes problems. This research aims to address how we might improve on existing designs, and not be constrained by what we have read in the past.

Methods

The tools used to measure legibility have understandably changed over time, primarily from mechanical to computer-controlled devices. The older methods are summarised in Spencer (1968) and described in more detail in Tinker (1963, 1965) and Zachrisson (1965). Despite the changes in technology, many of the underlying principles have remained the same, but we now use different ways to capture the data. There are two broad categories of methods:

objective, measuring behaviour or physical responses
subjective, asking readers for opinions

As described in Chapter 1, when reading we first need to be able to experience the sensation of images (letters) on our retina. We also know that we read by identifying letters which we then combine into words (Chapter 2). With this knowledge, it makes sense to measure how easy it is to identify letters or words and we can vary the typographic form (e.g. different typefaces or sizes). One technique used is the threshold method, which aims to measure the first point at which we can detect and identify the letter or word. This might be the greatest distance away or the smallest contrast, or the smallest size of type.

More recently, a newer chart (logMAR) has been introduced into clinical practice, having been used initially as a research tool (Bailey and Lovie, 1976). The chart is designed to ensure that the letters are of almost equal legibility, each row has the same number of letters, and consistent letter and line spacing. These adjustments to the Snellen chart reflect the researchers’ knowledge of the influence of crowding. Other differences relate to the scaling of letter size.

The SLOAN font (see below) is used in both the Snellen and logMAR charts. Louise Sloan designed ten letters (CDHKNORSVZ), a set of optotypes (Sloan, 1959).

The SLOAN letters above come from the font file created by Denis Pelli based on Sloan’s specifications. Pelli includes the complete uppercase alphabet, not just the 10 letters. The height and width of letters are equal to the nominal point size (11 point in this example) and adjoining characters touch.

The font file can be downloaded from https://github.com/denispelli/Eye-Chart-Fonts

Eye tests are typically carried out in a similar way, obtaining a distance threshold measurement. When having our eyes tested, we may be asked to read from a Snellen chart where the letters decrease in size as we go down the chart (Figure 4.8). We stop at the point when we can no longer decipher the letter and we have reached our threshold. This is letter acuity as the test uses unrelated letters and unconstrained viewing time.

**Figure 4.8:** An example of the Snellen eye chart, named after a Dutch ophthalmologist in 1862. The smallest letters that can be read accurately indicate the visual acuity of that eye (each eye is tested separately). The bottom row (9) corresponds to 20/20 vision meaning the letters can be read at a distance of 20 feet (about 6 metres).

The eye test uses a similar principle to distance thresholds except the size of type is varied, and we remain seated in our chair at the same distance from the chart. The visual angle is changed in both cases as the visual angle depends on size and distance (see Figure 3.2). In the eye test procedure the visual angle decreases until we can no longer read the letters; distance threshold measures work in the opposite direction with increases in visual angle until we are able to identity the image.

Question: Explain why the distance threshold measure needs to start with an image that is too far away to identify and is then moved closer. If you are not sure, read on to find the answer.

The accounts of older methods to test legibility include descriptions of tools which measured thresholds and more general approaches to using thresholds:

The visibility meter used filters to vary the contrast between the image and the background. The aim was to identify the smallest contrast that still preserves legibility. This has been used to measure the relative legibility of different typefaces using letters or words.
The focal variator used a similar principle to the visibility meter with a blurred image projected onto a ground glass screen and a measurement was made of the distance at which the image becomes recognisable. This device was limited to using letters.
A more general method of measuring distance thresholds, which is still in use, is simply to find out how far away something can still be recognised by starting at a great distance and gradually moving the material closer to the participant. The answer to the question above is that it is necessary to do the test in this direction as we cannot accurately report when we can no longer see something because we have already identified it. The method is appropriate for testing signs or other material that would normally be read at a distance but is also applied in other contexts. (See Chapter 5 and Chapter 6)
A similar principle is applied when measuring how far out into the periphery an object (e.g. letter) can be placed and still be recognised. Participants are asked to fixate on a specific point, so that they do not move their eyes to focus on the object. Our visual acuity for letters in peripheral vision decreases with eccentricity (i.e. distance from the fovea).

Panel 4.3 describes a sophisticated means of using the threshold to take account of differences among readers.

Panel 4.3: Setting a level of difficulty for each person

The threshold approach can also be applied in a more flexible manner to control how easy it is for a participant in a study to identify letters or words, to improve the sensitivity of the measure. The technique adjusts the presentation for each person, either varying the viewing distance or the length of time shown. Rather than just measuring the threshold, this measure is used to ensure that the level of difficulty is set at a certain level above threshold so that the participants in the study do not get 100% correct or close to 0%. For example, if the task of identifying letters is too easy, any effects of typographic form will not be apparent as even if letters are slightly harder to identify, they will still be identified. Similarly, if the task is too difficult, we either cannot provide answers or guess and get most answers wrong. If we can set the difficulty so that some letters can be identified and some cannot, this should help in revealing differences.

People vary, not only in terms of the more obvious characteristics such as eyesight (visual acuity) and reading ability, but also attention, motivation, fatigue, confidence, and anxiety when taking part in an experiment. Consequently it is useful to be able to set a level for each person. This technique may be particularly valuable in relation to inclusive design as it enables testing of participants with a larger range of abilities than some other techniques because the level of difficulty can be adjusted for each participant. The disadvantage of this approach is that additional time needs to be spent before the main experiment can start.

The short exposure method can be used to measure the threshold (how long is needed to identify a letter or word) or to set a suitable level of difficulty for participants. Before computers were routinely used in experiments, a tachistoscope controlled fixation time by presenting and then removing the image. This is now typically computer-controlled and an example of one form of short exposure presentation is Rapid Serial Visual Presentation (RSVP). Single words are displayed sequentially on screen in the same position which means we don’t need to make eye movements (saccades).

RSVP has been in used in reading research from 1970, but has recently emerged as a practical technique for reading from small screens as the sequential presentation takes up less space. RSVP has also been developed into apps promoted as a technique for increasing reading speed. The value of RSVP as a research method for testing legibility is that the experimenter can adjust the rate of presenting a series of words, which can form sentences. However, as with some of the other techniques above, it is only possible to investigate typographic variables at the letter and word level (e.g. typeface, type variant, type size, letter spacing).

The above methods related to threshold measures typically ask the participant to identify what they see (e.g. a letter or word). These responses either comprise the results (e.g. number of correct responses) or the distance/exposure time/eccentricity is recorded which corresponds to a certain level of correct answers.

Speed and accuracy measures

As mentioned in Chapter 3 and earlier in this chapter, speed of reading is a common way of measuring ease of reading, even though the primary concern of designers may not be to facilitate faster reading. If the letters are difficult to identify, we make more eye fixations (pauses) and pause for longer, which slows down reading; more effort is also likely to be expended.

Measures of speed are often combined with some measure of accuracy. This might be accuracy of:

identifying isolated letters or words
reading words in sentences and continuous text
proofreading
remembering (often referred to as recall)
understanding (comprehension)

Accuracy can therefore go beyond getting the letters or words correct to measures of recall or comprehension. If letter or word recognition is tested, accuracy may be measured together with exposure time. As we can substitute speed for accuracy when we read, some researchers combine these two measures. If I decide to read very quickly, I am likely to remember and understand less of the text because I am trading off speed and accuracy. If continuous text is read, a test of comprehension is important to check that a certain level of understanding is obtained.

Question: Do you think recall or understanding is more important than speed of reading? Are there any circumstances when speed might be more important?

Measuring legibility by the speed of reading continuous text can be similar to the more usual reading situation. Both silent reading and reading aloud have been used by researchers, though silent reading tends to be more common. If reading aloud, the number of words correctly identified can be measured. Comprehension measures for silent reading include:

summaries of what has been read
identifying an error in a sentence, which affects the meaning
cloze procedure where words are omitted at regular intervals within a text and a suitable word must be inserted into the gap
open-ended or short answer questions
multiple choice questions

As a researcher, I have made decisions as to which comprehension measure to use. In doing so, I have weighed up the difficulty of preparing the test material with the difficulty of scoring the results. Table 4.1 summarises my assessment of each of the measures in terms of these two considerations. Panel 4.4 explains the reasons for my assessment and some pointers to good practice when carrying out a study.

Table 4.1: what to consider when choosing a method for testing comprehension

	Easy to prepare	Reasonably easy to prepare	Quite difficult to prepare	Difficult to prepare
Easy to score		Identifying errors		Multiple-choice
Reasonably easy to score	Cloze procedure	Open-ended questions	Short-answer questions
Difficult to score	Summaries

Panel 4.4: Considerations when planning comprehension tests

Summaries require no preparation of questions but the accuracy and completeness of the responses are the most difficult to assess. Decisions need to be made as to whether responses are 100% correct, or partially correct. This difficulty reduces the reliability of the scores.
This is true to a lesser extent with open-ended questions, as the responses will be more focused and constrained and therefore a little easier to score.
The cloze procedure is similar to summaries in terms of preparation as it is straightforward to delete words but the responses require judgements as to what are acceptable synonyms as the precise word will not always be inserted.
Short-answer questions can be more targeted, removing some ambiguity from the assessment.
Multiple-choice questions are straightforward to assess.
There is a trend towards the easier the responses are to score, the more difficult the preparation. The exception is identifying an error in a sentence which has the advantage of being relatively easy to prepare and score.

Why are specific questions difficult to create? As with all measures, these questions need to be sufficiently sensitive to detect different levels of comprehension. If the texts are factual, you also need to consider whether participants might know the answers before reading the text. This may require a test of prior knowledge, such as a pre-test (before the main study). The score then becomes the difference between the pre- and post-test, the latter taking place after reading the text. The most difficult questions to generate are multiple-choice as the incorrect alternative answers need to be plausible to make the questions sufficiently difficult.

It is good practice to pilot questions that will be used in a study to detect any problems, such as too easy or difficult, ambiguities, misleading or confusing elements. A pilot is a small-scale study, with maybe only 2 or 3 people, and need not include all aspects of the experiment.

When comparing results across different texts, with different content, the questions on each text need to be at a similar level of difficulty and answers located in similar regions of the texts. Likewise, when identifying errors, the particular words changed, their position, and how they are changed requires careful attention. Various standardised tests have been developed which address these issues:

Nelson-Denny test (1981), originally developed in 1929, is a multiple-choice test.
Chapman-Cook Speed of Reading test (1923) has 30 items of 30 words each. In each item there is one word that spoils the meaning and the reader is asked to cross out this word. There is a time limit of 1.75 minutes.

Question: Which is the word that spoils the meaning in the item below?

If father had known I was going swimming he would have forbidden it. He found out after I returned and made me promise never to skate again without telling him.

Tinker Speed of Reading test (1947) is similar to Chapman-Cook but with 450 items of 30 words each. The time limit is 30 minutes.

Question: Which is the word that spoils the meaning in the item below?

We wanted very much to get some good pictures of the baby, so in order to take some snapshots at the picnic grounds, we packed the stove into the car.

Some authors refer to speed of reading as ‘rate of work’. This more generic term can cover other types of reading such as scanning text for particular words (as you might in a dictionary or if you are looking for a particular paragraph in a printed text), skim reading or filling in a form.

Physiological measures

In the methods described above the measure is the participant’s response, or how fast they respond, or some aspect related to the material (e.g. exposure time, distance from material). Another approach is to take physical measurements of the participants which have included pulse rate, reflex (involuntary) blink rate, and eye movements. These have been described as unconscious processes (Pyke, 1926, p30) which are automatic, whereas we are conscious of threshold, speed, and accuracy measures. An increased pulse rate is supposed to indicate that the participant is working harder. Similarly, an increase in blink rate is assumed to mean that legibility is reduced. However, in both cases, other (confounding) factors may be influencing the measure.

Eye movement measurements, also described as eye tracking, have survived as a technique and now use far more sophisticated technology than the original work around the beginning of the twentieth century (see Chapter 3: Historical perspective). The most widely used current technique records movements by shining a beam of invisible light onto the eye which is reflected back to a sensing device. From this, it is possible to calculate where the person is looking. Typical measurements include:

frequency or number of fixations (pauses)
duration of fixations
number of regressions

The advantage of looking at these individual measures, rather than overall reading speed, is that there may be a trade-off between the number of fixations and their duration. We may make lots of fixations, but for a very short time; conversely we may make few longer fixations. Both may result in the same overall reading time. Regressions indicate a difficulty in identifying letters or words, requiring back-tracking to re-fixate on the relevant part of the text. Another advantage of this technique is that we can measure reading of continuous text in a reasonably natural situation. It is not entirely natural as participants commonly need to wear devices strapped to their head. Eye tracking is also used to explore specific regions of interest (ROI) in advertisements or web pages to see what attracts attention.

Although introduced to measure reader’s emotions, changes in facial expression may also indicate the degree of effort exerted and therefore ease of reading (Larson, Hazlett, Chaparro and Picard, 2006). Facial electromyography (EMG) measures tiny changes in the electrical activity of muscles. The muscle which controls eye smiling, for example, is thought to be more of an unconscious process and may therefore reflect emotion or effort which might not be reported (see Subjective judgements below).

As mentioned above when describing how we read different typefaces (Chapter 2), electroencephalography (EEG) technology has recently been applied in research looking at letter recognition. Although the objectives of this research were not to investigate legibility issues, differences in the level of neural activity were found for low and high legibility typefaces. This method may therefore have potential as a means of measuring brain activity to infer how typographic variables influence legibility.

Subjective judgements

This procedure asks people what they think of different examples of material in relation to a particular criterion. Visual fatigue has been measured in this way, by asking people to rate their fatigue on a scale from no discomfort to extreme discomfort. Mental or perceived workload has also been assessed using the NASA Task Load Index (NASA-TLX). As these estimates can be influenced by other factors, a more reliable measure is to test visual fatigue objectively (as a physiological measurement). This has been done using equipment which can simultaneously measure pupillary change, focal accommodation, and eye movements.

A common way of employing subjective judgements in a study is to ask participants which material they think is easiest to read, or which they prefer. These judgements are quite often combined with other methods, such as speed and accuracy of reading. The procedure can vary from asking the participant to rank or rate a number of alternatives to asking them to make comparisons of pairs. (Panel 4.5)

Panel 4.5: Different ways of collecting subjective judgements

Ranking

Ranking asks a participant to put a number of examples of material (e.g. 8) in an order where 1 may indicate the easiest to read and 8 the most difficult to read. This method is suitable if there aren’t too many examples to rank. It becomes rather difficult to make comparisons of this nature if there are about 10 or more examples.

Rating

Rating can be easier than ranking with many examples as the participant gives a rating for each individual sample, rather than comparing all the samples together. Participants may make some comparisons when rating, but these are not a requirement. The rating scale can be various lengths, e.g. from 1 to 5, or 1 to 7, where 1 might indicate ‘very easy to read’ and 5 (or 7) might indicate ‘very hard to read’. This technique differs from ranking, even though the judgement appears very similar, because there is no need to place the examples in an order.

We should realise that participants will vary as to how they use a rating scale. Some people may use all the scale, e.g. from 1 to 7; others may not use the extremes so that the example they think is the easiest to read may be given a 2 or 3, because it is not thought to be ‘very easy to read’. For this reason, researchers sometimes encourage participants to use the full scale.

If the scale has a range which is an odd number (i.e. 5 or 7) this allows for a middle neutral rating which is ‘neither easy nor difficult to read’ or ‘OK’. Some researchers prefer to use a rating scale with an even number to avoid a neutral rating, perhaps because it seems like responding ‘Don’t know’. A middle rating isn’t quite the same as ‘Don’t know’. As long as distinctions are being made between the examples (i.e. given different ratings), the rating scale is serving its purpose. The results are collated for all participants to see whether they agree.

A semantic differential scale is a specific type of scale where adjectives can be used to rate the appropriateness of typefaces for certain purposes (see Figure 4.9). The two ends of the scale (of 5 or 7 points) are labelled with opposite meanings, for example 1 indicating strong and 7 weak; 1 indicating cheap and 7 expensive. A set of scales using quite a lot of different paired adjectives is given to participants and a statistical technique (factor analysis) determines a smaller number of concepts which underpin all the other adjectives’ ratings. These describe the nature of the typefaces.

Figure 4.9: Semantic differential scales for two dimensions. The participant is asked to select the circle which best represents their judgement.

Paired comparisons

Another way of making the task of comparing a large number of samples easier for participants is to compare pairs, rather than comparing the whole set at once (ranking). Each sample is compared with every other one, which makes quite a lot of comparisons. However, it is easier to be more confident in saying A is easier to read than B, B is easier to read than C, etc., than putting a large set in a ranked order. This method also detects any uncertainty or inconsistency as if a participant responds:

A is easier to read than B
B is easier to read than C
C is easier to read than A

they are being inconsistent and this might mean that they don’t have any strong views about the differences. It may be tempting as an experimenter to include the option of ‘Don’t know’ when using paired comparisons. I advise against this as inconsistencies will reveal this uncertainty without giving participants the ability to opt out with ‘Don’t know’. As a participant, it may be rather tempting to use ‘Don’t know’ a bit too often. With paired comparisons, as opposed to a rating scale, it is unhelpful to have ‘Don’t know’ responses as they are missing data.

Summary

Having a range of methods to test legibility can be viewed as positive, as they may have different applications, or may be combined within the same study. However, concerns have been raised as to whether studies of single letters or words can tell us anything about everyday reading. It may be tempting to dismiss results from threshold measures of individual characters but we should remember that reading starts with identifying individual characters. If individual characters cannot easily be identified, there is likely to be a problem in reading. Also, it is frequently easier to find differences when using threshold measurements, than when using measures which are closer to the everyday reading process. It is rather pointless to argue for using a method which will probably not be sufficiently sensitive to detect differences in legibility, assuming they exist. Also, it is not feasible to study the complete natural reading experience which will be influenced by numerous variables.

We do, however, need to be aware of the limitations of methods which do not involve reading continuous text. By showing letters or words individually, the reading environment is changed and the effects of many typographic variables cannot be assessed. We are unable to test the effects of changes to word spacing, line length, line spacing, number of columns, alignment, margins, and headings. If we wish to investigate these aspects of typography, we will probably need to more closely approximate natural reading conditions.

The objectives of the study will also guide the choice of method. We should make a clear distinction between testing alternatives as part of the design process and research studies which are intended to inform researchers and designers. In evaluating the value, appropriateness, validity and reliability of any study, the context will determine how and what we measure.

Legibility

4. What is measured and how