Let’s say you have a collection of photographs: each was taken by a user, who also added a tag and a location. Let’s say you’d like to understand some aspect of these photographs, such as which tags are associated with which locations. However, there’s a problem: a small number of users took a large number of photographs. To what extent will your understanding of the relationships between tags and locations be biased by how these users tagged these photographs?

The problem, which Jakob Nielsen calls participation inequality, is discussed in this article by Ross Purves, Alistair Edwardes, and Jo Wood. In the article, the authors use term profiles to judge the effects of participation inequality; that is, to detect the bias that is the result of a small number of users taking a large number of photographs. I couldn’t detect the bias, however, so I decided to simulate two collections of 100 photographs—one unbiased, one biased—and construct two term profiles.

What is a term profile?

A term profile relates some photographs in a collection—the photographs that were tagged with a term that interests us—to all photographs in a collection, according to how many photographs each user has taken; that is, according to prolificness. A term profile has three components: a bar chart, a line chart, and a coefficient of variation.

The bar chart

To construct the bar chart, we order the photographs in the collection by prolificness. We then group the photographs and compute the percentage of photographs in each group that were tagged with the term that interests us.

For example, let’s say a collection contains 100 photographs: 10 were taken by user A and 90 were taken by user B. We order the photographs by prolificness, so user B’s photographs are first and user A’s photographs are last. We then group the photographs into 10s, so the first nine groups contain user B’s photographs and the last group contains user A’s photographs. Let’s say the term that interests us is road, so we compute the percentage of photographs in each group that were tagged with road.

If a random sample of 50% of each user’s photographs was tagged with road, then our bar chart might look like this:

Notice the pattern: the first nine groups, which contain user B’s photographs, are similar to the last group, which contains user A’s photographs. The pattern suggests there isn’t bias in the use of road, which is correct.

If a random sample of 10% of user A’s photographs and 90% of user B’s photographs was tagged with road, then our bar chart might look like this:

Notice the pattern: the first nine groups, which contain user B’s photographs, are different to the last group, which contains user A’s photographs. The pattern suggests there is bias in the use of road, which is correct.

The line chart and the coefficient of variation

The line chart and the coefficient of variation are closely related to the bar chart because both are derived from the percentage of photographs in each group that were tagged with the term that interests us.

The line chart shows the z-score for each group.

The line chart for the 50%/50% random sample would look like this:

Again, notice the pattern: the first nine groups are similar to the last group.

The line chart for the 10%/90% random sample would look like this:

Again, notice the pattern: the first nine groups are different to the last group.

The coefficient of variation shows the degree to which the groups vary: the larger the number, the more they vary. Whilst the bar charts and the line charts are different, they’re not very different. Consequently, the coefficients of variation are similar: 0.30 for the 50%/50% random sample and 0.32 for the 10%/90% random sample.

Discussion

In the article, the authors construct term profiles for two collections of photographs: Geograph, with roughly 910,000 photographs; and Flickr, with roughly 760,000 photographs. Whilst I couldn’t detect the bias in real-world collections of nearly one million photographs, I could detect it in simulated collections of 100 photographs. Furthermore, simulating two collections of 100 photographs and constructing two term profiles made me think about the secondary order, the group size, and the nature of participation inequality.

The secondary order

To construct the bar chart, we order the photographs in the collection by prolificness. In other words, prolificness is the primary order. Whilst the least prolific user’s photographs will probably be contained within one group, the most prolific user’s photographs will probably be contained within many groups. To what extent does the allocation of these photographs to these groups—the secondary order—influence the pattern?

For example, let’s say we order the 50%/50% random sample by prolificness (the primary order) and by whether the photographs were tagged with road (the secondary order). Groups 1–4 will be 100%; group 5 will be 50%; groups 6–9 will be 0%; and group 10 will be 50%. The pattern suggests there is bias in the use of road, which is incorrect.

Whilst the effect diminishes with prolificness, I think it’s important to experiment with the secondary order, to judge the stability of the pattern.

The group size

As with the secondary order, I think it’s important to experiment with the group size, to judge the stability of the pattern. However, the speed of computation—and of experimentation—decreases as the size of the collection increases. For example, to compute a term profile for a simulated collection of 100 photographs takes milliseconds; to compute a term profile for a real-world collection of nearly one million photographs takes minutes.

To experiment with the group size for a real-world collection of nearly one million photographs, we could pre-compute term profiles with different group sizes. We could also explore the relationship between the group sizes and the coefficients of variation. Would this relationship show the ‘best’ group size? I don’t know.

The nature of participation inequality

Jakob Nielsen frames participation inequality according to the number of users and the number of contributions, arguing that 90% of users don’t contribute, 9% of users contribute a little, and 1% of users contribute a lot. I framed participation inequality in terms of the number of contributions: user A contributed a little with 10 photographs; user B contributed a lot with 90 photographs. I hope that the example was easier to understand than one where nine users contributed one photograph each and one user contributed 90 photographs.

Conclusion

Whilst I couldn’t detect the bias in real-world collections of nearly one million photographs, I could detect it in simulated collections of 100 photographs. Consequently, by simulating two collections of 100 photographs, constructing two term profiles, and noticing the patterns, I attempted to validate the technique.

The word technique is important. In this article, Tamara Munzner distinguishes between a technique and a design study, where the former emphasises the design and analysis of an algorithm and the latter the design and analysis of visual encodings and interactions. A term profile isn’t really a bar chart, a line chart, and a coefficient of variation; that is, visual encodings and interactions. A term profile is really an an algorithm that orders and groups the photographs in the collection, and computes the percentage of photographs in each group that were tagged with the term that interests us. Indeed, because a term profile is an algorithm, it makes more sense to think about the secondary order and the group size, and less sense to think about, for example, the layout of the bar chart and the line chart.

Thinking about a term profile as an algorithm also helps us to identify other situations where we could use the technique. For example, rather than a collection of photographs, let’s say you have a collection of tweets. Again, each was sent by a user, who also added a hashtag. Let’s say you’d like to understand which hashtags are associated with more positive sentiment and which hashtags are associated with more negative sentiment. To what extent will your understanding of the relationships between hashtags and sentiment be biased by how a small number of users sent a large number of tweets?

In the next post, we’ll move from simulated collections of 100 photographs to a real-word collection of nearly six million photographs.

Code

For more information, see:

https://github.com/iaindillingham/geograph-sandbox/blob/v0.0.2/notebooks/Simulated%20Term%20Profiles.ipynb