Jekyll2018-10-26T21:13:24+01:00http://dillingham.me.uk/The NotebookThe NotebookScaffolding the visualisation design space2018-10-26T00:00:00+01:002018-10-26T00:00:00+01:00http://dillingham.me.uk/2018/10/26/scaffolding-the-visualisation-design-space<p>It’s often useful to think about the visualisation design space before thinking about point designs.
However, the visualisation design space is large.
By scaffolding it, we can reassure ourselves that we haven’t missed an obvious point design.</p>
<p>This post is a Jupyter Notebook.
View it using <a href="https://nbviewer.jupyter.org/github/iaindillingham/factotum/blob/master/notebooks/Scaffolding%20the%20visualisation%20design%20space.ipynb">nbviewer</a> or <a href="https://mybinder.org/v2/gh/iaindillingham/factotum/master?filepath=notebooks/Scaffolding%20the%20visualisation%20design%20space.ipynb">Binder</a>.</p>It’s often useful to think about the visualisation design space before thinking about point designs. However, the visualisation design space is large. By scaffolding it, we can reassure ourselves that we haven’t missed an obvious point design.Geograph term profiles2018-10-02T00:00:00+01:002018-10-02T00:00:00+01:00http://dillingham.me.uk/2018/10/02/geograph-term-profiles<p>In <a href="/2018/09/28/term-profiles.html">the previous post</a>, we used term profiles to judge the effects of participation inequality in simulated collections of 100 photographs.
In this post, we will use them to judge the effects of participation inequality in Geograph, a collection of nearly six million photographs of Great Britain and Ireland.</p>
<h2 id="geograph">Geograph</h2>
<p><a href="http://www.geograph.org.uk/">Geograph</a> is a collection of nearly six million photographs of Great Britain and Ireland.
Each photograph is associated with a 1km² location on either <a href="https://www.ordnancesurvey.co.uk/resources/maps-and-geographic-resources/the-national-grid.html">The National Grid</a> or the <a href="https://www.osi.ie/resources/reference-information-2/irish-grid-reference-system/">Irish Grid Reference System</a>, a description, and several tags.</p>
<h2 id="term-profiles">Term profiles</h2>
<p>Following <a href="http://journals.uic.edu/ojs/index.php/fm/article/view/3710/3035">this article</a> by Ross Purves, Alistair Edwardes, and Jo Wood, I constructed three term profiles showing the percentage of photographs in each group that were tagged with the terms <em>engine</em>, <em>hill</em>, and <em>road</em>.</p>
<p><img src="/assets/prop_has_term_engine_hill_road_2005-2018.png" alt="" /></p>
<p>I wasn’t able to reconstruct the term profiles from the article, so we shouldn’t make comparisons between these term profiles and those from the article.
Nevertheless, for <em>engine</em> and <em>hill</em>, there <strong>doesn’t</strong> seem to be a relationship between the percentage of photographs and ranked prolificness.
However, there <strong>does</strong> seem to be a relationship between the percentage of photographs and ranked prolificness for <em>road</em>.
The percentage of photographs is larger where ranked prolificness is smaller.</p>
<p>We can see the relationship for <em>road</em> more clearly when we compare the z-scores to the numbers of users in each group.</p>
<p><img src="/assets/z_score_num_users_road_2005-2018.png" alt="" /></p>
<p>For roughly the first 75% of groups, the range of z-scores is larger and the numbers of users in each group are smaller.
For roughly the last 25% of groups, the range of z-scores is smaller and the numbers of users in each group are larger.
The pattern suggests there <strong>is bias</strong> in the use of <em>road</em>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I began <a href="/2018/09/28/term-profiles.html">the previous post</a> by admitting that I couldn’t judge the effects of participation inequality using term profiles.
However, by taking the technique apart and putting it back together again, I feel better able to detect the bias that is the result of a small number of users taking a large number of photographs.</p>
<p>I think it’s worth repeating that term profiles are a technique that can be applied to any collection of documents, not just a collection of photographs; that is, term profiles are a tool in the <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a> toolbox.</p>
<h2 id="code">Code</h2>
<p>For more information, see:</p>
<p><a href="https://github.com/iaindillingham/geograph-sandbox/blob/v0.0.3/notebooks/Term%20Profiles.ipynb">https://github.com/iaindillingham/geograph-sandbox/blob/v0.0.3/notebooks/Term%20Profiles.ipynb</a></p>In the previous post, we used term profiles to judge the effects of participation inequality in simulated collections of 100 photographs. In this post, we will use them to judge the effects of participation inequality in Geograph, a collection of nearly six million photographs of Great Britain and Ireland.Judging the effects of participation inequality with term profiles2018-09-28T00:00:00+01:002018-09-28T00:00:00+01:00http://dillingham.me.uk/2018/09/28/term-profiles<p>Let’s say you have a collection of photographs: each was taken by a user, who also added a tag and a location.
Let’s say you’d like to understand some aspect of these photographs, such as which tags are associated with which locations.
However, there’s a problem: a small number of users took a large number of photographs.
To what extent will your understanding of the relationships between tags and locations be biased by how these users tagged these photographs?</p>
<p>The problem, which Jakob Nielsen calls <em><a href="https://www.nngroup.com/articles/participation-inequality/">participation inequality</a></em>, is discussed in <a href="http://journals.uic.edu/ojs/index.php/fm/article/view/3710/3035">this article</a> by Ross Purves, Alistair Edwardes, and Jo Wood.
In the article, the authors use term profiles to judge the effects of participation inequality; that is, to detect the bias that is the result of a small number of users taking a large number of photographs.
I couldn’t detect the bias, however, so I decided to simulate two collections of 100 photographs—one unbiased, one biased—and construct two term profiles.</p>
<h2 id="what-is-a-term-profile">What is a term profile?</h2>
<p>A <em>term profile</em> relates some photographs in a collection—the photographs that were tagged with a term that interests us—to all photographs in a collection, according to how many photographs each user has taken; that is, according to <em>prolificness</em>.
A term profile has three components: a bar chart, a line chart, and a coefficient of variation.</p>
<h3 id="the-bar-chart">The bar chart</h3>
<p>To construct the bar chart, we order the photographs in the collection by prolificness.
We then group the photographs and compute the percentage of photographs in each group that were tagged with the term that interests us.</p>
<p>For example, let’s say a collection contains 100 photographs:
10 were taken by user A and 90 were taken by user B.
We order the photographs by prolificness, so user B’s photographs are first and user A’s photographs are last.
We then group the photographs into 10s, so the first nine groups contain user B’s photographs and the last group contains user A’s photographs.
Let’s say the term that interests us is <em>road</em>, so we compute the percentage of photographs in each group that were tagged with <em>road</em>.</p>
<p>If a random sample of 50% of each user’s photographs was tagged with <em>road</em>, then our bar chart might look like this:</p>
<p><img src="/assets/tp_50_50_prop_has_term.png" alt="" /></p>
<p>Notice the pattern: the first nine groups, which contain user B’s photographs, are <strong>similar</strong> to the last group, which contains user A’s photographs.
The pattern suggests there <strong>isn’t bias</strong> in the use of <em>road</em>, which is correct.</p>
<p>If a random sample of 10% of user A’s photographs and 90% of user B’s photographs was tagged with <em>road</em>, then our bar chart might look like this:</p>
<p><img src="/assets/tp_10_90_prop_has_term.png" alt="" /></p>
<p>Notice the pattern: the first nine groups, which contain user B’s photographs, are <strong>different</strong> to the last group, which contains user A’s photographs.
The pattern suggests there <strong>is bias</strong> in the use of <em>road</em>, which is correct.</p>
<h3 id="the-line-chart-and-the-coefficient-of-variation">The line chart and the coefficient of variation</h3>
<p>The line chart and the coefficient of variation are closely related to the bar chart because both are derived from the percentage of photographs in each group that were tagged with the term that interests us.</p>
<p>The line chart shows the <a href="https://en.wikipedia.org/wiki/Standard_score">z-score</a> for each group.</p>
<p>The line chart for the 50%/50% random sample would look like this:</p>
<p><img src="/assets/tp_50_50_z_score.png" alt="" /></p>
<p>Again, notice the pattern: the first nine groups are similar to the last group.</p>
<p>The line chart for the 10%/90% random sample would look like this:</p>
<p><img src="/assets/tp_10_90_z_score.png" alt="" /></p>
<p>Again, notice the pattern: the first nine groups are different to the last group.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Coefficient_of_variation">coefficient of variation</a> shows the degree to which the groups vary: the larger the number, the more they vary.
Whilst the bar charts and the line charts are different, they’re not very different.
Consequently, the coefficients of variation are similar: 0.30 for the 50%/50% random sample and 0.32 for the 10%/90% random sample.</p>
<h2 id="discussion">Discussion</h2>
<p>In the article, the authors construct term profiles for two collections of photographs: <a href="http://www.geograph.org.uk/">Geograph</a>, with roughly 910,000 photographs; and <a href="https://www.flickr.com/">Flickr</a>, with roughly 760,000 photographs.
Whilst I couldn’t detect the bias in real-world collections of nearly one million photographs, I could detect it in simulated collections of 100 photographs.
Furthermore, simulating two collections of 100 photographs and constructing two term profiles made me think about the secondary order, the group size, and the nature of participation inequality.</p>
<h3 id="the-secondary-order">The secondary order</h3>
<p>To construct the bar chart, we order the photographs in the collection by prolificness.
In other words, prolificness is the primary order.
Whilst the least prolific user’s photographs will probably be contained within one group, the most prolific user’s photographs will probably be contained within many groups.
To what extent does the allocation of these photographs to these groups—the secondary order—influence the pattern?</p>
<p>For example, let’s say we order the 50%/50% random sample by prolificness (the primary order) and by whether the photographs were tagged with <em>road</em> (the secondary order).
Groups 1–4 will be 100%; group 5 will be 50%; groups 6–9 will be 0%; and group 10 will be 50%.
The pattern suggests there is bias in the use of <em>road</em>, which is incorrect.</p>
<p>Whilst the effect diminishes with prolificness, I think it’s important to experiment with the secondary order, to judge the stability of the pattern.</p>
<h3 id="the-group-size">The group size</h3>
<p>As with the secondary order, I think it’s important to experiment with the group size, to judge the stability of the pattern.
However, the speed of computation—and of experimentation—decreases as the size of the collection increases.
For example, to compute a term profile for a simulated collection of 100 photographs takes milliseconds; to compute a term profile for a real-world collection of nearly one million photographs takes minutes.</p>
<p>To experiment with the group size for a real-world collection of nearly one million photographs, we could pre-compute term profiles with different group sizes.
We could also explore the relationship between the group sizes and the coefficients of variation.
Would this relationship show the ‘best’ group size?
I don’t know.</p>
<h3 id="the-nature-of-participation-inequality">The nature of participation inequality</h3>
<p>Jakob Nielsen frames participation inequality according to the number of users <em>and</em> the number of contributions, arguing that 90% of users don’t contribute, 9% of users contribute a little, and 1% of users contribute a lot.
I framed participation inequality in terms of the number of contributions: user A contributed a little with 10 photographs; user B contributed a lot with 90 photographs.
I hope that the example was easier to understand than one where nine users contributed one photograph each and one user contributed 90 photographs.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Whilst I couldn’t detect the bias in real-world collections of nearly one million photographs, I could detect it in simulated collections of 100 photographs.
Consequently, by simulating two collections of 100 photographs, constructing two term profiles, and noticing the patterns, I attempted to validate the technique.</p>
<p>The word <em>technique</em> is important.
In <a href="http://www.cs.ubc.ca/labs/imager/tr/2008/pitfalls/">this article</a>, Tamara Munzner distinguishes between a <em>technique</em> and a <em>design study</em>, where the former emphasises the design and analysis of an algorithm and the latter the design and analysis of visual encodings and interactions.
A term profile isn’t really a bar chart, a line chart, and a coefficient of variation; that is, visual encodings and interactions.
A term profile is really an an algorithm that orders and groups the photographs in the collection, and computes the percentage of photographs in each group that were tagged with the term that interests us.
Indeed, because a term profile is an algorithm, it makes more sense to think about the secondary order and the group size, and less sense to think about, for example, the layout of the bar chart and the line chart.</p>
<p>Thinking about a term profile as an algorithm also helps us to identify other situations where we could use the technique.
For example, rather than a collection of photographs, let’s say you have a collection of tweets.
Again, each was sent by a user, who also added a hashtag.
Let’s say you’d like to understand which hashtags are associated with more positive sentiment and which hashtags are associated with more negative sentiment.
To what extent will your understanding of the relationships between hashtags and sentiment be biased by how a small number of users sent a large number of tweets?</p>
<p>In <a href="/2018/10/02/geograph-term-profiles.html">the next post</a>, we’ll move from simulated collections of 100 photographs to a real-word collection of nearly six million photographs.</p>
<h2 id="code">Code</h2>
<p>For more information, see:</p>
<p><a href="https://github.com/iaindillingham/geograph-sandbox/blob/v0.0.2/notebooks/Simulated%20Term%20Profiles.ipynb">https://github.com/iaindillingham/geograph-sandbox/blob/v0.0.2/notebooks/Simulated%20Term%20Profiles.ipynb</a></p>Let’s say you have a collection of photographs: each was taken by a user, who also added a tag and a location. Let’s say you’d like to understand some aspect of these photographs, such as which tags are associated with which locations. However, there’s a problem: a small number of users took a large number of photographs. To what extent will your understanding of the relationships between tags and locations be biased by how these users tagged these photographs?