The geovisualization week

Completed

No more writing code! Four applications are ready to be evaluated, although the evaluation methodology has somewhat changed.

Priority

The design study (of which 1000 words has been written) forms the second evaluation component. This ties in with writing-up the requirements.

Schedule

On schedule but out of sync. The evaluation won’t be complete by the end of the week, as originally intended, although other parts of the write-up should be. This means more time next week to return to the evaluation. Capiche?

Issues

I got very dizzy the other night having spent too much time in front of the computer screen. It was about four pints dizzy, actually. Does that count as an issue?

Changes

I’m taking regular breaks now.

Posted in Geovisualization, MSc | Leave a comment

Be careful what you ask for. Especially with SQL.

N., D. and S. have been nice enough to read this blog. But I know, I know, you don’t understand what I’m going on about! It’s not you. It’s just that I don’t really write these things for a general audience.

Anyhow, to make amends, this is a description of something I’ve been working on this morning, for you (if you’re still interested!) and for me (because I’ll doubtless forget this when I come to writing my dissertation). Welcome to SQL, or Structured Query Language, a language used to access data stored in databases.

The problem

Imagine a flat area of land, measuring 30km by 30km, with a town every 10km. A map of Flatland would look something like the image below (although you should be aware that the pesky Flatlanders use a coordinate system with an origin to the top-left, rather than the bottom-left).

Flatland

Incidents happen in each town in Flatland. It doesn’t matter, really, what the incidents are: maybe an incident is the beer supply running low. (Maybe they speak Czech in flatland?) Each incident has an identification number (incidentID), a start date and time (start), an end date and time (end), the town in which the incident happened, expressed as a point in space (point) and the severity of the beer shortage (severity).

Flatland’s residents created a database to store beer shortage incidents. It looked like this:

Initial database

Initial database (click to enlarge)

Interesting. There are nine beer shortage incidents, in the nine towns of Flatland. Three happen between 12:00 and 13:00, three between 13:00 and 14:00 and three between 14:00 and 15:00. Flatland is very orderly. (Maybe they speak Swiss-German, rather than Czech?)

Note how each row has a unique updateID, which is different from the incidentID. This is because the residents of Flatland made a decision when creating their database to “model” beer shortages in this way. The upshot is that an incident can be recoded over several rows, each row representing an update to the incident.

Look! Some fresh incident data has come in…

Updated database

Updated database (click to enlarge)

Crikey! There are three new rows in the database: the beer shortages with incidentID 1, 2 and 3 have two rows each. This could cause a problem: which rows best characterise the beer shortages with incidentID 1, 2 and 3? We could “collapse” the duplicated rows into one single row, using the SQL GROUP BY clause:

Badly grouped rows

Badly grouped rows (click to enlarge)

However, note the subtle error: the beer shortages with incidentID 1, 2 and 3 end at 13:00. This should be 14:00.

What’s happened? Well, when two or more rows with the same incidentID are grouped, the values in the first row are retained whilst the values in the second (or third, fourth, fifth etc.) row are lost. The Flatlander’s require some more sophisticated SQL which makes an intelligent assessment about which values to retain and which values to loose. In this case, it would make sense to retain the earliest recorded start, the latest recorded end, and the maximum severity value. (Flatlanders are worst-case-scenario thinkers — an example of the Chicken Licken effect.)

Grouped rows

Grouped rows (click to enlarge)

So, there we have it. A slightly better characterisation of a beer shortage incident in Flatland. There are still some issues to consider (is it right to use the maximum incident severity? would a mean (average) be more appropriate?) but they can wait.

If you replace “beer shortage” with “traffic”, multiply the number of incidents by a few thousand and move them about a bit then you’ve a better idea about part of what I’m up to!

Posted in Geovisualization, MSc | 5 Comments

The geovisualization week

Completed

This was a depressing geovisualization week, spent opening cans of worms.

Incident Typology

Incident Typology. Incidents are depicted as solid horizontal lines. The time window is bounded by a start and end time (each a vertical dotted line). Incidents in red fall outside the time window; incidents in black fall inside the time window, to a greater or lesser degree. I'm sure a) I've seen this before somewhere; b) I'm being particularly dense in not getting it sooner.

Last week’s priority issues were first on the agenda. Normalising the dataset proved to be a morning’s worth of thinking without implementation: I investigated the chi-square statistic but ran into problems computing expected traffic incidents. Checking the application proved more productive and highlighted a glaring error in the data acquisition process. As such, another morning was spent checking (and double checking!) that the SQL query used to fetch traffic incident data for a given time period (e.g. a day, an hour) was accounting for all incidents in the time period, rather than only those which started in the time period. An upshot of this was an incident typology (although I think I’ve come across something similar in the literature) and a suite of tests.

Last week I described how the application aggregated to 10km grid cells. I wanted to reduce this to 5km or 2km. When meeting with BM it also became apparent that although getting data into the application was relatively straightforward, it could be easier. The result was more refactoring. I attempted to replace using OGR and MySQL to translate lat/lon into OSGB into screen coordinate space, with an on-the-fly conversion using the LandSerf API. This was a disaster! In it’s new guise the application ground to a halt.

In other news I also tried — and failed — to implement animation: I wanted zoom to mean “increased scale” rather than “enlarged graphics” and got stuck using the arrow keys to trigger a zoom/pan operation. I had to think hard about changes to the evaluation process (see below) and documented my work-flow, which proved useful. Even the good news was tinged with bad: checking the application demonstrated some amusing errors in the dataset.

I think there was more but I can’t remember it. I took some time off.

Priority

Re-evaluate the evaluation. The six weeks’ development time is up and unforeseen circumstances may necessitate changing the original insight-based methodology. After six weeks’ looking at NetBeans, I’m looking forward to opening Zotero again.

Schedule

On schedule but not happy!

Issues

Discussed.

Changes

Discussed.

Posted in Geovisualization, MSc | Comments Off

The geovisualization week

Completed

I’ve built out the ‘standard’ heat map into a more developed application. This involved refactoring the Processing code (and adding colours and zoom functionality from the giCentre’s Processing Utilities) and finding a more time-efficient and less error-prone coordinate transformation approach: creating an XML description of the incident dataset, processing with OGR and reading into a MySQL database.

The application currently contains traffic incident data for January 2009. In its default state, the number of unplanned incidents are aggregated by date (i.e. 24 hour period) and 100km grid cell. The up and down arrow keys increase and decrease the grid fineness/coarseness: 100km, 50km, 20km and 10km cells are available. Right and left arrow keys cycle through the dates from 1st to 31st January. The date is displayed in a selector to the right, which also is “clickable”: clicking a date “button” displays the traffic incident data for that date immediately.

Zooming and panning are not currently implemented, but can be without much effort. Alternative traffic incident data (e.g. for one day split into 30 minute segments) can be loaded with relative ease: I’ve written a short Bash script to query the database and dump the results to a series of TSV files. The application should update automatically to reflect the change (should — because I haven’t tested this yet!).

I haven’t — as yet — added a map backdrop to the heat map for performance reasons. I also haven’t had the time to convert the OS Meridian 2 coastline polylines into a simplified polygon.

Priority

Normalising the dataset. And also checking I’m sure the application is doing what I’ve claimed it’s doing!

Schedule

On schedule.

Issues

Too many to discuss! But nothing which will hinder the evaluation stage.

Changes

None.

Posted in Geovisualization, MSc | Comments Off

The geovisualization week

A slightly different format this week. I made some notes for the weekly progress meeting and rather than edit, I’m simply going to reproduce them here.

1. Compiled GDAL/OGR; split OS Meridian 2 layers into 100km cells; combine layers and generate images.

Why?

For efficient processing. Load layer into MySQL, measure length of road network in each cell. MySQL doesn’t have an intersection function, so layer has to be split by cell prior to loading. Could have written intersection/measure function in Java, although based on LandScript believed the function would have been inefficient (too much time to run, too much time to write).

For displaying in ‘zoomed’ prototype.

What happened?

GDAL/OGR compiled although doesn’t recognise MySQL (maybe it needs to be compiled against MySQL source rather than binary?). So, can split (wrote shell script) but cannot load. Could potentially use GDAL/OGR Windows binary from a Windows virtual machine (tested and works, but not ideal).

GDAL/OGR won’t generate images without existing image files. Compiled development version, which should generate images, although doesn’t.

And…?

Paused.

2. Added ‘zooming’ function to last week’s prototype (matrix heat map).

Why?

To investigate larger-scale patterns (smaller areas).

What happened?

Image problems (see above).

Querying database for smaller areas (10km cells) inefficient (too much time to run — point-in-polygon test for 9100 polygons because used MySQL’s geometry data types).

Some code problems, ‘linking’ interaction with country map to region map. Although partially-solved (some interaction inconsistencies) and tested with synthetic dataset.

And…?

Prototype developed. Paused.

3. Produced new prototype (‘standard’ heat map).

Why?

To add animation and dynamic selection to BM’s work on heat maps.

What happened?

Started with source data. Wished to have incidents which started in 2009, rather than incidents timestamped in 2009. Attempted to select from source data using command-line tools, but was inefficient. Loaded source data into database, without transformations and selected with SQL.

Built prototype in stages, based on very simple, synthetic dataset.

Wrote coordinate transformation function in Java (WGS84 to OSGB). (Good but not perfect — throws exceptions.) (Investigated coordinate transformation with GDAL/OGR.)

Data-wrangled (extract, transform, load, aggregate, extract, load) — scope for error.

And…?

Prototype developed. Paused.

4. Anything else?

I know more about…

  • LandSerf and LandScript
  • GDAL/OGR
  • PostGIS (binary and source releases)
  • JTS (Java Topology Suite) — some interesting tree structures
  • GEOS (Geometry Engine, Open Source — C++ port of JTS)
  • Bash scripting
  • Java, Processing, SQL (see MySQL’s Spatial Extensions, especially the omissions!)

5. And now?

1. I’d like to build out the two prototypes — ‘standard’ and matrix heat maps into more developed applications, ‘zooming’ country-region-district. I need to refactor the Processing code and find a more time-efficient and less error-prone coordinate transformation approach. I’d also like to heat map number of incidents and incidents’ severity.

2. I’d like to investigate alternative cell sizes (tree structures) but place this second.

Posted in Geovisualization, MSc | Comments Off

The geovisualization week

Completed

The first prototype is complete and was demonstrated on Thursday (22nd July). The initial state is shown below, with map and matrix “resting” at the top-left cell. The “n” values refer to the total number of unique, unplanned traffic incidents in the first working week of January, 2009 in the selected (clicked) map or matrix cell.

Matrix heat map prototype in its initial state

Matrix heat map prototype in its initial state

Not surprisingly, nothing traffic-related happened in the Atlantic Ocean in the first working week of January, 2009. However, the image below shows London and South East England in the same time period. The matrix is much more interesting.

Matrix heat map prototype with London and South East England selected

Matrix heat map prototype with London and South East England selected

Map and matrix support highlighting (mouse-over) and selecting (mouse-click). A highlighted cell’s “n” value is displayed below the map or matrix. A clicked cell’s “n” value is displayed below the map or matrix when the mouse is no longer over the map or matrix. Clicking (selecting) a map cell displays the corresponding matrix and fills the cell with an opaque red. Clicking (selecting) a matrix cell adds a green border to the cell, useful for making comparisons between matrices.

Priority

There are three areas to address:

Time. At present, incidents don’t have duration. They happen (or are displayed as happening) at a moment in time. Clearly, incidents should have duration. I need to write an SQL query to extract this from the database. Not so hard.

Scale. Map cells represent 100km squared. This is a large area (smaller-scale). I need to add a larger-scale map (smaller area) and aggregate incidents, for example, by 10km squared map cells. Moderately hard.

Scale, network length, normalisation. Really, the application shows nothing more than where most roads (and most people) are. It would be interesting to segment the area by road network density and possibly road network type (motorway, “A” road etc.); and to normalise the number of incidents by the network length. Hard!

Schedule

Slowing. However, I have something to evaluate and a second prototype on the way (see the “Scale” point above).

Issues

Data, rather than visualisation issues are putting the brakes on. However, I’ve been using the command-line GDAL/OGR tools to reduce processing time for basic operations (rather than LandSerf’s LandScript).

Changes

None.

Posted in Geovisualization, MSc | Comments Off