Posts Tagged ‘science’

Python, R, Qt, peewee, bokeh, pandas, SQLite plus couple of sleepless nights and here you are a cute app for the environmental monitoring needs )))

Main window of the application

It is amusing coincidence that another MOOC that I took this week (Geospatial Intelligence & the Geospatial revolution) mentioned [natural] disasters. About the other course see my recent Disasters: Myth or the Reality post.

In Geospatial Intelligence they gave a weird assignment: one need to mark the location on the world map where the next international natural disaster will occur O_o. This is not and easy task by any means and the lecturer suggested to use one’s ‘gut feeling’ if one’s knowledge is insufficient (I suppose it is close to impossible to find someone who can make such a prediction taking into account all the types of the disasters). Though the link to the International Disasters Database was given, so I accepted the challenge (to make a data-driven prediction). To predict the exact location of the next disaster one would need a lot of data – far more that you can get out of that database so my goal was to make prediction at the country level. (BTW the graphs from my post about disasters seems to be based on the data from this database – I saw one of them at that site)

I passed a query to the database and saved the output to process it with R. The dataframe looks like this:

year | country | continent | occurrence | deaths | injured | homeless | total_affected | total_damage
Example of disasters dataset

So how to predict the country with the next disaster? I came up with the idea to calculate cumulative average occurrence of disasters per country per year and plot it on the graph to see the trends. If I would just calculate average occurrence of disasters per country for the whole time of the observations I would have significant issues choosing from countries that would have close numbers. Plus the total average disasters per year can be misleading by itself due to it can be high because of high amount of disasters in the beginning of XX century but relatively low number in XXI.

The formula for the calculation of the cumulative average for the given year that I used was:

Cumulative_Average = Total_Occurences / ( Given_Year – (Starting_Year – 1) ) ,

where: Total_Occurrences is the sum of occurrences of disasters for given country in time interval between the starting year and the given year (inclusive).

Here is the plot I got for the short-list countries (plotting the results for all the 180 countries from the dataset makes plot unreadable):

Cumulative_average_disasters
Cumulative average number of disasters

It is clear that China and Indonesia are the two most likely candidates for the next disaster to strike, with a China having a lead. I’m not ready to provide insight on the reasons of the increasing number of natural disasters in the countries at the plot now (especially for Turkey and Iran). Maybe it is just that the events become documented more often?… It should be investigated further.

The code

Here is the code to create the plot above. ‘sqldf’ package was really helpful for divide data for the short list countries from the rest of 180 countries.


library(ggplot2)
library(sqldf)
library(grid)
#library(gridExtra)
# Load natural disasters data ---------------------------------------------

dis <- read.csv("~/R/Disasters/Natural_disasters.csv")

# Create data frame with average number of disasters per year -------------

average_events <- data.frame(country = character(),
year = numeric(),
disasters_per_year = numeric(),
stringsAsFactors = F)

countries <- unique(dis$country)

starting_year <- min(dis$year) - 1 # we subtract 1 year to have numbers greater than 0 further on

for (country in countries) {
data <- dis[dis$country == country,] # we need data for one country at a time
disasters_count <- 0
years <- unique(data$year)

for (year in years) {
total_years <- year - starting_year
y_data <- data[data$year == year,]
n_disasters <- sum(y_data$occurrence)
disasters_count <- disasters_count + n_disasters
average_disasters <- disasters_count / total_years
row <- data.frame(country = country, year = year, disasters_per_year = average_disasters)
average_events <- rbind(average_events, row)
}

}
# Plot data about average number of disasters per country per year --------
# Data for 180 countries is hard to plot, lets filter mots affected.
# Let's use SQL to query data: subset data for countries that had more than 0.6 disasters per year
# in any year after 2000
danger <- sqldf('SELECT * FROM average_events WHERE country IN
(SELECT DISTINCT country FROM average_events WHERE disasters_per_year >= 0.6 AND year > 2000)')

p <- ggplot(danger, aes (x = year, y = disasters_per_year)) +
geom_line(size = 1.2, aes(colour = country, linetype = country)) +
labs(title = 'Cumulative average number of disasters per year',
x = 'Year',
y = 'Average number of disasters cumulative') +
guides(guide_legend(keywidth = 3, keyheight = 1)) +
theme(axis.text.x = element_text(angle=0, hjust = NULL),
axis.title = element_text(face = 'bold', size = 14),
title = element_text(face = 'bold', size = 16),
legend.position = 'right',
legend.title = element_blank(),
legend.text = element_text(size = 12),
legend.key.width = unit(1.5, 'cm'),
legend.key.height = unit(1, 'cm'))

plot(p)

A Travelling Salesman Problem (TSP) is a well known computational challenge. A lot of algorithms were developed to solve it or its special cases.

I came around an article authored by Fang Liu ‘A dual population parallel ant colony optimization algorithm for solving the travelling salesman problem’. In this article he proposed a modification of an Ant Colony System algorithm for solving TSP and presented results obtained by his algorithm. In the table with results all looked fine – his algorithm was able to provide very good solutions for the TSP instances from TSPLIB (which is the common testing ground for TSP algorithms).

So the researcher presented good results… it seems. But then he decided to show the best routes his algorithm was able to find and annotated them with the corresponding route costs. Lets take a look at one of them. Here you are his best route for the ‘att48’ instance from the TSPLIB:

Route that claims to be optimal (but the cost is very wrong)

The optimal route for ‘att48’ and its cost is well-known (it applies to all TSPLIB instanses). Its cost is approximately 33523 (there are different approaches to rounding distances between points). So what we see at the picture above should be the optimal route (or extremely close to it). But dear reader, do you think that you see optimal route? Humans are able to provide very good solutions to TSP instances that consists of not too many points. I bet you can draw far better route yourself. The route from this picture is 1, 8, 46, 33, 20, 17, 43, 27, 19, 37, 6, 30, 36, 28, 7, 18, 44, 31, 38, 9, 40, 15, 12, 11, 47, 21, 13, 25, 14, 23, 3, 22, 16, 41, 34, 2, 29, 5, 48, 39, 32, 24, 42, 10, 45, 35, 4, 26, 1 and its cost is 41052 which is whooping 22% far from optimal! The same story for another illustration in the article.

Here take a look at the optimal route which cost is really 33523:

Actually optimal route for ‘att48’ with Cost = 33523

So what we can conclude? I do believe that the routes demonstrated in the article are the best routes found by given algorithm, but costs are put-up for the routes and for the table of results in the article as well. I think that author developed an algorithm that wasn’t able to find good solutions and provided fraud table with the put-up testing results. And clearly this article wasn’t reviewed by scientist that have knowledge in TSP area because these plots are so obviously flawed that you can’t overlook it!

No one usually shows plots of the routes they find with their algorithms. I wonder if there are more modern algorithms with the put-up results?

For my own project I needed to create a graph based on a Delauney triangulation using NetworkX python library. And a special condition was that all the edges must be unique. Points for triangulation stored in a .shp-file.I was lucky enough to find this tread on Delauney triangulation using NetworkX graphs. I made a nice function out of it for point NetworkX graphs processing. This function preserves nodes attributes (which are lost after triangulation) and calculates lengths of the edges. It can be further improved (and most likely will be) but even at the current state it is very handy.

Example of use:

import networkx as nx
import scipy.spatial
import matplotlib.pyplot as plt

path = '/directory/'
f_path = path + 'filename.shp'
G = nx.read_shp(f_path)

GD = createTINgraph(G, show = True)

Code for the function:

import networkx as nx
import scipy.spatial
import matplotlib.pyplot as plt

def createTINgraph(point_graph, show = False, calculate_distance = True):
  '''
  Creates a graph based on Delaney triangulation

  @param point_graph: either a graph made by read_shp() from another NetworkX's point graph
  @param show: whether or not resulting graph should be shown, boolean
  @param calculate_distance: whether length of edges should be calculated
  @return - a graph made from a Delauney triangulation

  @Copyright notice: this code is an improved (by Yury V. Ryabov, 2014, riabovvv@gmail.com) version of
                    Tom's code taken from this discussion
                    https://groups.google.com/forum/#!topic/networkx-discuss/D7fMmuzVBAw
  '''

  TIN = scipy.spatial.Delaunay(point_graph)
  edges = set()
  # for each Delaunay triangle
  for n in xrange(TIN.nsimplex):
      # for each edge of the triangle
      # sort the vertices
      # (sorting avoids duplicated edges being added to the set)
      # and add to the edges set
      edge = sorted([TIN.vertices[n,0], TIN.vertices[n,1]])
      edges.add((edge[0], edge[1]))
      edge = sorted([TIN.vertices[n,0], TIN.vertices[n,2]])
      edges.add((edge[0], edge[1]))
      edge = sorted([TIN.vertices[n,1], TIN.vertices[n,2]])
      edges.add((edge[0], edge[1]))


  # make a graph based on the Delaunay triangulation edges
  graph = nx.Graph(list(edges))

  #add nodes attributes to the TIN graph from the original points
  original_nodes = point_graph.nodes(data = True)
  for n in xrange(len(original_nodes)):
    XY = original_nodes[n][0] # X and Y tuple - coordinates of the original points
    graph.node[n]['XY'] = XY
    # add other attributes
    original_attributes = original_nodes[n][1]
    for i in original_attributes.iteritems(): # for tuple i = (key, value)
      graph.node[n][i[0]] = i[1]


  # calculate Euclidian length of edges and write it as edges attribute
  if calculate_distance:
    edges = graph.edges()
    for i in xrange(len(edges)):
      edge = edges[i]
      node_1 = edge[0]
      node_2 = edge[1]
      x1, y1 = graph.node[node_1]['XY']
      x2, y2 = graph.node[node_2]['XY']
      dist = sqrt( pow( (x2 - x1), 2 ) + pow( (y2 - y1), 2 ) )
      dist = round(dist, 2)
      graph.edge[node_1][node_2]['distance'] = dist


  # plot graph
  if show:
    pointIDXY = dict(zip(range(len(point_graph)), point_graph))
    nx.draw(graph, pointIDXY)
    plt.show()

  return graph
 
“Wow!” – I said to myself after reading R Helps With Employee Churn post – “I can create interactive plots in R?!!! I have to try it out!”

 

I quickly came up with an idea of creating interactive plot for my simple model for assessment of the profitable ratio between the volume waste that could be illegally disposed and costs of illegal disposal [Ryabov Y. (2013) Rationale of mechanisms for the land protection from illegal dumping (an example from the St.-Petersburg and Leningrad region). Regional Researches. №1 (39), p. 49-56]. The conditions for profitable illegal dumping can be describes as follows:

 

Here: k – the probability of being fined for illegal disposal of waste;

P – maximum fine for illegal disposal of waste (illegal dumping);

V – volume of waste to be [illegally] disposed by the waste owner;

E – costs of illegal disposal of waste per unit;

T – official tax for waste disposal per unit.The conditions for the profitable landfilling can be described as follows:

Here: V1 – total volume of waste that is supposed to be disposed at illegal landfill;

Tc – tax for disposal of waste at illegal landfill per unit;

P1 – maximum fine for illegal landfilling;

E1 – expenditures of the illegal landfill owner for disposal of waste per unit.

Lets plot the graphs (with some random numbers (except for fines) for a nice looking representation) to have a clue how it looks like.

 

Note that there is a footnote (this post provides nice examples on how to do it) with the values used for plotting – it is important to have to have this kind of indication if we want to create a series of plots.

Now I will show you the result and then will provide the code and some tips.

Playing with the plot

Tips and Tricks

Before I will show you code I want to share my hardly earned knowledge about nuances of the manipulate library. There are several ways to get static plot like that using ggplot, but some of them will fail to be interactive with manipulate.

  1. All the data for the plot must be stored in one dataframe.
  2. All data for plots must be derived from the dataframe (avoid passing single variables to ggplot).
  3. Do not use geom_hline() for the horizontal line – generate values for this line and store them inside dataframe and draw as a regular graph.
  4. To create a footnote (to know exactly which parameters were used for the current graph) use arrangeGrob() function from the gridExtra library.
  5. Always use $ inside aes() settings to address columns of your dataframe if you want plots to be interactive

The Code

<pre class="brush: r; title: ; notranslate" title="">library(ggplot2)
library(grid)
library(gridExtra)
library(manipulate)
library(scales)
library(reshape2)

## Ta --- official tax for waste utilisation per tonne or cubic metre.
## k --- probability of getting fined for illegal dumping the waste owner (0

(more…)

Today I received a copy of proceedings of the conference I participated in. A peculiar moment is that my article about using Random Forest algorithm for the illegal dumping sites forecast is the very first article of my section (as well as of the whole book) and it was placed regardless of the alphabetical order of the family names of the authors (this order is correct for all other authors in all sections).

My presentation and speech were remarkable indeed – the director of my scientific-research centre later called it “the speech of guru” (actually, not a “guru”, there is just no suitable equivalent in English for the word used). Also the extended version of this article for one of the journals of the Russian Academy of Sciences received an extremely positive feedback from the reviewers. So I suppose the position of my article is truly some kind of respect for the research and presentation and not a random editorial mistake.

Now I should overcome procrastination and make a post (or most likely two) about this research of mine.

 

Ok, it’s time to finish the story about land monitoring in Sverdlovskaya region. In this post I would like to demonstrate some of the most unpleasant types of the land use.

Lets begin with illegal dumping. This dump (note that there is the smoke from waste burning down) is located right next to the potato field (mmm… seems these  potato are tasty). The ground was intentionally excavated here for dumping waste. Obviously this dump is exploited by the agricultural firm – owner of this land, but who cares…

Panorama of freshly burnt illegal dump

The next stop is peat cutting. A huge biotops are destroyed for no good reason (I can’t agree that use of peat as an energy source is a good one). At the picture below you can see a peat cutting with the area of 1402 ha. There are dozen of them in the study area…

Peat Cutting (RapidEye, natural colours)

But the most ugly scars on the Earth surface are left from mining works. There is Asbestos town in Sverdlovskaya region. It was names after asbestos that is mined  there. The quarry has an area of 1470 ha and its depth is over 400 meters. Its slag-heaps covers another 2500 ha… The irony is that this quarry gives a job for this town and killing it. You see, if you wand to dig dipper you have to make quarry wider accordingly. Current depth is 450 m and in projects it is over 900 m, but the quarry is already next to the living buildings. So quarry is going to consume the town… By the way, the local cemetery was already consumed. Guess what happened to human remains? Well, it is Russia, so they were dumped into the nearest slug-heap.

Here is the panorama of the quarry. You may try to locate BelAZ trucks down there 😉

Asbestos quarry

Here is the part of the biggest slag-heap:

A slag-heap

That’s how it looks from space:

Asbestos town area (imagery – RapidEye, NIR-G-B pseudo-colour composition)

And in the end I will show you the very basic schema of disturbed land in the study area (no settlements or roads included). Terrifying isn’t it?

Basic schema of disturbed land

Open GIS! is a conference for users and developers of open-source GIS. It will take place on November 17-18, 2012 in Moscow, Russia. The really cool thing is that participation in this crowd-source conference is free of charge!!! Of course modest donations will be appreciated and a sponsorship as well, but it is not mandatory – we want this event to be available for everyone who are interested in open geospatial software and data. Download leaflet!

A call for papers will be opened in August but already now you may propose topics of your talk or master-class to the committee.

Please, spread the word!

P.S. If you are a rear person who interested in illegal dumping research (from geospatial point of view) – come to this conference to hear my talk on it!

A Bit More on Ignorance

Posted: June 8, 2012 in GIS
Tags: , ,

Occasionally found an article that have some relation to my previous post. The article has an intriguing name: “The Influence of Map Design on Resource Management Decision Making“. Unfortunately it is not in open access, so I wasn’t able to read it. Also abstract omits conclusions… And I would prefer to see study of the real cases… Nevertheless here you are:

  Abstract
The popular use of GIS and related mapping technologies has changed approaches to map-making. Cartography is no longer the domain of experts, and the potential for poorly designed maps has increased. This trend has raised concerns that poorly designed maps might mislead decision makers. Hence, an important research question is this: Can different cartographic displays of one data set so influence decision makers as to alter a decision that relies on the mapped data? This question was studied using a spatial decision problem typical for decision makers in a resource management agency in the United States (the USDA Forest Service). Cartographic display was varied by constructing three hypothetical map treatments from the same data set. Map treatments and other test materials were mailed to Forest Service District Rangers. All District Rangers received the same decision problem, but each received only one of the three map treatments. The participants were asked to make a decision using the decision problem and map treatment. Information about the decision and the influence of each map treatment was obtained through a questionnaire. The research and its implications for map-based decision making are presented and discussed.

This post is a some kind of reply to this one.

So our goal is to determine whether our point process is random or not. We will use R and spatstat package in particular. Spatstat provides a very handy function for this, that uses K-function combined with Monte Carlo tests. I will spear you from burbling  about theory behind it – the necessary links were already provided. Lets get directly to action.

In this example I will test data about location of my “favourite” illegal dumps in St. Petersburg and Leningrad region.

# we will need: 

library(maptools)

library(rgdal)

library(spatstat)

# import data for analysis

S <- readShapePoints(“custom_path/dump_centroids.shp”, proj4string= CRS(“+proj=tmerc +lat_0=0 +lon_0=33 +k=1 +x_0=6500000 +y_0=0 +ellps=krass +towgs84=23.92,-141.27,-80.9,-0,0.35,0.82,-0.12 +units=m +no_defs”))

SP <- as(S, “SpatialPoints”)

P <- as(SP, “ppp”)

# perform the test itself with a 100 simulations

E <- envelope(P, Kest, nsim = 100)

plot(E, main = NULL)

And here is what we’ve got in the end, a fancy graph, which demonstrates that our data (Kobs) significantly deviates from a random process (Ktheo):