Zalando Challenge
Introduction¶
I recently came across this little data challenge, which was posted by Zalando (one of the top fashion retailers in Europe) as a teaser for data scientists/analysts. The challenge is quite straightforward and is a good opportunity to show how to deal with this kind of analysis using the standard tools of python and the interactive notebook.
For data analysis, the community is in two minds between between python
and R
, but for spatial data it looks like the ecosystem has taken a bet on python
. There are useful python libraries for all stages of a geoprocessing pipeline, from data handling (shapely
, GDAL/ogr
, pyproj
, ...) to analysis (shapely, (geo)pandas, PySal, numpy/scipy, sklearn, etc) to plotting and visualisation (matplotlib, descartes, cartopy, pyQGIS). I've always found the last stage (visualisation) to be the most negelected, and trying to join up the data and analysis threads with interactive web-based visualisations (like and D3
and mapping libraries like leaflet
) is usually a pain. The ipython notebook promises interactivity, and newer packages like Plotly and Bokeh are really pushing it along fast.
I will use Shapely
for dealing with the geographic data, pyproj
for projections and scipy
for optimisation routines. It's nice to have interactive maps in the notebook, and for this I render the geojson geometries with google maps
(although leaflet
would be a fine choice too).
On to the challenge.
Problem Statement¶
The Zalando Data Intelligence Team is searching for a new top analyst. We already know of an excellent candidate with top analytical and programming skills. Unfortunately, we don’t know her exact whereabouts but we only have some vague information where she might be. Can you tell us where to best send our recruiters and plot an easy to read map of your solution for them?
This is what we could extract from independent sources:
The candidate is likely to be close to the river Spree. The probability at any point is given by a Gaussian function of its shortest distance to the river. The function peaks at zero and has 95 of its total integral within ±2730 m.
A probability distribution centered around the Brandenburg Gate also informs us of the candidate’s location. The distribution’s radial profile is log-normal with a mean of 4700m and a mode of 3877m in every direction.
A satellite offers further information: with 95% probability she is located within 2400m distance of the satellite’s path (assuming a normal probability distribution)
Earth radius
6371km
Brandenburg Gate GPS coordinates
{
"geometry": {
"coordinates": [13.377689, 52.516288],
"type": "Point"
},
"type": "Feature"
}
Satellite path is a great circle path between coordinates
{
"geometry": {
"coordinates": [[13.39915, 52.590117], [13.553989, 52.437385]],
"type": "LineString"
},
"type": "Feature"
}
River Spree can be approximated as piecewise linear between the following coordinates:
{
"geometry": {
"coordinates": [[13.274099, 52.529198],
[13.29234, 52.531835], [13.298541, 52.522116],
[13.317349, 52.520569], [13.322434, 52.524877],
[13.329, 52.522788], [13.332075, 52.517056],
[13.340743, 52.522514], [13.356665, 52.517239],
[13.372158, 52.523063], [13.379453, 52.519198],
[13.392328, 52.522462], [13.399703, 52.520921],
[13.406054, 52.515333], [13.416354, 52.514863],
[13.435923, 52.506034], [13.461587, 52.496473],
[13.483216, 52.487641], [13.491456, 52.488739],
[13.503386, 52.464011]],
"type": "LineString"
},
"type": "Feature"
}
This information is given in the data file.
Solution Outline¶
How to solve this problem?
In this case, the probability distributions are simple enough and we either have their parameters or can easily find them given the information. The probability distributions can be written as:
- Lognormal distribution centered on the Brandenburg Gate:
- Normal distribution of the shortest distance to a segment of the River Spree
- Normal distribution centered on a satellite path, given as a great-circle arc between two endpoints
These are independent observations of the probability of finding the person, and the solution will be the position ${\mathbf{x}}$ which maximises the total joint probability. Hopefully along the way we can shed some light on how to deal with spatial projections, distance calculations, optimisation and web mapping (all directly in a python
notebook.
These are the imports I have used throughout the notebook and I've grouped them together for convenience.
from __future__ import division
from functools import partial
import numpy as np
import jinja2
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import seaborn as sns
import scipy.stats as stats
import scipy.special as special
import scipy.optimize as opt
import shapely.geometry as geom
import shapely.ops as ops
import pyproj
from geojson import Feature, FeatureCollection
from geopy.geocoders import Nominatim
from IPython.display import display, HTML, Javascript
from plot_geoms import *
sns.set_context("paper", font_scale=1.5, rc={"figure.figsize": (18,12),
"lines.linewidth": 0.2,
"axes.grid": False,
"grid.linewidth": 0, })
geod = pyproj.Geod(ellps='WGS84')
%matplotlib inline
js = """window.gmaps_ready = function() { console.log('initialised gmaps');} ;
if (typeof google === 'object' && typeof google.maps === 'object' && typeof google.maps.Map === 'function') { initialize(); } else {
$.getScript("https://maps.googleapis.com/maps/api/js?v=3.exp&callback=gmaps_ready");
}"""
#display(Javascript(js))
#ignore
The coordinates of the Spree are given in the data file and we use them here to make a Shapely LineString
object.
The satellite path is given as the start and end coordinates of a segment of a great circle.
sat_beg, sat_end = (13.39915, 52.590117), (13.553989,52.437385)
bb_gate_coords = (13.377689, 52.516288)
spree_coords = [(13.274099, 52.529198), (13.29234, 52.531835),(13.298541, 52.522116),
(13.317349, 52.520569),(13.322434, 52.524877),
(13.329, 52.522788),(13.332075, 52.517056),
(13.340743, 52.522514),(13.356665, 52.517239),
(13.372158, 52.523063),(13.379453, 52.519198),
(13.392328, 52.522462),(13.399703, 52.520921),
(13.406054, 52.515333),(13.416354, 52.514863),
(13.435923, 52.506034),(13.461587, 52.496473),
(13.483216, 52.487641),(13.491456, 52.488739),
(13.503386, 52.464011)]
spree = geom.LineString(spree_coords)
satellite_path = geom.LineString([sat_beg, sat_end])
bb_gate = geom.Point(bb_gate_coords)
Feature(geometry=bb_gate)
tpl = jinja2.Template(open('./gmap.tpl', 'r').read())
div_id = 'map-canvas-{}'.format("1")
extra_css = "#{div_id} {{height: 600px; width:100%; margin:0; padding:0;}}".format(div_id=div_id)
center, zoom = {"lng":13.45525234770495, "lat":52.51148144124993}, 12
spree_c, sat_c, bb_c = '#19B5FE', '#F39C12', '#2ECC71'
feat_spree = Feature(geometry=spree, properties={'title':"River Spree", 'strokeColor': spree_c, 'strokeWeight':5, 'strokeOpacity':0.9})
feat_sat = Feature(geometry=satellite_path, properties={'title': "Satellite Path", 'strokeColor': sat_c, 'strokeWeight':5, 'strokeOpacity':0.9})
feat_bb_gate = Feature(geometry=bb_gate, properties={'strokeColor': bb_c, 'title':'Brandenburg Gate'})
tpl_html = tpl.render(div_id=div_id, suffix=div_id.replace('-', '_'), center=center, zoom=zoom, extra_css=extra_css,
feature_collection=FeatureCollection([feat_spree, feat_sat, feat_bb_gate]))
display(HTML(tpl_html))
The probability distribution around the Brandenburg Gate is lognormal in every direction. A lognormal distribution is a continuous probability distribution of a random variable (mean $\mu$ and variance $\sigma$) whose log is normally distributed. The mean of a lognormal is given by $\exp(\mu + \sigma^{2}/2)$ and the mode by $\exp(\mu - \sigma^{2})$.
Knowing the mean($4700m$) and mode($3877m$), we can solve for $\mu$ and $\sigma$:
$$ \begin{eqnarray} \sigma &=& \sqrt{\frac{2}{3} \left( \log(mean) - \log(mode)\right)} &=& 0.358237211939\\ \mu &=& \frac{2 \log(mean) + \log(mode)}{3} &=& 8.39115083769 \end{eqnarray} $$bbgate_mean, bbgate_mode = 4700, 3877
bbgate_mu = (2.0*np.log(bbgate_mean) + np.log(bbgate_mode))/3.0
bbgate_sigma = np.sqrt((2.0/3.0)*(np.log(bbgate_mean) - np.log(bbgate_mode)))
print bbgate_mu, bbgate_sigma
For a normal distribution, the probability that an observation is within a given number of standard deviations is
$$ \mu \pm x \sigma = {\mathrm{erf}}\left(\frac{x}{\sqrt{2}}\right) $$and the commonly used 'two-standard deviations' is almost $95\%$ likely:
$$P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) = {\mathrm{erf}}(2/\sqrt{2}) = 0.95449973610364158$$With exactly $95\%$ probability we are within almost $1.96$ standard deviations:
$$\sqrt(2) {\mathrm{erfinv}}(0.95) = 1.95996398454$$and so we can find the standard deviation of the normal probability distributions given the $95\%$ confidence interval:
$$\sigma = \frac{x_{95\%} } {\sqrt{2} {\mathrm{erfinv}}(0.95)}$$spree_sigma = 2730 / (special.erfinv(0.95) * np.sqrt(2))
sat_sigma = 2400 / (special.erfinv(0.95) * np.sqrt(2))
print spree_sigma, sat_sigma
And now we have all the parameters we need:
$\mu_{BBGate} = 8.39115083769$ $\sigma_{BBGate} = 0.358237211939$
$\sigma_{Spree} = 1392.8827374$
$\sigma_{Sat} = 1224.51229662$
norm_spree = stats.norm(0, spree_sigma)
norm_sat = stats.norm(0, sat_sigma)
lognorm_bbgate = stats.lognorm(bbgate_sigma, scale=np.exp(bbgate_mu))
v = np.linspace(0, 10000, 100)
lw=3
plt.figure(figsize=(12,6))
plt.plot(v, norm_spree.pdf(v), alpha=0.8, color='#26A65B', linewidth=lw, label='Spree')
plt.plot(v, norm_sat.pdf(v), alpha=0.8, color='#446CB3', linewidth=lw, label='Satellite path')
plt.plot(v, lognorm_bbgate.pdf(v), alpha=0.8, color='#D35400', linewidth=lw, label='Brandenburg Gate')
plt.ylabel('Probability')
plt.xlabel('distance (m)')
_ = plt.legend()
The coordinates we are given in the problem are standard latitude, longitude geographic coordinates (EPGSG:4326), on an oblate spheroid with major radius $R=6378137m$ at the equator. Finding the distances between points in geographic coordinates involves spherical distance calculations (great-circle or Vincenty), which we would prefer to avoid. It is convenient to project the points into an orthogonal coordinate system, for which EPSG:3068 looks suitable. In this coordinate system, the distance unit is metres and our standard euclidean distance routines will provide good approximations to the spherical calculations (as long as the distances are fairly small).
For handling the geographic coordinates we will use Shapely which is based on the geos library. Shapely does not do coordinate system transformation and all operations assume the features are in the same Cartesian plane. The coordinate system transformations are handled by pyproj.
# create a transform function using pyproj Projections
p_4326, p_3068 = pyproj.Proj(init='epsg:4326'), pyproj.Proj(init='epsg:3068')
p_4326_3068 = partial(pyproj.transform, p_4326, p_3068)
p_3068_4326 = partial(pyproj.transform, p_3068, p_4326)
bb_gate_t = ops.transform(p_4326_3068, bb_gate)
spree_t = ops.transform(p_4326_3068, spree)
satellite_path_t = ops.transform(p_4326_3068, satellite_path)
# Here we compute the distance by geodetic or great circle methods (using pyproj) in epsg:4326
# and compare with the cartesian distance in epsg:3068
# two points in Berlin
p1 = geom.base.geom_from_wkt('POINT (13.37927947157996 52.51928993794976)')
p2 = geom.base.geom_from_wkt('POINT (13.274099 52.529198)')
print 'EPSG:4326 spherical distance |p2-p1|', geod.inv(p1.x, p1.y, p2.x, p2.y)[2]
print 'EPSG:3068 cartesian distance |p2-p1|', ops.transform(p_4326_3068, p1).distance(ops.transform(p_4326_3068, p2))
We see that cartesian distances in EPSG:3068
are a very good approximation to the spherical distances, at least for areas covering Berlin, and we can safely use the EPSG:3068
projection. In the challenge statement, Zalando provide a simple transformation to an orthogonal coordinate system, which should be a good approximation to EPSG:3068
- because of the convenience of the pyproj transformations I will use them exclusively.
This is how we compute the joint probability distribution:
def compute_prob(p, origin_geom=None, prob_dist=None):
return prob_dist.pdf(p.distance(origin_geom))
def prob(x,y):
p = geom.Point(x, y)
p_bb_gate = compute_prob(p, origin_geom=bb_gate_t, prob_dist=lognorm_bbgate)
p_sat = compute_prob(p, origin_geom=satellite_path_t, prob_dist=norm_sat)
p_spree = compute_prob(p, origin_geom=spree_t, prob_dist=norm_spree)
return (p_bb_gate * p_spree * p_sat)
Let's make a grid and take a look at the individual probability distributions:
N=100
(minx, miny, maxx, maxy) = spree_t.bounds
xs, ys = np.linspace(minx, maxx, N), np.linspace(miny, maxy, N)
xg, yg = np.meshgrid(xs, ys)
fig = plt.figure(figsize=(18,12))
ax = fig.add_subplot(111,projection="3d")
alpha=0.4
offset = 0
cmap = sns.cubehelix_palette(12, start=0.5, rot=-.65, as_cmap=True)
for (g, prob_dist, color) in [(satellite_path_t, norm_sat, sat_c), (spree_t, norm_spree, spree_c), (bb_gate_t, lognorm_bbgate, 'black')]:
zg_geom = np.array([compute_prob(geom.Point(x,y), origin_geom=g, prob_dist=prob_dist) for x,y in zip(np.ravel(xg), np.ravel(yg))])
zg_geom = zg_geom.reshape(xg.shape)
ax.plot_surface(xg, yg, zg_geom, rstride=1, cstride=1, alpha=alpha, cmap=cmap,lw=0.2, edgecolor=color, antialiased=True)
plot_geom(spree_t, ax=ax, color=spree_c, linewidth=3)
plot_geom(satellite_path_t, ax=ax, color=sat_c, linewidth=3)
plot_geom(bb_gate_t, ax=ax, size=200, color='black')
ax.view_init(elev=28)
ax.azim = 360-90+10
ax.set_xlim3d(xmax=35000)
#_ = ax.set_zlim(zmin=offset)
_ = ax.set_xlabel('x')
_ = ax.set_ylabel('y')
_ = plt.title('the individual probability distributions')
#_ = plt.tight_layout()
If we look at the this on a map we see there is only a small area where all 3 probability distributions have a significant overlap. The polygons below extend out to the $95\%$ confidence interval of each distribution.
# 95% confidence intervals
level = 0.975
dilated_spree = spree_t.buffer(norm_spree.ppf(level), cap_style=3, resolution=32)
dilated_sat = satellite_path_t.buffer(norm_sat.ppf(level), cap_style=3, resolution=32)
dilated_bb_gate = bb_gate_t.buffer(lognorm_bbgate.ppf(level), cap_style=1, resolution=32)
overlap = dilated_spree.intersection(dilated_sat).intersection(dilated_bb_gate)
feat_overlap = Feature(geometry=ops.transform(p_3068_4326, overlap), properties={'strokeOpacity': 0.7, 'strokeColor': '#F1A9A0', 'fillColor': '#F1A9A0', 'zIndex':10, 'fillOpacity':0.4, 'title': 'Area of Overlap 95% confidence'})
feat_dilated_spree = Feature(geometry=ops.transform(p_3068_4326, dilated_spree), properties={'strokeOpacity': 0.7, 'strokeColor': spree_c, 'fillColor':spree_c, 'fillOpacity':0.05, 'zIndex':9})
feat_dilated_sat = Feature(geometry=ops.transform(p_3068_4326, dilated_sat), properties={'strokeOpacity': 0.7, 'strokeColor': sat_c, 'fillColor': sat_c, 'zIndex':9, 'fillOpacity':0.05})
feat_dilated_bbgate = Feature(geometry=ops.transform(p_3068_4326, dilated_bb_gate), properties={'strokeOpacity': 0.7, 'strokeColor': bb_c, 'fillColor':bb_c, 'zIndex':9, 'fillOpacity':0.05})
tpl = jinja2.Template(open('./gmap.tpl', 'r').read())
div_id = 'map-canvas-{}'.format("2")
extra_css = "#{div_id} {{height: 600px; width:100%; margin:0; padding:0;}}".format(div_id=div_id)
center, zoom = {"lng":13.45525234770495, "lat":52.51148144124993}, 11
tpl_html = tpl.render(div_id=div_id, suffix=div_id.replace('-', '_'), center=center, zoom=zoom, extra_css=extra_css,
feature_collection=FeatureCollection([feat_overlap,
feat_dilated_spree,
feat_dilated_sat, feat_dilated_bbgate]))
display(HTML(tpl_html))
What does the joint probability look like?
def draw_prob_3d(alpha_surface=0.4):
fig = plt.figure(figsize=(18,12))
ax = fig.gca(projection='3d')
ax.view_init(elev=18)
offset = 0
zg = np.array([(prob(x,y)) for x,y in zip(np.ravel(xg), np.ravel(yg))])
zg = zg.reshape(xg.shape)
ax.plot_surface(xg, yg, zg, rstride=1, cstride=1, alpha=alpha_surface, cmap=cmap,lw=0.,
edgecolor='#ABB7B7', antialiased=True)
cset = ax.contour(xs, ys, zg, zdir='z', linewidths=(2,), alpha=0.95, offset=offset, cmap=cmap)
ax.set_zlim(zmin=offset)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.azim = 270
plot_geom(spree_t, ax=ax, linewidth=3, color=spree_c, offset=offset)
plot_geom(satellite_path_t, ax=ax, linewidth=3, color=sat_c, offset=offset)
#plot_geom(geom.Point(p_opt_t), ax=ax, color='red', size=100)
_ = ax.set_xlim3d(24000, maxx)
_ = ax.set_ylim3d(16000, maxy)
_ = plt.title("Joint probability")
return ax
ax = draw_prob_3d()
It looks like we have only 1 candidate for the most likely position. Since we already have a grid its straighforward to find the maximum. There are a couple of options in scipy
- the grid search method is likely to be slow, but very reliable: scipy.optimize.brute
. If we are sure there is only one maximum, we could choose a starting point (close enough to the position of the maximum) and call scipy.optimize.fmin
. The scipy.optimize.differential_evolution
global optimisation routine seems to do a very good job at finding the most likely position given the full bounds and this method is fast too, converging after $9$ iterations in $170ms$. We explicity call fmin
to polish
off this approximate result.
%%time
res = opt.differential_evolution(lambda x: -prob(x[0], x[1]), [(minx,maxx), (miny,maxy)],
maxiter=1000, polish=False)
print res
print ops.transform(p_3068_4326, geom.Point(res.x))
p_opt_t = opt.fmin(lambda x: -prob(x[0], x[1]), res.x, maxiter=1000, ftol=1e-10)
print p_opt_t
p_opt = ops.transform(p_3068_4326, geom.Point(p_opt_t[0], p_opt_t[1]))
print p_opt
ax = draw_prob_3d(alpha_surface=0.25)
plot_geom(geom.Point(p_opt_t), ax=ax, color='red', size=150)
The position where the probability is a maximum is found to be (red dot on above plot):
$(x,y)_{P_{max}} = (13.4552523466314, 52.51148144154961)$
Reverse geocode the coordinates to find the address:
geolocator = Nominatim()
location = geolocator.reverse("{lat}, {lng}".format(lat=p_opt.y, lng=p_opt.x))
print(location.address)
green_dot = "http://maps.google.com/mapfiles/ms/icons/green-dot.png"
fc = [Feature(geometry=p_opt, properties={'title': "You'll find me here"})]
for (p, title) in [ ((13.446692,52.505535), "Zalando SETamara-Danz-Str. 1, 10243 Berlin"),
((13.41545,52.526460), "Zalando SE Mollstraße 1, D-10178 Berlin"),
((13.471004,52.506680), "Zalando SE neue bahnhofstraße 11-17, berlin"),
((13.46323,52.487550), "Zalando SE Am Treptower Park 28 - 30, 12435 Berlin"),
((13.43297,52.505580), "Zalando SE Köpenicker Straße 20, D-10997 Berlin"),
((13.471004,52.506680), "Zalando Customer Care neue bahnhofstraße 11-17, berlin"),
((13.436794,52.508725), "Zalando Content Creation Straße der Pariser Kommune 8 10247 Berlin")]:
fc.append(Feature(geometry=geom.Point(p), properties={'title':title,
'icon':green_dot}))
tpl = jinja2.Template(open('./gmap.tpl', 'r').read())
div_id = 'map-canvas-{}'.format("3")
extra_css = "#{div_id} {{height: 600px; width:100%; margin:0; padding:0;}}".format(div_id=div_id)
center, zoom = {"lng":13.45525234770495, "lat":52.51148144124993}, 14
tpl_html = tpl.render(div_id=div_id, suffix=div_id.replace('-', '_'), center=center, zoom=zoom, extra_css=extra_css,
feature_collection=FeatureCollection(fc))
display(HTML(tpl_html))
There are several Zalando offices in that small area of Berlin, so we have to guess the analyst they are looking for is not too far away. You might want to re-run the notebook using Zalando's coordinate transformation- I expect it will lead to a slightly different location (but not too far away from this one).
Wrap up¶
There were a few things that didn't work out when I was putting this notebook together. I couldn't get Cartopy
to work on OSX
. Cartopy
allows you to use matplotlib on a base map layer(like google tiles). This would be nice for static plots of the features- but thee is no interactivity. Folium allows you to create interactive maps and layer data on top of them, but it didn't work for me in the notebook. Mplleaflet tries to do matplotlib plotting on a live web map in the notebook, but it didn't work for me either. Since google maps (v3) understands geojson
objects, I created a simple template and used the google maps API to render the geojson, but I will keep an eye on Mplleaflet
which should be a better solution.
#ignore %watermark -unz -a 'Aonghus Lawlor' -p numpy,scipy,shapely,pyproj