By Ross Lawley, MongoEngine maintainer and Scala Engineer at 10gen Earlier in the year I gave a talk at MongoDB London about the different aggregation options with MongoDB. The topic recently came up again in conversation at a user group,
By Ross Lawley, MongoEngine maintainer and Scala Engineer at 10gen
Earlier in the year I gave a talk at MongoDB London about the different aggregation options with MongoDB. The topic recently came up again in conversation at a user group, so I thought it deserved a blog post.
I wanted to give a more interesting aggregation talk than the standard “counting words in text”, and as the aggregation framework gained shiny 2dsphere geo support in 2.4, I figured I’d use that. I just needed a topic…
Two things immediately sprang to mind: weather and beer.
I opted to focus on something close to my heart: beer :) But what to aggregate about beer? Then I remembered an old pub quiz favourite…
What is the most popular pub name in the UK?
I know there is some great open data, including a wealth of information on pubs available from the awesome open street map project. I just need to get at it and happily the Overpass-api provides a simple “xapi” interface for OSM data. All I needed was anything tagged with amenity=pub
within in the bounds of the UK and with their xapi interface this is as simple as a wget:
http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]
Once I had an osm file I used the imposm python library to parse the xml and then convert it to following GeoJSON format:
{ "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } }
Then it was a case of simply inserting it as a document into MongoDB. I quickly noticed that the data needed a little cleaning, as I was seeing duplicate pub names, for example: “The Red Lion” and “Red Lion”. Because I wanted to make a wordle I normalised all the pub names.
If you want to know more about the importing process, the full loading code is available on github: osm2mongo.py
It turns out finding the most popular pub names is very simple with the aggregation framework. Just group by the name and then sum up all the occurrences. To get the top five most popular pub names we sort by the summed value and then limit to 5:
db.pubs.aggregate([ {"$group": {"_id": "$name", "value": {"$sum": 1} } }, {"$sort": {"value": -1}}, {"$limit": 5} ]);
For the whole of the UK this returns:
At MongoDB London I thought that was too easy, so filtered to find the top pub names near the conference and showing off some of the geo functionality that became available in MongoDB 2.4. To limit the result set match and ensure the location is within a 2 mile radius by using $centreSphere
. Just provide the coordinates [ <long>, <lat> ]</lat></long>
and a radius of roughly 2 miles (3959 is approximately the radius of the earth, so divide it by 2):
db.pubs.aggregate([ { "$match" : { "location": { "$within": { "$centerSphere": [[-0.12, 51.516], 2 / 3959] }}} }, { "$group" : { "_id" : "$name", "value" : { "$sum" : 1 } } }, { "$sort" : { "value" : -1 } }, { "$limit" : 5 } ]);
At the conference I looked the most popular pub name near the conference. Thats great if you happen to live in the centre of London but what about everyone else in the UK? So for this blog post I decided to update the demo code and make it dynamic based on where you live.
See: pubnames.rosslawley.co.uk
Apologies for those outside the UK - the demo app doesn’t have data for the whole world - its surely possible to do.
All the code is available in my repo on github including the bson file of the pubs and the wordle code - so fork it and start playing with MongoDB’s great geo features!
原文地址:The Most Popular Pub Names, 感谢原作者分享。