此代码对于任何希望将具有地理数据的投资组合与任何其他地理进行匹配以计算两点之间的行驶时间和距离的人来说非常有用。它的灵感来自于我被分配的一项工作任务,该任务是为了帮助资助者在查询申请人的地理分布后了解批准的项目彼此之间的接近程度。
本文将演练如何使用 API 调用、内置和自定义函数将慈善机构列表(A 点)与其最近的火车站(B 点)相匹配,并计算距离(以英里为单位)和行驶距离时间以分钟为单位。
其他用例包括,例如:
套餐:
本文使用的资源:
这里讨论的步骤可能看起来错综复杂,但最终结果是一个可以重复使用和重新格式化的模板,以满足您在计算多行数据的 A 点和 B 点之间的地理距离时的需求。
例如,假设您正在与 100 个慈善机构合作。您想知道这些慈善机构距离附近火车站有多近,作为对这些慈善机构地理位置进行更广泛分析的一部分。您可能希望直观地映射此数据,或将其用作进一步分析的起点,例如研究从远处参加慈善机构的可达性。
无论何种用例,如果您想手动执行此操作,步骤如下:
这可能对少数慈善机构有效,但一段时间后,这个过程将变得耗时、乏味,并且容易出现人为错误。
通过使用 Python 来完成此任务,我们可以自动化这些步骤,并且只需用户需要的一些添加,只需在最后运行我们的代码即可。
让我们将任务分解为多个步骤。我们所需的步骤如下:
为了完成步骤 1,我们将使用 Python 来:
1- 导入包
# data manipulation import numpy as np import pandas as pd # http requests import requests # handling json import json # calculating distances import haversine as hs from haversine import haversine, Unit
2 - 导入和清理数据
# import as a pandas dataframe, specifying which columns to import charities = pd.read_excel('charity_list.xlsx', usecols='A, C, E') stations = pd.read_csv('uk-train-stations.csv', usecols=[1,2,3]) # renaming stations columns for ease of use stations = stations.rename(columns={'station_name':'Station Name','latitude':'Station Latitude', 'longitude':'Station Longitude'})
包含慈善数据集的变量(名为“charities”)将成为我们的主数据框,我们将在与提取的数据合并时使用它。
现在,我们的下一步是创建用于提取慈善机构邮政编码的经度和纬度的函数。3 - 将邮政编码转换为列表以进行匹配功能
charities_pc = charities['Charity Postcode'].tolist()Copy after login4 - 创建一个函数,该函数接受邮政编码,向 postcodes.io 发出请求,记录纬度和经度,并将数据返回到新的数据帧中。
有关更多信息,请查阅 postcodes.io 文档def bulk_pc_lookup(postcodes): # set up the api request url = "https://api.postcodes.io/postcodes" headers = {"Content-Type": "application/json"} # specify our input data and response, specifying that we are working with data in json format data = {"postcodes": postcodes} response = requests.post(url, headers=headers, data=json.dumps(data)) # specify the information we want to extract from the api response if response.status_code == 200: results = response.json()["result"] postcode_data = [] for result in results: postcode = result["query"] if result["result"] is not None: latitude = result["result"]["latitude"] longitude = result["result"]["longitude"] postcode_data.append({"Charity Postcode": postcode, "Latitude": latitude, "Longitude": longitude}) return postcode_data # setting up a fail safe to capture any errors or results not found else: print(f"Error: {response.status_code}") return []Copy after login5 - 将我们的慈善邮政编码列表传递到函数中以提取所需的结果
# specify where the postcodes are postcodes = charities_pc # save the results of the function as output output = bulk_pc_lookup(postcodes) # convert the results to a pandas dataframe output_df = pd.DataFrame(output) output_df.head()Copy after login请注意:
- if your Point B data (in this case, the UK rail stations) does not already contain latitude and longitude, you will need to also performs steps 3 to 5 on the Point B data as well
- postcodes.io allows bulk look up requests for up to 100 postcodes at a time. if your dataset contains more than 100 postcodes, you will need to either manually create new excel sheets containing only 100 rows per sheet, or you will need to write a function to break your dataset into the required length for the API call
6 - we can now either merge our output_df with our original charity dataset, or, to leave our original data untouched, create a new dataframe that we will use for the rest of the project for our extracted results
charities_output = pd.merge(charities, output_df, on="Charity Postcode") charities_output.head()Copy after loginStep 1 Complete
We now have two dataframes which we will use for the next steps:
- Our original stations dataframe containing the UK train stations latitude and longitude
- Our new charities_output dataframe containing the original charity information and the new latitude and longitude information extracted from our API call
Step 2 - Calculate the distance between Point A (charity) and Point B (train station), and record the nearest result for Point A
In this section, we will be using the haversine distance formula to:
- check the distance between a charity and every UK train station
- match the nearest result i.e. the UK train station with the minimum distance from our charity
- loop over our charities dataset to find the nearest match for each row
- record our results in a dataframe
Please note, for further information on using the haversine module, consult the documentation
1 - create a function for calculating the distance between Point A and Point B
def calc_distance(lat1, lon1, lat2, lon2): # specify data for location one, i.e. Point A loc1 = (lat1, lon1) # specify the data for location two, i.e. Point B loc2 = (lat2, lon2) # calculate the distance and specify the units as miles dist = haversine(loc1, loc2, unit=Unit.MILES) return distCopy after login2 - create a loop that calculates the distance between Point A and every row in Point B, and match the result where Point B is nearest to Point A
# create an empty dictionary to store the results results = {} # begin with looping over the dataset containing the data for Point A for index1, row1 in charities_output.iterrows(): # specify the location of our data charity_name = row1['Charity Name'] lat1 = row1['Latitude'] lon1 = row1['Longitude'] # track the minimum distance between Point A and every row of Point B min_dist = float('inf') # as the minimum distance i.e. nearest Point B is not yet known, create an empty string for storage min_station = '' # loop over the dataset containing the data for Point B for index2, row2 in stations.iterrows(): # specify the location of our data lat2 = row2['Station Latitude'] lon2 = row2['Station Longitude'] # use our previously created distance function to calculate the distance dist = calc_distance(lat1, lon1, lat2, lon2) # check each distance - if it is lower than the last, this is the new low. this will repeat until the lowest distance is found if dist < min_dist: min_dist = dist min_station = row2['Station Name'] results[charity_name] = {'Nearest Station': min_station, 'Distance (Miles)': min_dist} # convert the results dictionary into a dataframe res = pd.DataFrame.from_dict(results, orient="index") res.head()Copy after login3 - merge our new information with our charities_output dataframe
# as our dataframe output has used our charities as an index, we need to re-add it as a column res['Charity Name'] = res.index # merging with our existing output dataframe charities_output = charities_output.merge(res, on="Charity Name") charities_output.head()Copy after loginStep 2 Complete
We now have all our information in one place, charities_output, containing:
- Our charity information
- The nearest station to each charity
- The distance in miles
Step 3 - Calculate the driving time for travel
Our final step uses Project OSRM to find the driving distance between each of our charities and its nearest station. This is helpful as miles are not always an accurate descriptor of distance, where, for example, in a city like London, a 1 mile journey might take as long as a 5 mile journey in a rural area.
To prepare for this step, we must have one dataframe containing the following information:
- charity information: name, longitude, latitude, nearest station, distance in miles
- station information: name, longtiude, latitude
1- create a data frame with the above information
drive_time_df = pd.merge(charities_output, stations, left_on='Nearest Station', right_on='Station Name') drive_time_df = drive_time_df.drop(columns=['Station Name']) drive_time_df.head()Copy after login2 - now that our dataframe is ready, we can set up our function for calculating drive time using Project OSRM
please note: for further information, consult the documentationurl = "http://router.project-osrm.org/route/v1/driving/{lon1},{lat1};{lon2},{lat2}" # function def calc_driveTime(row): # extract lat and lon lat1, lon1 = row['Latitude'], row['Longitude'] lat2, lon2 = row['Station Latitude'], row['Station Longitude'] # request response = requests.get(url.format(lat1=lat1, lon1=lon1, lat2=lat2, lon2=lon2)) # parse response data = json.loads(response.content) # drive time in seconds drive_time_sec = data["routes"][0]["duration"] # convert to minutes drive_time = round((drive_time_sec) / 60, 0) return drive_timeCopy after login3 - pass our data into our new function to calculate driving time in minutes
# apply the above function to our dataframe driving_time_res = drive_time_df.apply(calc_driveTime, axis=1) # add dataframe results as a new column drive_time_df['Driving Time (Minutes)'] = driving_time_res drive_time_df.head()Copy after loginStep 4 Complete
We now have all our desired information in one compact dataframe. For layout purposes, and depending on what we want to do next with our data, we can create one final dataframe as output, containing the following information:
- Charity Name
- Nearest Station
- Distance (Miles)
- Driving Time (Minutes)
final_output = drive_time_df.drop(columns=['Charity Number', 'Charity Postcode', 'Latitude', 'Longitude', 'Station Latitude', 'Station Longitude']) final_output.head()Copy after loginThankyou for reading! I hope this was helpful. Please checkout my website if you are interested in my work.
The above is the detailed content of How to calculate the distance (time and miles) between the geographies of your portfolio and a comparator (Point A and Point B). For more information, please follow other related articles on the PHP Chinese website!