A summary presentation of András Benczúr (head of MTA Big Data - Momentum research group).
The MOL Bubi public bike-sharing system Analytics Challenge was organized by the "Big Data - Momentum" research group of the Hungarian Academy of Sciences (MTA SZTAKI) and the Centre for Budapest Transport (BKK) at the end of 2015.
The Municipality of Budapest made a decision in 2008 to establish a public bike-sharing scheme. The MOL Bubi system went live in September 2014 with 1100 bicycles and 76 docking stations. You can use MOL Bubi day and night, with annual, semi-annual, and quarterly passes or one, three and seven day tickets.
In the challenge, you are given 5 months of data to predict usage patterns in three tasks, two for predictive analytics and one for open research. The data contains information about every MOL Bubi travel during this period, such as the source and destination stations and the starting and ending time. The data of some of the days are set aside for evaluation. Further information can be found in the data section, where we introduce the MOL Bubi dataset along with the format of the submission files.
Predict the busiest routes (pairs of source, destination docking stations) for the given evaluation days. You should submit the top 100 routes with the highest predicted frequencies as a ranked order of (source, destination) pairs r1, r2, ... . The toplist will be evaluated by NDCG by using the following formula:
where traffic(r) is the number of bikes taking route r on the given evaluation day, and ti is the ith element of the list of traffic(r) values in decreasing order.
Check for the special rules considering routes that span two days in the data section. Submission files are also detailed there.
We calculate the demand of a docking station at a given time by subtracting the number of bikes docked to the station from the number of bikes taken away. For each day, we reset the demand to 0 at midnight.
For each day in the evaluation set, submit a predicted value of the daily maximum demand for each docking station. Note that the maximum demand is always non-negative. Predictions will be evaluated by root-mean-squared error.
Since entire routes are considered either for training or for evaluation, we exclude the relatively rare events when a bike is taken away before and returned after midnight from the demand calculation. Check for the special rules considering routes that span two days in the data section.
In this task, you may participate by submitting a very brief research plan. Goals may include, but not limited to, visualization or use of additional external data. You are expected to keep in regular contact with the organizers to share and discuss your findings.
Submit a very short research plan, explaining the way you want to use the data, the new insights that you expect from the analysis. Give sketches of the visualizations and descriptions of external data including availability, if applicable.
Research plans and solutions should be sent to bubiilab.sztaki.hu
We expect teams of B.Sc., M.Sc. or Ph.D. students of computer science, mathematics, physics, transportation, ..., but participation is open* to all individuals and teams. Money prizes have priority to go for the student teams.
*Staff of the Centre for Budapest Transport and T-Systems Hungary, as well as members and 2015 summer interns of the SZTAKI Big Data - Momentum Research Group are excluded.
We reserve the right to not give prizes to one or more tasks; in this case we will reallocate the prizes to other tasks.
The heart of the dataset contains travels from 01/01/2015 until 31/05/2015.
Data from every second day of April and May is withheld for testing the 1st and 2nd challenge tasks.
For the remaining days
train.csv contains the following information for each travel:
bicycle_id- unique ID of the bicycle
start_time- start time of the trip
end_time- end time of the trip
start_location- source station ID
end_location- target station ID
station_data.csv contains information about the docking stations. The corresponding columns are:
place_id- ID of the station (string, not integer!)
place_name- name of the station
lat- GPS latitude
lon- GPS longitude
num_of_rack- capacity (the number of bikes that the station can officially accept)
datetime_start- starting date of the period when the station was at this location
datetime_end- end date of the period when the station was at this location
For the entire period,
weather_data.csv contains the following information for every thirty minutes:
time- date and time
tempm- temperature in C
hum- humidity in %
wspdm- wind speed in kph
wdird- wind direction in degrees
wdire- wind direction description (ie. SW, NNE)
pressurem- pressure in mBar
vism- visibility in Km
windchillm- wind chill in C
fog- 1 in case of fog, 0 otherwise
rain- 1 in case of rain, 0 otherwise
snow- 1 in case of snow, 0 otherwise
hail- 1 in case of hail, 0 otherwise
thunder- 1 in case of thunder, 0 otherwise
The submitted solution file should contain the top 100 predicted routes for each testing day. It should have 3 columns, separated by a comma:
The submitted solution file should contain for each testing day for each station the predicted maximum decrease in the number of bicycles at the station. It should have 3 columns, separated by a comma:
Contact us at bubiilab.sztaki.hu