MOL Bubi

The MOL Bubi Challenge ended. Congratulations to the Winners!

A summary presentation of András Benczúr (head of MTA Big Data - Momentum research group).

Note: the following presentations are in Hungarian.

Task 1 : Busiest Route Prediction

Task 2 : Docking Station Demand Prediction

Task 3 : Open Research

The MOL Bubi Challenge (2015)

The MOL Bubi public bike-sharing system Analytics Challenge was organized by the "Big Data - Momentum" research group of the Hungarian Academy of Sciences (MTA SZTAKI) and the Centre for Budapest Transport (BKK) at the end of 2015.

The Municipality of Budapest made a decision in 2008 to establish a public bike-sharing scheme. The MOL Bubi system went live in September 2014 with 1100 bicycles and 76 docking stations. You can use MOL Bubi day and night, with annual, semi-annual, and quarterly passes or one, three and seven day tickets.


In the challenge, you are given 5 months of data to predict usage patterns in three tasks, two for predictive analytics and one for open research. The data contains information about every MOL Bubi travel during this period, such as the source and destination stations and the starting and ending time. The data of some of the days are set aside for evaluation. Further information can be found in the data section, where we introduce the MOL Bubi dataset along with the format of the submission files.

Challenge tasks

Task 1: Busiest Route Prediction

Predict the busiest routes (pairs of source, destination docking stations) for the given evaluation days. You should submit the top 100 routes with the highest predicted frequencies as a ranked order of (source, destination) pairs r1, r2, ... . The toplist will be evaluated by NDCG by using the following formula:

DCG = \sum_{i=1}^{100} traffic(r_i) / log_2 (1+i)

IDCG = \sum_{i=1}^{100} t_i / log_2 (1+i)


where traffic(r) is the number of bikes taking route r on the given evaluation day, and ti is the ith element of the list of traffic(r) values in decreasing order.

Check for the special rules considering routes that span two days in the data section. Submission files are also detailed there.

Task 2: Docking Station Demand Prediction

We calculate the demand of a docking station at a given time by subtracting the number of bikes docked to the station from the number of bikes taken away. For each day, we reset the demand to 0 at midnight.

For each day in the evaluation set, submit a predicted value of the daily maximum demand for each docking station. Note that the maximum demand is always non-negative. Predictions will be evaluated by root-mean-squared error.

Since entire routes are considered either for training or for evaluation, we exclude the relatively rare events when a bike is taken away before and returned after midnight from the demand calculation. Check for the special rules considering routes that span two days in the data section.

Task 3: Open research task

In this task, you may participate by submitting a very brief research plan. Goals may include, but not limited to, visualization or use of additional external data. You are expected to keep in regular contact with the organizers to share and discuss your findings.

Submit a very short research plan, explaining the way you want to use the data, the new insights that you expect from the analysis. Give sketches of the visualizations and descriptions of external data including availability, if applicable.

Research plans and solutions should be sent to


We expect teams of B.Sc., M.Sc. or Ph.D. students of computer science, mathematics, physics, transportation, ..., but participation is open* to all individuals and teams. Money prizes have priority to go for the student teams.

*Staff of the Centre for Budapest Transport and T-Systems Hungary, as well as members and 2015 summer interns of the SZTAKI Big Data - Momentum Research Group are excluded.


We reserve the right to not give prizes to one or more tasks; in this case we will reallocate the prizes to other tasks.

MOL Bubi Challenge Data


A - Traveling history

The heart of the dataset contains travels from 01/01/2015 until 31/05/2015. Data from every second day of April and May is withheld for testing the 1st and 2nd challenge tasks. For the remaining days train.csv contains the following information for each travel:

  1. bicycle_id - unique ID of the bicycle
  2. start_time - start time of the trip
  3. end_time - end time of the trip
  4. start_location - source station ID
  5. end_location - target station ID

Note that:

B - Additional datasets

Docking stations

station_data.csv contains information about the docking stations. The corresponding columns are:

  1. place_id - ID of the station (string, not integer!)
  2. place_name - name of the station
  3. lat - GPS latitude
  4. lon - GPS longitude
  5. num_of_rack - capacity (the number of bikes that the station can officially accept)
  6. datetime_start - starting date of the period when the station was at this location
  7. datetime_end - end date of the period when the station was at this location

Note that:

Weather information

For the entire period, weather_data.csv contains the following information for every thirty minutes:

  1. time - date and time
  2. tempm - temperature in C
  3. hum - humidity in %
  4. wspdm - wind speed in kph
  5. wdird - wind direction in degrees
  6. wdire - wind direction description (ie. SW, NNE)
  7. pressurem - pressure in mBar
  8. vism - visibility in Km
  9. windchillm - wind chill in C
  10. fog - 1 in case of fog, 0 otherwise
  11. rain - 1 in case of rain, 0 otherwise
  12. snow - 1 in case of snow, 0 otherwise
  13. hail - 1 in case of hail, 0 otherwise
  14. thunder - 1 in case of thunder, 0 otherwise

Note that:

C - Submission files

In both the first and second task, the testing days are the even days in April and May 2015. The solution files should not contain headers.

Submission file for Task 1

The submitted solution file should contain the top 100 predicted routes for each testing day. It should have 3 columns, separated by a comma:

  1. The date of the day in format 2015-05-15
  2. The ID of the link in format SID-DID, where SID corresponds to the source station, and DID to the destination station
  3. The predicted relevance (used for ranking, higher is better)
Note that the ranking is computed by us based on the predicted relevances (higher is better). A sample is included below:


Submission file for Task 2

The submitted solution file should contain for each testing day for each station the predicted maximum decrease in the number of bicycles at the station. It should have 3 columns, separated by a comma:

  1. The date of the day in format 2015-05-15
  2. The ID of the docking station
  3. The predicted decrease in the number of bicycles. It should be 0, if you predict increasing.
A sample is included below:



Contact us at