Data Riddle

    1 February, 2005

    2003-2005. Funded by the Hungarian Government for the development of usage analysis tools for large scale server logs. Key result: the Data Mining Pipeline.


    The great promise of the digital economy lies in that a rich store of information is available on customer behavior which enables far more accurate and efficient planning than it was possible earlier. The companies and organizations appearing on the web have the possibility to learn their customers by analyzing the usage logs. These analyses can give them statistics on how many people visited their sites, and by mapping the most frequent access routes and using various monitoring techniques, they can also help identify user profiles of similar interest and consumer behaviour.

    The aim of this application is to present a system suitable for analyzing log files. The quantity of data to be analyzed at the Internet service provider members of the consortium exceeds the volume that commercial analyzer software can handle, so individual modules need to be developed which feature statistical methods; refined data mining algorithms capable of handling billions of records; database size reduction procedures using random sampling and algebraic solutions; methods of episode mining and Fourier analysis to identify sequential patterns and repeated time periods; spectral decomposition, discriminant analysis and graph theory for the cluster analysis of usage pattern.

    We pay special attention to two key development issues. First, we build and test the analyzer software on various platforms to make it architecture independent. Second, we base the analysis on anonymous user identifiers to protect personal data.


    ELTE, BME, MTA SZTAKI, T-Online (Axelero),