Project information

Fllickr Dataset Clustering

Context

This project was part of the Data Mining course at INSA Lyon. The goal was to identify points of interest in Lyon based on intense photo-taking activity. The dataset comprised geolocated photos from Flickr, with an older set of 80,000 photos and a more recent set of 400,000 photos. The project was implemented using KNIME, a data analytics platform for data mining and machine learning.

Actions

  • Data Comprehension: My first step was to understand the data and its attributes.
  • Data Cleaning: I removed inconsistencies, duplicates, and errors to ensure data quality.
  • Data Visualization: I created visual representations like plots, graphs, and maps to reveal patterns and trends.
  • Statistical Analysis: I applied statistical methods to extract meaningful insights and descriptive information from the data.
  • Data Mining with Clustering: I used clustering algorithms such as K-means, DBSCAN, and hierarchical clustering to discover groups of similar data points.
  • Mining Patterns for Cluster Understanding: I used frequent tag sets to understand the clusters and identify points of interest and established association rules between tags and clusters to identify correlations between specific tags and the types of clusters they appear in.

Results

This project successfully demonstrated the feasibility of discovering centers of interest and events using the KNIME workflow on FLICKR data.

Various algorithms were explored, each with its strengths and limitations. For instance, K-means struggled with varying group densities and non-spherical group shapes, hierarchical clustering faced scalability challenges while DBSCAN gave the best results.

The project's limitations were tied to data accuracy, questioning the representativeness of tags and the correctness of geospatial coordinates.