Wikiclick is a simple interface to explore wikipedia clickstream data. Wikipedia has millions of requests every month, and through Wikiclick, you can discover the numbers behind the information giant! Get info on where people are clicking to or from various pages.
Wikiclick was inspired by a hackathon project to map out which people were related to who via links on wikipedia pages. Combing pages for links and cross-checking against a database of pages that belonged to a person quickly became inefficient. And towards the end of the project, we unearthed Wikipedia's Clickstream Dataset. This project then became a way for people to have easy access to the data in JSON format rather than combing through gigabytes of .tsv files.
There are a set of bash + SQL scripts that download the tsv files from wikipedia's website and convert them into efficiently indexed tables in a MySQL database powered by MariaDB. The script can be run every month to download the previous month's complete data.
To display the data, there is a simple webserver built with express.js that gives users fascinating info about click data!
While the hackathon project was done with a friend of mine. This venture into databases and webservers was something I did myself.
This project was incredibly enlightening to the world of databases and linux servers. Setting up a database on AWS and interacting with it through an intermediary EC2 instance that would run the scripts behind the scenes was a huge learning process. But now I can spin this process up in minutes!
I'd like to keep the server up for as long as I can afford and hopefully provide JSON API endpoints so that users can use the data in their own projects as well!