Popular data science meetup groups in Warsaw include over 1000 members, and its events and afterparties are amazing opportunity to meet new, interesting people working with data in various industries. That also means we already have a significant source of valuable informations full of undiscovered insights.
I was attending one of data science meetups, when I started to wonder… How many of the attendees in this room actually work as data scientists and what are the backgounds of everyone else?
I decided to find an answer using R and Python. I chose the largest Warsaw meetups and gathered their attendees data for further analysis (ordered alphabetically): Big Data Warsaw, Data Science Warsaw, Machine Learning Warsaw, PyData, Qlik, Warsaw R Enthusiasts and Warsaw Hadoop User Group.
I prepared my results as a Shiny dashboard published here.
After getting familiar with the gathered meetup data I decided to classify attendees by their public professional profiles on social media. I chose to create the following classes of attendees:
- Dev - software developers (Java/iOS/JS/.NET/Scala/…), tech-leads, webdevs, QA, etc
- Business - HR, managers, business consultants, owners, founders, PR, marketing, etc
- Data Scientist - having in their job description areas like data science, data analysis, big data, machine learning and similar. In particular, this group contains software developers who deal with data science/machine learning on a daily basis.
- Academic - working at the universities, PhD students, professors, etc
- Other - unable to classify to any other group / spam / no public profile data.
It’s common occurance for someone can belong to multiple groups like a student working as a data science developer, I had to classify each person to only one of the categories.
How many Data Scientists are there?
The results are interesting, but not surprising:
There is a big disparity in attendee proportions for particular meetups. For example, Warsaw Hadoop User Group targets mainly developers; let’s also see how these proportions look like for particular meetups:
Meetups with the largest proportion of developers are Hadoop User Group, Machine Learning Warsaw and PyData Warsaw. Almost 1 in 2 attendees is a developer.
1 in 5 attendees of Warsaw R Enthusiasts meetup work professionaly as data scientist.
Meetups with the largest proportion of business people is Qlik. Unfortunately the group of classified people is relatively small, but such result are intuitively expected.
Now let’s see how many attendees were classified for each selected meetup:
Here’s how the gathering algorithm steps looked like:
Collecting raw data from meetup website using Chrome plugin - DataMiner. This tool contains ready XPath sets for popular websites, so it was easy to get raw data, even if they were available only for logged users.
Drop names who looked like nicknames or were anonimized (like “John D.”). Simple rule was effective enough:
number of name parts >= 2 (at least name and surname) && number of letters in each name part >= 3
Classify attendees based on keywords found in gathered job titles and descriptions. This process was supported by manual selection of common useful keywords. My intention was to answer my initial question, not to create a state-of-the-art classifier.
I assumed that collected names are unique and represent one and the same person in all meetups. This doesn’t have to be true.
In total, I managed to gather 1873 unique names from the meetup groups. 70% of them were not anonymous and 60% of them allowed me to enter them into one of the created classes.