Citizen Data Scientists: the role of AutoML

Examine the impact of AutoML tools on citizen data scientists, exploring how these technologies democratize data science and enhance productivity.

Kaggle
Competitions
Data Visualization
Author

Jacopo Repossi

Published

November 28, 2021

Keywords

citizen data scientist, automl kaggle, tableau in python

Introduction

I recently took part in the Kaggle’s annual Machine Learning and Data Science Survey competition together with a colleague of mine, Alessia Musio.

The Survey, as stated in the Kaggle page, presents a comprehensive view of the state of data science and machine learning and was live from 09/01/2021 to 10/04/2021. After cleaning the data, they finished with 25.973 responses!

The topic we decided to explore what the role of Automated Machine Learning and how this can impact business strategies. For the full overview of the work, please refer to the notebook we published on Kaggle, since this post will be a quick overview of our findings.

Problem statement

At the Data Science Salon 2020 in Austin, Indeed.com revealed that job postings on its site for Data Scientists had more than tripled since December 2013.
While this is great for data science professionals, the same study revealed that the supply is still way lower than the demand.
In a world increasingly dominated by data, companies are trying to bridge this gap to keep up with the competition. When it comes to hiring data scientists, top salaries may not be enough and therefore, there are those who are adapting and exploiting new methods that could gradually solve this issue.
Automated machine learning (ML) tools, or AutoML, are designed to automate many steps in the data science process; these methods have been proliferating over the past few years, making it easier to create machine learning models by removing repetitive tasks without requiring the expertise of many data scientists.

The dilemma seems to have already found a solution, that is the birth of a new professional figure: the Citizen Data Scientist, a concept firstly introduced by Gartner years ago.
Suddenly, thanks to AutoML methods, other technical members of the organization with deep domain knowledge like BI analysts, data analysts, business analysts can also become valuable contributors to an organization’s development of ML and AI models.

Should data scientists be worried about these methods? How will our role evolve? But most importantly, are we aware of this trend?

In the survey we analized, we studied people’s traits and behavior towards these tools to understand if, as a community, we are ready to embrace what could become a new way of doing data science in the future.

Automated Machine Learning: A short overview

Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time-consuming, iterative tasks of the machine learning model development.[1]
Several major AutoML libraries have become very popular since 2013 with Auto-Weka. The aim is always the same: to automate one or more phases of the classic machine learning pipeline, making it easier for non-experts to create machine learning models or allowing expert users to build models quicker and more efficiently.
In general, the main components of the pipeline that can be automated are: the initial data preparation and feature engineering, hyperparameter optimization and model evaluation and neural architecture search.[2]

Below, an image showing the areas heavily affected by AutoML, adapted from [3] and [4].

One of the main advantages of the AutoML platforms, therefore, is the true Data Science democratization[5], in other words enabling a more diverse and larger group of users to contribute to the data science process.
With the economic uncertainty of these times, creating a new class of AI/ML developers with minimal investment allows maintaining or increasing competitive advantage.

Having said that, we are ready to dive into the data and analyze the results of the survey.

Exploratory Data Analysis

Based on the message we wanted to convey, creating a traditional notebook was not the best choice. Rather, we wanted our visual project to be interactive as well as eye-catching and strongly distinguished on a graphic level. For this reason, we have decided to display the CVs of our Personas, recalling a desktop-file interaction that anyone among us is familiar with. The metaphor used is relevant both to the topic of our analysis (we are in fact talking about how the working landscape in data science will transform) and to the type of data displayed (education, work experience, tools, skills, etc.). The next challenge was to understand how to do it (with all the limitations of the case).

That is why we opted for a Tableau Dashboard.

Our starting point was to explore the overall behavior of the survey respondents, about 25k people, but also to familiarize ourselves with the tool by exploring some macro results.
In the dashboard, this is called Miscellaneous.
To look at the all respondents, you need to view all files at once. You achieve that in two ways: - you can unselect the current file you are looking at (by clicking again on the file name). This is the fastest way. - you can select all the files by clicking them while holding CTRL/COMMAND

You can also see Miscellanous even if you select 2 Personas at a time. This is still considered Miscellaneous because it’s a mixture of people.

After selecting all files, let’s examine the information together.
In General Information, it is immediately obvious that the average age is relatively low, with about 56% of respondents being under 30 years old.
Geographically, the countries with the most respondents are India and the USA, which represent 28.7% and 10.2% of the total, respectively.

Our hypothetical CV then moves from Education&Work on to Skills: in the former, the most frequent answers are reported, while in the latter, there is a heatmap comparing ML Experience & Coding Experience.
Generally speaking, we see that most of the respondents are students (26%) or data scientists (14%) and, if workers, employed in the Computers/Technology (25%) or Education (20%) sectors.
Almost 36% of people, on a daily basis, analyze and explore data to influence business decisions. This is followed by 20% who apply ML to explore new areas. The young age coincides with little experience both in coding and in Machine Learning, with more than 50% of respondents having less than 3 years.

A key aspect on which we would like to reflect, however, is this: in the AutoML tools section on the right, it is interesting to note that AutoML is either not used (5%) or very few are familiar with it. In fact, those who are familiar with AutoML, selected for the most part Google, Microsoft and Amazon, which together accounts for only 6%.

Conclusions

As stated before, we published a notebook on Kaggle where we further analyzed the findings and created a storytelling with Personas, each one telling its own story and its own approach to data analysis and tools currently available.

Of course, not all data science challenges can be solved using AutoML tools. At the moment, the most suitable use cases are those in which the use of black-box models is allowed. In this case it is possible to take advantage of the simplifications that the tools provide, allowing you to focus more on other aspects of the pipeline.[3]
Models that require more in-depth skills or where data modeling is particularly difficult, still require the experience of qualified data scientists. In this case, it is very likely that one relies on PartialAutoML and not Full techniques to have greater control of the decision-making steps along the design workflow.

In any case, we believe that it is time to adapt and give value to the knowledge of AutoML tools, thus favoring the proliferation of Citizen Data Scientists.
The advantage, as mentioned, is twofold and affects both less experienced professionals and experts. On the one hand, it becomes more efficient and economical beneficial to employ many of the standard Data Science activities, a trend that will be even more prevalent in the future as these tools improve. At the same time, experienced data scientists will be free to take on more technically demanding tasks, allowing them to use their skills more efficiently and innovate faster, while increasing their job satisfaction. This benefits both the worker and companies seeking to maximize their production and employee retention, as correctly pointed out here[3].

To conclude, it is our hope that AutoML will lead to a true Data Science democratization, allowing more diverse and numerous user groups to actively contribute.

For the more curious cats, we have created a chapter specifically to explain the background of this analysis plus a little more. We suggest you take a look at the Breakdown!

ALESSIA AND JACOPO

References

1. What is automated machine learning (AutoML)?

2. Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence

3. Rethinking AI talent strategy as automated machine learning comes of age

4. Taking the Human out of Learning Applications: A Survey on Automated Machine Learning

5. AutoML 2.0: Is The Data Scientist Obsolete?

Back to top