Is it worth learning about data science?

data> opinion


Everyone wants data scientists. Universities offer courses. Coursera & Co are rolling over data science offers. Supposedly you can learn data science in a month. And that's important! Because data is the new oil. Without data and the data scientists turning it into gold, the future is bleak, everyone agrees. Even if you don't have any exciting data, a data scientist may be able to conjure up gold dust out of the little. And actually you have no idea what you can do with the data anyway, but once you have data scientists, everything will be fine. We're still not at the top of the hype cycle, but it won't be long before it descends to the valley of disillusionment (and then to the plateau of productivity). Several misunderstandings are to blame for this.

There is no one-size-fits-all definition of data science

So anyone can call themselves a data scientist who would like to. And you can also name a course or a degree program afterwards because it is trendy at the moment. This is exactly what happens too often at the moment.

Data science is the interplay of data mining, statistics and machine learning. And that's exactly what I offer in my courses. And so that we understand each other correctly: One semester is far too little for that. And that's why we don't even call it data science, but data analytics or something similar. We get a taste of data science. But in the 60 hours in the semester, I don't develop a new data scientist.

In principle, one would have to teach statistics for at least one semester before continuing. Then learn a programming language properly, be it R or Python. And then you would start with machine learning. Every now and then explain how to use Linux / Unix. Databases. Cloud technology. You can certainly fill an entire course of study with that.

But often it's just an introduction to Python with a little scikit. But, as already described above, that doesn't matter, because the term is not protected anyway. And hardly anyone notices it, because who is supposed to judge that?

There is still insufficient training

I recently sniffed a data science course on Udemy (which, by the way, still only costs a few euros for a few hours). The young man in his gamer chair could talk well, but he couldn't go deep. Whereby, it depends on how you define depth. The low point in terms of content was reached when he said that you don't have to understand certain things mathematically, for example whether you divide by n or by n-1. Wow.

Then I already had several IT or similar students from the University of Hamburg etc. with me. Apart from the fact that they lack basic knowledge (“What is a CSV file?”), They have learned a few techniques that they also write in their application (“Experience in ML”), but they have not understood them correctly what they are doing. So k-means likes to be shot at everything, even if it is not numeric data (you can simply convert it, then it is numeric). That that rarely makes sense when calculating Euclidean distances, well. If you only have a hammer, everything looks like a nail.

But if the training is sub-optimal, how are data scientists supposed to generate gold from data? Such training will not be enough for the really blatant stuff. And either crap is delivered or the project never ends. That reminds me a bit of the New Economy, when suddenly everyone could build HTML pages. Only those who could do more than HTML had a chance of a job after the crash. And too many stores went bust simply because they hired poorly trained people.

Not every problem needs a data scientist

Many problems can be solved without a data scientist. In fact, many methods have been well treated in statistics, from regression analysis to Bayesian inference. Classification and clustering also existed long before the data science age. Support vector machines are also a bit older (60s!). The only new thing is that there are many more libraries that anyone can use. But you don't have to immediately think of data science when it comes to these topics. Because you pay a hype bonus at the same time.

And before using such methods, there is first of all the analysis of data. This is the skill that is most lacking. First of all, we don't need more data scientists, we need more people who don't run away from a column of numbers and manage to draw the right conclusions from them. And if you don't know how to come up with a solution, you can always ask a specialist. The most common problems I see are not data science problems, they are data analysis tasks. And ideally, these tasks are not carried out by extra data analysts, but by the colleagues themselves, who are experts in a topic.

What, if not data science, will be important?

Of course, working with data will not become any less important in the future. But on the contrary. But it is to be feared that the current hype is not doing this new crop any good. Since there is a lot of money to be made there, talented people whose previous focus was not necessarily on mathematics-related subjects jump into it. Anyone can complete a Udemy course in any way. But the quality is not the same for every course. And accordingly, this type of training, as well as the clumsy learning of methods at the university, is not helpful in driving data science forward. As a result, data science is more likely to disappoint and slide into the valley of disappointment. Because not all expectations can be met.

The focus should be on working with data, not data science. The analysis. The acquisition. Data scientists get bored if only used as better paid data analysts. And the user, who cannot articulate his needs and problems at all (if there is a problem at all and is not just asked about the “hot shit”), no longer understands the world when the data scientists then go back and find one looking for a more exciting task. We need users and data scientists who first of all understand the problem to be solved and have also analyzed the corresponding data. We have to give more people the competence to analyze data themselves.