Technical Skills Every Data Scientist Must Know ūüĒ¨

Being a Data Science novice like me but aspiring to become a master, it is important to figure out what skills are worth learning and ways in which they can be learnt. In this post I will only cover the technical skills that every Data Science must know by aggregating advice from several resources and authors that I find to be authoritative in the field.

Notably, communication skills, domain knowledge and other soft skills are just as important for any Data Scientist but they are a bit more vague to define so I will leave the discussion of them for another blog post.

Also please note that these skills are listed in no particular order of importance. Furthermore, some skills may be more important in one type of organisation and completely unnecessary in the other. Thus, use this article only as a general guideline of where to focus your efforts when learning the required skills.

Coding skills

Coding skills is a must for any Data Scientist. There are many different languages and libraries that can be used such as R, Python, SAS, Julia, C++, JavaScript and many more. Each one of them serves a different purpose and should be used accordingly.

Luckily, being from a Computer Science background, I have no trouble with this one. Once you learn several different languages from different programming paradigms, learning a new language does not seem so scary anymore.

Data Acquisition

Data  comes from a variety of sources. Some is accessible through APIs so one should know how to deal with those. If you know how to code already, this should not be hard. Other data is harder to dig out and you may have to scrape public resources to acquire it e.g. web scraping.

It's quite likely that you won't have to do these things as a Data Scientist, same with data wrangling discussed below, especially if you work for a larger organisation since data will come already prepared for you. However, basic familiarity how to acquire data would be good to know, especially for your investigations of your own such as journalistic investigations backed by data.

Data Wrangling

Unfortunately, data in the real world is real messy and comes in many different shapes and forms. Try to imagine all the different ways that a simple date can be written down. Now try to fit all these different formats into a single one so it becomes available for data processing.

According to Dataist, sed, awk and grep are good enough for small tasks while Python or other coding languages should be able to handle the rest.

Querying Structured Data (SQL)

Structured data is essentially relational data. It is the most common way to store data. I do now know the statistics but I would bet that 99.9% or more of queries for relational data are built with SQL. Thus, knowing SQL is a must.

SQL could be added to the coding skills section but I feel it deserves its own section because its a type of declarative programming where you tell the computer what you want to achieve rather than how. Thus, coding and writing SQL queries is similar in some ways but very distinct in others.

Handling Unstructured Data

Due to the nature of unstructured data, it is being stored and retrieved differently to structured data. Plus, retrieval and analysis also differs and SQL cannot be used for these purposes.

Unlike in the structured data world, there are many approaches and philosophies of how unstructured data should be handled. The available implementations that can be used for purposes described above are numerous i.e. MongoDB, CouchDB, Cassandra, and so on.

Any good data scientist should know what the unstructured data is about, the tools that are available and how they differ, and be able to choose the appropriate tool to tame such data.

Big Data Processing Platforms

While it's important to not lock yourself to any particular technology and platforms, it is still important to be familiar data processing frameworks such as Hadoop, Spark and Flink.

These technologies come handy when SQL or regular unstructured data solutions are not capable of handling your data anymore due to its size. This is where Hadoop comes handy. It is more complex and expensive to set up, and in many cases may be an overkill, but it allows to work with incredible amounts of data.

As an example, while with SQL you may be able to store and analyze each customer's transaction, with Hadoop you would be able to store and analyze each customer's every single click.

Statistics

Statistics is a vital skill for any Data Scientist. At the very least, one should be familiar with statistical tests, distributions, maximum likelihood estimators, confidence intervals, p-value and correlation.

Statistics also play a huge role in Machine Learning. Thus, learning statistics is mandatory.

Data Visualization

Data Visualization is a subset of a more general skill - communication. It only turns out that data visualization is the most effective communication method to share insights derived from the data.

In some software packages such as Tableau visualization is built into software making the production of graphs rather simple. In such cases, the only big decision to make is choosing the appropriate graph to display the data.

In other cases where data is more complex, or you want something more visually stunning, JavaScript libraries such as D3.js come handy. However, D3.js is not trivial to learn and to really start taking advantage of all its capabilities, it will take some dedicated effort to master it.

Machine Learning

There are already many out-of-the box implementations of Machine Learning algorithms. To be able to use them effectively to solve problems with data, familiarity with how all these Machine Learning algorithms differ and the basic principle behind inner workings is a must.

Thus, a Data Scientist should know what k-nearest neighbors, random forests, ensemble methods and many other Machine Learning buzzwords mean.

Multivariable Calculus and Linear Algebra

Multivariable Calculus and Linear Algebra, along with Statistics, is a basis for Machine Learning. If you will be trying to understand how these Machine Learning algorithms really work, or implementing your own, knowing these two branches of Mathematics is a prerequisite.

Resources: