One of the most important decisions you’re going to make on the road towards understanding data science better is deciding what language to use.
Almost any language can be used to solve your Big Data problems, but it is going to be a lot easier using a mature solution that has the right combination of speed, tooling, libraries and community support.
Before you make that decision, however, it’s important to understand the impact of the decision you are making, and what you’re missing by not choosing another language. You’re always going to be making a sacrifice by picking one over the other, but just not as much for some as compared to others.
What kind of speed do you need?
When the topic of speed with regards to programming comes up, there are two main discussions that will pop up: speed of execution and speed of development.
Speed of execution
Technically speaking, the fastest programming languages out there are those that get closest to machine code. Assembly and C/C++ would be the clearest winners with Rust, Go and Pascal following close behind. However, C and Assembly aren’t the easiest languages to learn nor are they suitable for the vast majority of teams out there.
It would be akin to trying to beat nails into a piece of wood using a mallet. It can be done, but there are better ways to do it.
Not to mention, most developers don’t have the kind of expertise needed to optimize their code for it to run at top benchmark speeds, especially for something as potentially large as a Data Science project.
We have to look beyond bare metal speed if we’re going to find the most efficient language for our application.
Speed of development
The fastest language when it comes to development will depend on the individual and the team’s preferences, the tooling available, and the kind of community around the language.
Languages that have good support for IDEs like VS Code, Brackets and IntelliJ are generally easier to work with. The proper IDE streamlined the development process, thanks to the kind of additional utility it provides. For instance, an in-built terminal, auto-suggestion, lint, and git make development faster.
Languages that have been around for a while tend to have the largest community pooled around them. This especially works best if the language has been proven to have Enterprise support of a big company like Google or Facebook. If you run into a problem, finding a solution with the community’s help will be much faster.
Lastly, some languages have better support than others when it comes to the kind of libraries available. You don’t want to be in a position where you wish you used another language because it’s the only one with support for an important library you need. Sure, there are wrappers and alternatives, but nothing quite beats the real thing.
The best programming languages for data science
Clearly, then, we need a language that has both a beautiful development experience and is at least decently fast at (compiling if necessary and) execution. This is as close as it gets:
Go
Go/Golang emerged from the labs of Google engineers who dreamed of a C++ that’s easier to use than C++ itself. The result of their hard work gave birth to a systems language that found a home in many of the big data projects out there today: Kubernetes and Docker to name a few.
Aside from many advantages that Go has, the largest advantage comes in the fact that it’s has a surprisingly small learning curve and is easy to maintain. Aside from which, it’s almost unparalleled when it comes to distributed programming and parallel processing, an integral feature for machine learning systems.
R
R has enjoyed extended popularity in the limelight, growing alongside Python as a go-to data science option. It’s commonly referred to as the ‘love language of statistics’ because it comes with tools that simplify the process of building data models.
It helps that comes with a large number of libraries for basically any purpose out there. Besides which, it’s also simple to integrate with essential big data tools like Hadoop or Spark for in-memory processing and storage.
The only thing that might turn developers away from R is the fact that it might need to be translated into another language, which is why it’s often used alongside Python because it’s not deployable on its own. It’s not a general programming language.
It’s better suited towards a fixed set of solutions, eg. Data modeling, rather than being able to do everything like Java, for instance.
Scala
Scala was born as an effort to create a ‘better Java’ – a language that runs on the JVM and doesn’t have some of the biggest shortcomings of the popular Oracle-owned programming language. Its popularity was boosted even further by the fact that some of the most important Big Data tools out there – Spark and Kafka – were written using it.
Since it runs on the JVM, it’s interoperable with Java and all its cousin’s, giving it access to a large variety of libraries best exploited with a JVM-based language. While it definitely manages to get rid of the verbosity that Java developers detest so much while being incredibly fast, it has quite a learning curve.
Python
No list about Big Data programming languages is complete without mentioning Python. It has taken over the programming world in the last five years or so because of how suited it is for big data programming.
Python is the perfect example of why it’s s difficult to judge a programming language based solely in time-to-execute. Few languages out there have the same kind of depth when it comes to libraries and community as Python does.
Libraries like NumPy, SciPy, Pandas and a multitude more don’t have comparable alternatives in other languages. Even the most popular machine learning tools out there have at least some bits written in Python, eg. Tensorflow.
The downside? Python is slow. A lot of people prefer to use other languages as a result. A lot of these still use Python at one point in their workflow, however, because, being a scripting language, it’s the perfect solution for prototyping and developing systems that are not speed-intensive.