Tuesday, April 4, 2017

Introduction to R and Python

R is, arguably, the world's most popular programming language for data analysis. It is an adaptation of S language developed by Bells Laboratories strictly for statistics and data analysis. S language became very popular and R grew out of it as the open-source implementation of S language in August 1993.

The popularity of R is due to its open-source nature and the huge community contributing packages to it that handle all types of common data analysis work. You can read more about R at http://www.r-project.org/ and a more detailed history of R at http://www.r-project.org/about.html

Python, on the other hand, is a general purpose programming language. It is used for all types of programming from building websites (web applications) to computer programs and data analysis. It was created by Guido van Rossum in 1991. It is also an open-source language, benefiting hugely from a large community actively contributing to it. You can read more about Python at https://www.python.org/about/ 

image: medium.com

R and Python are the two most popular languages for data analysis. And anyone serious in becoming a data scientist must be proficient in at least one of the two. I recommend that you have average knowledge in both and then become an expert in one.

Both languages have libraries. More often called packages in R. They are already built algorithms that help you achieve specific tasks. More like Excel formulas, though way more robust in nature. You load them into your R or Python work space and can access the functions they provide.

A common need you will encounter is creating graphs/charts. In R, the most commonly used package for that is ggplot2 and in Python you would use Matplotlib.

To start using R and Python, you will need to install them. 

For R, you can download R at https://cran.r-project.org/ and it is recommended to install RStudio to make using R enjoyable. RStudio is an IDE (integrated development environment) and can be installed at https://www.rstudio.com/  With those two installations, you are set to begin analysing data with R. And they work whether you have a Windows PC, or Mac or Linux.

For Python, it is recommended that you download Anaconda at https://www.continuum.io/downloads. It is regarded as the best distribution of Python for data analysis work. And as for an IDE to use, there is no obvious best as RStudio is for R. Some people are a die-hard fan of Jupyter Notebook (formerly IPython Notebook), luckily it comes pre-installed with Anaconda. Others love Spyder (again, comes pre-installed with Anaconda). And there is PyCharm, you will have to install that at https://www.jetbrains.com/pycharm/. For this training series, I will be using Rodeo downloadable at https://www.yhat.com/products/rodeo, it is an RStudio lookalike. That way you won't have to stress yourself too much in getting used to the IDEs. Once you become familiar with RStudio, you will become more comfortable with Rodeo and vice-versa.

In the next sections we will dig deeper into carrying out simple data analysis tasks in both R and Python.