Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Tuesday, April 18, 2017

Webinar: Excel vs Python vs R for Data Analysis. Making the right pick.



There is a lot of textbook arguments on Excel vs Python vs R. In fact, a few weeks back I used to be part of that argument. I took sides with Excel. And I had my solid reasons.

But now I don't take sides anymore. I focus on a bigger picture, a picture I became aware of only after I transitioned for reading textbooks and doing video tutorials on Python and R to carrying out live interesting projects with them.

In this webinar I will be sharing my experience on using them all and what practical insight you too need to have to move from those textbook arguments to seeing the big picture. And most importantly, I will be showing you that big picture.

You shouldn't miss this webinar.

Time: 3:00pm to 4:00pm
Date: Tuesday, 21 March 2017
Venue: YouTube Live

You should set it up in your calendar with reminders so you won't miss this special edition.

See you!

Note: The webinar already took place but you can view the YouTube recorded session at https://www.youtube.com/watch?v=YLoOynh8iNQ 

Tuesday, April 4, 2017

Introduction to R and Python

R is, arguably, the world's most popular programming language for data analysis. It is an adaptation of S language developed by Bells Laboratories strictly for statistics and data analysis. S language became very popular and R grew out of it as the open-source implementation of S language in August 1993.

The popularity of R is due to its open-source nature and the huge community contributing packages to it that handle all types of common data analysis work. You can read more about R at http://www.r-project.org/ and a more detailed history of R at http://www.r-project.org/about.html

Python, on the other hand, is a general purpose programming language. It is used for all types of programming from building websites (web applications) to computer programs and data analysis. It was created by Guido van Rossum in 1991. It is also an open-source language, benefiting hugely from a large community actively contributing to it. You can read more about Python at https://www.python.org/about/ 

image: medium.com

R and Python are the two most popular languages for data analysis. And anyone serious in becoming a data scientist must be proficient in at least one of the two. I recommend that you have average knowledge in both and then become an expert in one.

Both languages have libraries. More often called packages in R. They are already built algorithms that help you achieve specific tasks. More like Excel formulas, though way more robust in nature. You load them into your R or Python work space and can access the functions they provide.

A common need you will encounter is creating graphs/charts. In R, the most commonly used package for that is ggplot2 and in Python you would use Matplotlib.

To start using R and Python, you will need to install them. 

For R, you can download R at https://cran.r-project.org/ and it is recommended to install RStudio to make using R enjoyable. RStudio is an IDE (integrated development environment) and can be installed at https://www.rstudio.com/  With those two installations, you are set to begin analysing data with R. And they work whether you have a Windows PC, or Mac or Linux.

For Python, it is recommended that you download Anaconda at https://www.continuum.io/downloads. It is regarded as the best distribution of Python for data analysis work. And as for an IDE to use, there is no obvious best as RStudio is for R. Some people are a die-hard fan of Jupyter Notebook (formerly IPython Notebook), luckily it comes pre-installed with Anaconda. Others love Spyder (again, comes pre-installed with Anaconda). And there is PyCharm, you will have to install that at https://www.jetbrains.com/pycharm/. For this training series, I will be using Rodeo downloadable at https://www.yhat.com/products/rodeo, it is an RStudio lookalike. That way you won't have to stress yourself too much in getting used to the IDEs. Once you become familiar with RStudio, you will become more comfortable with Rodeo and vice-versa.

In the next sections we will dig deeper into carrying out simple data analysis tasks in both R and Python.

Tuesday, March 28, 2017

Sentiment Analysis of Today's Tweets About President Buhari Using Python's Vader and Sentiwordnet


After a long weekend crisscrossing Lagos, Kaduna and Abuja to deliver our quarterly Business Data Analysis and In-depth Excel Training in Abuja, I rested today and decided to do an interesting data analysis task. 

Do a sentiment analysis on today's tweets about the President, Buhari.

It was a very interesting task. Made me get familiar with Python's Tweepy library. I got some help from Marco Bonzanini. You should check out his Mastering Social Media Mining with Python book.

Here is the breakdown of the steps in achieving the goal:

  1. Created the needed credentials to access Twitter API (it's at https://apps.twitter.com/app/new)
  2. Use Tweepy to search in realtime for tweets about Buhari
  3. Save the tweets as a JSON file
  4. Handle emoticons and strings peculiar to Twitter (like @, # and so on)
  5. Exclude stop words (words with no significant, for sentiment analysis, meaning) like to, be, is etc
  6. Apply Vader to each tweet text and calculate the entire tweet stream sentiment
  7. Apply Sentiwordnet, an alternative to Vader, to do the same stream sentiment analysis
Below are the screenshots of some of the steps.

Twitter developer setup

Twitter developer setup
Vader Sentiment Analysis results

Sentiwordnet sentiment analysis result
And the results?

Well let's just say if the tweets are a mirror of what people genuinely feel about President Buhari, he should be terribly worried. Most people were very emotional in their tweets about him and overwhelmingly negative.


Vader classified the tweet stream as 54.67% negative vs 17.76% positive and 27.57% neutral.


Sentiwordnet gave a more damning result: 92.06% negative vs 7.48% positive and 0.47% neutral.



Both Vader and Sentiwordnet are rule-based supervised classification algorithms. In a context like Nigeria where we have our own Naija words and peculiar word mix, machine learning algorithm, supervised or non-supervised, might be more reliable.

Monday, February 27, 2017

Tutorial On Creating A Frequency Distribution Chart With Microsoft Excel, R and Python

I got to join this amazing community of Data Scientists in Nigeria. We are a mix of experts and beginners. Today, I created a tutorial for the beginners to see how to do a common task like frequency distribution plot in both Python and R, also decided to include my dearest Microsoft Excel as a control.

The sample data is a fictionalized data for Dominos Pizza Nigeria. One day sales data for their Lekki branch. You can download the practice along raw data file here: https://dl.dropboxusercontent.com/u/28140414/Dominos%20Pizza.csv 


So the business question we want to tackle is: Is there a pattern in the quantities each customer buys? To be more specific, we want to examine the frequency distribution of the quantities purchased per sales transaction.

In Excel, it is extremely straightforward. Just plot a histogram on the quantity field.


Now let's head to doing same with R

I use R 3.3.2 and RStudio.

First, I import the csv file into RStudio.




Though not necessary for what we want to do, but I like doing it for any data I bring into R, I run the summary command on the dataframe/table. > summary(Dominos_Pizza)


Again, not a required step. I check out the standard plot graph on the Quantity field. > plot(Dominos_Pizza$Quantity)


Finally, I do the histogram chart on the Quantity field. > hist(Dominos_Pizza$Quantity)



For now I don't bother customizing the graph elements (labels, color, title, etc.)

It is Python time.

I use Rodeo IDE and Anaconda. 


I import Pandas and use it to read in the csv file. 




And here is the plot graph, like we did in R.


Finally, I create the histogram.



I will try to follow up with more tutorials of complex tasks, and some that are best suited to R and others that are best suited to Python. As per Excel, it is in a completely different class. It is a spreadsheet application. 

Got any particular task you will like me to create a tutorial around? Ask away!