Tuesday, May 9, 2017

Cluster and Word Cloud Analysis of Tweets About Buhari Today Using R


Last week I wrote about doing a sentiment analysis of tweets about the President using Python's Sentiwordnet and Vader. Today, I switched to R and did a clustering and word cloud of the tweets about President Buhari.



Below are the steps I took:
  1. I imported all the necessary libraries.
  2. I connected to Twitter and created a search stream to gather tweets about Buhari
  3. I saved the results in a csv file with append set to true so I can keep piling up the search results from different time of the day. Then I removed punctuation and stopwords.
  4. I extracted the most frequently used words and created a word cloud from them. Lastly, I did a clustering of the words.
The require statements that were struck out were of libraries I didn't use but forgot take out before the screenshot


Below is a screenshot of the scrapped tweets.



Tuesday, May 2, 2017

Data Types and Data Structures In R

R recognizes four main data types:
  1. Numeric values: These are number values which can have decimal parts. Examples are 55, 27.8 and 100.255
  2. Integer values: These are number values with no decimal parts. To differentiate them from Numeric values, when manually inputting them in R you append the number with the letter "L". So you'll write 5 as 5L, 10 as 10L and 50 as 50L to make R recognize them as Integer values rather than treat them as Numeric values.
  3. Character values: These are text values. You surround them with quotes when manually inputting them into R. You should note that if you input number values in R but surround them with quotes R will recognize them as Character and not Numeric values. Examples are "Michael", "Data" and "200".
  4. Logical values. These are TRUE and FALSE (must always be in CAPS). You enter them in R without quotes, unlike Character values. Also when you carry out comparison operations in R (often called logical operations) the results are logical values. 
Besides these four common ones that you have to be very familiar with and will extensively use in your data analysis work in R, there are two other less common data types: complex values and raw values. I won't bother discussing them because I don't see much real life practical use for them.

Above the layer of data types, we have data structures in R. These are the different standard ways you can organize your data in R. And there are six data structures in R.



  1. Vectors: These are the most basic data structure in R and the first you should be familiar with. Usually the other data structures are built on top of vectors, so a proper understanding of vectors provide a fundamental advantage to using the ones built on it. A vector is a collection of values of the same data type. A very common way to create vectors in R is to use the combine function c(). For example c(2,6,7,9) creates a vector that holds the values 2, 6, 7 and 9. You can call out the elements by providing its position number from the left in a square bracket. So to call out the value 6 if I assign the previous vector to a variable called sample_vector, I can write sample_vector[2]. 
  2. Factors: These are character values vectors. They hold what in statistics is called nominal data. Data that represent different categories. An example of factor is factor(c("Lagos","New York","Sydney","London")).
  3. Lists: These are a more advanced data structure than vectors and factors. They allow for storage of values of different data types and allow you to give each value a name you can reference. Examples of a list are list("Michael",21,"Lagos",FALSE) and list(name="Michael", age=21,city="Lagos",married=FALSE). It is also possible to create list of lists.
  4. Data Frames: These are tabular representation of data. Very similar to the way regular database (SQL tables) and spreadsheet (Excel tables) present data. They allow you to reference the values by row and column address. They are a powerful data structure often used for analysis of large records.An example is data.frame(name = c("John","Michael","Tunde"), age=c(32,21,28), city = c("Abuja","Lagos","Kaduna"), married = c(TRUE,FALSE,TRUE))
  5. Matrices: Matrices are two dimensional representation of values of the same data type. The values can also be addressed by their row and column position. An example of a matrix is matrix(c(4,5,6,7,8,9), nrow=3)
  6. Arrays: Arrays are multi-dimensional tables. They are not limited to two dimensions like the matrix, they can take up as many number of dimensions as desired. An example of an array is array(c(1,2,3,4,5,6,7,8), dim=c(2,2,2)). This is a three dimensional array.