As the number of cases of novel Corona Virus is increasing day by day globally, I thought why not use Data Analysis to analyze the COVID-19 cases in different countries. So, this project is basically using two datasets to see if there is any relationship between the spread of the virus in a country and how happy people are, living in that country.
Exploratory Data Analysis
So, first things first.. the tools and libraries we are going to use in this project are Numpy, Pandas, Seaborn,and Matplotlib. Also, we are going the use two datasets — first, the COVID-19 dataset, published by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE), which consists of the data related to the cumulative number of confirmed cases (till July 21,2020), per day, in each country and second dataset of World Happiness Report 2019, published by Sustainable Development Solutions Network, that consists of various life factors, scored by the people living in each country around the globe. Both of the datasets are available in my Github repository. I’ll provide the link at the end of the post.
So, we’ll start by importing all the libraries. If any of the libraries aren’t installed on your machine, use the pip or conda command to do it and then import them.
We then import our first dataset of confirmed COVID-19 cases using the read_csv method in pandas. We also check the shape of our dataset and it is found to be (266, 186) i.e. 266 rows and 186 columns.
Our next task is to clean the data. First of all, we delete the columns like Latand Long which are of no use to us. After that, we aggregate the rows by country because our other dataset will be based on countries. After aggregating, we check the shape of our new dataset and now it’s (188, 182).
After this, we visualize the data of some countries like India, China, US and find the maximum infection rate for all the countries using the first derivative and max method. Then, we create another data frame that has only the Maximum Infection Rate for each country.
At this point, we are done cleaning our first dataset of confirmed COVID-19 cases and will proceed to our second dataset by first importing it using
read_csv method. Then we will first drop some unnecessary columns like Overall Rank, Score, Generosity, Perceptions of corruption. Also, we’ll change the indices to Country or regionusing the set_index method.
So, at this moment we have now our final datasets which are ready to be merged together and visualized. The final datasets now look like this:
Now we will join both the datasets using the join method and find the correlation matrix. A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables.
Our analysis isn’t finished unless we visualize the results in terms of figures and graphs so that everyone can understand what we have got out of our analysis. I am showing the visualization of GDP per capita vs. Maximum Infection Rate. Other visualization graphs can be found in my Github repository. For visualization purposes, we have used the Seaborn library.
We can clearly see in the graph above that it has a positive slope. So, we came to the conclusion that people who are living in more developed countries are more prone to getting infected by the novel Coronavirus as compared to those living in less developed countries. This may be due to the lack of Corona tests in the less developed countries. In order to prove that this is not the case, we can perform a similar analysis on a dataset related to the cumulative number of deaths.