Thursday, December 17, 2015

Is Data Science a Buzzword? aka: My first Coursera Course

Data science and data scientists are all the rage right now in the information technology space. Every company wants one; every job candidate touts they are one. But what actually does that mean to companies and potential employees? I decided to take a course on data science to see if I could find out!

My co-worker, Gabriella Melki, recommended the Coursera Data Science specialization by John Hopkins Bloomberg School of Public Health. The entire specialization contains a set of 9 courses, but you can take each one individually. I started with the first course, called "The Data Scientist's Toolbox". Over the four week timeframe, I was able to view lectures and perform the assignments at my own pace. I've listed below my thoughts on the course and what I learned about data science.

Week 1: Introduction to Data Science
Data science is about data, specifically about answering questions, and science, following a method to discover an answer. A data scientist is the person that uses data to answer questions. The reason data scientists are such a commodity is because people either can't find the data they need or there is so much data, it is difficult to wade through it all. There are a variety of tools available to help with this task. A common tool is the R language, which is what the rest of the course used. The course teaches the students to accomplish tasks on their own, such as suggesting good places to research R functions (, anyone?) and recommending the use of online forums like StackOverflow to get help on confusing errors. The first week wrapped up with an overview of all the courses in the specialization and a few questions to make sure you were paying attention.

Week 2: Learning the Tools
I used to program about 10 years ago, but have focused more on SQL in recent years. So I had a bit of a learning curve when it came to the tools needed to work with R programs. Week 2 of the course walked the students through installing and the basics on each of these tools. Specifically, I learned about Git Bash, Git, GitHub, R packages, and RStudio. Although it seemed a little overwhelming at first, the course went step-by-step through each installation and explained each of the commands needed to use each of the tools. At the end of Week 2, we ran some of the commands to show our mastery of the commands taught.

Week 3: Understanding Questions, Data, and Approach
As we learned in Week 1, data science is all about trying to answer questions with data. Based on the data you have and the answer you need, you may ask different questions and use different approaches. For example, you may just want to describe the data, to make a prediction about the information, or more. Data comes in all shapes and sizes: qualitative versus quantitative, large versus small, confounding versus predicting. The most important thing is to ask the right question first, design a logical experiment, and then investigate the data to find the answer. Be careful not to force your results through experious correlations or data dredging! You don't want just any answer - you want the most accurate one. At the end of the third week, we completed a project to show our full understanding of the process and tools that we had learned over the past three weeks.

Week 4: Completing the Project
Week 4, the last week of the course, we checked everyone else's work. This grading exercise was also interesting to me. They asked every student to check over the work for at least three other students. Then they assigned a grade based on those results. I'm guessing they have some way to verify the information - maybe kick out the highest/lowest values and average the rest? Or kick out the students' results with the most variety in their scores?  Sounds like a good data science problem.  I'm not sure of the answer, but I completed the course with a 100%!

If you're interested in taking the data science course described above, visit here:

No comments: