Data Science

Data Science

Retro Game Retrieval Engine Design

I’ve got a new Shiny web app that I’ve embedded on another site where I’m doing some experimental things, and I wanted to talk generally about how I created it. The web app can be found at the following link that allows the user to do interactive searches for similar classic games for home consoles from what are generally known as the third generation (NES, Sega Master System) through the sixth generation (Wii, PS2, Xbox).

Continue reading

A New Introduction to Spark 2.1 Dataframes with Python and MLlib

A couple of years ago, when I was in the midst of my rookie year as a data scientist, I wrote a blog post and tutorial about using the Python Spark API to build a simple model from housing data with Spark dataframes. Despite the simple nature of the model (a straight train-test split with multivariate linear regression), it was one of the more challenging tutorials I’ve ever written for this blog.

Continue reading

Minivan Price Comparison With R

With my family growing once again and my 13-year-old Mazda Protégé on the fritz, I recently decided it was time to go minivan shopping. A frugal shopper, some might say cheap, I quickly set my focus on the used, domestic market and found that there are only two competitors here, the Dodge Grand Caravan and the Chrysler Town and Country. Two questions immediately came to mind: As these two minivans are, for all practical purposes identical (manufactured at the same facility, same internals, just different branding), if one compared them with a similar set of features, does one name carry a price premium over the other?

Continue reading

Databricks Review

  Not too long ago, I did my first post on Apache Spark, a Spark dataframes tutorial. I’ve continued to experiment with Spark since taking my first tentative steps with it just a few months ago. One of the challenges with Spark is that it has a reputation for being difficult to deploy at scale. Stepping in to try to solve that problem is Databricks. Databricks offers the ability for corporations to deploy an optimized Spark via the cloud with some very nice extra bells and whistles.

Continue reading

Spark Dataframes and MLlib

NOTE: I have created an updated version of my Python Spark Dataframes tutorial that is based on Spark 2.1 uses an easier, updated Spark ML API. I would encourage readers to check that out over this older post. A couple of months ago, I got my first experience with Apache Spark. While I am just starting to use it to implement meaningful problems, in my experience when working with a new tool or technology, just getting one’s feet wet can be crucial to getting a learning snowball rolling.

Continue reading

Favorite Podcasts for Data Scientists

One of my favorite learning methods is via podcasts. They allow me to multitask–exercising, driving, or doing chores–while listening to experts on a particular topic. Some of the podcasts I listen to are purely for entertainment (think Serial or StartUp) but many others are for educational purposes. As I’ve been trying to build up my data science awareness in a variety of areas, I’ve been putting together list of podcasts specific to data science.

Continue reading

Best R Tutorial Sites

There’s no doubt that the ability to analyze data and do predictive modeling by programming in R is a very valuable skill, whether you are looking to learn it for a college statistics class or one of a great many great jobs that utilize R. If you are trying to get started on your own, you may find it is a little tricky, however. While there are tons of sites in the Codecademy model to get started with certain languages like Javascript, PHP, CSS, or HTML, there are fewer options for getting started with R.

Continue reading

Reproducible Research Coursera Review

The fifth course in Johns Hopkins Data Science Specialization on Coursera is Reproducible Research. This is the third and final course in the sequence taught by Roger Peng. Reproducible Research is the course among the first five in the specialization (except The Data Scientist’s Toolbox), where I spent the least time learning new R code. Instead, the emphasis of this course was more philosophical in nature. Here the emphasis was on writing your research findings up in a way that they could be shared with others in such a way that they were considered to be reproducible, though not necessarily replicable.

Continue reading