Knowing how to set up and conduct a hypothesis test is a critical skill for any aspiring data scientist. It can feel confusing at first trying to make sense of alpha, beta, power, and type I or II errors. My goal in this article is to help you build intuition and provide some visual references.
First, let’s envision setting up a standard A/B experiment where the A group is the control and B is the experimental group. Our null hypothesis is that the two groups are equal and the change applied to group B did not have a significant effect…
It’s not enough to be good at statistical tests, machine learning, or coding. These technical skills are, of course, essential to being good at data science. But it’s possible to know all the technical things and still be considered a terrible data scientist. One also needs the soft skills and business knowledge to be able to work effectively with others cross-functionally, communicate results, and really understand the problems you are trying to solve. Having some business acumen is going to make you a much more effective data scientist.
Do you understand how your work connects to the larger whole of…
In honor of women’s history month, I wanted to do an exploratory data analysis (EDA) project related to gender equality. With equal pay day upon us in the United States, I immediately thought about pink tax. After all, nothing boils the blood quite like the combination of getting paid less while also not having your dollar go as far, all simply for being a woman.
Pink tax is the tendency for products marketed to women to be more expensive than equivalent products for men. Have you ever noticed that the pink razors cost an extra few cents? Or that women’s…
SQL is an important skill for many data scientists. SQL (Structured Query Language) is a language that is very flexible and reads a lot like regular English. It allows for easy access to even the most complex table structures in a database. After all, what good is data if you can’t access it? Many jobs on the market call for SQL knowledge so it’s definitely a smart idea to at least learn some basics. …
Recently I was using K-Means in a project and decided to see what other options were out there for clustering algorithms. I always find it enjoyable to sink my teeth into expanding my data science skillset. I decided to write this article to share the experience of what I discovered on my quest to broaden my clustering knowledge to include using Gaussian Mixture Models.
When hearing of this technique you may have thought about the Gaussian distribution (also called normal distribution). That’s exactly what this clustering technique is based on. …
An AUC ROC (Area Under the Curve Receiver Operating Characteristics) plot can be used to visualize a model’s performance between sensitivity and specificity. Sensitivity refers to the ability to correctly identify entries that fall into the positive class. Specificity refers to the ability to correctly identify entries that fall into the negative class. Put another way, an AUC ROC plot can help you identify how well your model is able to distinguish between classes.
In real world problems, there is often overlap between classes which means catching all true negatives and true positives can be a trade off. …
When I start a new classification project I always take some time to sit down with myself, the data, and my business case to ask an important question: what does it mean to have a “successful” model? In this article I attempt to help you think through some different scoring metrics and which might be right for your modeling project, but this is by no means a complete or exhaustive list.
Often times the accuracy of the model is thought of as the most basic or standard scoring metric. Accuracy is based on how many predictions the model got correct…
Understanding some basics of how Python works under the hood can help you be more confident in coding. I want to share with you 3 things about Python that you may find useful especially if you are new to the language.
You may have heard people say, “everything in Python is an object”. Well if you are new to Python this can feel really confusing. Everyone says it but what does it really mean? …
When I first started with data science I was amazed at all the beautiful plots that could be made so easily with packages like Seaborn or Plotly Express. But there came a point where I was working on a project and realized the perfect EDA plot would show the percentage of entries in my data that were in the different target classes split out by a categorical feature. Some scouring through documentation, galleries, and Stack Overflow pages and I realized that there was no canned plot to be able to do what I wanted. In this article, I’m going to…
Data scientist with a background in biology and health tech interested in using data for projects that improve lives. GitHub @HeyThatsViv