A central portal to navigate through my Data Science writings

Top Data Science Stories Leihua Ye, PhD, Top Writer, Machine Learning, Experimentation, Causal Inference, R, Python, SQL
Top Data Science Stories Leihua Ye, PhD, Top Writer, Machine Learning, Experimentation, Causal Inference, R, Python, SQL

Greetings! Welcome to my Data Science Blog.

First off, Happy Chinese New Year, and May the Year of the Ox be the year of happiness and prosperity.

My name is Leihua Ye. I wear multiple hats. I’m a Ph.D. researcher at the University of California, Santa Barbara for the day and a Top Writer in Artifical Intelligence, Education, and Technology for the night.

I’ve been on the platform for over a year and created 50+ original content on various niches under the Data Science umbrella, including Statistics, Experimentation & Causal Inference, Machine Learning, Programming (R, Python, and SQL), and Research Design.

This portal post serves you to…


How not to fail your online controlled experimentation

Experimentation and Causal Inference 8 Common Pitfalls of Running A/B Tests How to fail your online controlled experimentation
Experimentation and Causal Inference 8 Common Pitfalls of Running A/B Tests How to fail your online controlled experimentation

Online experimentation has become the industry standard for product innovation and decision-making. With well-designed A/B tests, tech companies can iterate their product lines quicker and provide better user experiences. Among FAANG, Netflix is the company most open about its experimental approach. In a series of posts, Netflix has introduced how to improve experimentation efficiency, reduce variance, quasi-experiments, key challenges, and more.

Indeed, online controlled experiments offer a high level of internal validity after controlling for all other external factors and only allow for one factor (the treatment condition) to vary. Unlike other statistical tools (e.g., …


How to reduce the effects of confounding in observational data

Causal Inference using Observational Data An Ultimate Guide to Matching and Propensity Score Matching How to reduce the effects of confounding in observational data
Causal Inference using Observational Data An Ultimate Guide to Matching and Propensity Score Matching How to reduce the effects of confounding in observational data

Introduction

Randomized Control Trials (aka. A/B tests) are the Gold Standard in identifying the causal relationship between an intervention and an outcome. RCT’s high validity originates from its tight grip over the Data Generating Process (DGP) via a randomization process, rendering the experimental groups largely comparable. Thus, we can attribute any differences in the final metrics between the experimental groups to the intervention.

The downside of it is RCT is not always feasible in real-world scenarios for practical reasons. Companies don’t have the Experimentation infrastructure to facilitate large-scale tests. Or, high user interference invalidates any results from individual-level randomization.

Under…


Hi Howard, I tried your code and got the same results as yours: there is a 9.95% for getting 1, 18.4% for getting 2, 22.41% for getting 3, and 49.24% for getting 4.

For a short array like [1,2,3,4], the empirical probabilities of getting these numbers are reasonably accurate. Think about this: the chance of getting 4 out of the sum of the array is 4/(1+2+3+4) = 0.4.

If you have a much longer array, the emprical distribution will move closer to the theoretical distribution.

Hope it helps!


Hi Howard,

Thank you so much for running the code and catch the typo. The last line of code "return sequence[i]" had wrong spacing, which is why the code returned only the first value.

The correct spacing should be in parallel to the for loop. I've fixed it, and the code should return the value according to their weights.

p.s. Honestly, there is no better thing when someone actually takes out time, runs the code, and spots a mistake in my writings. Thanks again!

Leihua


An advanced read for Data Scientists and Software Engineers

SQL Data Science Interviews Programming 2021
SQL Data Science Interviews Programming 2021

Structured Query Language, SQL, is the go-to programming language that retrieves and manages data. Pulling data effectively from a relational database is a must-have skill for any Data professional. For the past few months, I’ve been in close contact with Data Science Leaders, and one suggestion that comes up frequently is to write more and better SQL queries.

To track who has been active users, we use SQL.

To calculate a business metric, we use SQL.

To perform anything related to data retrieval and management, we use SQL.

In two previous posts, I’ve introduced several fundamental SQL skills asked…


An essential data type for Data Scientists and Soft Engineers in 2021

Data Science Interview How to Solve Python Coding Questions using Stack An essential Python data type for Data Scientists
Data Science Interview How to Solve Python Coding Questions using Stack An essential Python data type for Data Scientists

Python is a versatile script-based programming language with a wide application in Artificial Intelligence, Machine Learning, Deep Learning, and Soft Engineering. Its popularity benefits from the various Data Types that Python stores.

Dictionary is the natural choice if we have to store key and value pairs, as in today’s Question 5. String and list are a pair of twin sisters that come together and solve string manipulation questions. Set holds a unique position as it does not allow duplicates, a unique feature that allows us to identify the repetitive and non-repetitive items. Well, tuple is the least frequently asked…


An essential coding skill for Data Scientists and Soft Engineers in 2021

Python coding interviews come in different shapes and forms, and each type has its unique characteristics and approaches. For example, String Manipulation questions expect candidates to have a solid grasp of element retrieval and access. Data Type Switching questions test your understanding of the tradeoffs and unique traits with each type.

However, the math question is different. There is no consistent way of testing. Instead, you have to spot the data pattern and code it up in Python, which sounds daunting at first but totally doable after practice.

In this post, I elaborate and live-code 5 real interview questions…


Train your Python coding muscles using different weights

Python String Manipulation for Data Scientists in 2021 Train your Python coding muscles using different weights Data Science
Python String Manipulation for Data Scientists in 2021 Train your Python coding muscles using different weights Data Science

Array and string manipulation are among the most heavily tested topics in Data Science and Soft Engineering interviews. This is the best type of interview question that tests candidates’ ability to think programmatically and coding fluency. To perform well, we have to be familiar with the basic operations of arrays/strings, matrix and its row/column structures, and Python syntax.

In two similar blog posts, I’ve touched upon the basics and live-coded several real interview questions.

In today’s post, let’s try something different. As suggested by Emma Ding (Data Scientist at Airbnb) and Rob Wang’s (Data Scientist at Robinhood) post, we…


Winning in 2021: a must-read for data scientists/engineers, Part 2

Crack Data Science Interviews: Essential Statistics Concepts Winning in 2021: a must-read for data scientists/engineers
Crack Data Science Interviews: Essential Statistics Concepts Winning in 2021: a must-read for data scientists/engineers

Introduction

Data Science Interviews cover a wide range of topics, and interviewers frequently ask us to explain the most fundamental concepts. It’s more likely to ask questions like why you choose L1 over L2 than building up a Machine Learning algorithm from scratch.

My Data Science professional network has told me repeatedly that they do not expect job candidates to know every algorithm. Instead, they expect a high level of familiarity with the fundamentals. It makes total sense. You can quickly pick up a new algorithm after establishing a solid ground.

Statistics and Machine Learning are inseparable twins, and these…

Leihua Ye, Ph.D. Researcher

PhD @ University of California. Top Writer | Machine Learning | Data Science | Experimentation & Causal Inference www.linkedin.com/in/leihuaye

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store