A central portal to navigate through my Data Science writings

Top Data Science Stories Leihua Ye, PhD, Top Writer, Machine Learning, Experimentation, Causal Inference, R, Python, SQL
Photo by Jivko Iordanov on Unsplash

Greetings! Welcome to my Data Science Blog.

First off, Happy Chinese New Year, and May the Year of the Ox be the year of happiness and prosperity.

My name is Leihua Ye. I wear multiple hats. I’m a Ph.D. researcher at the University of California, Santa Barbara for the day and a Top Writer in Artifical Intelligence, Education, and Technology for the night.

I’ve been on the platform for over a year and created 50+ original content on various niches under the Data Science umbrella, including Statistics, Experimentation & Causal Inference, Machine Learning, Programming (R, Python, and SQL), and Research Design.

This portal post serves you to…

Getting Started, Experimentation and Causal Inference

How not to fail your online controlled experimentation

Experimentation and Causal Inference 8 Common Pitfalls of Running A/B Tests How to fail your online controlled experimentation
Photo by Rolf Blicher Godfrey on Unsplash

Online experimentation has become the industry standard for product innovation and decision-making. With well-designed A/B tests, tech companies can iterate their product lines quicker and provide better user experiences. Among FAANG, Netflix is the company most open about its experimental approach. In a series of posts, Netflix has introduced how to improve experimentation efficiency, reduce variance, quasi-experiments, key challenges, and more.

Indeed, online controlled experiments offer a high level of internal validity after controlling for all other external factors and only allow for one factor (the treatment condition) to vary. Unlike other statistical tools (e.g., …

Experimentation and Causal Inference

A statistical approach to A/A tests

Experimentation and Causal Inference A Statistical Approach to A/A Tests What it is? Why do you need? How to do it?
Photo by Andy Salazar on Unsplash


A rigorous process of experimentation, aka., A/B tests, has become trendy and widely adopted in the tech sector. As the early adopters, FAANG companies have incorporated experimentation into their decision-making process.

For example, Microsoft Bing conducts A/B tests on 80% of its product changes. Google resorts to experimentation to identify top-performing candidates in the interview process. Netflix improves personalization algorithms using interleaving, a pairwise experimental design.

The increased adoption of experimentation originates from its high level of internal validity, which is further determined by two factors. First, data scientists are selective with the overall research design and model selection…

Experimentation and Causal Inference

Best practices that data scientists should follow pre-, during-, and after- experiments

Photo by niko photos on Unsplash


Randomized Controlled Trials (aka. A/B tests) are the gold standard of establishing causal inference. RCTs strictly control for the randomization process and ensure equal distributions across covariates before rolling out the treatment. Thus, we can attribute the mean difference between the treatment and control groups to the intervention.

A/B tests are effective and only rely on mild assumptions, and the most important assumption is the Stable Unit Treatment Value Assumption, SUTVA. It states that the treatment and control units don’t interact with each other; otherwise, the interference leads to biased estimates. …

Data Structure and Algorithm

An essential algorithm for Data Scientists

Data Structure and Algorithm Why Should Every Data Scientist Master Dynamic Programming? An essential algorithm for Data Scientists
Photo by Birger Strahl on Unsplash


Data Science is no longer a pure analytical field in today’s job market but requires extensive hands-on experience in programming and engineering. Data scientists are teaming up with the engineering team to build the infrastructure pipeline, in addition to their normal obligations like model development and data analysis. A deep understanding of programming speeds up the production timeline and reduces friction.

Python is widely used in the Data Science and Soft Engineering communities. …

Data Structure and Algorithm

LeetCode your way to a top-paid data science position

Crack Data Science Interview Master Data Type Dictionary in Python from Zero to Hero, Part 2 LeetCode your way to a data science position
Photo by Lewis Keegan on Unsplash


Python is a popular scripting programming language that offers various data structures, including array, set, stack, string, dictionary, heap, etc. They possess idiosyncratic characteristics and serve different goals. Therefore, we should choose the data type that best fits with our needs.

Like Javascript and others, Python also offers hash tables that store “the index value of the data element that is generated from a hash function.” Thus, it makes the data accessing and retrieval much faster as the key values become the index, or identifier, of the values. …

Crack Data Science Interview

Leetcode your way to a top-paying data position

Crack Data Science Interviews: Five SQL Skills for Data Scientists Leetcode your way through a top-paying data position
Photo by Zetong Li on Unsplash


Structured Query Language, SQL, is the go-to programming language that data practitioners use to retrieve data stored in a relational database. Writing effective query requests is no longer considered a nice-to-have but an essential skill for data scientists. The trend can be supported by the specific inclusion of SQL experience in DS job postings and the interview loop.

In addition to Programming (Python), Machine Learning, A/B Tests, and Statistics, Data Scientists are frequently tasked to define and pull data from multiple sources to construct the metrics of interest. …


Three solutions from Lyft, LinkedIn, and Doordash

EXPERIMENTATION AND CAUSAL INFERENCE How User Interference May Mess Up Your A/B Tests? Three solutions from Lyft, LinkedIn, and Doordash
Photo by Thom Holmes on Unsplash


A rigorous process of A/B testing generates valuable insights about consumer behaviors directly related to the success of a product. More often than not, PMs adopt an iterative approach to product optimization: A/B testing the variants → find a winner → ship out the winner → new round of A/B testing → ...

When in doubt, A/B tests it!

In the past year, the global pandemic has drastically increased consumers’ online presence, making it easier to track and analyze consumer behavioral data at scale. No wonder companies shift their focus to online user behavior and spend a ton of…

Hands-on Tutorials, Causal Inference using Observational Data

How to reduce the effects of confounding in observational data

Causal Inference using Observational Data An Ultimate Guide to Matching and Propensity Score Matching How to reduce the effects of confounding in observational data
Photo by Ralph Mayhew on Unsplash


Randomized Control Trials (aka. A/B tests) are the Gold Standard in identifying the causal relationship between an intervention and an outcome. RCT’s high validity originates from its tight grip over the Data Generating Process (DGP) via a randomization process, rendering the experimental groups largely comparable. Thus, we can attribute any differences in the final metrics between the experimental groups to the intervention.

The downside of it is RCT is not always feasible in real-world scenarios for practical reasons. Companies don’t have the Experimentation infrastructure to facilitate large-scale tests. Or, high user interference invalidates any results from individual-level randomization.


Hi Howard, I tried your code and got the same results as yours: there is a 9.95% for getting 1, 18.4% for getting 2, 22.41% for getting 3, and 49.24% for getting 4.

For a short array like [1,2,3,4], the empirical probabilities of getting these numbers are reasonably accurate. Think about this: the chance of getting 4 out of the sum of the array is 4/(1+2+3+4) = 0.4.

If you have a much longer array, the emprical distribution will move closer to the theoretical distribution.

Hope it helps!

Leihua Ye, Ph.D. Researcher

PhD @ University of California. Top Writer | Machine Learning | Data Science | Experimentation & Causal Inference www.linkedin.com/in/leihuaye

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store