Applied Data Science Tools (1): Wrangle Your Data — What I Have Learned From WorldQuant University

Ran
3 min readMay 4, 2023

Introduction

After completing the course on “Applied Data Science Lab” from WorldQuant University, I would like to summarise the basic skills and code snippets that I have learned. I strongly recommend that every developer should take this course, regardless of their years of work or hackathon experience. This is because the course can help you structure essential data science skills without any fees. Based on the course materials, I am summarising what I learned from this course. Due to IP protection, I am not going to reveal any direct information such as codes or course materials. Instead, I will present the methods and tips for applied data science skills. (Your click on the like button always encourages me!)

The WorkFlow of Data Wrangling

a. Subset the Data

Usually, your database will contain many non-interested rows. For example, if you have the housing prices of all cities in China, but your company wants you to analyze the housing price of each property in Beijing City (the capital of China). Then, you can create a mask for your data frame using the following examples. Similarly, you can create as many masks as needed and join them together.

# mask for Beijing 
mask_0 = df["city_name"=="Beijing"]
df_beijing = df[mask_0] # df_beijing is your interested data frame

# mask for housing price less than 100,000
mask_1 = df["housing_price" < 100000]
df_less_than_100000 = df[mask_1] # df_beijing is your interested data frame

# mask for housing prices less than 100,000 in Beijing
df_target = df[mask_0 & mask_1]

b. Remove the outlier

Usually, your data rows contain extremely low or high values. You can remove them or fetch them for your data analysis and modeling. Sometimes, outliers are also very important for certain applications such as risk (anomaly) detection. For building a prediction or classification model, you can remove them as outliers are the noise for your model.

# reomve the extreme values 
low, high = df["key"].quantile([0.1, 0.9])
mask_3 = df["key"].between(low, high)
df = df[mask_3] # replace df by removing outliers

c. Misc Manipulation

c1. Unexpected Data Format

Sometimes, you need to do some string operations on your columns because your data format is stored in a way you didn’t expect. For example, the database stores the latitude and longitude values (lat and lon) in the same cell. You should split them.

c2. Missing Values

Your data columns might contain missing values. You can fill them with mean or median values, delete the rows containing the missing value, or even delete the entire column when the missing values are too many. You need to balance these operations. Typically, you can build your custom wrangle function for your data, so you can inspect the effects of your model when you take these operations.

c3. Drop Categorical Columns

Sometimes, your categorical column might contain too much or too little information if this column has extremely high or low cardinality. For example, if your dataset has a column named city_type and nearly every cell's values are the same as city, then this column is useless. Also, if your dataset has a column named "URL" representing the data source, and nearly all your cells in this column are different, then this column is useless as well.

c4. Drop Leak Values

Sometimes, your data contains the same or similar information. For example, your data might have columns for housing prices in different currencies. In that case, it is not essential to keep all the housing prices in both CNY and USD, as they represent the same information if you want to predict or analyze the housing price.

# Example of some misc operations
# c1. Split "lat-lon" column
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True) # drop the original column

# c3. drop high and low cardinal columns
df.drop(columns=['city_type', 'URL'], inplace=True)

# c4. drop leak values
df.drop(columns= ['price','price_aprox_local_currency'], inplace=True)

Conclusion

In this blog post, we’ve looked at some essential skills and handy code examples from the Applied Data Science Lab course at WorldQuant University. We’ve seen how important data wrangling is in the whole data analysis and modeling process. The post covers the ins and outs of data wrangling, touching on subsetting data, getting rid of outliers, and tackling various other data manipulations like unexpected data formats, missing values, and categorical columns with too much or too little information, as well as leak values. By using these tips and tricks, developers can boost their data science projects and get better at data preparation and wrangling. After all, solid data wrangling is the key to successful data analysis and modeling!

--

--

Ran

AI Developer. Ph.D. Candidate in Electrical Engineering at Imperial College.