a major period of time on information preparation for downstream duties. Whether or not it includes information cleansing, dealing with lacking values, characteristic engineering, information preprocessing, or submit processing, this part requires loads of time.
So, I used to be engaged on this post-processing job the place I wanted to create a brand new column in a Pandas DataFrame by extracting values from an present column, primarily based on the information from two different columns.
I may have immediately requested an LLM to write down the code (which I often do) however this time I wished to do it myself. It was early within the morning and I had a contemporary thoughts so I used to be within the temper to deal with some complicated information operations.
Here’s what I needed to do. I had a DataFrame with predicted_categories, pred_category_id, and text_predicted_probs columns.
The values within the predicted_categories column are lists of 5 classes in “category_id” – “category_description” format.
['80814001 - Freze Uçları',
'13003106 - Freze',
'80805004 - Sanayi Makineleri',
'13003144 - Torna Makinesi',
'13003195 - Kumpas']
The text_predicted_probs column has the expected possibilities of those 5 classes so as.
[0.943, 0.018, 0.008, 0.006, 0.004]
Therefore, the primary worth within the text_predicted_probs is the likelihood of the primary class within the predicted_categories, and so forth.
The pred_category_id column reveals the expected class id from one other mannequin . What I would like is the expected likelihood of the class within the pred_category_id column.
I must get the order of the pred_category_id within the predicted_categories column after which take its worth from the test_predicted_probs column.
The drawing beneath demonstrates what I wish to obtain:
On this drawing, I wish to get the likelihood of class 13003106, which is the second merchandise within the listing and its corresponding likelihood worth is 0.018.
If we requested Gemini, or one other superior mannequin, we’ll in all probability get the reply in seconds. However, I wished to do it by myself first after which ask Gemini.
Let’s begin with studying the dataset right into a Pandas DataFrame.
import pandas as pd
outcomes = pd.read_csv("prediction_results.csv")
The values within the predicted_categories column are lists of strings with class ids and class names:
outcomes.loc[0, "predicted_categories"]
# output: "['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']"
It’s a listing however saved as a string so we first convert it to a listing object utilizing the literal_eval operate within the built-in ast module of Python:
ast.literal_eval(outcomes.loc[0, "predicted_categories"])
# output:
['80814001 - Freze Uçları',
'13003106 - Freze',
'80805004 - Sanayi Makineleri',
'13003144 - Torna Makinesi',
'13003195 - Kumpas']
To extract the class ids, we will break up every string on this listing on the “-” character after which choose the primary half after splitting. Since we have now a listing with 5 classes, we must always do that operation in a listing comprehension as follows:
[category.split("-")[0].strip()
for class in ast.literal_eval(outcomes.loc[0, "predicted_categories"])]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']
We’ve carried out it for a single worth (i.e. one row). To be able to do the identical operation to your entire predicted_categories column, we will use a listing comprehension. It is going to be a listing comprehension inside one other listing comprehension (i.e. nested listing comprehension):
outcomes.loc[:, "predicted_category_ids"] = [
[category.split("-")[0].strip() for class in ast.literal_eval(predicted_categories)]
for predicted_categories in outcomes["predicted_categories"]
]
We now have class ids extracted from the predicted_categories column:

The subsequent step is to verify the order of the classes within the predicted class id lists. We are going to then use this order to extract the expected likelihood of the class.
Python listing object has an index methodology, which returns the index (i.e. order) of the merchandise within the listing.
outcomes.loc[0, "predicted_category_ids"]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']
outcomes.loc[0, "predicted_category_ids"].index("13003106")
# output:
2
As soon as I discover the index of a predicted class id, I can use it to get the likelihood of this class id from the text_predicted_probs column:

What we have to do:
- Get the index of
pred_category_idwithin thepredicted_category_ids - Use this index to extract the related worth from
text_predicted_probs
These steps could be carried out in a single operation by zipping these three columns. Let’s take a look at it on the primary row:
for i, j, okay in zip(outcomes["pred_category_id"][:1], outcomes["predicted_category_ids"][:1], outcomes["text_predicted_probs"][:1]):
print(j.index(str(i))) # get the index of pred_category_id in predicted_category_ids
print(ast.literal_eval(okay)[j.index(str(i))]) # get the worth at this index in text_predicted_probs
# output:
0
0.943
We will verify the output within the screenshot above. The pred_category_id is 80814001, which is the primary merchandise (i.e. index = 0) within the predicted_category_ids and the primary likelihood worth is 0.943.
The lists within the text_predicted_probs column are saved as string as effectively so we apply the literal_eval operate to transform them to a listing object.
We will do that operation on your entire DataFrame utilizing a listing comprehension as comply with:
outcomes.loc[:, "pred_category_prob"] = [
float(ast.literal_eval(k)[j.index(str(i))]) if str(i) in j else 0
for i, j, okay in zip(outcomes["pred_category_id"], outcomes["predicted_category_ids"], outcomes["text_predicted_probs"])
]
We mainly transformed the for loop within the earlier code block to a listing comprehension. I’ve solely added a verify “if str(i) in j else 0” as a result of there are some rows with a pred_category_id that doesn’t exist in predicted_category_ids. In such circumstances, we set the likelihood worth to 0.
The duty is now accomplished as we will see beneath:

How did Gemini do that?
After working about an hour (together with breaks) I accomplished the duty. Then, I wished to ask Gemini to see how it might deal with the identical job.
I wrote a immediate explaining my aim and included a small pattern of the information.
Here’s what Gemini produced on its first try:
import pandas as pd
# 1. Create the pattern DataFrame
information = {
'predicted_categories': [
['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
],
'pred_category_id': [80814001],
'text_predicted_probs': [
[0.943, 0.018, 0.008, 0.006, 0.004]
]
}
df = pd.DataFrame(information)
# 2. Outline a operate to extract the likelihood matching the class ID
def get_matching_prob(row):
# Convert ID to string for matching
target_id_str = str(row['pred_category_id'])
# Iterate by means of the classes to seek out the matching index
for index, class in enumerate(row['predicted_categories']):
if class.startswith(target_id_str):
# Return the corresponding likelihood from the identical index
return row['text_predicted_probs'][index]
# Return None (or 0) if no match is discovered to forestall the code from crashing
return None
# 3. Create the brand new column
df['pred_category_prob'] = df.apply(get_matching_prob, axis=1)
# Show the end result
print(df[['pred_category_id', 'pred_category_prob']])
It first created a pattern DataFrame utilizing the information I offered within the immediate. Then, it wrote a operate that does precisely what I defined.
It took me about an hour to write down the code myself. Gemini accomplished the very same job in seconds primarily based on a immediate that took me only some minutes to write down.
Nonetheless, there’s a minor problem in Gemini’s code. It makes use of the apply operate, which isn’t vectorized. In case of working with giant datasets, utilizing the apply operate (mainly a for loop) can improve the run time.
However the resolution is easy. I requested Gemini to write down a vectorized model of it and here’s what I received:
import pandas as pd
# Pattern DataFrame
information = {
'predicted_categories': [
['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri'],
['13003144 - Torna Makinesi', '13003195 - Kumpas']
],
'pred_category_id': [80814001, 13003195],
'text_predicted_probs': [
[0.943, 0.018, 0.008],
[0.6, 0.4]
]
}
df = pd.DataFrame(information)
# 1. Explode each listing columns concurrently to take care of alignment between class and likelihood
df_exploded = df.explode(['predicted_categories', 'text_predicted_probs'])
# 2. Extract the numeric ID from the class string utilizing vectorized regex
df_exploded['extracted_id'] = df_exploded['predicted_categories'].str.extract(r'^(d+)').astype(float)
# 3. Filter for rows the place the extracted ID matches the goal 'pred_category_id'
matched = df_exploded[df_exploded['extracted_id'] == df_exploded['pred_category_id']]
# 4. Dedup the index (security web in case an ID seems twice throughout the identical listing)
matched = matched[~matched.index.duplicated(keep='first')]
# 5. Map the extracted likelihood column again to the unique DataFrame utilizing the index
df['pred_category_prob'] = matched['text_predicted_probs']
df
The second resolution was completely effective and regarded easier than the code I wrote.
So, I spent about an hour on a job that an LLM may have accomplished in lower than 5 minutes. Nonetheless, if I didn’t understand how Pandas labored, I’d have accepted the primary resolution, which was not the optimum one. It’s a good instance of how LLMs can improve productiveness, however provided that you really know what you’re doing.
