Skip to main content

Finding a mismatch between two correlated columns in pandas dataframe

I have a large pandas dataframe which has two columns. One column is the unique code for an item and the second column contains the name of that item.

For example purposes lets say 'ID' has a unique code for an animal and 'Name' has the name of that animal.

df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE'], 
                   'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog']})


    ID      Name
0   AA      Ape
1   BB      Bull
2   CC      Chimp
3   DD      Dog
4   EE      Dog

When I run the code below I will get false because there is one more unique item in the ID column than in the Name column

len(df['Code'].unique()) == len(df['Name'].unique())

My question is how can I find out which animal is being represented by two codes. Ideally I would like to end up with a dataframe that has:

    ID  Name
3   DD  Dog
4   EE  Dog

But as long as I can identify the animal being represented by more than one ID, I am happy with whatever solution gets the desired result.

Thanks

EDIT:

The dataframe contains duplicate rows, so a more realistic example that resembles the actual problem is a dataframe that looks like the one below:

    ID      Name
0   AA      Ape
1   BB      Bull
2   CC      Chimp
3   DD      Dog
4   EE      Dog
5   CC      Chimp
6   AA      Ape
7   DD      Dog
8   FF      Frog

df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE', 'CC', 'AA', 'DD', 'FF'], 
                   'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog', 'Chimp', 'Ape', 'Dog', 'Frog']})

Desired output is still to find which ID has been matched to two different animal names

    ID  Name
3   DD  Dog
4   EE  Dog
Answer

For a dataframe like the second example (containing duplicate rows) with a slightly augmented example

import pandas as pd

df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE', 'CC', 'AA', 'DD', 'FF', 'AA'], 
                   'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog', 'Chimp', 'Ape', 'Dog', 'Frog','Bull']})
#    ID   Name
# 0  AA    Ape
# 1  BB   Bull
# 2  CC  Chimp
# 3  DD    Dog
# 4  EE    Dog
# 5  CC  Chimp
# 6  AA    Ape
# 7  DD    Dog
# 8  FF   Frog
# 9  AA   Bull

we can use the following. The result is sorted to make it easier to identify ID by Name.

df.groupby('Name').filter(lambda x: x['ID'].nunique() > 1).drop_duplicates(subset=['ID','Name']).sort_values('Name')

Output

   ID  Name
1  BB  Bull
9  AA  Bull
3  DD   Dog
4  EE   Dog
Other helpful answers

Use groupby.filter to filter which groups of names have ID count greater than or equal to 2.

>>> df.groupby('Name').filter(lambda s: s['ID'].count() >= 2)

    ID  Name
3   DD  Dog
4   EE  Dog

If the column ID always contains unique values, then you can simply use:

df[df.duplicated('Name', keep=False)]

   ID Name
3  DD  Dog
4  EE  Dog

Comments