What’s In A Name?

What Is In A Baby Name?

Becoming a first time parent is a daunting task in an individuals life. From the many baby books to all the gadgets (hot take: you don’t need all the gadgets) you need to purchase for the individual that will soon become your new roomy. With all the chaos that will come soon in those 9 short months, one of the most challenging can be coming up with a name. Using the Social Security Card Application Baby Names from 2010 – 2020 I used a data approach to try and solve this problem.

We want to pick a name that is not the most popular and/or a passing trend, unique enough for our family tree, and true to our families culture.

Approach is as follow:

  • Complete simple counts to examine overall most/least popular
  • Year-over-year differences of popularity values.
  • Find names that have sudden spikes & then drop off, proxy for trendy names.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

df_m = pd.read_csv("data_b.csv", sep='\t')

Once the data is imported & filtered for male only names we take a quick look at our four columns of interest.

  • year
  • name
  • gender
  • count

For a quick look at the top 5 names we run a simple aggregate by name and count using pandas.

# Grab top 5 names
df_m_sum = df_m.groupby('name')['count'].agg(['sum', 'max'], as_index=False)

df_m_sum.nlargest(5, ['sum'])
namesummax
Noah20124519650
Liam19337620555
William17223817347
Jacob17215422139
Mason16768119518

Next lets examine fastest growing names from 2010 – 2020. We do this by creating two separate dataframes and then use the merge function in pandas to join and calculate the growth column. (latest_year - first_year)/(latest_year) * 100

# Fastest Growing Names (2010 - 2020)

df_2010 = df_m[df_m["year"] == 2010]
df_2020 = df_m[df_m["year"] == 2020]

df_yoy_all = pd.merge(df_2010, df_2020, on="name")
# x is 2010, y is 2020

# Filter names with counts over 100 in 2010
df_yoy = df_yoy_all[df_yoy_all["count_x"] > 5000]

# Create yoy metric
# (2020-2010)/(2010)*100
df_yoy["growth"] = (df_yoy["count_y"] - df_yoy["count_x"])/(df_yoy["count_x"])
df_yoy.nlargest(10, ['growth'])
year_xnamegender_xcount_xyear_ygender_ycount_ygrowth
2010LiamM109282020M196590.798957
2010HenryM63992020M107050.672918
2010LeviM60162020M90050.496842
2010SebastiM63612020M89270.403396
2010JosiahM52062020M60770.167307
2010NoahM164602020M182520.108870
2010WyattM73742020M81350.103200
2010LucasM103792020M112810.086906
2010OwenM81762020M86230.054672
2010JackM85192020M88760.041906
df_yoy.nsmallest(10, ['growth'])
year_xnamegender_xcount_xyear_ygender_ycount_ygrowth
2010TylerM104502020M2771-0.734833
2010GavinM96192020M2570-0.732820
2010BrandonM85472020M2287-0.732421
2010JustinM78482020M2277-0.709862
2010KevinM73242020M2359-0.677908
2010EvanM97302020M3389-0.651696
2010BraydenM91132020M3253-0.643037
2010ZacharyM71802020M2698-0.624234
2010JoshuaM154482020M5924-0.616520
2010JaydenM171892020M7102-0.586829

A quick look at the top 10 largest & smallest growing names over the 10 year span tells us that Liam is the fastest growing and Tyler is the name that is shrinking the most. I’ve filtered the dataset to include only names with over 5000 counts beginning in the year 2010.

# Filter specific names of initial interest
df_int = df_yoy_all[df_yoy_all["count_x"] > 1]

df_int["growth"] = (df_int["count_y"] - df_int["count_x"])/(df_int["count_x"])

Creating a function to explore any name of interest will be a valuable reusable asset.

# Create function to look up any name of interest
name_list = ['Paxton', 'Parker', 'Ethan', 'Hayden']

def find_name(search: str):
    return (df_int[df_int['name'].str.contains(search)])

def find_list(search: list):
    return df_int[df_int['name'].isin(search)].sort_values("growth", ascending=False)
search = ['Allan', 'Paxton', 'Parker', 'Ethan', 'George', 'Dee', 'Hayden', 'Enzo']

find_list(search)
year_xnamegender_xcount_xyear_ygender_ycount_ygrowth
2010EnzoM6022020M22012.656146
2010DeeM52020M60.200000
2010PaxtonM11102020M12860.158559
2010GeorgeM23732020M27460.157185
2010ParkerM47322020M3797-0.197591
2010AllanM4032020M277-0.312655
2010EthanM180062020M9464-0.474397
2010HaydenM41912020M2146-0.487950
find_name("Hayden")
year_xnamegender_xcount_xyear_ygender_ycount_ygrowth
2010HaydenM41912020M2146-0.48795

Plot Most Trendy Names

Plotting the overall growth gives us some insights but lets break that calculation out by each year to get a better sense of the growth trend.

Lets observe how all-time most popular names have grown over the years instead of just observing the 10 year growth. We can accomplish this by first creating a pivot df.

pivot_df = df_m.pivot_table(index="name", columns="year", values="count", aggfunc=np.sum).fillna(0)

# Now we calucalte the percentage of each name by year.

perc_df = pivot_df / pivot_df.sum() * 100

# Then add a new column with the cumulative percentages sum.
perc_df["total"] = perc_df.sum(axis=1)

sort_df = perc_df.sort_values(by="total", ascending=False).drop("total", axis=1)[0:10]

transpose_df = sort_df.transpose()
transpose_df.head(5)

We sort the dataframe to check which are the top values and slice the data appropriately. Lastly, we drop the total column and flip the axes to make plotting the data easier.

nameNoahLiamWilliamJacobMasonEthanMichaelJamesAlexanderElijah
20100.8585540.5700050.8897461.1547710.7745240.9391930.9055500.7243460.8740460.725285
20110.8890280.7083550.9142741.0740231.0286960.8795940.8858130.6986580.8276790.736711
20120.9161260.8869960.8915351.0073171.0014060.9331720.8541190.7092590.8043020.732743
20130.9668540.9603960.8813690.9620900.9373710.8600900.8213970.7187620.7893200.730778
20141.0072130.9629500.8775510.8805750.8971540.8202020.8065940.7529460.8043520.722134
import plotly.express as px

plot = px.line(transpose_df, x=transpose_df.index, y=transpose_df.columns, title="Top 10 Trendy Names")
plot.show()
Figure 1.1 Trendy Baby Names Over Time

Figure 1.1 Trendy Baby Names Over Time.

Liam is still the most ‘trendy’ & popular name, according to growth, over the last 10 years.

I’m going to create another function to grab the year where the name of interest is the highest.

def when_most_births(name):

    if name in set(df_m["name"]):

        highest = df_m[df_m["name"] == name].groupby("year")["count"].sum().sort_values(ascending = False)[:1]
        in_2020 = df_m[(df_m["name"] == name) & (df_m["year"] == 2020)]["count"].sum()

        print("Name {} was most popular in {} with {} kids given this name.\n".format(name, int(highest.index[0]), highest.iloc[0]))

        print('In 2020 there were {} babies in total who were given the name {}.\n'.format(in_2020, name))

        px.line(df_m[df_m["name"] == name], x="year", y="count", color = "name", title=f"Baby Name {name} Over Time").show()

    else:
        print(f"Name {name} is not in the database.")
when_most_births("Enzo")

Name Enzo was most popular in 2020 with 2201 kids given this name.

In 2020 there were 2201 babies in total who were given the name Enzo.

Figure 1.2 Most Popular Over Time

Figure 1.2 Most Popular Over Time.

Using a function from a kaggle notebooke we will

Create a metric that measure spikes & then has a drop off.

  • Divide a names maximum count by its total count.

Most Sudden Names

df = df_m.groupby(['name', 'gender'])['count'].agg(['sum', 'max'])

df_ = df.reset_index()

df_['spike_fall'] = df_['max']/df_['sum']

popular = df_.sort_values(by='spike_fall',ascending=False)

popular_df = popular[popular["sum"] > 5000]
popular_df.head(5)

Lets use our function when_most_births to plot what names we want to examine for name spikes/falls.

when_most_births("Jase")
Figure 1.3 Spike-Fall Over Time

Figure 1.3 Spike-Fall Over Time.

Name Jase was most popular in 2013 with 4552 kids given this name.

In 2020 there were 624 babies in total who were given the name Jase.

Examining The Spike-Fall Names

  • Jase is a great example of the spike/fall being able to capture an example of a name that peaked in 2013 and has dropped in popularity.
  • For some high ranked spike/fall names we do not see the fade part because their peak year is the last one in the dataset.

As you might imagine, this is not the end of finding a baby name. Some open questions are:

  • How do I actually use this data to choose a name and not just use the analysis for avoiding names?
  • What if a trendy name is something we want?

Further analysis can look into both gender names to create a metric that finds the optimal gender neutral name.

We solved the initial problem of avoiding specific names but the question of interest is still left open-ended. Luckily we have 6 months remaining to decide on a name.

Baby photo