Do My Friends Care About Me?

A Brief Analysis of my Facebook Friends

This was a (not so) quick look at how I stacked up against my Facebook friends in a to­tally ar­bi­trary mea­sure.

I looked at the ra­tio of likes and re­acts (henceforth to be re­ferred to as a sin­gu­lar en­tity: realikes”) to a user’s pro­file pic­ture to that user’s friend count. That is, what per­cent­age of a user’s friends re­aliked his or her pro­file pic­ture?

Gathering Data

I’ll post a more de­tailed write-up of the tech­ni­cal as­pects of scrap­ing the Facebook pro­files, as well as an iPython Notebook with the code.

The most chal­leng­ing (and time con­sum­ing) part was gath­er­ing the data. I ran into sev­eral dead ends be­fore fi­nally scrap­ing the data from Facebook us­ing Selenium. I’ll go through all of the failed meth­ods and my suc­ceess­ful method here.

I had al­ready col­lected a list of Facebook pro­file URLs just us­ing Chrome Dev Tools on the client side.

My first idea was to use the Facebook Graph API. However, that quickly proved to be im­pos­si­ble, mainly be­cause this StackOverflow an­swer said so.

So on to the sec­ond so­lu­tion! I fig­ured I could use the python requests and beautifulsoup li­braries to crawl Facebook, and just pass in my Facebook cook­ies for auth. While the au­then­ti­ca­tion to­tally worked (exciting!), I dis­cov­ered that Facebook does ba­si­cally all of the ren­der­ing client side, so the HTML that I re­ceived was just a bunch of links to async scripts.

Ultimately, I re­sorted to Selenium to scrap­ing first the links to pro­file pic­tures, and later, the ac­tual re­alike counts.

Examining the Data

After a lit­tle bit of clean­ing up in Excel, I was ready to take a deeper look at the data.

# Import our beloved libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Read in the data

data = pd.read_csv('./data.csv')
# View the first 5 entries

data[:5]
# Likes on Pic # Friends
0 77.0 321.0
1 137.0 527.0
2 176.0 563.0
3 54.0 898.0
4 103.0 283.0
# This includes "bad" rows, which don't have values for either or both of the columns
len(data)
# Drop the bad stuff
data = data.dropna()
len(data)
plt.plot(data['# Friends'], data['# Likes on Pic'], 'bo')
plt.title('Realikes to Friend Count')
plt.ylabel('Realikes')
plt.xlabel('# Friends')
<matplotlib.text.Text at 0x115283f28>

There seems to be a gen­eral up­wards trend, with users with more friends re­ceiv­ing a greater amount of likes from friends. This makes sense, since their pic­tures go out to a larger au­di­ence. I am more in­ter­ested in the ra­tio be­tween the re­alike count and friend count.

Looking at the Realike Ratios

# Insert new column of the ratios.
data['Ratio'] = data['# Likes on Pic'] / data['# Friends']
# First look at the ratios! Just a brief summary.
data['Ratio'].describe()
count    278.000000
mean       0.158833
std        0.093243
min        0.000000
25%        0.083202
50%        0.159469
75%        0.220952
max        0.492114
Name: Ratio, dtype: float64

Ok, so this sum­mary pro­vides some in­ter­est­ing in­for­ma­tion.

Just from click­ing around Facebook ear­lier, I had thought that the av­er­age ra­tio would be some­where be­tween 20% and 30%, but it ended up be­ing much lower, at 15.9%.

Also in­ter­est­ing to note is that no one had over half of their friends like their pro­file pic­ture, al­though the max of 49.2% came pretty close.

# Sort data by friend count
data = data.sort_values(by='# Friends')
%matplotlib inline

plt.plot(data['# Friends'], data.Ratio, 'bo')
plt.xlabel('# Friends')
plt.ylabel('Ratio')
plt.title('# Friends vs Realike Ratio')
<matplotlib.text.Text at 0x113fc76a0>

One of the biggest things I was hop­ing to see was a cor­re­la­tion be­tween friend size and ra­tio. However, that did not re­ally man­i­fest it­self. As you can see, the points are rel­a­tively evenly dis­trib­uted.

# Top 5 realike ratios.
data.sort_values(by='Ratio', ascending=False)[:5]
# Likes on Pic # Friends Ratio
8 156.0 317.0 0.492114
11 155.0 376.0 0.412234
158 280.0 742.0 0.377358
4 103.0 283.0 0.363958
12 406.0 1118.0 0.363148
# Bottom 5 realike ratios.
data.sort_values(by='Ratio', ascending=True)[:5]
# Likes on Pic # Friends Ratio
92 0.0 125.0 0.000000
40 0.0 117.0 0.000000
299 1.0 370.0 0.002703
57 2.0 539.0 0.003711
89 6.0 826.0 0.007264

Viewing the low­est 5 re­alike ra­tios re­veals that those with the low­est friend counts do not have the low­est re­alike ra­tios.

Does Friend Count Matter?

I then took a look at the dis­tri­b­u­tions for users with more than 1000 friends com­pared to users with fewer than 1000 friends.

It’s im­por­tant to note that there are sig­nif­i­cantly more users with fewer than 1000 friends in my data set.

gt1000 = data.loc[data['# Friends'] >= 1000]
lt1000 = data.loc[data['# Friends'] < 1000]
len(gt1000)
gt1000['Ratio'].describe()
count    44.000000
mean      0.173164
std       0.084129
min       0.008739
25%       0.116832
50%       0.180046
75%       0.219244
max       0.363148
Name: Ratio, dtype: float64
lt1000['Ratio'].describe()
count    234.000000
mean       0.156139
std        0.094783
min        0.000000
25%        0.072971
50%        0.155958
75%        0.220952
max        0.492114
Name: Ratio, dtype: float64

The mean ra­tio for those with > 1000 friends is a lit­tle bit larger than those with­out, as is the me­dian.

%matplotlib inline

plt.figure(figsize=(12, 4))

plt.subplot(121)
plt.plot(lt1000['# Friends'], lt1000['Ratio'], 'ro')
plt.plot(gt1000['# Friends'] - 1000, gt1000['Ratio'], 'bo')

plt.xlabel('# Friends')
plt.ylabel('Realike Ratio')
<matplotlib.text.Text at 0x1140b42e8>

This plot might be a lit­tle bit con­fus­ing. The red dots rep­re­sents users with fewer than 1000 friends. The blue dots rep­re­sents users with 1000 or more friends, but the dots are scaled so as to align with the red dots (by sub­tract­ing 1000 from the friend count).

What does this re­veal? Not much. There are more red dots in the up­per right cor­ner than blue dots, but the points are spread evenly enough where this is in­signif­i­cant.

Looking at Friend Counts

As an aside, I took a look at the dis­tri­b­u­tion of friend counts.

data['# Friends'].describe()
count     278.000000
mean      656.589928
std       395.612176
min        22.000000
25%       371.750000
50%       564.500000
75%       841.750000
max      2403.000000
Name: # Friends, dtype: float64

Conclusion? My friends, on av­er­age, have twice as many friends as I do. Sad.

plt.figure(figsize=(16, 8))

plt.subplot(121)
plt.boxplot(data['# Friends'])
plt.ylabel('# Friends')

plt.subplot(122)
plt.hist(data['# Friends'], bins=20)
plt.ylabel('Freq.')
plt.xlabel('# Friends')
plt.title('How Many Friends My Friends Have')

<matplotlib.text.Text at 0x114f552b0>
Graph of friend count distributions.
Distribution of my friends’ friend counts

As shown by the box plot and, per­haps more ob­vi­ously, by the his­togram, the dis­tri­b­u­tion of friend counts is skewed right, ac­count­ing for the large dif­fer­ence be­tween mean friend count and me­dian friend count. Either way, I look to be pretty anti-so­cial.

Conclusions?

I can’t re­ally draw any sta­tis­ti­cally sig­nif­i­cant con­clu­sions from these data. That said, there were some trends that were in­sight­ful or in­ter­est­ing to some ex­tent.

Quality over Quantity

It’s ap­par­ent that hav­ing a higher friend count does not nec­es­sar­ily re­sult in a higher re­alike ra­tio. For 4 of the top 5 ranked re­alike users, the friend count was be­low the third quar­tile. In fact, the user with the high­est friend count (2403 friends) had one of the low­est ra­tios (.8%)

That said, by ex­am­in­ing the bot­tom 5, al­most all were be­low the me­dian friend count. This seems to im­ply that hav­ing too few friends is also not ideal.

Given these very broad gen­er­al­iza­tions, one can make the log­i­cal ssump­tion that there is a Goldilocks zone” of friend count that yields that high­est ra­tio. However, the data is prob­a­bly much to scat­tered to be able to ac­tu­ally gen­er­ate a use­ful model.

An Individualized Analysis

This data set was com­posed of only my friends. This meant the data set was tai­lored uniquely to my choice of friends, which are mostly high school­ers in the NOVA area.

Next Steps

There are quite a few paths in which I can pro­ceed with this lit­tle ex­per­i­ment, if I want to. For one thing, it would be great to ac­quire more data, and, now that I think of it, gen­er­ally pro­file pic­tures and friend counts are pub­lic, so I could ex­pand the data set to be­yond my friends.

Looking at the con­tents of the pro­file pic­ture would also be in­ter­est­ing. For ex­am­ple, in pic­tures that were es­pe­cially pop­u­lar, were there mul­ti­ple peo­ple? What is the gen­der / age / etc of the sub­ject? However, this will be a much more non­triv­ial task.