Analysis and interpretation of the social interaction of users on Facebook using Machine Learning

Diego Amador
19 min readDec 21, 2020

Publication by: Aimé López, Adalit Reyes, Diego Amador and Óscar Reyes

As part of the third module “Data mining” of the diploma course “Statistical techniques and data mining” taught by Jacobo Gonzalez, we have decided to share our final project, in which through a set of data obtained from the Kaggle website we performed the entire KDD process (Knowledge Data Discovery) that we will show below.

Goal

Different characteristics pertaining to Facebook users such as their behavior, habits and trends will be predicted and discovered through data collected from their interaction with the social network.

Through data mining, a classification of data on user activity on Facebook will be made according to certain variables, taking as a factor or main characteristic the gender variable, having two classes Female (1) and Male (0), allowing us to perform a statistical analysis, data grouping, conversion of variables, among others.

Context

Social networks have become information gathering platforms. They know our behavior patterns and our psychological profile. The most popular platforms such as Facebook and Google have made technology seem fun, casual and superficial, but in reality, the amount of data they accumulate about us is sometimes alarming. Every day, Facebook users feed the platform with large volumes of information. With this type of data, Facebook knows what we look like, who our friends are, what we are doing, what we like and dislike, among other situations that may arise every day.
Facebook has become known as one of the most visited social networks nowadays, it is defined as a great source of data and has been questioned because they do with that data; this data is available to different companies, brands among others that allow them to target a specific audience, whose interest is reflected in being a main means by which to advertise.

Hypothesis

We will seek through the formulation of various hypotheses to demonstrate without certainty some of the beliefs that are held regarding the interaction of users in that social network.
. In an age range of 15 to 25 years, greater use is made of the social network Facebook
- The male gender is the one that generates more likes in the social network
- From the age range of 55 to 60 years, user interaction decreases
- People who have more time using Facebook have more friends.

State Of The Art

Previously, the Universidad Iberoamericana of Mexico, conducted a study called Facebook and daily life, this was to know the influence that social network has in the daily life of young university students, it was built an instrument with the purpose of measuring attitudes, behaviors and uses given to the social network. The instrument was applied to 381 young people from the Universidad Iberoamericana in Mexico City, 239 female and 142 male. The results show that there are differences between men and women, with the latter spending more time in the social network (Aspani,2012).

On the other hand, a new study from the University of Malaga, focused on the following, deepen the knowledge of a possible gender gap when using technological tools to carry out group work, ie, analyzing whether the propensity to innovate and the use that students make of technologies differ according to gender. To this end, an online survey was carried out on a sample of 403 students from the degree courses in the field of Economics. The results show that although the participants of both genders use almost equally all the technologies proposed for group work (wiki, WhatsApp, Google Drive, Dropbox, Skype, email, Facebook, etc.), statistically significant differences are observed in the greater use by women of the WhatsApp and email tools (Vallespin-Aran,2020).

Another study from the University of Alicante, in order to show results about the use of social networks by young people, contrasted information from different sources with the results obtained from a survey of its own. The survey was conducted through the Internet, was applied to a sample of users of social networks, with the aim of knowing what is the reason for its success or, in other words, what leads young people to open a profile and maintain it and with special emphasis on possible differences in use between boys and girls. Obtaining as results that, young people and adolescents are the main users and that, worldwide, the presence of women reaches 60%, the most intensive users are young people between 24 and 29 years. Concluding that the differences seem to focus on the age and not so much on the sex of the respondents (Espinar,2009).

Thus the study proposed here will help us to better understand the patterns of user behavior through their interaction with Facebook, this will be achieved through the implementation of various branches of Data Science such as Data Mining and Machine Learning following the methodology Knowledge Discovery in Databases (KDD) used as a process for the exploitation of data to predict and / or discover interesting patterns in our model and thus develop a theoretical perspective on our approach.

As a main resource will be used Python language and its various libraries which will allow us to obtain a better visualization, manipulation and data processing, thereby achieving refute the approach of our hypothesis and in turn achieve in the future apply these findings in various areas such as social big data, psychology, communication and advertising.

Throughout this project we will show the use of various concepts of machine learning to build models that allow us to generate interpretations. Its use, until today is used in various fields from avoiding spam in your email to detection of forest fires or genetics, the question is have we obtained the full potential of these concepts?

NOW… LET’S START!!

1. Understanding the data

The variables available in our data set are the following:

Variables and meaning
Variables and meaning

We load the following libraries to work:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import missingno as msno
%matplotlib inline

We load the data, and now, we will discover the amount of values, and type of data stored in each column:

df.info()
Definition of variables and data type.

We generate first correlations that could be interesting regarding the age and the amount of friendships that users start:

sns.lmplot( x="age", y="friendships_initiated", data=df, fit_reg=False, hue='dob_year', legend=False)

Knowing the age ranges of the data set, age groups are created as follows

labels=['11-20', '21-30', '31-40', '40-50', '51-60', '61-70', '71-80', '81-90', '91-100', '100-110', '111-120']df['age_group'] = pd.cut(df.age,bins=np.arange(10,121,10),labels=labels,right=True)

And a graph is generated with distributions by age and gender

fig,ax=plt.subplots(figsize=(13,7))
color=['deeppink','blue']
test=df.pivot_table('tenure',index='age_group',columns='gender',aggfunc='count')
#conversion into percenatage
for col in test.columns:
test[col]=test[col]/sum(test[col])*100
test.plot(kind='bar',color=color,ax=ax,alpha=0.7)
ax.set_xticklabels(test.index,rotation=360)
ax.set_xlabel("Conjunto de edades",fontsize=14)
ax.set_ylabel("Porcentaje",fontsize=14)
ax.set_title('Distribución de usuarios por edad y genero',fontsize=14)
User distributions by age and gender

We group by gender

gender_no=df.groupby("gender")["age"].count()
fig,ax=plt.subplots(figsize=(13,7))
gender_no.plot.pie(ax=ax,autopct='%0.2f%%')
Gender distribution in the data

2. Data processing

2.1 Data cleaning

In this section you must be able visualize the missing data, nullity by column and eliminate the outliers. The empty cells are converted to NaN and the columns with null value are located. A way must be found to enter values in the empty cells, it can be done with the mean or the most frequent value for each of the variables, in this particular case, we found null values in the variable gender, so they were filled with the gender that has more concurrence in the data set: male. In the case of tenure (days in Facebook) we will fill the empty cells with the average of the rest of the data. We will not go too far into this part, however, we invite readers to request our repository for additional consultations. The most important thing to remember is that at the end of this process we renamed the original data frame as out
We can look at the graphs of the data before and after removing the outliers from our data set.

Before deleting outliers
After deleting outliers

2.2 Selection of features

The method is the following: first we will graph the heat map of the Pearson correlation and we will see the correlation of the independent variables or characteristics with the output variable or target. Only select the characteristics that have a correlation greater than 0.5 (taking absolute value) with the output variable.
Remember that the Pearson correlation coefficient has values between -1 and 1:
A value closer to 0 implies a weaker correlation (an exact 0 implies no correlation)
A value closer to 1 implies a stronger positive correlation
A value closer to -1 implies a stronger negative correlation

plt.figure(figsize=(8,8))
cor = out.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
Pearson correlation result

This is how we identify the main characteristics with which we will work our models. The likes_received, mobile_likes, mobile_likes_received, www_likes, www_likes_received variables exceed the threshold assumption that by approximately 93.31%, the mobile_likes variable has a higher correlation with the likes variable

3. Modeling

3.1 Simple linear regression analysis for likes and mobile_likes

Linear regression is a supervised learning algorithm used in Machine Learning and statistics. In its simplest version, what we will do is “draw a line” that will indicate the trend of a continuous data set (if they were discrete, we would use Logistic Regression).
In statistics, linear regression is an approach to model the relationship between a scalar dependent variable “y” and one or more explanatory variables named “X”

likes = out.iloc[:,[False, False, False, False, False, True, False, False, False, False, False]].values
moblikes = out.iloc[:,[False, False, False, False, False, False, False, True, False, False, False]].values
# Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_split
A_train, A_test, b_train, b_test = train_test_split(likes, moblikes, test_size = 1/3, random_state = 123)
# Fitting Simple Linear Regression to the Training setfrom sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(A_train, b_train)
# Visualising the Training set resultsplt.scatter(A_train, b_train, color = 'red')
plt.plot(A_train, model.predict(A_train), color = 'blue')
plt.title('Likes vs Mobile likes')
plt.xlabel('Likes')
plt.ylabel('Mobile likes')
plt.show()
Training set results
# Visualising the Test set resultsplt.scatter(A_test, b_test, color = 'red')
plt.plot(A_train, model.predict(A_train), color = 'blue')
plt.title('Likes vs Mobile likes (Test set)')
plt.xlabel('Likes')
plt.ylabel('Mobile likes')
plt.show()
Test set results

Model performance is obtained with a value of 87%, which indicates that the model is adequate

# Predicting the Test set resultsb_pred = model.predict(A_test)
print ("Desempeño del modelo: ", model.score(A_test, b_test))

3.2 Principal Component Analysis

The idea of Principal Component Analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of related variables, while maintaining as much variance in the data as possible. PCA finds a set of new variables where the original variables are only their linear combinations. The new variables are called principal components (PCs). These principal components are orthogonal: In a three-dimensional case, the principal components are perpendicular to each other. X cannot be represented by Y or Y cannot be presented by Z.
Figure (A) shows PCA’s intuition: it “rotates” the axes to better align with its data. The first major component will capture most of the variance in the data, followed by the second, third, and so on. As a result, the new data will have fewer dimensions.

# DEFINIMOS VARIABLESall_variables = ['age_group', 'gender', 'tenure', 'friend_count', 'friendships_initiated', 'likes', 'likes_received', 'mobile_likes', 'mobile_likes_received', 'www_likes', 'www_likes_received'] #ojo en el tarjetfeatures = ['tenure', 'friend_count', 'friendships_initiated', 'likes', 'likes_received', 'mobile_likes', 'mobile_likes_received', 'www_likes', 'www_likes_received']target = ['gender']# Using MinMaxfrom sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_sc=pd.DataFrame(scaler.fit_transform(out1[features]),
columns=features)
df_sc.head()
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles
#pca = PCA(n_components=3)
df_pca = pd.DataFrame(pca.fit_transform(out1[features]),
columns=['PC1', 'PC2', 'PC3'])
df_pca.head()
# OBTAINING THE VARIANCES OF EACH CHARACTERISTICexplained_variance = pca.explained_variance_ratio_.cumsum()
explained_variance
# Plotting the dataframedf_pca['gender'] = out1[target]
df_pca.columns = ['PC1', 'PC2','PC3','gender']
df_pca.head()
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_title('2 component PCA')
targets = ['male', 'female']
colors = ['blue', 'pink']
for target, color in zip(targets,colors):
indicesToKeep = df_pca['gender'] == target
ax.scatter(df_pca.loc[indicesToKeep, 'PC1']
, df_pca.loc[indicesToKeep, 'PC2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
Results obtained from the dataframe graph

3.3 KNN for gender

It is a non-linear computer model that works by starting with an unclassified object and then counting how many neighbors belong to each category. If more neighbors belong to category A than to category B, then the new point should belong to category A. Therefore, the classification of a certain point is based on most of its closest neighbors (hence the name)

# Clasificando los valores unicos de la columna gender"""0 male1 female"""df_pca01['gender'].replace({'male':0,'female':1}, inplace=True)# DEFINING VARIABLESX = df_pca01.iloc[:, [1,2]].values
y = df_pca01.iloc[:, 3].values
# TRAINING AND TEST SETS ARE DEFINEDfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# SCALE OF CHARACTERISTICSfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# APPLYING KNN TO THE TRAINING SETfrom sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
# PREDICTION OF RESULTS ON THE TEST SETy_pred = classifier.predict(X_test)
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))
# Visualizando los resultados del conjunto de entrenamientofrom matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('blue', 'pink')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('blue', 'pink'))(i), label = j)
plt.title('K-NN (Train)')
plt.xlabel('Gender')
plt.ylabel('Prueba')
plt.legend()
plt.show()
KNN Train results
# Visualising the Test set resultsfrom matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('blue', 'pink')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('blue', 'pink'))(i), label = j)
plt.title('K-NN (Test)')
plt.xlabel('Gender')
plt.ylabel('Estimated')
plt.legend()
plt.show()
KNN Test results

Validation

According to the accuracy value we have a total of 54% of predictions against the given total, however we have in the TruePositive and TrueNegative position 0.0093% and 0.0031% of estimated values correctly within the main diagonal. While in the inverse diagonal we have a 0.0058% chance that the values that are correct and are rejected and a 0.0045% chance that the values are false and are not rejected.

# Making the Confusion Matrixfrom sklearn.metrics import confusion_matrix
import seaborn as sns
cm = pd.DataFrame(confusion_matrix(y_test, y_pred))
sns.heatmap(cm, annot=True, annot_kws={"size": 12}) # font size
Confusion matrix obtained

The KNN model is a method that looks for the observations closest to what you are trying to predict.

It has a performance of 54%, so it is defined that it is not a suitable model for this type of data

from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

3.4 Random forest for tenure and likes

Decision trees are predictive models formed by binary rules with which it is possible to distribute the observations according to their attributes and thus predict the value of the response variable. Random Forest models are formed by a set of individual decision trees, each trained with a slightly different sample of the training data generated by bootstrapping. The prediction of a new observation is obtained by adding the predictions of all the individual trees that make up the model.

#Initializing the variablesX = out2.iloc[:, [2,5]].values
y = out2.iloc[:, 1].values
#Declaring the training and test setsfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
#Climbing the characteristicsfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Applying Random forest to the training setfrom sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
#Viewing the results of the training setfrom matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('DarkBlue', 'LimeGreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('DarkBlue', 'LimeGreen'))(i), label = j)
plt.title('K-NN (Train)')plt.xlabel('Tenure')
plt.ylabel('Likes')
plt.legend()plt.show()
KNN Train results
#Visualizando el resultado del conjunto de pruebafrom matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('DarkBlue', 'LimeGreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('DarkBlue', 'LimeGreen'))(i), label = j)
plt.title('K-NN (Test)')
plt.xlabel('Tenure')
plt.ylabel('Likes')
plt.legend()
plt.show()
KNN Test results

Results

Random Forest model performance is obtained with a value of 59%, which indicates that it is not an adequate model to make a prediction on the data.

from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred))

Validation

We have an accuracy with a value of 59% so it is concluded that it is not an acceptable model to make predictions about this set

Within the main diagonal, we have the position TruePositive with a value of 0.00011% and TrueNegative of 0.0036% to accept real values; however we are rejecting real values with a probability of 0.0056% and accepting false values with a probability of 0.0038%.

from sklearn.metrics import confusion_matrix
import seaborn as sn
cm = pd.DataFrame(confusion_matrix(y_test, y_pred))
sn.heatmap(cm, annot=True, annot_kws={"size": 12}) # font size
Confusion matrix obtained

3.5 Hierarchical Clustering for friends count and tenure

We will apply this concept to the variables friend_count and tenure to visualize how the data is grouped and if there is an obvious relationship.

First, we will take a sample of the data frame out previously worked, equivalent to 15% of the total data.

sample = df01.sample(frac=0.15, random_state=1)
sample = sample.dropna()
sample = sample[['tenure','friend_count']]
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(sample, method='ward'))
Dentogram obtained

We obtain the distances between each of the observations

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(sample)

By visualizing the results obtained, a main conclusion can be drawn:

In orange at the top, it is shown that having been on Facebook for longer does not necessarily represent having a greater number of friends, which could speak of a selection/debugging behavior of the contacts over time, leading them to keep mostly, a small number of friends.

plt.figure(figsize=(10, 7))plt.scatter(sample['friend_count'], sample['tenure'], c=cluster.labels_, cmap='rainbow')
Clusters obtained

Results

With Elbow Curve applied to the variables of friend_count and tenure is obtained an approximate value of 4 clusters, comparing it with the 5 clusters obtained with Hierarchical clustering, it is concluded that the results obtained are correct.

ft = sample[['friend_count','tenure']]from sklearn.cluster import KMeans
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(ft).score(ft) for i in range(len(kmeans))]
score
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

Validation

The Davies-Bouldin Index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979), a metric for evaluating clustering algorithms, is an internal evaluation scheme, in which the validation of how well the clustering has been done is done using quantities and characteristics inherent in the data set. The lower the value of the DB index, the better the clustering. It also has a drawback, a good value reported by this method does not imply the best information retrieval. The DB index for the number k of clusters is defined below.

from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
# K-Means
kmeans = KMeans(n_clusters=5, random_state=1).fit(sample)
# we store the cluster labels
labels = kmeans.labels_
printdavies_bouldin_score(sample, labels))

3.6 K-Means for mobile_likes, mobile_likes_received, likes and gender

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. It halts creating and optimizing clusters when either:

The centroids have stabilized — there is no change in their values because the clustering has been successful.

The defined number of iterations has been achieved.

X = np.array(aux[['mobile_likes','mobile_likes_received','likes']])
y = np.array(aux['gender_cod'])
X.shape
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
%matplotlib inlinefrom mpl_toolkits.mplot3d import Axes3D
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
#Graph of the selected variablesfig = plt.figure()
ax = Axes3D(fig)
colores=['blue','red','green','blue','cyan','yellow','orange','black','pink','brown','purple']
asignar=[]
for row in y:
asignar.append(colores[row])
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=asignar,s=60)

Finding k-means clusters

kmeans = KMeans(n_clusters=3).fit(X)
centroids = kmeans.cluster_centers_
print(centroids)
df2 = df_sc.copy()
df3 = df_pca.copy()
df3['cl'] = aux['cl'] = df2 ['cl'] = kmeans.predict(X)
# Obtaining the Clusterslabels = kmeans.predict(X)# Finding the centers by clusterC = kmeans.cluster_centers_
colores=['green','blue','yellow']
asignar=[]
for row in labels:
asignar.append(colores[row])
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=asignar,s=60)
ax.scatter(C[:, 0], C[:, 1], C[:, 2], marker='*', c=colores, s=1000)

Results

aux['gender_cod'].value_counts(normalize=True)

Where 1 stands for female and 0 for male.

Validation

Results obtained for validation

All three groups are young adults, as this was the dominant group within the data.

The first group, is very balanced in the constitution of female and male, so we can say that they are adult individuals, since of the three groups is the one that has a higher average age, which do not use so much Facebook on the mobile, which receive a good amount of likes, but give more likes.

The second group, since it is a higher percentage of men we will say that, are young men who give more likes than they receive.

The last group, is a large percentage of women, so we will say that, are young women who have much more activity on Facebook, than the previous groups, and receive a large number of likes, but give even more likes than they receive.

3.7 Naive Bayes for gender

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sb
%matplotlib inlineplt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import SelectKBest
var = ['tenure', 'friend_count', 'friendships_initiated', 'likes', 'likes_received', 'mobile_likes', 'mobile_likes_received', 'www_likes', 'www_likes_received']X = datos[var].copy()
y = datos['gender_cod'].copy()
from sklearn.feature_selection import SelectKBest, chi2
best=SelectKBest(k=5)
X_new = best.fit_transform(X, y)
X_new.shape
selected = best.get_support(indices=True)
print(X.columns[selected])
X_train, X_test = train_test_split(datos, test_size=0.3, random_state=6)
y_train =X_train['gender_cod']
y_test = X_test['gender_cod']
gnb = GaussianNB()# Train classifier
gnb.fit( X_train[used_features].values, y_train)
y_pred = gnb.predict(X_test[used_features])
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Results

Summary of the models results

Accuracy of the training and tests for each model

Conclusions and answers to the hypothesis

It is very diffuse to want to classify gender through the variables that were given within the data, however, being able to group through the activity on Facebook is more feasible.
It can be deduced that within the age range of 21 to 30 years, people tend to use more Facebook, having a greater inclination towards the male gender than the female, observing that after 30 years the female gender occupies more this social network giving more likes from mobile devices such as cell phones.
Despite the fact that in the age of 21 to 30 we see a more frequent use of the social network Facebook, it was also observed that the use of this social network has a beginning between 11 and 19 years of age being more prone in men than in women, but with a very minimal difference.

In an age range of 15 to 25 years, greater use is made of the social network Facebook
A: The initial hypothesis is rejected because there is a greater use of Facebook in the age of 21 to 30 years, in order to confirm this hypothesis we made a grouping by age with a clean and standardized data set.

The male gender is the one that generates more likes in the social network
A: We reject the initial hypothesis, through the application of the Naive Bayes model since we observe that the female gender generates a greater number of likes as specified in the implementation of the model.

From the age range of 55 to 60 years the interaction of users decreases
A: The initial hypothesis is accepted, since there is a greater record of interaction within the young population in a range of 11 to 30 years.

People who have more time using Facebook have more friends.
A: The initial hypothesis is rejected, but the results of the Hierarchical Clustering model and its denogram show that having been on Facebook for longer does not necessarily represent having a greater number of friends, which could speak of a selection/purification behavior of contacts over time, leading them to maintain a majority of a small number of friends.

Work for the future

We expect to be able to scale the project by entering a greater amount of data from different social networks to the analysis of our data such as Twitter, WhatsApp, Telegram, Instagram, Youtube in order to obtain a large volume of data allowing us to apply Big Data and specifically apply the strategy we discovered called Social Big data, which can be applied in companies because it allows us to know the behavior and trends of their consumers according to their particular segmentation thus acquiring a greater competitive advantage and allowing us to improve our service. In addition, the results of the study can be implemented in areas such as psychology, marketing and communication.

RESOURCES

Bagnato, J. I. (2011). Aprende Machine Learning. Aprende Machine Learning.

https://www.aprendemachinelearning.com/regresion-lineal-en-espanol-con-python/#:~:text=La%20regresi%C3%B3n%20lineal%20es%20un,Machine%20Learning%20y%20en%20estad%C3%ADstica.&text=En%20estad%C3%ADsticas%2C%20regresi%C3%B3n%20lineal%20es,explicativas%20nombradas%20con%20%E2%80%9CX%E2%80%9D.

Anónimo. (2013). GeeksforGeeks.

https://www.geeksforgeeks.org/dunn-index-and-db-index-cluster-validity-indices-set-1/

Anónimo. Redes sociales: El uso y abuso. (2005).

http://www.unilibre.edu.co/bogota/ul/noticias/noticias-universitarias/2349-redes-sociales-el-us-y-el-abuso#:~:text=Las%20redes%20sociales%20son%20servicios,permiten%20interactuar%20con%20otros%20internautas.

El algoritmo K-NN y su importancia. (2020, 1 septiembre). El algoritmo KNN.

https://www.merkleinc.com/es/es/blog/algoritmo-knn-modelado-datos

Normas APA. (2001, 15 abril). El estado del arte

https://normasapa.net/que-es-el-estado-del-arte/

Aspani, S., Sada, M., & Shabot, R. (2012). Facebook y vida cotidiana. Alternativas en psicología, 16(27), 107–114.

Vallespin-Aran, M. L., Anaya-Sanchez, R., Aguilar-Illescas, R., & Molinillo-Jimenez, S. (2020). Diferencias de género en el uso de las herramientas colaborativas para la realización de los trabajos en grupo.

Espinar-Ruiz, E., & González-Río, M. J. (2009). Jóvenes en las redes sociales virtuales: un análisis exploratorio de las diferencias de género.

--

--