Skip to content
dinoolivo edited this page Jul 14, 2015 · 1 revision

Consumer GRA (Gender Recognition Algorithm)

Introduction

On Twitter the information about user gender is not specified. Nonetheless, it is interesting having such an information for analytics purposes (e.g. for marketing research or having a clue if a target event was more interesting for male or for female users could be very useful). Providing support to business analytics is the reason why of our work: the development of a gender recognition algorithm (GRA) whose purpose is to classify the gender of twitter users.

For information on how the algorithm works and results achieved check the document on /documents/GenderRecognitionAlgorithmGRA.pdf

This consumer provides the information about gender on the aggregate information about tweet count (e.g 10 tweets made by males, 2 retweet made by females and so on).

There are two versions of this component:

  • Stream
  • Batch

Both modules are based on a core module which aim is to classify the gender of a twitter user from his profile information. The Gender Recognition Algorithm contains 3 sub algorithms:

name/screenName recognition

This sub algorithm expects key/value pairs in the form of name/gender. In its default implementation the module loads a file in the confs/consumers/consumer-gra folder called names_gender.txt. This file contains the key/value pairs in the following format:

name,gender

using comma as field separator. There are already some keys with the related gender.

The user can change the default implementation by implementing the interface NamesGenderMap. Then in GraConsumer.properties the property namesGenderMapImplClass has to be valorized with the qualified name of the new implementation. If the new implementation need some properties (for example db connection url) these can be added into the file names_gender_mapping_impl.conf in the form of key/value pairs.

recognize gender from profile description and colors

These two sub modules use internally a classifier. The classifier class must implement the MlModel interface providing an initialization method to train the classifier and a predict method to classify the gender of the target user providing a sparse vector of features. GRA core provides an implementation of MlModel with Naive Bayes with the class NBModel. The user anyway is free to change this implementation with a custom one implementing a different classifier. You can link the new implementation by edit coloursModelImplClass and descrModelImplClass properties in GraConsumer.properties file.

recognize gender from profile description

Create the training set and save it in LIBSVM format

Create a file containing training data with the following format:

<gender>FS<user profile description>

e.g.
m,the pen is on the table

where FS is the field separator. Then run the python script (located in $SDA_HOME/sda-tools/python_scripts/sda_gra_tools/gra_usr_descr.py) that convert the training set in libsvm format (that will be used afterwards to feed the description gender classifier of GRA core):

$SPARK_HOME/bin/spark-submit --master local[*] gra_usr_descr.py --i <training data location> --algo tf 

Where the algo option can be tf for term frequency algorithm or tf-idf for term frequency–inverse document frequency. Remember to use tf algorithm to use this file for training in gra core even if you decide to apply tf-idf algorithm since the tfidf occurrencies will be calculated from gra description module. Use tf-idf in that case could lead to erroneous predictions.

Below an example of the output file in libsvm format:

0 14955:1 16284:1 61154:1 86485:1 108074:1 168298:1 224032:1 228823:1 238246:1
0 228:1 6293:1 31852:1 66186:1 103560:1 109452:1 116014:1 132917:1 177241:1 194778:1 200529:1 222879:1
0 50892:1 57911:1 140459:1 143926:1 198102:1 226265:1 246321:1 256253:1
1 84172:1 101480:1 168384:1 212544:1 252792:1
1 2091:1 33157:1 35412:1 39705:1 57535:1 70700:1 76150:1 92249:1 96011:1 104809:1 124240:1 127061:1 207234:1 249431:3

recognize gender from profile color

Create the training set and save it in LIBSVM format

Create a file containing training data with the following format:

<gender>FS<profileBackgroundColor>FS<profileTextColor>FS<profileLinkColor>FS<profileSidebarFillColor>FS<profileSidebarBorderColor>

e.g.

m,9AE4E8,030202,0D0808,949B84,949B84

where FS is the field separator(, in the example). Then run the python script (located in $SDA_HOME/sda-tools/python_scripts/sda_gra_tools/gra_usr_color.py) that convert the training set in libsvm format (that will be used afterwards to feed the color gender classifier of GRA core):

$SPARK_HOME/bin/spark-submit --master local[*] gra_usr_color.py --i <training data location> --numcols 4 --nbits 9 --fdc

where:

  • numcols is the number of profile colors to consider (over the 5 profile colors)
  • nbits is the number of bits to which each color has to be scaled (for example from 24 to 9 bits in total -> 3 bits for each channel RGB)
  • fdc (filter default colors): set this option if you want to filter twitter default colors configuration from the training set

Below an example of the output file in libsvm format (4 colors and 9 bits mapping):

0 1:1 8:1 234:1 445:1
0 1:1 445:1 481:1 512:1
0 1:2 8:1 445:1
0 148:1 284:1 365:1 373:1
0 74:1 154:1 303:1 375:1
0 1:1 74:1 102:1 311:1
0 1:1 66:1 147:1 302:1

GRA properties configurations

Property Optional Default Description
coloursModelImplClass YES com.tilab.ca.sda.gra_core.ml.NBModel class that implements the classificator for predictions from profile colours (Default implementation uses Naive Bayes classifier)
colorAlgoReductionNumBits YES 9 number of bits to which scale each profile color (from 24 original bits). It determines the number of features in input for color classification algorithm
colorAlgoNumColorsToConsider YES 4 The number of profile colors to consider (5 means all colors,1 just profile background color)
descrModelImplClass YES com.tilab.ca.sda.gra_core.ml.NBModel class that implements the classificator for predictions from profile description (Default implementation uses Naive Bayes classifier)
featureExtractionClassImpl YES com.tilab.ca.sda.gra_core.ml.FeaturesExtractionTFIDF class that implements the feature extraction module. Two implementation are available: FeaturesExtractionTF,that implements Term frequency algorithm, and FeaturesExtractionTFIDF (Read https://en.wikipedia.org/wiki/Tf–idf for more information)
namesGenderMapImplClass YES com.tilab.ca.sda.gra_core.components.NamesGenderMapDefaultImpl class that map keywords (person name or keywords to recognize pages e.g news) to gender (Default implementation is an in-memory hash map name/gender). Data for default implementation are stored under GRA configuration folder
trainingFilesPath NO - Path where are stored GRA training files to feed classifiers (colors and descr). Use a distributed filesystem path to avoid undesidered behaviours