This section describes the individual data generation scripts to build the nodes and edges of the artificial social network.
As mentioned in the root level README, a shell script generate_data.sh
is provided that sequentially runs the Python scripts from this directory, generating the data for the nodes and edges for the social network. This is the recommended way to generate the data. A single positional argument is provided to the shell script: The number of person profiles to generate, specified as an integer value as shown below.
# Generate data for 100K persons
bash generate_data.sh 100000
Running this command generates a series of files in the output
directory, following which we can proceed to ingesting the data into a graph database.
First, fake male and female profile information is generated for the number of people required to be in the network.
$ cd data
# Create a dataset of fake profiles for men and women with a 50-50 split by gender
$ python create_nodes_person.py -n 100000
The parquet file generated fake person metadata, and looks like the below.
id | name | gender | birthday | age | isMarried |
---|---|---|---|---|---|
1 | Kenneth Scott | male | 1984-04-14 | 39 | true |
2 | Stephanie Lozano | female | 1993-12-31 | 29 | true |
3 | Thomas Williams | male | 1979-02-09 | 44 | true |
Because the parquet format encodes the data types as inferred from the underlying arrow schema, we can be assured that the data, for example, age
, is correctly stored of the type date
. This reduces the verbosity of the code when compared to the CSV format, which would required us to clearly specify the separator and then re-parse the data to the correct type when using it downstream.
To generate a list of cities that people live in, we use the world cities dataset from Kaggle. This is an accurate and up-to-date database of the world's cities and towns, including lat/long and population information of ~44k cities all over the world.
To make this dataset simpler and more realistic, we only consider cities from the following three countries: US
, UK
and CA
.
$ python create_nodes_location.py
Wrote 7117 cities to parquet
Wrote 273 states to parquet
Wrote 3 countries to parquet
Three parquet files are generated accordingly for cities, states and the specified countries. Latitude, longitude and population are the additional metadata fields for each city, each stored with the appropriate data type within the file's schema.
id | city | state | country | lat | lng | population |
---|---|---|---|---|---|---|
1 | Airdrie | Alberta | Canada | 51.2917 | -114.0144 | 61581 |
2 | Beaumont | Alberta | Canada | 53.3572 | -113.4147 | 17396 |
id | state | country |
---|---|---|
1 | Alberta | Canada |
2 | British Columbia | Canada |
3 | Manitoba | Canada |
id | country |
---|---|
1 | Canada |
2 | United Kingdom |
3 | United States |
A static list of interests/hobbies that a person could have is included in raw/interests.parquet
. This is cleaned up and formatted as required by the data generator script.
$ python create_nodes_interests.py
This generates data as shown below.
id | interest |
---|---|
1 | Anime |
2 | Art & Painting |
3 | Biking |
Edges are generated between people in a similar way to the way we might imagine social networks. A Person
follows another Person
, with the direction of the edge signifying something meaningful. Rather than just generating a uniform distribution, to make the data more interesting, during generation, a small fraction of the profiles (~0.5%) is chosen to be highly connected. This resembles the role of "influencers" in real-world graphs, and in graph terminology, the nodes representing these persons can be called "hubs". The rest of the nodes are connected via these hubs in a random fashion.
python create_edges_follows.py
This generates data as shown below, where the from
column contains the ID of a person who is following someone, and the to
column contains the ID of the person being followed.
from | to |
---|---|
50 | 1 |
152 | 1 |
271 | 1 |
The "hub" nodes can be connected to anywhere from 0.5-5% of the number of persons in the graph.
Edges are generated between people and the cities they live in. This is done by randomly choosing a city for each person from the list of cities generated earlier.
$ python create_edges_location.py
The data generated contains the person ID in the from
column and the city ID in the to
column.
from | to |
---|---|
1 | 6015 |
2 | 6296 |
3 | 6657 |
Edges are generated between people and the interests they have. This is done by randomly choosing anywhere from 1-5 interests for each person from the list of interests generated earlier for the nodes.
python create_edges_interests.py
The data generated contains the person ID in the from
column and the interest ID in the to
column.
from | to |
---|---|
1 | 24 |
2 | 4 |
2 | 8 |
A person can have multiple interests, so the from
column can have multiple rows with the same ID.
Edges are generated between cities and the states they are in, as per the cities.parquet
file
python create_edges_city_state.py
The data generated contains the city ID in the from
column and the state ID in the to
column.
from | to |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
Edges are generated between states and the countries they are in, as per the states.parquet
file
python create_edges_state_country.py
The data generated contains the state ID in the from
column and the country ID in the to
column.
from | to |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
The following files are generated by the scripts in this directory.
In the ./output/nodes
directory, the following files are generated.
persons.parquet
interests.parquet
cities.parquet
states.parquet
countries.parquet
In the ./output/edges
directory, the following files are generated.
follows.parquet
lives_in.parquet
interests.parquet
city_in.parquet
state_in.parquet