Skip to content

Commit

Permalink
Add hz_mobile and senegal_mobile repos
Browse files Browse the repository at this point in the history
  • Loading branch information
caesar0301 committed Dec 11, 2015
1 parent 191e069 commit 5e6afd4
Show file tree
Hide file tree
Showing 5 changed files with 124 additions and 10 deletions.
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,44 @@ Utilities for OMNILab data warehouse.

OMNILab data warehouse is designed with three layers:

* Layer0: Original raw data from multiple sources.
* Layer0: Original raw data from multiple sources (NFS).

* Layer1: Independent wide tables for each source after simple ETLing.
* Layer1: Independent wide tables for each source after simple ETLing (HDFS).

* Layer2: A bunch of small tables (models) after combining multiple data at Layer1 to fit different applications.
* Layer2: A bunch of small tables (models) after combining multiple data at Layer1 to fit different applications (HDFS).

In most scenarios, data administrators hold the access right to Layer0 and Layer1, and data users have accesses to
the small tables at Layer2 to meet their requirements.
In most scenarios, data administrators hold the access right to Layer0 and Layer1, and data users have accesses to the
small tables at Layer2 to meet their requirements.

The data users can also contribute new data models to Layer2 when they develop a new type of table from application. In this process, other
data sources may be involved to generate the new model. At this time, the user should contact admin to add new data sources to Layer0 or
Layer1.
The data users can also contribute new data models to Layer2 when they develop a new type of table from application. In
this process, other data sources may be involved to generate the new model. At this time, the user should contact admin
to add new data sources to Layer0 or Layer1.


## Project structure

* `etlers`: ETL tools for each data repo.

* `porters`: automatic scripts to port a new repo perodically with ETL tools.

* `repos`: documentation for each repo.

* `global_config.sh`: global settings used by porters.

* `README.md`: this file.


## Layer2 Repos

* [WifiSyslogSession](https://github.com/OMNILab/OmniDataHouse/blob/master/porters/wifi_syslog_session.md)
* [WifiSyslogSession](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog_session.md)


## Layer1 Repos

* [WifiSyslog](https://github.com/OMNILab/OmniDataHouse/blob/master/porters/wifi_syslog.md)
* [HzMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/hz_mobile.md)

* [SenegalMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/senegal_mobile.md)

* [WifiSyslog](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog.md)

* [WifiTraffic](#)
57 changes: 57 additions & 0 deletions repos/hz_mobile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# HangZhou Data Description

This data set contains user web-browsing logs in 19 days, across two months,
in August and October, 2012. The network topology covers the main areas of
Hangzhou City and Wenzhou City, Zhejiang Province.


## Data path

hdfs://user/omnilab/warehouse/HzMobile/hzclean


## Data Columns

This folder maintains the set after data cleansing and formatting. Each set
contains 27 independent columns separated by '\t' to describe user web-browsing activities.

* ttime (double): timestamp issuing a web request, in seconds
* dtime (double): timestamp ending a request or dumping this log, in seconds
* BS (long): signature of individual base stations (LAC*10^6 + CI)
* IMSI (string): user IMSI signature
* mobile_type (string): signature of mobile client type
* dest_ip (long): destination IP address
* dest_port (int): destination TCP port
* success (long): indicating if the web request succeeded
* failure_cause (string): reason of web request failure
* response_time (long): time delay from request to the first byte of response
* host (string): host name of web request
* content_length (long): content-length field of HTTP header
* retransfer_count (long): the number of retransmission
* packets (long): the number of network packets
* status_code (int): HTTP status code
* web_volume (long): byte number of transfered web request
* content_type (string): content-type field of HTTP header
* user_agent (string): MD5 value of user-agent field of HTTP header
* is_mobile (int): if the client is mobile device
* e_gprs (int): E_GPRS mode indicator
* umts_tdd (int): UMTS/TDD mode indicator
* ICP (long): classification of Internet Content Providers, e.g., Netease
* SC (string): service classification, e.g., video, music.
* URI (string): Uniform resource identifier
* OS (string): operating system type
* LON (double): latitude of base station location
* LAT (double): longitude of base station location


## Data Stat

* Total logs: 852314304
* Total unique users:
* Total base stations:


## Data sample

1345084549.229 1345085752.000 22696030330 460022688112277 1862344734 80 1 2000 storage7.cdn.kugou.com 17221 0 12 206 16384 application/octet-stream 12 0 13 酷狗音乐网 /201208161032/602b72233338bcf732ed1a0d1ab9de0e/M01/11/DC/OtfxyE_xEgf9hd16AB-VzJZmo9g824.m4a 119.06104 29.615866
1345084666.528 1345085752.000 22696030330 460007472554744 com.sina.weibo 1862344789 80 1 1880 tp4.sinaimg.cn 3079 0 3 200 2550 image/jpeg zMu/gx+2nvFedIg8HudOww== 1 10 0 32 新浪微博 /2044794631/50/5613889201/0 IOS 119.06104 29.615866
38 changes: 38 additions & 0 deletions repos/senegal_mobile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Senegal Mobile Data

This repo contains data sets for the second Data for Development (D4D) Challenge. The data were wangled from Orange's
mobile phone users in Senegal, wich consist of three subsets:

* Dataset 1: One year of site-to-site traffic for 1666 sites on an hourly basis,

* Dataset 2: Fine-grained mobility data (site level) on a rolling 2-week basis with bandicoot behavioral indicators at
individual level for about 300,000 randomly sampled users meeting the two criteria mentioned before for each 2 week
period,

* Dataset 3: One year of coarse-grained (123 arrondissement level) mobility data with bandicoot behavioral indicators at
individual level for about 150,000 randomly sampled users meeting the two criteria mentioned before for a year


## Data path

hdfs://user/omnilab/warehouse/Senegal


## Data format

For more introduction of data collection, preprocessing and format, refer to [this
paper](http://arxiv.org/abs/1407.4885).


## Data sample

1,2013-01-07 13:10:00,461
1,2013-01-07 17:20:00,454
1,2013-01-07 17:30:00,454
1,2013-01-07 18:40:00,327
1,2013-01-07 20:30:00,323
1,2013-01-08 18:40:00,323
1,2013-01-08 19:30:00,323
1,2013-01-08 21:00:00,323
1,2013-01-09 11:00:00,323
1,2013-01-09 14:50:00,323
File renamed without changes.
File renamed without changes.

0 comments on commit 5e6afd4

Please sign in to comment.