From 5e6afd41a6660cdc72deaf4a65dd0f4245016c80 Mon Sep 17 00:00:00 2001 From: Xiaming Chen Date: Fri, 11 Dec 2015 13:02:51 +0800 Subject: [PATCH] Add hz_mobile and senegal_mobile repos --- README.md | 39 ++++++++++++---- repos/hz_mobile.md | 57 +++++++++++++++++++++++ repos/senegal_mobile.md | 38 +++++++++++++++ {porters => repos}/wifi_syslog.md | 0 {porters => repos}/wifi_syslog_session.md | 0 5 files changed, 124 insertions(+), 10 deletions(-) create mode 100644 repos/hz_mobile.md create mode 100644 repos/senegal_mobile.md rename {porters => repos}/wifi_syslog.md (100%) rename {porters => repos}/wifi_syslog_session.md (100%) diff --git a/README.md b/README.md index e4a3416..95b0e09 100644 --- a/README.md +++ b/README.md @@ -7,25 +7,44 @@ Utilities for OMNILab data warehouse. OMNILab data warehouse is designed with three layers: -* Layer0: Original raw data from multiple sources. +* Layer0: Original raw data from multiple sources (NFS). -* Layer1: Independent wide tables for each source after simple ETLing. +* Layer1: Independent wide tables for each source after simple ETLing (HDFS). -* Layer2: A bunch of small tables (models) after combining multiple data at Layer1 to fit different applications. +* Layer2: A bunch of small tables (models) after combining multiple data at Layer1 to fit different applications (HDFS). -In most scenarios, data administrators hold the access right to Layer0 and Layer1, and data users have accesses to -the small tables at Layer2 to meet their requirements. +In most scenarios, data administrators hold the access right to Layer0 and Layer1, and data users have accesses to the +small tables at Layer2 to meet their requirements. -The data users can also contribute new data models to Layer2 when they develop a new type of table from application. In this process, other -data sources may be involved to generate the new model. At this time, the user should contact admin to add new data sources to Layer0 or -Layer1. +The data users can also contribute new data models to Layer2 when they develop a new type of table from application. In +this process, other data sources may be involved to generate the new model. At this time, the user should contact admin +to add new data sources to Layer0 or Layer1. + + +## Project structure + +* `etlers`: ETL tools for each data repo. + +* `porters`: automatic scripts to port a new repo perodically with ETL tools. + +* `repos`: documentation for each repo. + +* `global_config.sh`: global settings used by porters. + +* `README.md`: this file. ## Layer2 Repos -* [WifiSyslogSession](https://github.com/OMNILab/OmniDataHouse/blob/master/porters/wifi_syslog_session.md) +* [WifiSyslogSession](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog_session.md) ## Layer1 Repos -* [WifiSyslog](https://github.com/OMNILab/OmniDataHouse/blob/master/porters/wifi_syslog.md) \ No newline at end of file +* [HzMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/hz_mobile.md) + +* [SenegalMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/senegal_mobile.md) + +* [WifiSyslog](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog.md) + +* [WifiTraffic](#) diff --git a/repos/hz_mobile.md b/repos/hz_mobile.md new file mode 100644 index 0000000..5334d59 --- /dev/null +++ b/repos/hz_mobile.md @@ -0,0 +1,57 @@ +# HangZhou Data Description + +This data set contains user web-browsing logs in 19 days, across two months, +in August and October, 2012. The network topology covers the main areas of +Hangzhou City and Wenzhou City, Zhejiang Province. + + +## Data path + + hdfs://user/omnilab/warehouse/HzMobile/hzclean + + +## Data Columns + +This folder maintains the set after data cleansing and formatting. Each set +contains 27 independent columns separated by '\t' to describe user web-browsing activities. + +* ttime (double): timestamp issuing a web request, in seconds +* dtime (double): timestamp ending a request or dumping this log, in seconds +* BS (long): signature of individual base stations (LAC*10^6 + CI) +* IMSI (string): user IMSI signature +* mobile_type (string): signature of mobile client type +* dest_ip (long): destination IP address +* dest_port (int): destination TCP port +* success (long): indicating if the web request succeeded +* failure_cause (string): reason of web request failure +* response_time (long): time delay from request to the first byte of response +* host (string): host name of web request +* content_length (long): content-length field of HTTP header +* retransfer_count (long): the number of retransmission +* packets (long): the number of network packets +* status_code (int): HTTP status code +* web_volume (long): byte number of transfered web request +* content_type (string): content-type field of HTTP header +* user_agent (string): MD5 value of user-agent field of HTTP header +* is_mobile (int): if the client is mobile device +* e_gprs (int): E_GPRS mode indicator +* umts_tdd (int): UMTS/TDD mode indicator +* ICP (long): classification of Internet Content Providers, e.g., Netease +* SC (string): service classification, e.g., video, music. +* URI (string): Uniform resource identifier +* OS (string): operating system type +* LON (double): latitude of base station location +* LAT (double): longitude of base station location + + +## Data Stat + +* Total logs: 852314304 +* Total unique users: +* Total base stations: + + +## Data sample + + 1345084549.229 1345085752.000 22696030330 460022688112277 1862344734 80 1 2000 storage7.cdn.kugou.com 17221 0 12 206 16384 application/octet-stream 12 0 13 酷狗音乐网 /201208161032/602b72233338bcf732ed1a0d1ab9de0e/M01/11/DC/OtfxyE_xEgf9hd16AB-VzJZmo9g824.m4a 119.06104 29.615866 + 1345084666.528 1345085752.000 22696030330 460007472554744 com.sina.weibo 1862344789 80 1 1880 tp4.sinaimg.cn 3079 0 3 200 2550 image/jpeg zMu/gx+2nvFedIg8HudOww== 1 10 0 32 新浪微博 /2044794631/50/5613889201/0 IOS 119.06104 29.615866 diff --git a/repos/senegal_mobile.md b/repos/senegal_mobile.md new file mode 100644 index 0000000..a67e5a1 --- /dev/null +++ b/repos/senegal_mobile.md @@ -0,0 +1,38 @@ +# Senegal Mobile Data + +This repo contains data sets for the second Data for Development (D4D) Challenge. The data were wangled from Orange's +mobile phone users in Senegal, wich consist of three subsets: + +* Dataset 1: One year of site-to-site traffic for 1666 sites on an hourly basis, + +* Dataset 2: Fine-grained mobility data (site level) on a rolling 2-week basis with bandicoot behavioral indicators at + individual level for about 300,000 randomly sampled users meeting the two criteria mentioned before for each 2 week + period, + +* Dataset 3: One year of coarse-grained (123 arrondissement level) mobility data with bandicoot behavioral indicators at + individual level for about 150,000 randomly sampled users meeting the two criteria mentioned before for a year + + +## Data path + + hdfs://user/omnilab/warehouse/Senegal + + +## Data format + +For more introduction of data collection, preprocessing and format, refer to [this +paper](http://arxiv.org/abs/1407.4885). + + +## Data sample + + 1,2013-01-07 13:10:00,461 + 1,2013-01-07 17:20:00,454 + 1,2013-01-07 17:30:00,454 + 1,2013-01-07 18:40:00,327 + 1,2013-01-07 20:30:00,323 + 1,2013-01-08 18:40:00,323 + 1,2013-01-08 19:30:00,323 + 1,2013-01-08 21:00:00,323 + 1,2013-01-09 11:00:00,323 + 1,2013-01-09 14:50:00,323 \ No newline at end of file diff --git a/porters/wifi_syslog.md b/repos/wifi_syslog.md similarity index 100% rename from porters/wifi_syslog.md rename to repos/wifi_syslog.md diff --git a/porters/wifi_syslog_session.md b/repos/wifi_syslog_session.md similarity index 100% rename from porters/wifi_syslog_session.md rename to repos/wifi_syslog_session.md