Skip to content

Commit

Permalink
Add TCP and UDP repos
Browse files Browse the repository at this point in the history
  • Loading branch information
caesar0301 committed Dec 27, 2015
1 parent 36af77c commit dcb00a3
Show file tree
Hide file tree
Showing 6 changed files with 237 additions and 28 deletions.
34 changes: 18 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,39 +21,41 @@ this process, other data sources may be involved to generate the new model. At t
to add new data sources to Layer0 or Layer1.


## Project structure
## Layer2 Repos

* `etlers`: source code of ETL tools.
* [WifiSyslogSession](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog_session.md)

* `deploy`: folder to dploy binary ETL tools referred by `porters`.

* `porters`: automatic scripts to port a new repo periodically with ETL tools.
## Layer1 Repos

* `repos`: documentation for each repo.
* [HzMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/hz_mobile.md)

* `global_config.sh`: global settings used by porters.
* [SenegalMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/senegal_mobile.md)

* `workflow.sh`: global workflow to run periodically.
* [WifiSyslog](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog.md)

* `README.md`: this file.
* [WifiTrafficHTTP](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_traffic_http.md)

* [WifiTrafficTCP](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_traffic_tcp.md)

## Layer2 Repos
* [WifiTrafficUDP](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_traffic_udp.md)

* [WifiSyslogSession](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog_session.md)
* [WifiUsers](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_users.md)


## Layer1 Repos
## Project structure

* [HzMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/hz_mobile.md)
* `etlers`: source code of ETL tools.

* [SenegalMobile](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/senegal_mobile.md)
* `deploy`: folder to dploy binary ETL tools referred by `porters`.

* [WifiSyslog](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_syslog.md)
* `porters`: automatic scripts to port a new repo periodically with ETL tools.

* [WifiTrafficHTTP](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_traffic_http.md)
* `repos`: documentation for each repo.

* [WifiUsers](https://github.com/OMNILab/OmniDataHouse/blob/master/repos/wifi_users.md)
* `global_config.sh`: global settings used by porters.

* `workflow.sh`: global workflow to run periodically.


## Instructions to add a new repo.
Expand Down
41 changes: 30 additions & 11 deletions porters/wifi_traffic_tcp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,26 +20,45 @@ if [ ! -d $WIFI_TRAFFIC_PATH ]; then
fi

year=`date -d "yesterday" "+%Y"`
month=`date -d "yesterday" "+%b"`
month2=`date -d "yesterday" "+%m"`
monthChr=`date -d "yesterday" "+%b"`
monthDig=`date -d "yesterday" "+%m"`
day=`date -d "yesterday" "+%d"`

INPUT_PATH=$WIFI_TRAFFIC_PATH/$year$month/tcp/*_$day_$month_$year.out
monthnames=(invalid Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)
if [ XXOO$1 != "XXOO" ]; then
year=`echo $1 | cut -d "-" -f1`
monthDig=`echo $1 | cut -d "-" -f2`
monthChr=${monthnames[${monthDig}]}
day=`echo $1 | cut -d "-" -f3`
fi

OUTPUT_TCP=$HDFS_WIFI_TRAFFIC/TCP/$year$month2$day
OUTPUT_TCP_NOCOMPLETE=$HDFS_WIFI_TRAFFIC/TCP_NOCOMPLETE/$year$month2$day
OUTPUT_UDP=$HDFS_WIFI_TRAFFIC/UDP/$year$month2$day
INPUT_PATH=$WIFI_TRAFFIC_PATH/${year}${monthDig}/tcp
OUTPUT_TCP=$HDFS_WIFI_TRAFFIC/TCP/$year$monthDig$day
OUTPUT_TCP_NOCOMPLETE=$HDFS_WIFI_TRAFFIC/TCP_NOCOMPLETE/$year$monthDig$day
OUTPUT_UDP=$HDFS_WIFI_TRAFFIC/UDP/$year$monthDig$day

# Decompress files WITHOUT further processing
for file in `ls $INPUT_PATH`; do
for ((i = 0; i < 24; i++)); do
hour=`printf "%02d" $i`
file=$INPUT_PATH/${hour}_00_${day}_${monthChr}_${year}.out
echo $file
rfname=${file%.*}

if ! hadoop fs -test -e $INPUT_TEMP/`basename $rfname`; then
gunzip -c $file | hadoop fs -put - $INPUT_TEMP/`basename $rfname`
if ! hadoop fs -test -e ${OUTPUT_TCP}/$hour; then
python wifi_traffic_tcp/unzip_tcp.py $file/log_tcp_complete.gz \
| hadoop fs -put - ${OUTPUT_TCP}/$hour
fi

if ! hadoop fs -test -e ${OUTPUT_TCP_NOCOMPLETE}/$hour; then
python wifi_traffic_tcp/unzip_tcp.py $file/log_tcp_nocomplete.gz \
| hadoop fs -put - ${OUTPUT_TCP_NOCOMPLETE}/$hour
fi
done

clean_trash
if ! hadoop fs -test -e ${OUTPUT_UDP}/$hour; then
python wifi_traffic_tcp/unzip_tcp.py $file/log_udp_complete.gz \
| hadoop fs -put - ${OUTPUT_UDP}/$hour
fi

done

exit 0;
2 changes: 1 addition & 1 deletion repos/wifi_traffic_http.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# WifiSyslog
# WifiSyslogHTTP

This repo contains updated Wifi HTTP traffic logs.

Expand Down
148 changes: 148 additions & 0 deletions repos/wifi_traffic_tcp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# WifiSyslogTCP

This repo contains updated Wifi TCP traffic logs, which reports every TCP connection that has been tracked by [Tstat 2.3.1](http://tstat.polito.it/). A TCP connection is identified when the first SYN segment is observed, and is ended when either

* the FIN/ACK or RST segments are observer;
* no data packet has been observed (from both sides) for a default timeout of 10s after
the thress-way handshake or 5 min after the last data packet.

Tstat discards all the connections for which the three-way handshake is not properly seen. Then, in case a connection is correctly closed it is stored in log_tcp_complete, otherwise in log_tcp_nocomplete. For detailed description, please refer to [official documentation](https://web.archive.org/web/20130331032520/http://tstat.polito.it/measure.shtml).



## Data path

hdfs://user/omnilab/warehouse/WifiTraffic/TCP
hdfs://user/omnilab/warehouse/WifiTraffic/TCP_NOCOMPLETE


## Data format

There are 111 fields recorded and each line represents an individual TCP flow:

**Client info.**

* [1] Client IP addr
* [2] Client TCP port
* [3] Client packets
* [4] Client RST sent
* [5] Client ACK sent
* [6] Client PURE ACK sent
* [7] Client unique bytes
* [8] Client data packets
* [9] Client data bytes
* [10] Client rexmit packets
* [11] Client rexmit bytes
* [12] Client out sequence packets
* [13] Client SYN count
* [14] Client FIN count
* [15] Client RFC 1323 ws sent
* [16] Client RFC 1323 ts sent
* [17] Client window sacle factor
* [18] Client SACK option set
* [19] Client SACK sent
* [20] Client MSS declared
* [21] Client max segment size observed
* [22] Client min segment size observed
* [23] Client max receiver windows announced
* [24] Client min receiver windows announced
* [25] Client segements window zero
* [26] Client max cwin (in-flight-size)
* [27] Client min cwin (in-flight-size)
* [28] Client initial cwin (in-flight-size)
* [29] Client average RTT
* [30] Client min RTT
* [31] Client max RTT
* [32] Client standard deviation RTT
* [33] Client valid RTT count
* [34] Client min TTL
* [35] Client max TTL
* [36] Client rexmit segments RTO
* [37] Client rexmit segments FR
* [38] Client packet recording observed
* [39] Client network duplicated observed
* [40] Client unknown segments classified
* [41] Client rexmit segments flow control
* [42] Client unnece rexmit RTO
* [43] Client unnece rexmit FR
* [44] Client rexmit SYN different initial seqno

**Server info.**

* [45] Server IP addr
* [46] Server TCP port
* [47] Server packets
* [48] Server RST sent
* [49] Server ACK sent
* [50] Server PURE ACK sent
* [51] Server unique bytes
* [52] Server data packets
* [53] Server data bytes
* [54] Server rexmit packets
* [55] Server rexmit bytes
* [56] Server out sequence packets
* [57] Server SYN count
* [58] Server FIN count
* [59] Server RFC 1323 ws sent
* [60] Server RFC 1323 ts sent
* [61] Server window sacle factor
* [62] Server SACK option set
* [63] Server SACK sent
* [64] Server MSS declared
* [65] Server max segment size observed
* [66] Server min segment size observed
* [67] Server max receiver windows announced
* [68] Server min receiver windows announced
* [69] Server segements window zero
* [70] Server max cwin (in-flight-size)
* [71] Server min cwin (in-flight-size)
* [72] Server initial cwin (in-flight-size)
* [73] Server average RTT
* [74] Server min RTT
* [75] Server max RTT
* [76] Server standard deviation RTT
* [77] Server valid RTT count
* [78] Server min TTL
* [79] Server max TTL
* [80] Server rexmit segments RTO
* [81] Server rexmit segments FR
* [82] Server packet recording observed
* [83] Server network duplicated observed
* [84] Server unknown segments classified
* [85] Server rexmit segments flow control
* [86] Server unnece rexmit RTO
* [87] Server unnece rexmit FR
* [88] Server rexmit SYN different initial seqno

**Flow info.**

* [89] Flow duration
* [90] Flow first packet time offset
* [91] Flow last segment time offset
* [92] Client first payload time offset
* [93] Server first payload time offset
* [94] Client last payload time offset
* [95] Server last payload time offset
* [96] Client first PURE ACK time offset
* [97] Server first PURE ACK time offset
* [98] Flow first packet absolute time
* [99] Client has internal IP
* [100] Server has internal IP
* [101] Flow type bitmask
* [102] Flow P2P type
* [103] Flow P2P subtype
* [104] P2P ED2K data message number
* [105] P2P ED2K signaling message number
* [106] P2P ED2K C2S message number
* [107] P2P ED2K S2C message number
* [108] P2P ED2K chat message number
* [109] Flow HTTP type
* [110] Flow SSL client hello
* [111] Flow SSL server hello


## Data sample

10.187.72.40 55917 14 0 13 11 447 1 447 0 0 0 1 1 1 1 6 1 0 1386 447 447 55744 28000 0 447 447 447 2.804573 2.784000 2.822000 0.019218 3 60 60 0 0 0 0 0 0 0 0 0 117.144.242.26 80 9 0 9 2 13093 5 13093 0 0 0 1 1 1 0 7 1 0 1440 2772 2005 6912 5760 0 13093 2772 13093 20.041998 17.816000 22.270000 0.000000 2 51 51 0 0 0 0 0 0 0 0 0 58.683000 60.636000 119.319000 25.514000 37.947000 25.514000 38.077000 25.078000 28.336000 1451059201213.258057 1 0 1 0 0 0 0 0 0 0 2 - -
10.185.227.136 28245 6 1 5 2 1861 2 1861 0 0 0 1 0 1 0 2 1 0 1386 1386 475 66528 8192 0 1861 1386 1861 3.806873 3.780000 3.861000 0.046765 3 60 60 0 0 0 0 0 0 0 0 0 182.254.11.191 80 5 0 5 2 230 1 230 0 0 0 1 1 1 0 9 1 0 1440 230 230 11776 5760 0 230 230 230 3.024349 2.469000 3.580000 0.000000 2 51 51 0 0 0 0 0 0 0 0 0 16.345000 108.137000 124.482000 7.571000 12.025000 7.631000 12.025000 6.330000 11.351000 1451059201260.759033 1 0 1 0 0 0 0 0 0 0 1 - -
10.187.140.194 28875 5 0 4 2 287 1 287 0 0 0 1 1 1 0 0 1 0 1386 287 287 65535 65535 0 287 287 287 29.699015 27.832000 31.569000 0.000000 2 124 124 0 0 0 0 0 0 0 0 0 119.75.220.50 80 5 1 4 1 474 1 474 0 0 0 1 1 0 0 0 1 0 1200 474 474 65535 15544 0 474 474 474 22.925854 8.052000 37.802000 0.000000 2 44 48 0 0 0 0 0 0 0 0 0 134.026000 37.012000 171.038000 36.132000 67.956000 36.132000 67.956000 35.884000 67.701000 1451059201189.634033 1 0 1 0 0 0 0 0 0 0 1 - -
36 changes: 36 additions & 0 deletions repos/wifi_traffic_udp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# WifiSyslogUDP

This repo contains updated Wifi UDP traffic logs. An UDP flow pair is identified when the first UDP segment is observed for a UDP socket pair, and is ended when no packet has been observed (from both sides) for 10s after the first packet or 3min after the last data packet. For detailed description, please refer to [official documentation](https://web.archive.org/web/20130331032520/http://tstat.polito.it/measure.shtml).


## Data path

hdfs://user/omnilab/warehouse/WifiTraffic/UDP


## Data format

There are 16 fields recorded and each line represents an individual UDP flow:

* [1] Client IP address
* [2] Client UDP port
* [3] Client first packet in absolute time (epoch)
* [4] Client time between the first and the last packet from the 'client'
* [5] Client number of bytes transmitted in the payload
* [6] Client total number of packets observed from the client/server
* [7] Client if IP address is internal
* [8] Client Protocol type
* [9] Server IP address
* [10] Server UDP port
* [11] Server first packet in absolute time (epoch)
* [12] Server time between the first and the last packet from the 'client'
* [13] Server number of bytes transmitted in the payload
* [14] Server total number of packets observed from the client/server
* [15] Server if IP address is internal
* [16] Server Protocol type

## Data sample

10.186.218.86 13964 1451062801455.429932 0.000000 137 1 1 12 217.216.91.184 53411 0.000000 0.000000 0 0 0 0
10.186.218.86 13964 1451062801455.577881 0.000000 106 1 1 12 14.162.3.135 1066 0.000000 0.000000 0 0 0 0
87.241.155.52 49001 1451062801456.123047 0.000000 225 1 0 12 10.187.192.237 56525 0.000000 0.000000 0 0 1 0
4 changes: 4 additions & 0 deletions workflow.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,7 @@ $BASEDIR/porters/wifi_syslog_session.sh
## Run WifiTrafficHttp cleansing
chmod +x $BASEDIR/porters/wifi_traffic_http.sh
$BASEDIR/porters/wifi_traffic_http.sh

## Run WifiTrafficTcp cleansing
chmod +x $BASEDIR/porters/wifi_traffic_tcp.sh
$BASEDIR/porters/wifi_traffic_tcp.sh

0 comments on commit dcb00a3

Please sign in to comment.