diff --git a/README.md b/README.md index e52e2b1..eee355f 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Concretly, escaped values are not handled correctly by a CSV parser due to inher Make sure the package is in the classpath, eg: by using the --packages option: ```bash -spark-shell --packages "be.icteam:adobe-analytics-datafeed-datasource_2.12:0.0.1" +spark-shell --packages "be.icteam:adobe-analytics-datafeed-datasource_2.12:$version" ``` And you can read the feed as following: @@ -23,21 +23,35 @@ val df = spark.read .load("./src/test/resources/randyzwitch") ``` -## Features - -* Correct handling of records which contain [special characters](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en) -* Capability to translate lookup columns with their actual value as specified in the [Lookup files](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-contents.html?lang=en#lookup-files) - -## Options +Here is what it looks like: -* FileEncoding (default: ISO-8859-1) -* MaxCharsPerColumn (default: -1) -* EnableLookups (default: true) +```scala +df.show(3, false) + ++------------------------------------------------------+----------------------------------+------------------+------------+---------------+------------------------+----------+------------------------+-----------+-----------+----------------+--------------------+-------------------+---------------+----------+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-----------+------------------+-------------------------------------------------------------------+--------------------------+------------------+---------+-----------+-------+----------+--------+--------------+-----------------+-----------------+----------------------+---------+-------------------+------------------+-------------+------------+------------+-------------+----------------------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+----------+----------+----------+----------+-------------+---------------+--------------------+--------------------+--------------------+---------------------------------------------------------------------+--------------------+--------------+---------------------------------------------------------------------+----------------------+--------------------------------------------------------------------+----------+----------+-------------+-----------------+-----------------------------------------------------------+------------+----------+----------+-----------+-----------+------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+---------------+----------------------------------------------------------------------+------------------+----------+-------------------+-------------------+---------+----------+-------------+-------+-------------------------------------------------------------------------------------------------------------------------+--------------+---------+--------------+-------------------------------------+-------------------+--------------------+-------------------------------------------------------------------+--------------------+ +|post_event_list |post_product_list |browser |browser_type|connection_type|country |javascript|language |os |resolution |ref_type |accept_language |date_time |domain |evar1 |evar2|evar3|evar4|evar5|evar6|evar7|evar8|evar9|evar10|evar11|evar12|evar13|evar14|evar15|evar16|evar17|evar18|evar19|evar20|evar21|evar22|evar23|evar24|evar25|evar26|evar27|evar28|evar29|evar30|evar31|evar32|evar33|evar34|evar35|evar36|evar37|evar38|evar39|evar40|evar41|evar42|evar43|evar44|evar45|evar46|evar47|evar48|evar49|evar50|evar51|evar52|evar53|evar54|evar55|evar56|evar57|evar58|evar59|evar60|evar61|evar62|evar63|evar64|evar65|evar66|evar67|evar68|evar69|evar70|evar71|evar72|evar73|evar74|evar75|exclude_hit|first_hit_pagename|first_hit_page_url |first_hit_referrer |first_hit_time_gmt|geo_city |geo_country|geo_dma|geo_region|geo_zip |ip |last_hit_time_gmt|last_purchase_num|last_purchase_time_gmt|new_visit|post_browser_height|post_browser_width|post_campaign|post_channel|post_cookies|post_currency|post_cust_hit_time_gmt|post_evar1|post_evar2|post_evar3|post_evar4|post_evar5|post_evar6|post_evar7|post_evar8|post_evar9|post_evar10|post_evar11|post_evar12|post_evar13|post_evar14|post_evar15|post_evar16|post_evar17|post_evar18|post_evar19|post_evar20|post_evar21|post_evar22|post_evar23|post_evar24|post_evar25|post_evar26|post_evar27|post_evar28|post_evar29|post_evar30|post_evar31|post_evar32|post_evar33|post_evar34|post_evar35|post_evar36|post_evar37|post_evar38|post_evar39|post_evar40|post_evar41|post_evar42|post_evar43|post_evar44|post_evar45|post_evar46|post_evar47|post_evar48|post_evar49|post_evar50|post_evar51|post_evar52|post_evar53|post_evar54|post_evar55|post_evar56|post_evar57|post_evar58|post_evar59|post_evar60|post_evar61|post_evar62|post_evar63|post_evar64|post_evar65|post_evar66|post_evar67|post_evar68|post_evar69|post_evar70|post_evar71|post_evar72|post_evar73|post_evar74|post_evar75|post_hier1|post_hier2|post_hier3|post_hier4|post_hier5|post_keywords|post_page_event|post_page_event_var1|post_page_event_var2|post_page_event_var3|post_pagename |post_pagename_no_url|post_page_type|post_page_url |post_persistent_cookie|post_prop1 |post_prop2|post_prop3|post_prop4 |post_prop5 |post_prop6 |post_prop7 |post_prop8|post_prop9|post_prop10|post_prop11|post_prop12 |post_prop13|post_prop14|post_prop15|post_prop16|post_prop17|post_prop18|post_prop19|post_prop20|post_prop21|post_prop22|post_prop23|post_prop24|post_prop25|post_prop26|post_prop27|post_prop28|post_prop29|post_prop30|post_prop31|post_prop32|post_prop33|post_prop34|post_prop35|post_prop36|post_prop37|post_prop38|post_prop39|post_prop40|post_prop41|post_prop42|post_prop43|post_prop44|post_prop45|post_prop46|post_prop47|post_prop48|post_prop49|post_prop50|post_prop51|post_prop52|post_prop53|post_prop54|post_prop55|post_prop56|post_prop57|post_prop58|post_prop59|post_prop60|post_prop61|post_prop62|post_prop63|post_prop64|post_prop65|post_prop66|post_prop67|post_prop68|post_prop69|post_prop70|post_prop71|post_prop72|post_prop73|post_prop74|post_prop75|post_purchaseid|post_referrer |post_search_engine|post_state|post_visid_high |post_visid_low |post_zip |prev_page |ref_domain |service|user_agent |visit_keywords|visit_num|visit_page_num|visit_referrer |visit_search_engine|visit_start_pagename|visit_start_page_url |visit_start_time_gmt| ++------------------------------------------------------+----------------------------------+------------------+------------+---------------+------------------------+----------+------------------------+-----------+-----------+----------------+--------------------+-------------------+---------------+----------+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-----------+------------------+-------------------------------------------------------------------+--------------------------+------------------+---------+-----------+-------+----------+--------+--------------+-----------------+-----------------+----------------------+---------+-------------------+------------------+-------------+------------+------------+-------------+----------------------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+----------+----------+----------+----------+-------------+---------------+--------------------+--------------------+--------------------+---------------------------------------------------------------------+--------------------+--------------+---------------------------------------------------------------------+----------------------+--------------------------------------------------------------------+----------+----------+-------------+-----------------+-----------------------------------------------------------+------------+----------+----------+-----------+-----------+------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+---------------+----------------------------------------------------------------------+------------------+----------+-------------------+-------------------+---------+----------+-------------+-------+-------------------------------------------------------------------------------------------------------------------------+--------------+---------+--------------+-------------------------------------+-------------------+--------------------+-------------------------------------------------------------------+--------------------+ +|[{Instance of eVar1, null}, {Instance of eVar2, null}]|[{null, , null, null, null, null}]|Safari 7.1 |Apple |LAN/Wifi |Commercial (mostly U.S.)|1.6 |English (United States) |OS X 10.9.5|1400 x 864 |Search Engines |en-us |2015-07-13 00:26:18|netvigator.com |logged-out|guest|null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |0 |null |http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/ |https://www.google.com.hk/|1436761578 |hong kong|hkg |0 |no region |0 |219.77.75.182 |0 |0 |0 |1 |687 |1347 |null |null |Y |USD |1436761578 |logged-out|guest |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |::empty:: |0 |null |null |null |http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free |null |null |http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free |Y |Broken MacBook Pro Hinge? Apple will fix for free! | randyzwitch.com|1173 |post |single-post |technology |apple,customer-service,genius-bar,macbook-pro |Randy Zwitch|1 |2012 |06 |25 |June 25, 2012 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |https://www.google.com.hk/ |557 |null |2791471528899189638|791228704714081521 |::hash::0|0 |google.com.hk|ss |Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/600.1.17 (KHTML, like Gecko) Version/7.1 Safari/537.85.10 |::empty:: |1 |1 |https://www.google.com.hk/ |557 |null |http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/ |1436761578 | +|[{Instance of eVar1, null}, {Instance of eVar2, null}]|[{null, , null, null, null, null}]|Google Chrome 43.0|Google |LAN/Wifi |Japan |1.6 |English (United States) |Windows 8.1|1280 x 800 |Search Engines |en-US,en;q=0.8 |2015-07-13 00:56:09|aist.go.jp |logged-out|guest|null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |0 |null |http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts/|https://www.google.com/ |1436426719 |tsukuba |jpn |0 |08 |305-0005|150.29.149.177|1436754129 |0 |0 |1 |777 |1293 |null |null |Y |USD |1436763369 |logged-out|guest |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |::empty:: |0 |null |null |null |http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts |null |null |http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts |Y |Visualizing Website Pathing With Sankey Charts |3047 |post |single-post |digital-analytics|adobe-analytics,data-visualization,omniture,r,rsitecatalyst|Randy Zwitch|1 |2014 |09 |10 |September 10, 2014|7 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |https://www.google.com/ |57 |null |3037297388874966800|6917530475045353754|::hash::0|0 |google.com |ss |Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36 |::empty:: |4 |1 |https://www.google.com/ |57 |null |http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts/|1436763369 | +|[{Instance of eVar1, null}, {Instance of eVar2, null}]|[{null, , null, null, null, null}]|Google Chrome 43.0|Google |LAN/Wifi |Network (mostly U.S.) |1.6 |English (United States) |OS X 10.10 |1280 x 800 |Search Engines |en-US,en;q=0.8 |2015-07-13 00:48:36|comcast.net |logged-out|guest|null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |0 |null |http://randyzwitch.com/hive-five-hard-won-lessons/ |https://www.google.com/ |1435962984 |san jose |usa |807 |ca |95126 |50.136.222.167|1436200856 |0 |0 |1 |777 |1197 |null |null |Y |USD |1436762916 |logged-out|guest |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |::empty:: |0 |null |null |null |http://randyzwitch.com/hive-five-hard-won-lessons |null |null |http://randyzwitch.com/hive-five-hard-won-lessons |Y |Five Hard-Won Lessons Using Hive | randyzwitch.com |2680 |post |single-post |data-science |big-data,hadoop,hive,python,r |Randy Zwitch|1 |2014 |06 |12 |June 12, 2014 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |https://www.google.com/ |57 |null |3083707027358817578|6917535643501355093|::hash::0|0 |google.com |ss |Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36|::empty:: |4 |1 |https://www.google.com/ |57 |null |http://randyzwitch.com/hive-five-hard-won-lessons/ |1436762916 | +``` -We also support the Generic file source options: -* [Path Glob Filter](https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter) for manifest files (so should end with *.txt) -* [Modification Time Path Filters](https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#modification-time-path-filters) for manifest files +## Features +* Correct handling of records which contain [special characters](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en) +* Lookup values are replaced with their actual value in the [Lookup files](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-contents.html?lang=en#lookup-files) + * [Dynamic lookups](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/dynamic-lookups.html?lang=en) are supported as well +* [Events](https://experienceleague.adobe.com/docs/analytics/implementation/vars/page-vars/events/events-overview.html?lang=en) are parsed as array of (key, value) +* [Products](https://experienceleague.adobe.com/docs/analytics/implementation/vars/page-vars/products.html?lang=en) are parsed as product with name, category, quantity, price, events and evars. +* Capability to filter found manifest files through: + * [Path Glob Filter](https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter) + * [Modification Time Path Filters](https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#modification-time-path-filters) + +All available options are here: +[DatafeedOptions.scala](./src/main/scala/be/icteam/adobe/analytics/datafeed/DatafeedOptions.scala) + +Example: + ```scala val df = spark.read .format("be.icteam.adobe.analytics.datafeed") diff --git a/src/test/scala/be/icteam/adobe/analytics/datafeed/DefaultSourceTest.scala b/src/test/scala/be/icteam/adobe/analytics/datafeed/DefaultSourceTest.scala index ddd5948..9374de5 100644 --- a/src/test/scala/be/icteam/adobe/analytics/datafeed/DefaultSourceTest.scala +++ b/src/test/scala/be/icteam/adobe/analytics/datafeed/DefaultSourceTest.scala @@ -75,20 +75,12 @@ class DefaultSourceTest extends AnyFunSuite { val spark = TestUtil.getSparkSession() val df = spark.read - .format("datafeed") + .format(DatafeedOptions.SOURCE_NAME) //.option(DatafeedOptions.ENABLE_LOOKUPS, "false") .load(feedPath) - //.select(col("mobile_attributes")).filter(col("mobile_attributes").isNotNull) - //.select(col("mobile_id")).filter(col("mobile_id").isNotNull) - //.select(col("mobile_attributes").getField("Manufacturer").as("XM")) - //.filter(col("XM").isNotNull) - - //df.printSchema() - //val os = df.take(1)(0).getString(0) - //assert(os == "1550374905") - //df.show(10, false) //df.printSchema() + df.show(10, false) spark.stop() }