Skip to content

Hakulani/reatimeanalytic

Repository files navigation

NIDA DADS6005 Realtime Analytics QUIZ2 Ksqldb

Video Presentation

Team member :

  1. 6420422002 Tanakorn Withuratธนากร วิธุรัติ

  2. 6420422008 Wannapa Sripen วรรณนภา ศรีเพ็ญ

  3. 6420422011 Juntanee Pattanasukkul จันทร์ทนีย์ พัฒนสุขกุล

  4. 6420422017 Witsarut Wongsim วิศรุต วงศ์ซิ้ม

  5. 6420422021 Suchawalee Jeeratanyasakul สุชาวลี จีระธัญญาสกุล

image

Assignment :

  • Use Python to write the data source line-by-line from the CSV to SQL database (DB) (source part). The writing duration (d) must be random and between 0 < d < 2 seconds.
  • Connect the SQL DB to a Kafka cluster via topic1.
  • Clean the data in topic1 using ksqlDB and then submit it to topic2.
  • Analyze the data in topic2 using ksqlDB and then submit it to topic3. The source part (NoSQL DB) will incrementally consume the data from topic3
  • Visualize the real-time data from topic3 using tools such as Plotly, etc..
  • Conduct analytics based on your questions
  • Set up and answer three questions easy, e.g., simple statistics medium, e.g., join, windowed, etc. hard, e.g., analytical insight

Objective:

The objective of this project is to design and implement a data-streaming and real-time analytics system that can be used to analyze college students' food and cooking preferences data from Kaggle. The project involves data cleansing and analytics to be done in real-time.

Datasets:

https://www.kaggle.com/datasets/borapajo/food-choices?select=food_coded.csv The project uses the Food Choices dataset available on Kaggle. The dataset contains 106 records and 61 columns.

System Design:

design

The system design for the project is as follows:

  1. Use Python to write the data source line-by-line from the CSV to SQL database (DB) (source part). The writing duration (d) must be random and between 0 < d < 2 seconds.

  2. Connect the SQL DB to a Kafka cluster via topic1.

  3. Clean the data in topic1 using ksqlDB and then submit it to topic2.

  4. Analyze the data in topic2 using ksqlDB and then submit it to topic3.

  5. The source part (NoSQL DB) will incrementally consume the data from topic3

  6. Visualize the real-time data from topic3 using tools such as Plotly, etc.

  7. Conduct analytics based on your questions

  8. Set up and answer three questions:

  • Easy, e.g., simple statistics
  • Medium, e.g., join, windowed, etc.
  • Hard, e.g., analytical insight

Download Connector

Get the connectors To get started, download the connectors for Postgres, MongoDB, and Elasticsearch to a fresh directory. The easiest way to do this is by using confluent-hub. https://docs.confluent.io/platform/current/connect/confluent-hub/client.html

Create a directory for your components: mkdir confluent-hub-components

First, acquire the Postgres Debezium connector:

confluent-hub install --component-dir confluent-hub-components --no-prompt debezium/debezium-connector-postgresql:1.1.0

Likewise for the MongoDB Debezium connector:

confluent-hub install --component-dir confluent-hub-components --no-prompt debezium/debezium-connector-mongodb:1.1.0

And finally, the Elasticsearch connector:

confluent-hub install --component-dir confluent-hub-components --no-prompt confluentinc/kafka-connect-elasticsearch:10.0.1

ref : Debezium Connector for MongoDB https://debezium.io/documentation/reference/1.1/connectors/mongodb.html

Steps:

  1. download zip file "docker-main.zip" and unzip cmd in folder

  2. Create docker file from the ksqlDB tutorial page. https://docs.ksqldb.io/en/latest/tutorials/etl/

  3. After docker build is complete, run the file create_table.py to create a table inside Postgres, read csv file, and insert into Postgres.

    python create_table.py

    You can check Postgres with the following command:

    docker exec -it postgres /bin/bash

    psql -U postgres-user customers

    Inside Ksql, run the command to check:

    customers=# select * from foodcoded;

  4. Run the file import_foodcoded.py to import data from the food_coded.csv file.

    python import_foodcoded.py

  5. Start Ksql to create the connector.

     docker exec -it ksqldb-cli ksql http://ksqldb-server:8088
    
     SET 'auto.offset.reset' = 'earliest';
     
    

    To create a source connector:

    
     CREATE SOURCE CONNECTOR customers_reader WITH (
     'connector.class' = 'io.debezium.connector.postgresql.PostgresConnector',
     'database.hostname' = 'postgres',
     'database.port' = '5432',
     'database.user' = 'postgres-user',
     'database.password' = 'postgres-pw',
     'database.dbname' = 'customers',
     'database.server.name' = 'localhost',
     'table.whitelist' = 'public.foodcoded',
     'transforms' = 'unwrap',
     'transforms.unwrap.type' = 'io.debezium.transforms.ExtractNewRecordState',
     'transforms.unwrap.drop.tombstones' = 'false',
     'transforms.unwrap.delete.handling.mode' = 'rewrite'
    );   
    
    

    Create a Stream topic foodcoded for input data:

    CREATE STREAM foodcoded WITH (kafka_topic = 'localhost.public.foodcoded', value_format = 'avro');

    Create a Stream topic food_clean for cleaning data ksql command in food_clean.sql.

    Create a Stream topic for analyze data ksql command in food_analyze.sql.

    To create a sink connector:

    CREATE SINK CONNECTOR `mongodb_foodcoded_sink` WITH (
    "connector.class"='com.mongodb.kafka.connect.MongoSinkConnector',
    "key.converter"='org.apache.kafka.connect.storage.StringConverter',
    "value.converter"='io.confluent.connect.avro.AvroConverter',
    "value.converter.schema.registry.url"='http://schema-registry:8081',
    "key.converter.schemas.enable"='false',
    "value.converter.schemas.enable"='true',
    "tasks.max"='1',
    "connection.uri"='mongodb://root:rootpassword@mongodb:27017/admin?readPreference=primary&appname=ksqldbConnect&ssl=false',
    "database"='foodcoded_db',
    "collection"='foodcoded_analyze',
    "topics"='food_analyze');
    
    
  6. Data visualization from mongoDB with python Using Dash plotly

5.1 app version

python app.py
Dash is running on http://127.0.0.1:8050/ 

จาก Histrogram เพศชายมีน้ำหนักมากกว่าเพศหญิง MicrosoftTeams-image (1)

python app2.py
Dash is running on http://127.0.0.1:8050/ 

5.2 jupyter notebook version

จาก Pivot table ทั้งเพศชายและหญิง คนที่กิน comford food กินเพราะว่าช่วยบรรเทาการเครียด และมีความเบื่ออาหาร

01

เพศหญิงส่วนใหญ่ให้ความสำคัญเกี่ยวกับการนับแคลลอรี่อาหารต่อวัน

02

วิธีการไดเอทแต่ละชั้นปี ทุก Grade Level ใช้วิธี ทานผักผลไม้และอาหารเพื่อสุขภาพ แต่วิธีการที่รองลงมาแตกต่างกันดังนี้ Senior : more protein Junior : current diet Sophomore & Freshman : home cooked / organic

03

คนส่วนมากคิดว่าอาหารที่มีแคลลอรรี่มากที่สุดคือ waffle รองลงมาคือ tortilla ในทางตรงกันข้ามคนส่วนใหญ่กลับคิดว่า scone มีแคลลอรี่น้อยที่สุด

04

น้ำหนักและGPA ไม่ได้มีความสัมพันธ์เชิงเส้นต่อกัน

05

About

reatimeanalytic

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages