Video Presentation
-
6420422002 Tanakorn Withuratธนากร วิธุรัติ
-
6420422008 Wannapa Sripen วรรณนภา ศรีเพ็ญ
-
6420422011 Juntanee Pattanasukkul จันทร์ทนีย์ พัฒนสุขกุล
-
6420422017 Witsarut Wongsim วิศรุต วงศ์ซิ้ม
-
6420422021 Suchawalee Jeeratanyasakul สุชาวลี จีระธัญญาสกุล
- Use Python to write the data source line-by-line from the CSV to SQL database (DB) (source part). The writing duration (d) must be random and between 0 < d < 2 seconds.
- Connect the SQL DB to a Kafka cluster via topic1.
- Clean the data in topic1 using ksqlDB and then submit it to topic2.
- Analyze the data in topic2 using ksqlDB and then submit it to topic3. The source part (NoSQL DB) will incrementally consume the data from topic3
- Visualize the real-time data from topic3 using tools such as Plotly, etc..
- Conduct analytics based on your questions
- Set up and answer three questions easy, e.g., simple statistics medium, e.g., join, windowed, etc. hard, e.g., analytical insight
The objective of this project is to design and implement a data-streaming and real-time analytics system that can be used to analyze college students' food and cooking preferences data from Kaggle. The project involves data cleansing and analytics to be done in real-time.
https://www.kaggle.com/datasets/borapajo/food-choices?select=food_coded.csv The project uses the Food Choices dataset available on Kaggle. The dataset contains 106 records and 61 columns.
The system design for the project is as follows:
-
Use Python to write the data source line-by-line from the CSV to SQL database (DB) (source part). The writing duration (d) must be random and between 0 < d < 2 seconds.
-
Connect the SQL DB to a Kafka cluster via topic1.
-
Clean the data in topic1 using ksqlDB and then submit it to topic2.
-
Analyze the data in topic2 using ksqlDB and then submit it to topic3.
-
The source part (NoSQL DB) will incrementally consume the data from topic3
-
Visualize the real-time data from topic3 using tools such as Plotly, etc.
-
Conduct analytics based on your questions
-
Set up and answer three questions:
- Easy, e.g., simple statistics
- Medium, e.g., join, windowed, etc.
- Hard, e.g., analytical insight
Get the connectors To get started, download the connectors for Postgres, MongoDB, and Elasticsearch to a fresh directory. The easiest way to do this is by using confluent-hub. https://docs.confluent.io/platform/current/connect/confluent-hub/client.html
Create a directory for your components:
mkdir confluent-hub-components
First, acquire the Postgres Debezium connector:
confluent-hub install --component-dir confluent-hub-components --no-prompt debezium/debezium-connector-postgresql:1.1.0
Likewise for the MongoDB Debezium connector:
confluent-hub install --component-dir confluent-hub-components --no-prompt debezium/debezium-connector-mongodb:1.1.0
And finally, the Elasticsearch connector:
confluent-hub install --component-dir confluent-hub-components --no-prompt confluentinc/kafka-connect-elasticsearch:10.0.1
ref : Debezium Connector for MongoDB https://debezium.io/documentation/reference/1.1/connectors/mongodb.html
-
download zip file "docker-main.zip" and unzip cmd in folder
-
Create docker file from the ksqlDB tutorial page. https://docs.ksqldb.io/en/latest/tutorials/etl/
-
After docker build is complete, run the file create_table.py to create a table inside Postgres, read csv file, and insert into Postgres.
python create_table.py
You can check Postgres with the following command:
docker exec -it postgres /bin/bash
psql -U postgres-user customers
Inside Ksql, run the command to check:
customers=# select * from foodcoded;
-
Run the file import_foodcoded.py to import data from the food_coded.csv file.
python import_foodcoded.py
-
Start Ksql to create the connector.
docker exec -it ksqldb-cli ksql http://ksqldb-server:8088 SET 'auto.offset.reset' = 'earliest';
To create a source connector:
CREATE SOURCE CONNECTOR customers_reader WITH ( 'connector.class' = 'io.debezium.connector.postgresql.PostgresConnector', 'database.hostname' = 'postgres', 'database.port' = '5432', 'database.user' = 'postgres-user', 'database.password' = 'postgres-pw', 'database.dbname' = 'customers', 'database.server.name' = 'localhost', 'table.whitelist' = 'public.foodcoded', 'transforms' = 'unwrap', 'transforms.unwrap.type' = 'io.debezium.transforms.ExtractNewRecordState', 'transforms.unwrap.drop.tombstones' = 'false', 'transforms.unwrap.delete.handling.mode' = 'rewrite' );
Create a Stream topic foodcoded for input data:
CREATE STREAM foodcoded WITH (kafka_topic = 'localhost.public.foodcoded', value_format = 'avro');
Create a Stream topic food_clean for cleaning data ksql command in food_clean.sql.
Create a Stream topic for analyze data ksql command in food_analyze.sql.
To create a sink connector:
CREATE SINK CONNECTOR `mongodb_foodcoded_sink` WITH ( "connector.class"='com.mongodb.kafka.connect.MongoSinkConnector', "key.converter"='org.apache.kafka.connect.storage.StringConverter', "value.converter"='io.confluent.connect.avro.AvroConverter', "value.converter.schema.registry.url"='http://schema-registry:8081', "key.converter.schemas.enable"='false', "value.converter.schemas.enable"='true', "tasks.max"='1', "connection.uri"='mongodb://root:rootpassword@mongodb:27017/admin?readPreference=primary&appname=ksqldbConnect&ssl=false', "database"='foodcoded_db', "collection"='foodcoded_analyze', "topics"='food_analyze');
-
Data visualization from mongoDB with python Using Dash plotly
5.1 app version
python app.py
Dash is running on http://127.0.0.1:8050/
จาก Histrogram เพศชายมีน้ำหนักมากกว่าเพศหญิง
python app2.py
Dash is running on http://127.0.0.1:8050/
5.2 jupyter notebook version
จาก Pivot table ทั้งเพศชายและหญิง คนที่กิน comford food กินเพราะว่าช่วยบรรเทาการเครียด และมีความเบื่ออาหาร
เพศหญิงส่วนใหญ่ให้ความสำคัญเกี่ยวกับการนับแคลลอรี่อาหารต่อวัน
วิธีการไดเอทแต่ละชั้นปี ทุก Grade Level ใช้วิธี ทานผักผลไม้และอาหารเพื่อสุขภาพ แต่วิธีการที่รองลงมาแตกต่างกันดังนี้ Senior : more protein Junior : current diet Sophomore & Freshman : home cooked / organic
คนส่วนมากคิดว่าอาหารที่มีแคลลอรรี่มากที่สุดคือ waffle รองลงมาคือ tortilla ในทางตรงกันข้ามคนส่วนใหญ่กลับคิดว่า scone มีแคลลอรี่น้อยที่สุด
น้ำหนักและGPA ไม่ได้มีความสัมพันธ์เชิงเส้นต่อกัน