Revamp everything!

New lab, new bot :)
ivanhigueram · Mar 19, 2024 · c31042b · c31042b
1 parent a63e820
commit c31042b
Show file tree

Hide file tree

Showing 16 changed files with 722 additions and 229 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,30 @@
+# HR Slack Bot
+
+This is a Slack app using the Python `slack-sdk` to extract emails from a certain list of channels
+and parse them to a structured schema and push the data to a Google Spreadsheet. The bot tries
+to have the same functionality as other bots using *slash* commands. 
+
+## What's happening under the hood? 
+
+As a email body gets posted from one of the observed Slack channels, the bot uses the Slack SDK
+to retrieve the messages as a JSON string and push them to a SQLite database stored locally (ofc)
+We use Slack's rounded timestamp ( `ts`) as the primary key for each message, so we are only adding
+the messages that are new to the `messsages` table. 
+
+Emails are parsed using `gpt-3-1102` with no temperature to avoid generation and the data is
+retrieved using a data schema (see `src/schemas.py`). The data is taken from the table `messages`
+in the SQL database (`text` column, following the `slack-sdk` standard). Once processed, the parsed
+data is stored in the `parsed_messages` table with the `ts` identifier and the `channel_id`, and
+later pushed to a Google Spreadsheet. 
+
+$$ Language model configuration
+ - Prompting design happens in `src/extractor.py`, but changes to the prompt are done in other 
+ parts of the script. We follow the following prompting strategy:
+ ```
+ main_prompt -> examples -> response_schema 
+ ```
+ - Examples are a way to retrieve more accurate responses for data that the model hasn't seen. We
+ use the examples as a way to improve retrieval and we can define better examples in `data/examples.json`
+ Notice any example should include a text prompt and the outcome we want. 
+
+
diff --git a/botpy.py b/botpy.py
diff --git a/command.py b/command.py
diff --git a/data/examples.json b/data/examples.json
@@ -0,0 +1,28 @@
+[{
+    "text": "Hello Prof. Burke, my name is Iván Higuera-Mendieta. I am major in Economics at the Universidad de los Andes in Bogotá, Colombia. I am currently an intern at the Macro Stability Research Unit at the Central Bank of Colombia, and I am looking forward to leearn more about the research assitantship opening at the Sustain Lab at Stanford University. I am very intrested on the applications of Machine Learning in environmental economics and I am looking forward to learn more about the research that is being done at the Sustain Lab. I am looking forward to hear from you soon. Best, Iván Higuera-Mendieta.",
+    "name": "Ivan Higuera-Mendieta",
+    "undergraduate_institution": "Universidad de los Andes",
+    "graduate_institution": null,
+    "program_major": "Economics",
+    "advisor": null,
+    "current_workplace": "Central Bank of Colombia",
+    "current_role": "Intern",
+    "current_project_name": "Macro Stability Research Unit",
+    "email": null,
+    "quality_assessment": "10",
+    "overall_summary": "Iván is a major in Economics at the Universidad de los Andes in Bogotá, Colombia. He is currently an intern at the Macro Stability Research Unit at the Central Bank of Colombia, and he is looking forward to learn more about the research assitantship opening at the Sustain Lab at Stanford University. He is very intrested on the applications of Machine Learning in environmental economics and he is looking forward to learn more about the research that is being done at the Sustain Lab."
+},
+{
+    "text": "Dear Prof. Burke, Hope this email finds you well. My name is Emilio Leguízamo and I am reaching out because I will be applying to the E-IPER PhD program at Stanford this fall, and I am very interested in joining your lab as a graduate student. I hold a BA and MA degree in Economics from Universidad de los Andes in Colombia and have research experience working as an RA at los Andes and in the Inter-American Bank, where I currently work.  I’m working under the supervision of Allen Blackman, mainly on research projects that estimate the effects of deforestation and pollution on health, productivity, and livelihoods in Latin America. I'm interested in joining the E-IPER program to further study how environmental changes that affect access to natural resources and ecosystem services impact poverty and inequality. To achieve this, I aim to use satellite and administrative data to investigate how factors such as income, gender, race, and location are related to environmental issues. My goal is to produce research papers and data products that can guide policy design for equitable natural resource management. I am enthusiastic about joining E-IPER's interdisciplinary program, as I believe it will provide me with a better understanding of the environmental factors that influence human decisions and welfare while maintaining a social science approach. Your work at the ECHOlab and SustainLab aligns very well with some of the questions I intend to address during my PhD, particularly those related to air pollution and the impact of climate change on health, livelihoods, and cognitive ability. I am eager to be a part of these labs and learn more about the cutting-edge methodologies you use to process satellite data and create products that measure economic development at a fine scale. I believe these data products are invaluable tools for uncovering the unequal effects of climate change that I hope to address in my research. As the application deadline approaches, I wanted to take the opportunity to introduce myself and inquire whether you will be accepting graduate students for the fall of 2024. Thank you for your time. I look forward to your response and any potential discussions about research ideas or opportunities. I'd be happy to share my CV and any additional information if you'd like to take a closer look at the projects I’m working on. All the best,",
+    "name": "Emilio Leguízamo",
+    "undergraduate_institution": "Universidad de los Andes",
+    "graduate_institution": "Universidad de los Andes",
+    "program_major": "Economics",
+    "advisor": "Allen Blackman",
+    "current_workplace": "Inter-American Bank",
+    "current_role": "Research Assistant",
+    "current_project_name": "Deforestation and Pollution",
+    "email": null,
+    "quality_assessment": "10",
+    "overall_summary": "Emilio is a graduate student at Universidad de los Andes in Colombia. He is currently working as a research assistant at the Inter-American Bank, where he is working on research projects that estimate the effects of deforestation and pollution on health, productivity, and livelihoods in Latin America. He is interested in joining the E-IPER program to further study how environmental changes that affect access to natural resources and ecosystem services impact poverty and inequality. To achieve this, he aims to use satellite and administrative data to investigate how factors"
+}]
diff --git a/db_credentials.yaml b/db_credentials.yaml
diff --git a/echolab_candidates_bot.py b/echolab_candidates_bot.py
@@ -0,0 +1,136 @@
+import logging
+import os
+import sqlite3
+
+import pandas as pd
+from slack_bolt import App
+from slack_bolt.adapter.socket_mode import SocketModeHandler
+from slack_sdk import WebClient
+from tqdm import tqdm
+
+from src.retrieve_messages import parsing_messages, retrieve_messages
+from src.utils import send_messages_to_google_spreadsheet
+
+logging.basicConfig(level=logging.DEBUG)
+
+# Set up the Slack client
+slack_token = os.getenv("SLACK_API_TOKEN")
+slack_bot_token = os.getenv("SLACK_BOT_TOKEN")
+slack_secret = os.getenv("SLACK_SIGNING_SECRET")
+client = WebClient(token=slack_token)
+bolt_app = App(token=slack_token)
+
+
+@bolt_app.event("app_mention")
+def event_test(say):
+    say(
+        {
+            "blocks": [
+                {
+                    "type": "section",
+                    "text": {
+                        "type": "mrkdwn",
+                        "text": "I am research procastination at its finest, but also a HR assistant. After a posting, please reload me we can add the candidates to the database.",
+                    },
+                },
+                {
+                    "type": "section",
+                    "text": {"type": "mrkdwn", "text": "Access the candidate database"},
+                    "accessory": {
+                        "type": "button",
+                        "text": {"type": "plain_text", "text": "🆒", "emoji": True},
+                        "value": "click_me_123",
+                        "url": "https://docs.google.com/spreadsheets/d/1rfNE7-SP3sYEWDg3tWw27lS1L8njRJ4QlXj1SqgQCf0/edit?usp=sharing",
+                        "action_id": "button-action",
+                    },
+                },
+            ]
+        }
+    )
+
+
+@bolt_app.command("/summary")
+def summary_command(say, ack):
+    ack("Querying database... 👨🏽‍💻")
+
+    conn = sqlite3.connect("data/slackbot_messages.db")
+
+    query = """
+        WITH table_group AS (
+        SELECT pm.name, pm.undergraduate_institution, pm.graduate_institution, pm.program_major, pm.advisor, pm.current_workplace, pm.current_project_name, pm.email, pm.quality_assessment, pm.overall_summary, c.channel_name, pm.ts, m.file_1, m.file_2, m.file_3, m.file_4, m.file_5
+        FROM parsed_messages pm
+        LEFT JOIN channels c ON pm.channel_id = c.channel_id
+        LEFT JOIN messages m ON pm.ts = m.ts
+        ) select channel_name, count(name) as count from table_group group by channel_name order by 2;
+        """
+
+    df = pd.read_sql_query(query, conn)
+
+    # Send message to Slacks
+    say(
+        {
+            "blocks": [
+                {
+                    "type": "rich_text",
+                    "elements": [
+                        {
+                            "type": "rich_text_section",
+                            "elements": [
+                                {
+                                    "type": "text",
+                                    "text": "Number of candidates by type:\n\n",
+                                }
+                            ],
+                        },
+                        {
+                            "type": "rich_text_preformatted",
+                            "elements": [
+                                {
+                                    "type": "text",
+                                    "text": f"{df.to_markdown(index=False)}",
+                                }
+                            ],
+                        },
+                    ],
+                }
+            ]
+        }
+    )
+
+
+@bolt_app.command("/reload")
+def reload_command(say, ack):
+    channel_ids = ["C06PSDC08AX", "C06Q5A168DP", "C06PRB2EX61"]
+    poster_ids = ["U06N7CSQQKZ", "WBA9HFDCL"]
+    save_data = "./data/downloads"
+
+    conn = sqlite3.connect("data/slackbot_messages.db")
+    """Reload database to include new candidates in channel"""
+    ack("Loading database... 👨🏽‍💻")
+    # Retrieve messages from the channel
+    for channel_id in tqdm(channel_ids, desc="Retrieving messages from channels"):
+        retrieve_messages(
+            client,
+            channel_id,
+            filter_users=poster_ids,
+            save_data=save_data,
+            db_path="data/messages.db",
+            messages_table="messages",
+        )
+
+    # Parse messages
+    parsed_messages = parsing_messages(conn)
+
+    # Send parsed messages to Google Spreadsheet
+    send_messages_to_google_spreadsheet(
+        parsed_messages, credentials="creds.json", conn=conn
+    )
+
+    # Send message to Slacks
+    say(text="Database reloaded successfully! 🚀")
+
+
+if __name__ == "__main__":
+    SocketModeHandler(
+        bolt_app, app_token=slack_bot_token, web_client=client, trace_enabled=True
+    ).start()
diff --git a/event.py b/event.py