Table of Contents
- compliant-real-estate-chatbot
It is a chatbot specialized in the real estate domain by fine tuning the Llama3 8B model with an emphasis on incorporating behavior to mitigate against discriminatory practices like steering and redlining, which have historically plagued the real estate industry in the United States.
In the United States, real estate transactions are regulated under the federal Fair Housing Act, which prohibits discrimination in connection with the sale, rental, or financing of a dwelling. Additionally, a number of states and localities have enacted separate fair housing requirements that mirror or expand upon federal law. Under that law, discrimination is prohibited on the basis of: Race, color, national origin, sex (including sexual orientation or gender identity), religion, familial status, and disability.
More details can be found here.
We use pip
to install the required packages. This project was tested with python 3.10 and pytorch 2.4.1. After installing
them install the required packages by running the following command:
pip install -r requirements.txt
Our model is hosted on huggingface model hub in this link. Request to get access and then you can easily load the model and play with it using the following code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "zillow/realestateLM_llama3-8b"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = model.to(device)
model.eval()
messages = [
{'role': 'system', 'content': 'You are a helpful real estate chatbot. Your primary goal is to provide accurate, compliant, and useful information to users.'},
{'role': 'user', 'content': 'how do zoning laws impact the feasibility of integrating smart grid technology in new residential developments?'}
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
input_t = torch.LongTensor([input_ids]).to(device)
output = model.generate(input_t)[:,input_t.shape[1]:]
resp = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print(resp)
You can also use the evaluation/chat.py
script to load the model and chat with it on gradio:
python evaluation/chat.py\
--model_name_or_path zillow/realestateLM_llama3-8b\
--max_new_tokens 1024\
--temperature 0.7
This will open a gradio interface where you can chat with the model:
All of our synthetically generated datasets are hosted on huggingface:
To generate the data from scratch, you first need to set your OpenAI API key. You can do this by exporting it as follow:
export OPENAI_API_KEY=your-api-key
The scripts for data generation are located in data_generation
directory. To generate the general instruction following and
dialog splits, run the following commands respectively:
python data_generation/diverse_QA_datagen.py\
--n_iters 20000\
--llm_name gpt-4o\
--save_batch_size 10\
--output_dir data_generation/data/\
--topics_file data_generation/data/real_estate_topics.txt\
--n_subtopics 50\
--output_file_name general_instructions.json
python data_generation/diverse_conversation_datagen.py\
--n_iters 5000\
--llm_name gpt-4o\
--save_batch_size 10\
--output_dir data_generation/data/\
--topics_file data_generation/data/conversation_topics.txt\
--output_file_name dialogs.json
samples of the generated data can be found in data_generation/data/
directory.
For generating the safety split of the dataset, first request access to the fair housing dataset
and download the dataset in data_generation/data/fairhousing.json
directory. Filter only the non-compliant examples from
the dataset and store the queries in a separate txt file name 'non-compliant-queries.txt' (first section of data_preparation.ipynb
does that),
Then run the following command to generate the responses given our defined safe behavior:
python data_generation/response_generator.py\
--query_file data_generation/data/non-compliant-queries.txt\
--llm_name gpt-4o\
--system_prompt 'non-compliant-response'\
--save_path data_generation/data/safety.json
You can then follow the rest of the data_preparation.ipynb
notebook to postprocess the generated data, including conversion
to LLM chat format, pruning the dataset using sentence_bert
transformer, and splitting the data into train, validation and test sets.
To fine-tune the model, you can run the training/run_trainer.sh
script. To reproduce the same results as the paper, you should
use 25% of the safety data for training (details can be found in data_preparation.ipynb
notebook). The rest of the configs are
set in the script. Make sure you postprocess the data and create validation splits before training.
We use amazon bedrock to generate responses for our baseline models. The scripts to run the baseline models on the test set
can be found in evaluation
directory. To run the baseline models, you need to set your bedrock API key first.
Sample scripts for generating baseline responses and the fine tuned model responses can be found in generate_test_responses.sh
script.
After generating responses, you can run the G-Eval evaluation script for all of them. Here's a sample command to run our G-Eval metrics on the generated responses of a single model:
DEEPEVAL_RESULTS_FOLDER="data/geval_results/" \
OPENAI_API_KEY="your api key here"\
python geval_evaluation.py --test_data_path path_to_generated_responses_by_some_model.json\
--evaluation_metrics "[helpfulness_with_ref, helpfulness_without_ref, safety_with_ref, safety_without_ref]"
Our head-to-head comparison result can be seen in the figure below, it illustrates the win rate of each model on the left
versus the top model with one percent difference threshold for ties. After running the evaluation,
you can use the result_analysis.ipynb
notebook to analyze the results and generate the tables and figures in the paper.
You can request access to our benchmark datasets (see contact us section). Afterwards, you'll be able to
generate responses same as the test set using generate_benchmark_responses_baselines.py
and generate_benchmark_responses_lora.py
.
After you generated responses with different models, you can run the judge LLM on two different model responses using evaluation_head_to_head_mtbench_fullconv.py
.
Here's a sample command to run the judge LLM evaluation for helpfulness and safety respectively:
OPENAI_API_KEY="your api key here"\
python evaluation_head_to_head_mtbench_fullconv.py --model1_response_file path_to_model1_respones.json\
--model2_response_file path_to_model2_responses.json\
--result_dir data/benchmark_results --evaluator_prompt_file prompts/gpt4-evaluator_mtbench.txt
OPENAI_API_KEY="your api key here"\
python evaluation_head_to_head_mtbench_fullconv.py --model1_response_file path_to_model1_respones.json\
--model2_response_file path_to_model2_responses.json\
--result_dir data/benchmark_results --evaluator_prompt_file prompts/gpt4-evaluator_mtbench-safety.txt
The fine-tuned model(provided upon request) is highly experimental and an ongoing work in progress and we will be iteratively improving its accuracy. This model is not designed to ensure compliance and should not be used as such. We recognize that users will interpret fair housing and fair lending requirements based on their own understanding and risk appetite, and are responsible for ensuring compliance with all applicable laws and regulations when integrating the model into different use cases.
If you are interested in obtaining the benchmarks data and/or trained model, kindly contact us at [email protected]. In your message, provide a brief paragraph outlining your intended use case and how you plan to utilize both the model and dataset.
If you use this code in academic work, please consider citing https://arxiv.org/abs/2410.10860
@misc{madani2024recipebuildingcompliantreal,
title={A Recipe For Building a Compliant Real Estate Chatbot},
author={Navid Madani and Anusha Bagalkotkar and Supriya Anand and Gabriel Arnson and Rohini Srihari and Kenneth Joseph},
year={2024},
eprint={2410.10860},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.10860},
}