Skip to content

John Snow Labs Releases LangTest 2.5.0: Spark & Delta Live Tables Support, Image & Performance Robustness Tests, Customizable LLM Templates, Enhanced VQA & Chat Models

Latest
Compare
Choose a tag to compare
@chakravarthik27 chakravarthik27 released this 24 Dec 15:12
· 12 commits to main since this release
c457432

πŸ“’ Highlights

We are thrilled to announce the latest release, packed with exciting updates and enhancements to empower your AI model evaluation and development workflows!

  • πŸ”— Spark DataFrames and Delta Live Tables Support
    We've expanded our capabilities with support for Spark DataFrames and Delta Live Tables from Databricks, allowing seamless integration and efficient data processing for your projects.

  • πŸ§ͺ Performance Degradation Analysis in Robustness Testing
    Introducing Performance Degradation Analysis in robustness tests! Gain insights into how your models handle edge cases and ensure consistent performance under challenging scenarios.

  • πŸ–Ό Enhanced Image Robustness Testing
    We've added new test types for Image Robustness to evaluate your vision models rigorously. the models can test with diverse image perturbations and assess their ability to adapt.

  • πŸ›  Customizable Templates for LLMs
    Personalize your workflows effortlessly with customizable templates for large language models (LLMs) from Hugging Face. Tailor prompts and configurations to meet your specific needs.

  • πŸ’¬ Improved LLM and VQA Model Functionality
    Enhancements to chat and completion functionality make interactions with LLMs and Vision Question Answering (VQA) models more robust and user-friendly.

  • βœ” Improved Unit Tests and Type Annotations
    We've bolstered unit tests and type annotations across the board, ensuring better code quality, reliability, and maintainability.

  • 🌐 Website Updates
    The website has been updated with new content highlighting Databricks integration, including support for Spark DataFrames and Delta Live Tables tutorials.

πŸ”₯ Key Enhancements

πŸ”— Spark DataFrames and Delta Live Tables Support

Open In Colab

We've expanded our capabilities with support for Spark DataFrames and Delta Live Tables from Databricks, enabling seamless integration and efficient data processing for your projects.

Key Features

  • Seamless Integration: Easily incorporate Spark DataFrames and Delta Live Tables into your workflows.
  • Enhanced Efficiency: Optimize data processing with Databricks' powerful tools.

How it works:

from pyspark.sql import DataFrame

 # Load the dataset into a Spark DataFrame
 df: DataFrame = spark.read.json("<FILE_PATH>")

df.printSchema()

Tests Config:

prompt_template = (
    "You are an AI bot specializing in providing accurate and concise answers to questions. "
    "You will be presented with a medical question and multiple-choice answer options. "
    "Your task is to choose the correct answer.\n"
    "Question: {question}\n"
    "Options: {options}\n"
    "Answer: "
)
from langtest.types import HarnessConfig

test_config: HarnessConfig = {
    "evaluation": {
        "metric": "llm_eval",
        "model": "gpt-4o", # for evaluation
        "hub": "openai",
    },
    "tests": {
        "defaults": {
            "min_pass_rate": 1.0,
            "user_prompt": prompt_template,
        },
        "robustness": {
            "add_typo": {"min_pass_rate": 0.8},
            "add_ocr_typo": {"min_pass_rate": 0.8},
            "add_speech_to_text_typo":{"min_pass_rate": 0.8},
            "add_slangs": {"min_pass_rate": 0.8},
            "uppercase": {"min_pass_rate": 0.8},
        },
    },
}

Dataset Config:

input_data = {
     "data_source": df,
     "source": "spark",
     "spark_session": spark # make sure that spark session is started or not
 }

Model Config:

model_config = {
     "model": {
         "endpoint": "databricks-meta-llama-3-1-70b-instruct",
     },
     "hub": "databricks",
     "type": "chat"
 }

Harness Setup:

from langtest import Harness 

 harness = Harness(
     task="question-answering",
     model=model_config,
     data=input_data,
     config=test_config
 )
harness.generate().run().report()

image

To Review and Store in DLT

testcases= harness.testcases()
testcases
testcases_dlt_df = spark.createDataFrame(testcases)

testcases_dlt_df.write.format("delta").save("<FILE_PATH>")
generated_results = harness.generated_results()
generated_results
# write into delta tables.
results_dlt_df = spark.createDataFrame(generated_results)

# Choose a file model based on the requirements
# to append results into the existing table or 
# overwrite the table.
results_dlt_df.write.format("delta").save("<FILE_PATH>")

πŸ§ͺ Performance Degradation Analysis in Robustness Testing

Open In Colab

Introducing Performance Degradation Analysis in robustness tests! Gain insights into how your models handle edge cases and ensure consistent performance under challenging scenarios.

Key Features

  • Edge Case Insights: Understand model behavior in extreme conditions.
  • Performance Consistency: Ensure reliability across diverse inputs.

How it works:

from langtest.types import HarnessConfig
from langtest import Harness
test_config = HarnessConfig({
    "tests": {
        "defaults": {
            "min_pass_rate": 0.6,
        },
        "robustness": {
            "uppercase": {
                "min_pass_rate": 0.7,
            },
            "lowercase": {
                "min_pass_rate": 0.7,
            },
            "add_slangs": {
                "min_pass_rate": 0.7,
            },
            "add_ocr_typo": {
                "min_pass_rate": 0.7,
            },
            "titlecase": {
                "min_pass_rate": 0.7,
            }
        },
        "accuracy": {
            "degradation_analysis": {
                "min_score": 0.7,
            }
        }
    }
})

# data config
data = {
    "data_source": "BoolQ",
    "split": "dev-tiny",
}

Setup Harness:

harness = Harness(
    task="question-answering", 
    model={
        "model": "llama3.1:latest", 
        "hub": "ollama",
        "type": "chat",
    },
    config=test_config,
    data=data
)

harness.generate().run()

Harness Report

harness.report()

image

πŸ–Ό Enhanced Image Robustness Testing

Open In Colab

We've added new test types for Image Robustness to evaluate your vision models rigorously. Could you challenge your models with diverse image perturbations and assess their ability to adapt?

Key Features

  • Diverse Perturbations: Evaluate performance with new image robustness tests.
  • Vision Model Assessment: Test adaptability under varied visual conditions.
Perturbation Description
image_translate Shifts the image horizontally or vertically to evaluate model robustness against translations.
image_shear Applies a shearing transformation to test how the model handles distortions in perspective.
image_black_spots Introduces random black spots to simulate damaged or obscured image regions.
image_layered_mask Adds layers of masking to obscure parts of the image, testing recognition under occlusion.
image_text_overlay Places text on the image to evaluate the model's resilience to textual interference.
image_watermark Adds a watermark to test how the model performs with watermarked images.
image_random_text_overlay Randomly places text at varying positions and sizes, testing model robustness to overlays.
image_random_line_overlay Draws random lines over the image to check the model's tolerance for line obstructions.
image_random_polygon_overlay Adds random polygons to the image, simulating graphical interference or shapes.

How it Works:

from langtest.types import HarnessConfig
from langtest import Harness
test_config = HarnessConfig(
{
    "evaluation": {
        "metric": "llm_eval",
        "model": "gpt-4o-mini",
        "hub": "openai"

    },
    "tests": {
        "defaults": {
            "min_pass_rate": 0.5,
            "user_prompt": "{question}?\n {options}\n",
        },
        "robustness": {
            "image_random_line_overlay": {
                "min_pass_rate": 0.5,
            },
            "image_random_polygon_overlay": {
                "min_pass_rate": 0.5,
            },
            "image_random_text_overlay": {
                "min_pass_rate": 0.5,
                "parameters": {
                    "color": [123, 144, 123],
                    "opacity": 0.8
                }
            },
            "image_watermark": {
                "min_pass_rate": 0.5,
            },
        }
    }
}
)

Setup Harness:

from langtest import Harness

harness = Harness(
    task="visualqa",
    model={
        "model": "gpt-4o-mini",
        "hub": "openai"
    },
    data={"data_source": 'MMMU/MMMU',
          # "subset": "Clinical_Medicine",
          "subset": "Art",
          "split": "dev",
          "source": "huggingface"
    },
    config=test_config
)

harness.generate().run()
from IPython.display import display, HTML

res_df = harness.generated_results()
html=res_df.sample(5).to_html(escape=False)

display(HTML(html))

image

report

harness.report()

image

πŸ›  Customizable Templates for LLMs

Open In Colab
Personalize your workflows effortlessly with customizable templates for large language models (LLMs) from Hugging Face. Tailor prompts and configurations to meet your specific needs.

Key Features

  • Workflow Personalization: Customize LLM templates to suit your tasks.
  • Enhanced Usability: Simplify configurations with pre-built templates.

How it Works:

from langtest.types import HarnessConfig
from langtest import Harness

import os 

os.environ["HUGGINGFACE_API_KEY"] = "<YOUR HUGGINGFACE API>"
os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"
# only jinja template supported
meta_template = """
{{- bos_token }}\n

{%- if messages[0]['role'] == 'system' %} 
    {%- set system_message = messages[0]['content']|trim %} 
    {%- set messages = messages[1:] %} 
{%- else %} 
    {%- set system_message = "You are a helpful assistant. Provide a short answer based on the given context and question in plain text." %} 
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\\n" }}
{{- system_message }}
{{- "<|eot_id|>" }}

{%- for message in messages %} 
    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n' + message['content'] | trim + '<|eot_id|>' }} 
{%- endfor %} 
{{- '<|start_header_id|>assistant<|end_header_id|>\\n' }}

"""

# few shot prompt config
prompt_config =  {
    "NQ-open": {
        "prompt_type": "chat",
        "instructions": "Write a short answer based on the given context and question in plain text.",
        "user_prompt": "You are a helpful assistant. Provide a short answer based on the given context and question.\n {question}",
        "examples": [{
            "user": {
                "question": "What is the capital of France?",
                "context": "France is a country in Europe."
            },
            "ai": {
                "answer": "Paris"
            }
      }]
    }
}

Test Config:

from langtest.types import HarnessConfig


test_config: HarnessConfig = {
    "evaluation": {
        "metric": "llm_eval",
        "model": "gpt-4o",
        "hub": "openai",
    },
    "prompt_config": prompt_config,
    "model_parameters": {
        "chat_template": meta_template,
        "max_tokens": 50,
        "task": "text-generation",
        "device": 0, # Use GPU 0
    },
    "tests": {
        "defaults": {
            "min_pass_rate": 0.6,
        },
        "robustness": {
            "uppercase": {
                "min_pass_rate": 0.7,
            },
            "add_slangs": {
                "min_pass_rate": 0.7,
            },
            "add_ocr_typo": {
                "min_pass_rate": 0.7,
            },
        },
    }
}

Harness Setup:

harness = Harness(
    task="question-answering",
    model={
        "model": "meta-llama/Llama-3.2-3B-Instruct", 
        "hub": "huggingface",
        "type": "chat",
        },
    data={"data_source": "NQ-open",
          "split": "test-tiny"},
    config=test_config,
)
harness.generate().run().report()

image

harness.generated_results()

image

πŸ’¬ Improved LLM and VQA Model Functionality

We have enhanced the chat and completion functionality, making interactions with LLMs and Vision Question Answering (VQA) models more robust and intuitive. These improvements enable smoother conversational experiences with LLMs and deliver better performance for VQA tasks. The updates focus on creating a more user-friendly and efficient interaction framework, ensuring high-quality results for diverse applications.

βœ” Improved Unit Tests and Type Annotations

We have strengthened unit tests and implemented clearer type annotations throughout the codebase to ensure improved quality, reliability, and maintainability. These updates enhance testing coverage and robustness, making the code more resilient and dependable. Additionally, the use of precise type annotations supports better readability and easier maintenance, contributing to a more efficient development process.

🌐 Website Updates

The website has been updated to feature new content emphasizing Databricks integration. It now includes tutorials that showcase working with Spark DataFrames and Delta Live Tables, providing users with practical insights and step-by-step guidance. These additions aim to enhance the learning experience by offering comprehensive resources tailored to Databricks users. The updated content highlights key features and capabilities, ensuring a more engaging and informative experience.

πŸ“’ New Notebooks

Notebooks Colab Link
LangTest-Databricks Integration Open In Colab
Degradation Analysis Test Open In Colab
Custom Chat Template Config Open In Colab

What's Changed

Full Changelog: 2.4.0...2.5.0