Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1008] enable interacting with java client to hopsworks #200

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ dmypy.json
.vscode
*.iml
target/

**target/
# Mac
.DS_Store

Expand Down
41 changes: 41 additions & 0 deletions java/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Online Feature Vector Retrieval Using Java Application

## Introduction
In this tutorial you will learn how to fetch feature vectors from online feature store for near real-time model serving
using external java application.

## Clone tutorials repository
This section requires maven; java 1.8 and git.

```bash
git clone https://github.com/logicalclocks/hopsworks-tutorials
cd ./hopsworks-tutorials/java
mvn clean package
```

## Create Feature Group and Feature View
This tutorial comes with pyspark program with a code to create feature group and feature view:
- `./setup/create_fg_fv.py`

Feature group data is generated using `dbldatagen` library.

You can execute this pyspark program directly on Hopsworks cluster. Follow the documentation how to set up and run
[spark jobs](https://docs.hopsworks.ai/hopsworks-api/3.3/generated/api/jobs/)

## Execute java application:
Now you will create [connection](https://docs.hopsworks.ai/hopsworks-api/3.3/generated/api/connection/) with
your Hopsworks cluster. For this you need to have Hopsworks cluster host address and [api key](https://docs.hopsworks.ai/3.3/user_guides/projects/api_key/create_api_key/)

Then define environment variables

```bash
HOPSWORKS_HOST=REPLACE_WITH_YOUR_HOPSWORKS_CLUSTER_HOST
HOPSWORKS_API_KEY=REPLACE_WITH_YOUR_HOPSWORKS_API_KEY
HOPSWORKS_PROJECT_NAME=REPLACE_WITH_YOUR_HOPSWORKS_PROJECT_NAME
export FEATURE_VIEW_NAME=products_fv
export FEATURE_VIEW_VERSION=1
```

```bash
java -jar ./target/hopsworks-java-tutorial-3.4.0-SNAPSHOT-jar-with-dependencies.jar $HOPSWORKS_HOST $HOPSWORKS_API_KEY $HOPSWORKS_PROJECT_NAME $FEATURE_VIEW_NAME $FEATURE_VIEW_VERSION
```
144 changes: 144 additions & 0 deletions java/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.hopsworks.tutorials</groupId>
<artifactId>hopsworks-java-tutorial</artifactId>
<version>3.7.0-SNAPSHOT</version>

<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<guava.version>14.0.1</guava.version>
<httpclient.version>4.5.6</httpclient.version>
<httpcore.version>4.4.13</httpcore.version>
<surefire-plugin.version>2.22.0</surefire-plugin.version>
</properties>

<dependencies>
<dependency>
<groupId>com.logicalclocks</groupId>
<artifactId>hsfs</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>${httpclient.version}</version>
</dependency>

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>${httpcore.version}</version>
</dependency>

<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>${guava.version}</version>
</dependency>

<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.33</version>
<scope>runtime</scope>
</dependency>

<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>32.1.2-jre</version>
</dependency>
</dependencies>

<build>
<plugins>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.hopsworks.tutorials.FeatureVectors</mainClass>
</manifest>
</archive>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<archive>
<manifest>
<mainClass>com.hopsworks.tutorials.FeatureVectors</mainClass>
</manifest>
</archive>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>${surefire-plugin.version}</version>
<configuration>
<systemProperties>
<property>
<name>hadoop.home.dir</name>
<value>${project.basedir}/src/test/resources/hadoop/</value>
</property>
</systemProperties>
<systemPropertiesFile>src/test/resources/system.properties</systemPropertiesFile>
</configuration>
</plugin>
</plugins>
<testResources>
<testResource>
<directory>src/test/resources</directory>
</testResource>
</testResources>
</build>
<repositories>
<repository>
<id>Hops</id>
<name>Hops Repo</name>
<url>https://archiva.hops.works/repository/Hops/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>

<!--
<distributionManagement>
<repository>
<id>Hops</id>
<name>Hops Repo</name>
<url>https://archiva.hops.works/repository/Hops/</url>
</repository>
</distributionManagement>
-->
</project>
139 changes: 139 additions & 0 deletions java/setup/create_fg_fv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
### <span style='color:#ff5f27'> 📝 Imports

import dbldatagen as dg
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType, StringType, BooleanType, TimestampType, ArrayType
from pyspark.sql.functions import pandas_udf

import pandas as pd
import numpy as np

import hopsworks
import random

### <span style='color:#ff5f27'> 👩🏻‍🔬 Generation Function
spark = SparkSession.builder.enableHiveSupport().getOrCreate()


def generate_data(n_rows):
df_spec = (
dg.DataGenerator(spark, name="Products_Data", rows=n_rows, partitions=4)
.withColumn(
"user_id",
IntegerType(),
minValue=0,
maxValue=300000,
random=True,
)
.withColumn(
"product_id",
IntegerType(),
minValue=0,
maxValue=1000,
random=True,
)
.withColumn(
"timestamp",
TimestampType(),
random=True,
)
.withColumn(
"col_float",
FloatType(),
expr="floor(rand() * 350) * (86400 + 3600)",
numColumns=110,
random=True,
)
.withColumn(
"col_str",
StringType(),
numColumns=8,
values=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
random=True,

)
.withColumn(
"col_int",
IntegerType(),
numColumns=6,
minValue=0,
maxValue=500,
random=True,
)
.withColumn(
"col_bool",
BooleanType(),
numColumns=4,
random=True,
)
)

df = df_spec.build()

return df


@pandas_udf(ArrayType(IntegerType()))
def generate_list_col(rows: pd.Series) -> pd.Series:
return pd.Series([np.random.randint(100, size=random.randint(10, 31)) for _ in range(len(rows))])

@pandas_udf(ArrayType(IntegerType()))
def generate_click_col(rows: pd.Series) -> pd.Series:
return pd.Series([np.random.randint(10, size=random.randint(0, 5)) for _ in range(len(rows))])

## <span style="color:#ff5f27;">🔮 Generate Data </span>
n_rows = 5_000

data_generated_products = generate_data(n_rows)

# Get the number of rows
num_rows = data_generated_products.count()

# Get the number of columns
num_columns = len(data_generated_products.columns)

print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

for i in range(6):
data_generated_products = data_generated_products.withColumn(
f'col_list_{i}',
generate_list_col(data_generated_products.product_id + i)
)

data_generated_products = data_generated_products.withColumn("clicks", generate_click_col(data_generated_products.product_id + i))

clicks_df = data_generated_products.select("user_id", "product_id", "timestamp", "clicks")
products_df = data_generated_products.drop("user_id", "clicks")

# Get the number of rows
num_rows = data_generated_products.count()

# Get the number of columns
num_columns = len(data_generated_products.columns)

print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

## <span style="color:#ff5f27;">🪄 Feature Group Creation</span>
project = hopsworks.login()
fs = project.get_feature_store()

products_fg = fs.get_or_create_feature_group(
name="products",
version=1,
description="Products Data",
primary_key=["product_id"],
event_time="timestamp",
stream=True,
online_enabled=True,
)
products_fg.insert(products_df)

products_fg = fs.get_feature_group(name="products", version=1)

fs.get_or_create_feature_view(
name='products_fv',
version=1,
query=products_fg.select_all(),
)
Loading