Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce required third-party dependencies for Phileas core by removing phileas-metrics-service #122

Closed
robfromboulder opened this issue Aug 9, 2024 · 2 comments
Assignees
Milestone

Comments

@robfromboulder
Copy link
Collaborator

Phileas is pretty svelte with the exception of PhileasMetricsService, which pulls in the io.micrometer packages, and leads to a large number of transitive dependencies.

These transitive dependencies have several negative impacts:

  • Single-jar executables using Phileas are pretty large (even for simple programs)
  • There's higher risk of class/version collision with large codebases like Trino
  • Some transitive dependencies have known security vulnerabilities

This issue was first identified with phileas-benchmark, which generates a single-jar executable using maven-assembly-plugin. Using the built-in jar-with-dependencies configuration results in a 270MB jar file.

A relatively easy workaround is to use a custom assembly configuration, which reduces the phileas-benchmark jar size to just 37MB, but requires explicitly including all of the packages required:

<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.2.0 https://maven.apache.org/xsd/assembly-2.2.0.xsd">
    <id>cmd</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/</outputDirectory>
            <includes>
                <include>ai.philterd:phileas-benchmark:jar:</include>
                <include>ai.philterd:phileas-core:jar:</include>
                <include>ai.philterd:phileas-model:jar:</include>
                <include>ai.philterd:phileas-processors-unstructured:jar:</include>
                <include>ai.philterd:phileas-services-alerts:jar:</include>
                <include>ai.philterd:phileas-services-anonymization:jar:</include>
                <include>ai.philterd:phileas-services-disambiguation:jar:</include>
                <include>ai.philterd:phileas-services-metrics:jar:</include>
                <include>ai.philterd:phileas-services-policies:jar:</include>
                <include>com.googlecode.libphonenumber:libphonenumber:jar:</include>
                <include>io.micrometer:micrometer-core:jar:</include>
                <include>io.micrometer:micrometer-registry-cloudwatch:jar:</include>
                <include>io.micrometer:micrometer-registry-datadog:jar:</include>
                <include>io.micrometer:micrometer-registry-jmx:jar:</include>
                <include>io.micrometer:micrometer-registry-prometheus:jar:</include>
                <include>org.json:json:jar:</include>
            </includes>
            <useProjectArtifact>true</useProjectArtifact>
            <unpack>true</unpack>
            <scope>runtime</scope>
        </dependencySet>
    </dependencySets>
</assembly>

👆 This works but is tricky if this is the norm for Phileas users. The resulting jar is also still larger than necessary when the MetricsService implementation isn't being activated.

The phileas-connector uses similar includes for building the Trino connector, which isn't done using maven-assembly-plugin but with similar tooling.

Refactoring PhileasMetricsService as a dynamically-loaded implementation of MetricsService would keep the io.micrometer dependencies out of the Phileas core -- and open up the possibility of writing other MetricsService implementations (including allowing phileas-connector to publish metrics tables for Trino users, and a "blackhole" or in-memory implementation to use by default).

@jzonthemtn
Copy link
Member

@RobDickinson Thanks for typing this up. Agreed that Phileas should be more lighter weight. I will take this one on since it might involve moving the metrics stuff out into its own GitHub repository to simplify the code and make it more loosely connected.

@jzonthemtn jzonthemtn self-assigned this Aug 12, 2024
@jzonthemtn
Copy link
Member

jzonthemtn commented Aug 25, 2024

Because Phileas is a library to do redaction, it has to be used from within another application. I don't think it is necessary for Phileas to have an integrated implementation of MetricsService when the application implementer can easily add their own metric collection and have more flexibility when doing so.

The phileas-metrics-service will be removed from Phileas and integrated with Philter so the functionality can still be used, but Phileas will now let users their own implementation of MetricsService.

This however does not address the issue of the size of the jar file. The ONNX Runtime dependencies are responsible for a very large part of the 270 MB jar file.

Wrote #134 to take a better look at the size of the dependencies.

jzonthemtn added a commit that referenced this issue Aug 25, 2024
@jzonthemtn jzonthemtn added this to the 2.7.0 milestone Aug 25, 2024
@jzonthemtn jzonthemtn changed the title Reduce required third-party dependencies for Phileas core Reduce required third-party dependencies for Phileas core by removing phileas-metrics-service Aug 25, 2024
jzonthemtn added a commit that referenced this issue Aug 27, 2024
jzonthemtn added a commit that referenced this issue Aug 27, 2024
* #122 Removing metrics service.

* #122 Changint to NoOpMetricsService.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants