This library directly addresses the problem I faced with GRPC due to the nature of it's connection and the custom solution we have built to address the same for our application working at a tremendously high scale in production.
This library is inspired from the custom solution we have built in our organisation to address the same problem.
This is a very common GRPC level problem which might every GRPC user might be facing, idea is to share the solution with everyone
since not much is available in the community otherwise around the same.
This is a custom implementation of a grpc connection pooling which manages and honours a managed pool of connection between client and server to serve the incoming RPC requests.
In a traditional grpc usage, a connection is established between a client and a server. This persistent sticky connection lives through the entire lifetime of the client and server and all the RPCs are served over the existing connection underlying the power of multiplexing.
This does not seem to be of any concern unless you zoom out and look over the potential problem that may arise in production at scale. The problem with native way of using grpc connection is explained in the blog [here](TODO write medium article ARPIT).
This library is a custom implementation of a grpc connection pooling to overcome the problems that arise due to a sticky grpc single connection usage.
go get -u github.com/arpit006/go-grpc-conn-pool
It mimics most of the features of a connection pooling which is not offered in a
traditional sticky
, persistent
, long lived
grpc connection
This offers a support of connection pool over traditional single connection to serve multiple RPCs over multiple grpc connections. A single grpc connection means 1 connection existing between client and 1 server instance, but maintaining a pool ensures that each client instance is connected to multiple server instance.
All the connections have state and lifetime management. The connection has state transitions
and also the support of
deadline,
which helps in mapping the connection to healthy
and unhealthy
and mimic a proper managed connection behaviour.
Supported Connection state and transitions
From/To | CONNECTING | READY | TRANSIENT_FAILURE | IDLE | SHUTDOWN |
---|---|---|---|---|---|
CONNECTING | Incremental progress during connection establishment | All steps needed to establish a connection succeeded | Any failure in any of the steps needed to establish connection | No RPC activity on channel for IDLE_TIMEOUT | Shutdown triggered by application. |
READY | Incremental successful communication on established channel. | Any failure encountered while expecting successful communication on established channel. | No RPC activity on channel for IDLE_TIMEOUT OR upon receiving a GOAWAY while there are no pending RPCs. |
Shutdown triggered by application. | |
TRANSIENT_FAILURE | Wait time required to implement (exponential) backoff is over. | Shutdown triggered by application. | |||
IDLE | Any new RPC activity on the channel | Shutdown triggered by application. | |||
SHUTDOWN |
All the connections in the pool have the support of connection max lifetime,
which means a connection will only be considered as healthy
till the max lifetime
of the connection along with an added deviation value. After that the connection will be treated as unhealthy
.
There is also a support of request timeout
which will be applied per RPC level.
A grpc connection is sticky and persistent by nature, meaning once a connection is established,
it exists till the entire lifetime of the client and server.
This means when a traditional grpc connection is established, it exists b/w 1 client and 1 server,
which could be leading to traffic skew on the server side leading all the traffic to be sent to 1 server instance.
With the support of connect
and disconnect
, it ensures that periodically the client connects to a different server
ensuring that the traffic is not always sent to only 1 server instance.
A small duration value added to the connection lifetime to ensure that not all connections exceed their deadline at the same time. This helps in scattering the connection max lifetime.
A connection is attributed along with a deadline, which ensures that a connection is not used after the deadline has expired. So for all the connections that have exceeded their deadline, they will be treated as unhealthy connection and will be refreshed periodically. A refresh of a connection means a fresh grpc dial which switches the client-server connection to another server instance.
You can create a grpc client in the same way with all the configurable options as you do in native grpc implementation with additional benefits
import v2 "github.com/arpit006/go-grpc-conn-pool/pkg/grpc"
func initalizeClientConnection() {
clientConfig = v2.
ClientConfigBuilder().
WithName("grpc-test").
WithTarget(":9003").
WithPoolSize(3).
WithConnMaxLifetime(2 * time.Minute).
WithStdDeviation(10 * time.Second).
Build()
// conn is a connection pool object internally
conn, err := v2.NewClient(clientConfig, grpc.WithTransportCredentials(insecure.NewCredentials()))
client := protos.NewServerClient(conn)
response := client.Process()
}
import v2 "github.com/arpit006/go-grpc-conn-pool/pkg/grpc"
func initalizeClientConnection() {
clientConfig = v2.
ClientConfigBuilder().
WithName("grpc-test").
WithTarget(":9003").
WithPoolSize(3).
WithConnMaxLifetime(2 * time.Minute).
WithStdDeviation(10 * time.Second).
Build()
// conn is a connection pool object internally
conn, err := v2.NewClient(clientConfig, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
client := protos.NewServerClient(conn)
response := client.Process()
}
import (
v2 "github.com/arpit006/go-grpc-conn-pool/pkg/grpc"
"google.golang.org/grpc/keepalive"
)
func initalizeClientConnection() {
kacp := keepalive.ClientParameters{
Time: 10 * time.Second, // send pings every 10 seconds if there is no activity
Timeout: time.Second, // wait 1 second for ping ack before considering the connection dead
PermitWithoutStream: true, // send pings even without active streams
}
clientConfig = v2.
ClientConfigBuilder().
WithName("grpc-test").
WithTarget(":9003").
WithPoolSize(3).
WithConnMaxLifetime(2 * time.Minute).
WithStdDeviation(10 * time.Second).
Build()
// conn is a connection pool object internally
conn, err := v2.NewClient(clientConfig, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithKeepaliveParams(kacp))
client := protos.NewServerClient(conn)
response := client.Process()
}
- Name: The name of the client
- Target: The server address along with port no.
- Pool Size: The no of connections in the connection pool per client
- Connection Max Lifetime: The max lifetime of a grpc connection
- Standard Deviation: The deviation value of lifetime amongst all the connections in the pool
- Request Timeout: The timeout value of a RPC request.
This library has been tested against high throughput system at scale with inter-region calls as well. This was highly performant and there was no added overhead because of this library for grpc calls.
This has been written taking into considerations that the library only provides additional performance benefits
without compromising on any of the native GRPC capabilities
.
We have benchmarked the original library against 300K RPS throughput
along with inter-region calls
.
In the native grpc connection where there only exists 1 connection between a client and a server, all the node from a particular client instance is only forwarded to 1 server instance. With increasing the no of connections using a pool at client level, now a single instance of client is making multiple connections to different server instances and all the throughput that is getting generated from the client is now getting distributed to multiple server instances.
When a connection is created, the connection max lifetime is provided to the connection along with a deviation value, which ensures to add a random factor of deviation to the connection max lifetime, which makes connection lifetime scattered over a period of time rather than being a single value = connection max lifetime.
All the connections in the pool have a state management support along with the deadline. Whenever any RPC needs to be served, the connection is fetched from the pool and is then checked if it is healthy or not.
If the connection
- is in unhealthy state
- exceeded the deadline the connection will be treated as unhealthy and the RPC request will be served by another active connection.
If a connection is considered as unhealthy connection, it will be picked up for refresh via a background scheduled job running every 30 seconds and the connection will be replaced with the new active connection.