Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Extending functionality of Silk and refactorings #14

Open
sakno opened this issue Jun 12, 2021 · 9 comments
Open

[Discussion] Extending functionality of Silk and refactorings #14

sakno opened this issue Jun 12, 2021 · 9 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@sakno
Copy link

sakno commented Jun 12, 2021

Hi @Insvald , I would like to join in your project. I see that some refactoring can be applied:

  • Membership changes. My library offers IMemberServiceDiscovery that allows to replace member discovery based on configuration. It was especially designed to be compatible with third-party discovery mechanisms such as Kubernetes or Consul. I think it's better to provide extensibility here in Slik
  • JSON format of log entries may be not the best choice for performance reasons
  • Moving to Interpreter Framework to decouple command handling
  • Various optimization for log entry serialization/deserialization to increase throughput
  • Moving to background log compaction instead of sequential log compaction
  • Introduce distributed locks. The locks were previously added to .NEXT library but then I decided to drop them so it was never released. Now your project is appropriate place for them IMO.
@Insvald
Copy link
Owner

Insvald commented Jun 13, 2021

Hi,

I am more than glad to welcome you. Feel free to suggest ideas and improve areas which require additional work.

  1. Somehow I missed this membership discovery. Yes, using this interface looks more appropriate.
  2. Good point. Protobuf?
  3. Yes, I thought about this, worth start to using it.
  4. OK, I think we should start with some benchmarks/tests to measure our progress.
  5. Tests/benchmarks here too.
  6. Agreed, an interesting functionality.

I will add these points to the roadmap, thanks for the input!

@Insvald Insvald added enhancement New feature or request help wanted Extra attention is needed labels Jun 13, 2021
@sakno
Copy link
Author

sakno commented Jun 13, 2021

About JSON format for log entries: protobuf is redundant here because any IRaftLogEntry inherits from IDataTransferObject where you can serialize and deserialize data using IAsyncBinaryWriter and IAsyncBinaryReader respectively. Both interfaces were upgraded to support fast synchronous scenarios when possible (using IBufferWriter<byte> and Span<byte>) along with traditional async methods.

@sakno
Copy link
Author

sakno commented Jun 13, 2021

Could you please add develop branch? I think it's not a good idea to make PRs directly to master.

@sakno
Copy link
Author

sakno commented Jun 16, 2021

@Insvald , after deep analysis of existing code base I found the root cause of code complexity. The main problem is routing. Slik server acts as a proxy server when the node accepted the request is not a leader node. It is trying to redirect the request to the leader. From my point of view this routing should be processed by the cache client, not by the server nodes. The main reason is redundant traffic:

  • The client must serialize the request
  • The proxy node must deserialize the request and serialize it again for routing
  • The leader node must deserialize the request again

The same story with the response.

I think the problem comes from the chosen architecture. There are two common approaches for a such kind of caches:

  • Data Grid when the cache grid itself is represented by the cluster. In this case, the client need to communicate with it via network. But for performance reasons the client must have its own in-memory copy of the cache with LRU policy and receive updates from the grid. In this case, all writes always go to the grid but reads can be done from the local copy (if no cache miss). However, the local copy must receive real-time updates from the grid. The grid must distribute notifications about updated to the all connected clients. Pros: the clients are stateless. Cons: additional complexity with keeping local caches up-to-date using the notifications.
  • In-memory distributed cache when the clients themselves are cache nodes. In other words, they are members of Raft cluster. In this situation they can read from the local copy (if eventual consistency applicable which is normal for distributed caches) as well. Writes must be routed to the leader node. Pros: all reads can be done locally without LRU caching. Cons: the clients are stateful (because the state is stored in persistent WAL).

The first approach allows use to use gRPC or any other duplex protocol for communication between the clients and the grid. However, it should be wrapped into the client library. The library is responsible for caching the location of the leader, retry logic, communication with leader, receiving updates from the grid, keeping LRU cache.

The second approach doesn't require any special protocol and you can use Messaging infrastructure from DotNext.Net.Cluster for communication between nodes in the cluster.

At the moment, the current implementation trying to behave like grid. In the same time it is trying to hide the complexity from the client using proxy node.

@sakno
Copy link
Author

sakno commented Jun 16, 2021

One more thing: it is possible to combine both approaches. .NEXT Raft library provides so called standby nodes. These nodes never become leaders but participate in replication. As a result, the clients can be standby nodes and remain stateless. Their persistent WAL can be stored in ramfs or another volatile storage that can be dropped in case of failure. For instance, when you restarting POD in Kubernetes or container in Docker. In the same time the cache nodes must be stateful and participate in leader election. However, all these things leads to review of existing architecture.

@Insvald
Copy link
Owner

Insvald commented Jun 17, 2021

@sakno
The idea behind this project was to use the second approach definitely, the client is a node itself. gRPC was added as an optional interface.

I am looking at this project as a basis for lightweight orchestration. Hopefully in-process, without any additional standalone services/nodes. In such scenario writes should be relatively rare events, I am mostly concerned with reads and ease of use for a consumer.

Nevertheless, any ideas are welcomed as at the moment I'm stuck with the containerd driver.

@sakno
Copy link
Author

sakno commented Jun 17, 2021

With the second approach we need to choose one of the followings:

  • Completely drop support of gRPC
  • Implement transparent redirection of gRPC requests to the leader node

The last one is possible with routing middleware shipped with DotNext.AspNetCore.Cluster library as described here. AFAIK gRPC client doesn't support transparent redirections with 302 HTTP status while REST API can do that.

@Jeevananthan-23
Copy link

Jeevananthan-23 commented Mar 5, 2023

Hi @sakno / @Insvald, I'm really curious about that project is good to explore and understand the implementation of gRPC with Raft consensus. And sad that this project is no more active, if @sakno / @Insvald still like to work on the project and help me to implement a Lucene.net -based search engine like yelp/nrtsearch which is implemented using gRPC protobuf and some other feature for performance.

@sakno
Copy link
Author

sakno commented Mar 5, 2023

@Jeevananthan-23 , I don't own this project. However, you can use .NEXT repo to ask the question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants