Add hash index to speed up searching #101

LianYangCn · 2021-09-09T02:51:55Z

This patch allows you to boost searching from O(n) to O(1)
Because the implementation is based on the hash, it has to take more memory to store the index data.
And each key-value entry takes 4 bytes.
More details please reference the "Demos/Linux"

_fdb_kv_load() 函数里面本来就会更新缓存，刚开始没有注意到这次删除了重复更新Hash的代码

LianYangCn · 2021-09-09T04:41:43Z

各位大佬们，出来围观吊打我吧

armink · 2021-09-10T09:23:57Z

感谢你的 PR，这两天有些忙，我要晚点看下哈

armink · 2021-09-12T09:03:23Z

src/fdb_kvdb.c

+    }
+
+    {
+#endif


简单点，name_crc 对 FDB_KV_CACHE_TABLE_SIZE 取余，作为 hash index ，这种效果如何？

不是特别明白你的意思，我想法是这样的

name_crc 对 FDB_KV_CACHE_TABLE_SIZE 取余，作为 hash index，是不是说，直接使用CRC32 作为 hash index 的字符串hash算法，如果你是这个意思，那么肯定是可以的，效果应该也不错，你可以参考这里

关于 string 的 hash 算法是用 CRC32 还是 CRC32+Robert Jenkins (PR 里面使用的方法)或者其他hash算法，我觉得都不算大问题，可以随时替换，但是减少对撞肯定是第一目标。不过，坦白的说 CRC32+Robert Jenkins 是有点慢的！

这里保留了原来的 CACHE 机制是因为，hash index 空间的大小和字符串 hash 算法共同决定了对撞的概率，但是如果发生对撞，常见的解决方法有两个，一个是通过扩大 hash index 空间，然后再 re-hash，还有一个办法就是 hash index 空间不变了，用另外一个可靠（即不是依靠概率，没有对撞可能性的方法），这个时候原来的 CACHE 就可以登场，原来的 CACHE 就可以解决 hash index 带来的概率性冲撞问题，这样可靠性就可以大大增加

那么为什么不用 re-hash 的方法的呢？re-hash 的方法意味着， hash index 的内存空间大小是动态变化的，在嵌入式设备中，内存空间动态变化，是比较难以管理的，所以放弃使用 re-hash 来解决冲撞问题

如果 hash index + CACHE 能以接近 100% 的准确度（也就是上面提到的双重保证），那么我们在更新/添加 Key Value 的时候就可以直接信任这套机制(指 hash index + CACHE)，如果在 hash index + CACHE 中都找不到，那说明这个 Key 不在 FlashDB 中，如果是添加操作，这样写入速度也会极大的提高，因为现在的方法是如果 CACHE 中没有，还要遍历一遍 FlashDB，这样才能确定 Key 有没有存在数据库中。根据我在 ESP32 中的测试，当数据量到 2000条左右的时候，添加一条新数据，耗时已经超过1秒钟，每一次添加都要搜索前面的2000条或者更多数据，这种回溯真的是惨绝人寰，不环保，太耗电

hash index 和 CACHE 作为两种完全不同的索引机制，CACHE 更加简单可靠，适用于小数据（大约是小于1000条）; 而 hash index 稍微复杂，会占用更多的内存（一条数据占用4字节，预留总内存 = 预估总数据条数x1.25x4，5000 条记录最好预留 24.414KB 的内存），又有 CACHE 加持，这就是一个活脱脱的数据库，要知道，SQLite 第一版也是用 hash index，后来才换成 B tree

使用 name_crc 取余，主要是考虑比较小的改动成本，保证 FlashDB 的轻量级特色。当然，这样对于碰撞的处理会欠佳。

原本 FlashDB 的 cache 是具备 LRU 特色的，用的越频繁保留时间越久，也能降低资源占用，不用全部缓存。

另外，现在的问题我稍微整理一下，你看下对不对：

1、初始化时，如果 FlashDB 中的 KV 较多，初始化时间过长

2、KV 数量超过 1K 时，查询时间会超过 500MS ，体验不好

我明白你的意思了，但是我主要是考虑，CACHE 已经稳定的话不要动，CACHE 的 LRU 也是特色，非常好，而且配合 hash index 能获得更好的性能

现在我在 ESP32 上面测试

1、初始化时，如果 FlashDB 中的 KV 较多，初始化时间过长 这个问题已经证实是 hash 太慢造成的，现在已经解决但是PR 里面没有， #99 你说 murmur，也行，都是好算法
2、KV 数量超过 1K 时，查询时间会超过 500MS ，体验不好 这个问题引入 hash index 开始就没有再出现过，O(1)可不是盖的
3、如果说现在 PR 还不完善的地方就是，没有考虑对撞冲突的问题，所以添新 entry 还是非常非常非常非常慢的，现在比较像 hash cache 等把这条腿补充完成了，才算一个真正的 hash index
4、我觉得正如你在 #99 所说，cache + hash index 独立出来是一个不错的想法，我刚开始也是这么干的，后来我觉得有点麻烦，哈哈，就偷了个懒，但是我觉得更重要的是把这个独立成一个分支，把咱这个PR纳入到“官方”正统，这样接着干才有意思，不然，那岂不是白忙活，因为我要是自己用，代码肯定写的比这个还简单，我会算的死死的，不会出现这么多 if else 对我来说够用就行

LianYangCn · 2021-09-28T06:16:55Z

准备投入实际使用了

armink · 2021-09-28T06:38:13Z

👍，我十一的时候再把整体框架梳理一下，后续索引算法应该会独立文件出来

LianYangCn added 7 commits September 8, 2021 14:35

add hash index to speed up searching

7bbbe19

comment it

e3c7f55

Add hash index

d3a4923

删除了重复的更新的代码，提升初始化速度

9de97c9

_fdb_kv_load() 函数里面本来就会更新缓存，刚开始没有注意到这次删除了重复更新Hash的代码

增加 hash enhancement 相关的测试代码

12ea0a2

增加 hash enhancement macro

5376a93

删除侦测条件错误

c2ddea0

在构建索引(build index)的时候，假设没有重复的键值，提升启动速度

57cfef8

armink reviewed Sep 12, 2021

View reviewed changes

LianYangCn requested a review from armink September 12, 2021 10:36

替换了 HASH 算法增加了冲突检测

9c26f73

armink force-pushed the master branch from b73fb75 to 37e0597 Compare April 30, 2023 03:28

armink force-pushed the master branch from 724cc81 to dab9d2b Compare May 19, 2023 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hash index to speed up searching #101

Add hash index to speed up searching #101

LianYangCn commented Sep 9, 2021

LianYangCn commented Sep 9, 2021

armink commented Sep 10, 2021

armink Sep 12, 2021

LianYangCn Sep 12, 2021

armink Sep 12, 2021

LianYangCn Sep 13, 2021

LianYangCn commented Sep 28, 2021

armink commented Sep 28, 2021

+                  }
+                  {
+              #endif

Add hash index to speed up searching #101

Are you sure you want to change the base?

Add hash index to speed up searching #101

Conversation

LianYangCn commented Sep 9, 2021

LianYangCn commented Sep 9, 2021

armink commented Sep 10, 2021

armink Sep 12, 2021

Choose a reason for hiding this comment

LianYangCn Sep 12, 2021

Choose a reason for hiding this comment

armink Sep 12, 2021

Choose a reason for hiding this comment

LianYangCn Sep 13, 2021

Choose a reason for hiding this comment

LianYangCn commented Sep 28, 2021

armink commented Sep 28, 2021