-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range Locking - rebased on latest fb-mysql-8.0.20 #1165
base: fb-mysql-8.0.20
Are you sure you want to change the base?
Range Locking - rebased on latest fb-mysql-8.0.20 #1165
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few initial comments. I will continue to read through this.
c89e6b3
to
f8effb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a different behavior from InnoDB where the gap from the last key to supremum is not being acquired.
On InnoDB:
TXN 1:
create table t1 (pk int primary key) engine=innodb;
insert into t1 values (1), (10), (100);
begin;
select * from t1 for update;
TXN 2:
begin;
select * from t1;
insert into t1 values (150); # times out on lock
On MyRocks:
insert into t1 values (150); # succeeds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another difference with InnoDB:
TXN 1:
create table t1 (pk int primary key) engine=rocksdb;
insert into t1 values (1), (10), (100);
begin;
select * from t1;
TXN 2:
begin;
select * from t1;
insert into t1 values (90);
commit;
TXN 1:
select * from t1 lock in share mode; /* InnoDB returns 4 rows, but MyRocks only returns 3 rows: 90 is missing. */
Using select * from t1 for update
, MyRocks correctly returns 4 rows instead of 3.
(Cherry-picked on top of the pushed iterator patches) When it is ON, MyRocks will: - initialize RocksDB to use range-locking lock manager - for all DML operations (including SELECT .. FOR UPDATE) will lock the scanned range before reading/modifying rows. - In range locking mode, there is no snapshot checking (cannot do that for ranges). Instead, MyRocks will read and modify latest committed data, just like InnoDB does (in the code, grep for (start|end) _ignore_snapshot) - Queries that do not have a finite range to scan, like UPDATE t1 .... ORDER BY t1.key LIMIT n will use a "Locking iterator" which will read rows, lock the range, and re-read the rows. See class LockingIterator.
LockingIterator may attempt to lock the same range multiple times if another transaction is inserting records into the range it is scanning through. Make the code avoid this.
Do not decrement it in unlock_row, as there may be multiple unlock_row calls per range.
Range Locking code locks the gap after the last value now, which means for 2-row table "SELECT * FROM t FOR UPDATE" will place 3 locks.
Initializing it in ha_rocksdb::external_lock was not enough. MyRocks uses index read calls before external_lock() is called: myrocks::Rdb_iterator_base::seek() myrocks::ha_rocksdb::index_read_intern() myrocks::ha_rocksdb::index_read_intern() myrocks::ha_rocksdb::index_last() myrocks::ha_rocksdb::load_auto_incr_value_from_index() myrocks::ha_rocksdb::load_auto_incr_value() myrocks::ha_rocksdb::open() handler::ha_open()
This is a followup to: Address review input: lock the "gap" after the last row that was read Made the code work when iterator bounds are not available, in which case it locks till index start/end.
…l lock secondary keys
…meter .. not as template argument.
From ha_rocksdb::update_write_sk into ha_rocksdb::check_and_lock_sk
04b29be
to
8c6c1b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following doesn't look right:
/* TXN 1 */
mysql> create table t1 (pk int not null auto_increment, a int, b int, primary key (pk), unique key (a)) engine=rocksdb;
Query OK, 0 rows affected (0.05 sec)
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into t1 (a, b) values (1, 1), (2, 2);
Query OK, 2 rows affected (0.01 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+--------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY | MODE |
+------------------+----------------+--------------------+------+
| 0 | 17 | 000001080180000002 | X |
| 0 | 17 | 000001080180000001 | X |
| 0 | 17 | 0000010780000002 | X |
| 0 | 17 | 0000010780000001 | X |
+------------------+----------------+--------------------+------+
4 rows in set (0.01 sec)
/* TXN 2 */
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into t1 (a, b) values (10, 10), (20, 20);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction: Timeout on index: test.t1.PRIMARY
/* TXN 1 */
mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+-------------------------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY | MODE |
+------------------+----------------+-------------------------------------+------+
| 0 | 17 | 0000010780000001-000001080180000002 | X |
+------------------+----------------+-------------------------------------+------+
1 row in set (0.01 sec)
The range lock seems to span multiple indexes.
Another example:
mysql> create table t6 (pk int not null auto_increment, a int, b int, c int, primary key (pk), unique key (a), key (b), unique key (c)) engine=rocksdb;
Query OK, 0 rows affected (0.03 sec)
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into t6 (a, b, c) values (1, 1, 1), (2, 2, 2);
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+---------------------------------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY | MODE |
+------------------+----------------+---------------------------------------------+------+
| 0 | 21 | 00000113018000000280000002 | X |
| 0 | 21 | 0000011080000001-00000113018000000180000001 | X |
+------------------+----------------+---------------------------------------------+------+
2 rows in set (0.00 sec)
mysql> select index_name, hex(index_number) from information_schema.rocksdb_ddl where table_name = 't6';
+------------+-------------------+
| index_name | hex(index_number) |
+------------+-------------------+
| PRIMARY | 110 |
| a | 111 |
| c | 112 |
| b | 113 |
+------------+-------------------+
4 rows in set (0.00 sec)
Use InnoDB-like concurrency mode: make the DML statements always read | ||
the latest data (instead of using transaction's snapshot). | ||
This "downgrades" the transaction isolation to READ-COMMITTED on the | ||
primary, but in return the actions can be replayed on the replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first half of the sentence makes sense, but the second half: "but in return the actions can be replayed on the replica" seems to involve replication and it's not clear how replication to secondaries is affected by this concurrency mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it doesn't sound clear.
IIRC I had this in mind when writing this:
With statement-based replication and point locking, one can get the slave
out of sync. Using gap(or range) locking allows to avoid that. But when
one uses Gap (or range) locking, the locking reads must read the latest
committed data.
I guess this should either be stated in full, or not at all...
I can't reproduce this behavior... This is what I'm getting:
so far the same. But TXN 2 will succeed to get the lock:
..which looks correct to me |
For me, I get this:
which looks ok at first glance |
Double checking if I am using the right tree... it is https://github.com/spetrunia/mysql-5.6/commits/fb-mysql-8.0.20-range-locking-may
|
I ran my tests with the range locking change based on top of: b22013d. In your lock outputs, none of the locks look like they were escalated and merged together. What triggers that process? |
lock escalation is triggered by lock tree memory exceeding the limit. The current lock memory usage should be shown in |
Ah, my setup also inherited these settings from an existing test: -rocksdb_use_range_locking=1 --rocksdb_max_lock_memory=1024 |
- Truncate the table before starting the test. - Added n_insert_failed counter
If an attempt to insert a row or create a group fails (because we've found unsuitable data), do not count it.
It starts by doing a locking read of 5 rows in a random location. If it encounters a grouped row that's not a group leader, it should follow till the end of this group and then delete+create the next group.
Also: We are now using skewed distribution, in order for INSERTs to not become stuck, make them do deletes when they are not able to insert.
Should this one be closed because of #1185 ? |
This includes: