Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430

Open
wants to merge 1 commit into
base: fb-mysql-8.0.23
Choose a base branch
from

Conversation

hermanlee
Copy link
Contributor

Summary:
(Range Locking pull request,filed against fb-mysql-8.0.23)

This adds a my.cnf parameter, rocksdb_use_range_locking.

When it is ON, MyRocks will:

  • initialize RocksDB to use range-locking lock manager
  • for all DML operations (including SELECT .. FOR UPDATE) will lock the scanned range before reading/modifying rows.
  • In range locking mode, there is no snapshot checking (cannot do that for ranges). Instead, MyRocks will read and modify latest committed data, just like InnoDB does (in the code, grep for (start|end)ignore snapshot)
  • Queries that do not have a finite range to scan, like UPDATE t1 .... ORDER BY t1.key LIMIT n will use a "Locking iterator" which will read rows, lock the range, and re-read the rows. See class LockingIterator.

Pull Request resolved: #1185
GitHub Author: Sergei Petrunia [email protected]

Summary:
(Range Locking pull request,filed against fb-mysql-8.0.23)

This adds a my.cnf parameter, rocksdb_use_range_locking.

When it is ON, MyRocks will:
- initialize RocksDB to use range-locking lock manager
- for all DML operations (including SELECT .. FOR UPDATE) will lock
   the scanned range before reading/modifying rows.
- In range locking mode, there is no snapshot checking (cannot do that
  for ranges). Instead, MyRocks will read and modify latest committed
  data, just like InnoDB does (in the code, grep for (start|end)_ignore_
  snapshot)
- Queries that do not have a finite range to scan, like
    UPDATE t1 .... ORDER BY t1.key LIMIT n
  will use a  "Locking iterator" which will read rows, lock the range,
  and re-read the rows. See class LockingIterator.

Pull Request resolved: facebook#1185
GitHub Author: Sergei Petrunia <[email protected]>
Copy link
Contributor Author

@hermanlee hermanlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding Manuel's comments.

@@ -266,6 +269,8 @@ class ha_rocksdb : public my_core::handler, public blob_buffer {
/* Type of locking to apply to rows */
Rdb_lock_type m_lock_rows;

bool m_use_range_locking;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like this belongs more on the Rdb_transaction object than on ha_rocksdb (whereas m_lock_rows may be different per table in the same transaction).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like this belongs more on the Rdb_transaction object than on ha_rocksdb (whereas m_lock_rows may be different per table in the same transaction).

My current best understanding is 50-50 re. the best home for this field. In its current form, its value is decided by ha_rocksdb::m_lock_rows too. If it's changed to be independent (as it already half-way is in the checks), then it can be stored in Rdb_transaction, but then some of the code paths currently avoid looking at the transaction object at all, so they would get extra get_or_create_tx calls.

Note that in a multi-statement transaction, the snapshot may have been
allocated by another statement.
*/
bool m_stmt_ignores_snapshot = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this works well with TTL tables, since it relies on rocksdb tracking oldest snapshot timestamp for us so that old transactions won't see rows expiring while they are still in progress. We won't get this if we don't create rocksdb snapshots though.

rocksdb::Status lock_singlepoint_range(rocksdb::ColumnFamilyHandle *const cf,
const rocksdb::Slice &point) {
// Normally, one needs to "flip" the endpoint type for reverse-ordered CFs.
// But here we are locking just one point so this is not necessary.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this comment is necessarily true. For rev cfs, the locking iterator will always lock adjacent keys using inf_suffix=true. However, locking using inf_suffix=false here would mean that the locking iterator would not block modifications from happening.

For example, if the locking iterator iterates from 6->8 and then it locks the range (8, inf_suffix=true) to (6, inf_suffix=true). A second query could delete 8 since it would use the point lock (8, inf_suffix=false) which falls outside of the locked range.

The following shows this is possible:

CONNECTION 1:
mysql> show create table t;
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                      |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t     | CREATE TABLE `t` (
  `i` int NOT NULL,
  `j` int NOT NULL,
  PRIMARY KEY (`i`) COMMENT 'rev:cf',
  KEY `j` (`j`) COMMENT 'cf'
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> begin; select * from t force index (PRIMARY) where i > 5 order by i asc limit 3 for update ;
Query OK, 0 rows affected (0.00 sec)
+---+---+
| i | j |
+---+---+
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
+---+---+
3 rows in set (2.60 sec)
mysql> select * From information_schema.rocksdb_locks;
+------------------+-----------------+---------------------------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID  | KEY                                   | MODE |
+------------------+-----------------+---------------------------------------+------+
|                3 | 140534119312896 | 0000010780000008:1-0000010780000006:1 | X    |
+------------------+-----------------+---------------------------------------+------+
1 row in set (0.00 sec)
CONNECTION 2:
mysql> delete from t where i = 8;
Query OK, 1 row affected (6.58 sec)
CONNECTION 1:
mysql> select * from t force index (PRIMARY) where i > 5 order by i asc limit 3 for update ;
+---+---+
| i | j |
+---+---+
| 6 | 6 |
| 7 | 7 |
| 9 | 9 |
+---+---+
3 rows in set (3.21 sec)

Incidentally, RangeLockManagerBase::TryLock also has the same behaviour of always setting inf_suffix=false and it's possible that also introduces similar bugs (This is called when we call get_for_update, I wonder if we should call get_for_update with nullptr as value instead of adding a lock_singlepoint_range function here).

@@ -5516,7 +5686,7 @@ class Rdb_snapshot_status : public Rdb_tx_list_walker {

/* Calculate the duration the snapshot has existed */
int64_t snapshot_timestamp = tx->m_snapshot_timestamp;
if (snapshot_timestamp != 0) {
if (snapshot_timestamp != 0 && tx->has_snapshot()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this information still seems useful even if there's no actual snapshot on the transaction. It might be useful to also indicate whether this transaction has an active snapshot too.
I guess some of this information is in rocksdb_trx, but dogpiles only has show engine status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this information still seems useful even if there's no actual snapshot on the transaction.

Are there any constraints on the output should look like in the case of no snapshot, if tooling parses it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline

@@ -6837,6 +7050,28 @@ static int rocksdb_init_internal(void *const p) {
tx_db_options.write_policy =
static_cast<rocksdb::TxnDBWritePolicy>(rocksdb_write_policy);

if (rocksdb_use_range_locking && rocksdb_use_range_lock_manager_as_point) {
// rdb_log_status_error(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to keep this commented out?


iter.set_use_locking() // sets m_iter_should_use_locking=true
iter.seek() // this will re-create m_scan_it to be the right kind
// of iterator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether this workflow is ever useful where we change locking mode between seeks. It seems like the code would be simpler if we just specified if we want a locking iterator or not at construction time, and we wouldn't have to track whether we need to recreate the iterator or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this would result in cleaner iterator code, but it would require postponing the creation of iterator from ha_rocksdb::index_init right to the point of attempting to take the range lock, and at that point the iterator is already in use. I will look some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks possible. It might result in a separate iterator reorg PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimplemented this without much refactoring, essentially merged reset and set_use_locking.

Rdb_transaction *tx = get_tx_from_thd(m_thd);
return rdb_tx_set_status_error(tx, s, *kd, m_tbl_def);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have most of our error handling code for iterators in is_valid_iterator. Would be nice to keep it in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions are a bit different:

  • is_valid_iterator: works on a RDB-level iterator, has no THD or MyRocks transaction
  • iter_status_to_retval: works on a MyRocks-level iterator, has THD, transaction, key definition.

Almost all uses follow the pattern of calling is_valid_iterator first and then iter_status_to_retval. But there is also calculate_cardinality_table_scan, which works with an RDB-level iterator directly.

Unless we find a different way to address that last use, the best I can think of is add a call to is_valid_iterator to iter_status_to_retval, perhaps with a rename, so that it's more natural to use in iteration loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock the range between [target, (current m_iter position)] and position
the iterator on the first record in it.

@param call_next false means current iterator position is achieved by
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have ha_rocksdb::index_next_with_direction_intern already with skip_next that does something similar so I wonder if we should stick to one naming convention.

DEBUG_SYNC(my_core::thd_get_current_thd(), "rocksdb.locking_iter_scan");

if (my_core::thd_killed(current_thd)) {
m_status = rocksdb::Status::Aborted();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little strange that we end up returning HA_ERR_ROCKSDB_STATUS_ABORTED instead of the usual HA_ERR_QUERY_INTERRUPTED here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a straightforward way to do the regular THD kill flag handling here, will need to revisit.

- lock the range [-inf; $first_key]
- return, the iterator is positioned on $first_key

The problem here is that we cannot have "-infinity" bound.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the comparator functions, I would assume that "-infinity" would be Endpoint(slice="", inf_suffix=false), though I agree that this function is basically unused for us. Might be worth clarifying what the comment means.

@laurynas-biveinis
Copy link
Contributor

Continued in #1449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants