Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDB_PROBLEM: Unexpected problem - txn should abort #153

Open
LekoArts opened this issue Apr 6, 2022 · 28 comments
Open

MDB_PROBLEM: Unexpected problem - txn should abort #153

LekoArts opened this issue Apr 6, 2022 · 28 comments

Comments

@LekoArts
Copy link

LekoArts commented Apr 6, 2022

Hi Kris!
Every now and then a user reports this error. Or our CI or our testing itself runs into it:

MDB_PROBLEM: Unexpected problem - txn should abort

User error:

user-error

Our CI run (https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/81406/workflows/46bdcdf9-85eb-444a-aab1-b92845954af3/jobs/964466) failed with it intermittently:

MDB_PROBLEM: Unexpected problem - txn should abort

  1 | query pageHomecircleciprojectstartersdefaultsrcpagesusingTypescriptTsx2907560070 {
  2 |   site {
> 3 |     buildTime(formatString: "YYYY-MM-DD hh:mm a z")
    |     ^
  4 |   }
  5 | }
  6 |

File path: /home/circleci/project/starters/default/src/pages/using-typescript.tsx
Url path: /using-typescript/
Plugin: none


  Error: MDB_PROBLEM: Unexpected problem - txn should abort
  
  - open.js:156 
    [default]/[lmdb]/open.js:156:21
  
  - write.js:762 LMDBStore.transactionSync
    [default]/[lmdb]/write.js:762:17
  
  - open.js:155 new LMDBStore
    [default]/[lmdb]/open.js:155:11
  
  - open.js:218 LMDBStore.openDB
    [default]/[lmdb]/open.js:218:6
  
  - cache-lmdb.ts:60 GatsbyCacheLmdb.getDb
    [default]/[gatsby]/src/utils/cache-lmdb.ts:60:44
  
  - cache-lmdb.ts:69 GatsbyCacheLmdb.get
    [default]/[gatsby]/src/utils/cache-lmdb.ts:69:17
  
  - date.ts:234 formatDate
    [default]/[gatsby]/src/schema/types/date.ts:234:53
  
  - date.ts:318 Object.resolve [as originalResolver]
    [default]/[gatsby]/src/schema/types/date.ts:318:20
  
  - task_queues.js:93 processTicksAndRejections
    internal/process/task_queues.js:93:5
  
  - task_queues.js:62 runNextTicks
    internal/process/task_queues.js:62:3
  
  - timers.js:434 processImmediate
    internal/timers.js:434:9
  
  - async Promise.all

I'm not too well versed with all the moving pieces yet so I'll ask my colleague to maybe shed some more light on this, but wanted to report it nevertheless so that a) I don't forget it and b) we can get behind this :)

Files from the stacktrace:

Thanks for your work!

@kriszyp
Copy link
Owner

kriszyp commented Apr 6, 2022

I don't think that I can tell what is occurring from this information alone, but I have added some additional details for the error messages when a MDB_PROBLEM occurs to better determine the cause (it is possible that is related #41, but not sure). I published this additional error information in the v2.3.0-beta3 release if you want to try that out.

@LekoArts
Copy link
Author

LekoArts commented Apr 7, 2022

Ok, thanks! I'll tell users to use this version until it's in a stable version then (when they can more reliably reproduce this).

@kriszyp
Copy link
Owner

kriszyp commented Apr 21, 2022

Have you ever seen this error again (with the more detailed error message)? Still would be interested in knowing what caused this.

@LekoArts
Copy link
Author

We had to pin lmdb (gatsbyjs/gatsby#35397) so our own tests didn't run this version yet. (cc @pieh)

@kriszyp
Copy link
Owner

kriszyp commented Apr 21, 2022

That issue (#156) has been fixed though, so you can use latest now, right?

@LekoArts
Copy link
Author

We need to update https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/schema/graphql-engine/lmdb-bundling-patch.ts first for the new binary loading -- but that's on us.

@kriszyp
Copy link
Owner

kriszyp commented Apr 21, 2022

Ok, no rush, was just curious.

@kriszyp
Copy link
Owner

kriszyp commented Apr 21, 2022

@LekoArts actually looking at the patch module you are using... would you like me to use more statically analyzable code like fs.readFileSync(require.resolve('../dict/dict.txt')) for reading the dictionary? Referencing binaries might be a bit better trickier though. But maybe could do something like listing them out:

try {
require.resolve('lmdb-win32-x64/node.napi.node');
require.resolve('lmdb-win32-x64/node.abi93.node');
require.resolve('lmdb-linux-x64/node.napi.node');
....
}catch(e){}

Or do you think that would just cause more errors/confusion with vercel (the require would be coming from a createRequire too)?

@pieh
Copy link

pieh commented Apr 21, 2022

Hey @kriszyp - about the hacky patch we are using, ideally we don't need patch like that, but I did some attempts at playing with lmdb code before in hope I could submit pull request and frankly didn't exactly like what I was doing there and that's when I went with the "hacky patch" route.

dict.txt part problem was probably the easier of the 2 (the problem there IIRC was that lmdb package setup is that entry to package could be in different "directory levels" - depending if it's commonjs or ESM):

https://github.com/DoctorEvidence/lmdb-js/blob/4d6371dc6faaa69f2a1a43d980ace0f328c89106/package.json#L22-L32
+
rollup outputting to dist

The binaries loading part also shared that, but I think there were more problems there which I just don't remember (it was a while ago).

Other than above, with [email protected], node-gyp-build-optional-packages was introduced (vs node-gyp-build) which I suspect won't work with @vercel/webpack-asset-relocator-loader:

  1. It looks for specific things (like require('node-gyp-build'))
  2. Given kriszyp/node-gyp-build@57be47b they would actually need to run this "modified" load.path function from your forked package and not just try regular node-gyp-build (at least until Add support for finding prebuilds in platform-specific packages prebuild/node-gyp-build#45 land) I think.

For [email protected] the node-gyp-build-optional-packages is what's not working anymore with our patch and this is part that we will need to adjust to be able to bump it - as @LekoArts mentioned - that's on us.

In general - non-web-frontend npm packages have varying levels of "support" for being bundled and it's hard to expect them to work out-of-the-box for that, so I just didn't want to waste your time on our niche use case for lmdb package. It's also a bit fragile in general so we do prefer to pin versions as it is safer for our users (upside is things shouldn't randomly break, downside is we will be slower on updates than using ^ version selector or similar)

@kriszyp
Copy link
Owner

kriszyp commented Apr 21, 2022

Ok, thanks for the explanation! And yeah hopefully the node-gyp-build PR merges soon. And again, no rush, was just checking.

@wardpeet
Copy link

wardpeet commented Apr 22, 2022

This is the error message I get with the 2.3.0 beta

Error message
There was an error in your GraphQL query:

MDB_PROBLEM: Unexpected problem - txn should abort:
mdb_page_unspill no parent

   1 | query DateQuery {
   2 |   dateTime: contentfulDate(contentful_id: {eq:
"38akBjGb3T1t4AjB87wQjo"}) {
   3 |     title
   4 |     date: dateTime
>  5 |     formatted: dateTime(formatString: "D.M.YYYY - hh:mm")
     |     ^
   6 |   }
   7 |   dateTimeTimezone: contentfulDate(contentful_id: {eq:
"6dZ8pK4tFWZDZPHgSC0tNS"}) {
   8 |     title
   9 |     date: dateTimeTimezone
  10 |     formatted: dateTimeTimezone(formatString: "D.M.YYYY -
 hh:mm (z)")
  11 |   }
  12 |   date: contentfulDate(contentful_id: {eq:
"5FuULz0jl0rKoKUKp2rshf"}) {
  13 |     title
  14 |     date
  15 |     formatted: date(formatString: "D.M.YYYY")

File path: /Users/ward/projects/gatsby/gatsby/e2e-tests/contentf
ul/src/pages/date.js
Url path: /date/
Plugin: none



  Error: MDB_PROBLEM: Unexpected problem - txn should abort: mdb
  _page_unspill no parent
  
  - open.js:177 
    [contentful]/[lmdb]/open.js:177:17
  
  - write.js:771 LMDBStore.transactionSync
    [contentful]/[lmdb]/write.js:771:17
  
  - open.js:176 new LMDBStore
    [contentful]/[lmdb]/open.js:176:11
  
  - open.js:240 LMDBStore.openDB
    [contentful]/[lmdb]/open.js:240:6
  
  - cache-lmdb.ts:60 GatsbyCacheLmdb.getDb
    [contentful]/[gatsby]/src/utils/cache-lmdb.ts:60:44
  
  - cache-lmdb.ts:69 GatsbyCacheLmdb.get
    [contentful]/[gatsby]/src/utils/cache-lmdb.ts:69:17
  
  - date.ts:234 formatDate
    [contentful]/[gatsby]/src/schema/types/date.ts:234:53
  
  - date.ts:318 resolve
    [contentful]/[gatsby]/src/schema/types/date.ts:318:20
  
  - task_queues:96 processTicksAndRejections
    node:internal/process/task_queues:96:5
  
  - task_queues:65 runNextTicks
    node:internal/process/task_queues:65:3
  
  - timers:437 processImmediate
    node:internal/timers:437:9
  
  - async Promise.all

This the source code we're hitting

https://github.com/gatsbyjs/gatsby/blob/9b25267009f318949705e2e1faf7af859b0b668a/packages/gatsby/src/schema/types/date.ts#L224-L259

const formatDate = async ({
  date,
  fromNow,
  difference,
  formatString,
  locale = `en`,
}: IFormatDateArgs): Promise<string | number> => {
  const normalizedDate = JSON.parse(JSON.stringify(date))
  if (formatString) {
    const cacheKey = `${normalizedDate}-${formatString}-${locale}`
    const cachedFormat = await getFormatDateCache().get(cacheKey)
    if (cachedFormat) {
      return cachedFormat as string
    }

    const result = moment
      .utc(normalizedDate, ISO_8601_FORMAT, true)
      .locale(locale)
      .format(formatString)

    await getFormatDateCache().set(cacheKey, result)

    return result
  } else if (fromNow) {
    return moment
      .utc(normalizedDate, ISO_8601_FORMAT, true)
      .locale(locale)
      .fromNow()
  } else if (difference) {
    return moment().diff(
      moment.utc(normalizedDate, ISO_8601_FORMAT, true).locale(locale),
      difference
    )
  }
  return normalizedDate
}

Let me know if this helps or not at all. We're basically checking if the element is in the cache if not we set it

@kriszyp
Copy link
Owner

kriszyp commented Apr 22, 2022

That helps, that definitely is not where I had been expecting an MDB_PROBLEM to occur, this seems to point to the unspilling operation. A little explanation of this: in LMDB, if a transaction gets too large and has more than about 500MB of "dirty" data (that hasn't been committed to disk), it starts "spilling" or writing it to disk. But if that data is also modified again in the same transaction it is "unspilled" back to memory, and this appears to be where the error is occurring.
It is not very clear how this code would trigger this, like you said, it looks just a basic get/put.
Is this part of a very large number of writes in the same event turn/transaction (writing more than 500MB in the operation)?
Are you using child transactions (childTransaction) at all?
Of course if there is a way I can run a test locally to reproduce, that would be helpful.

I am investigating if this assertion/error really is necessary and the state is detrimental. It may be possible to simply turn off of this check or make it a warning (which basically allows unspilling even if a page can't be found in the spilled list), but will do some checking of the effects. Of course the potential problem is if this allows for a bad state where invalid data is returned or corruption can take place.

Also FYI, I have not removed these more detailed messages, you can use the latest lmdb version (2.3.5 right now), and the extra error messages should still be there.

kriszyp added a commit that referenced this issue Apr 22, 2022
@wardpeet
Copy link

I'll see if I can get you a repro. We're not using childTransactions and I don't event think we're using any transactions ourselves.

@kriszyp
Copy link
Owner

kriszyp commented Apr 22, 2022

Well, everything in LMDB is in a transaction, it is just implicit/automatic if there is no explicit transaction, and usually one transaction per event turn, unless there is a lot of heavy writing where more gets batched into the current transaction.

@kriszyp
Copy link
Owner

kriszyp commented Apr 22, 2022

Ok, v2.3.6 should have the assertion switched to just being a warning (with a little more info), so the application can proceed and we can see if it really is a bad state with detrimental outcomes, or can safely be ignored.

@LekoArts
Copy link
Author

LekoArts commented May 9, 2022

gatsby@next now uses lmdb 2.3.10 so going forward we should get better error logs and will post them here 👍

@lauraturk
Copy link

@kriszyp
Copy link
Owner

kriszyp commented May 12, 2022

@lauraturk FYI, we believe this may be related or the same issue #164, which also has plenty of discussion of our efforts to track this down. Certainly if you are ever able to come up with a reproducible test case that I could run locally, we would be most grateful (I've never been able to reproduce the issue in this ticket, so most of the discussion has been theorizing about ways that it might possibly occur).

@lauraturk
Copy link

Hi @kriszyp , unfortunately, I am not able to create a reproducible test case for you to run locally. It's very random across our 5 gatsby sites. Per Gatsby support I upgraded to gatsby@next today, and only one site failed. Hopefully this log is helpful? https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/cb4bfd92-c8a9-4d73-9ff6-b96b3d94ffe1/details#rawLogs

@KyleAMathews
Copy link
Contributor

KyleAMathews commented May 21, 2022 via email

@kriszyp
Copy link
Owner

kriszyp commented May 24, 2022

@lauraturk I would still love to see what logs you got for that error, if you can paste them in?

@lauraturk
Copy link

Thanks for your patience, here it is:

16:00:28 PM:
success Building HTML renderer - 28.962s

16:00:28 PM:
success Execute page configs - 0.126s

16:00:32 PM:
success Caching Webpack compilations - 0.001s

16:00:32 PM:
success Validating Rendering Engines - 4.071s

16:00:36 PM:
../dependencies/lmdb/libraries/liblmdb/mdb.c:7602: Assertion 'root > 1' failed in mdb_page_search()

16:00:36 PM:
Go here for troubleshooting tips: https://gatsby.dev/pqr-feedback

16:00:36 PM:
ERROR An error occurred during parallel query running.

16:00:36 PM:
16:00:36 PM:
    [www]/[gatsby-worker]/dist/index.js:117:45

16:00:36 PM:
  - index.js:117 ChildProcess.

16:00:36 PM:
  

16:00:36 PM:
  Error: Worker exited before finishing task

16:00:36 PM:
  - node:events:390 ChildProcess.emit

16:00:36 PM:
  - child_process:290 Process.ChildProcess._handle.onexit

16:00:36 PM:
    node:events:390:28

16:00:36 PM:
  

16:00:36 PM:
  

16:00:36 PM:
    node:internal/child_process:290:12

16:00:36 PM:
  

16:00:36 PM:
16:00:36 PM:
not finished run queries in workers - 3.943s

@kriszyp
Copy link
Owner

kriszyp commented May 25, 2022

Thanks for the logs, this is helpful. This is a db corruption error, which is basically impossible to occur in typical usage. There are only few things that I know that can lead to this:

  • Physical damage or manipulation of hard drive.
  • Copying a db file while it is use and using the copy.
  • Use noSync flag and a computer crash (I don't think gatsby doesn't do this)

It is clear from these errors that there are multiple child processes (rather than just threads). This is fine to do with LMDB, but is a helpful insight. From this comment #164 (comment) it sounds like that error might be specific cloud configuration where this occurs (GCP Kubernetes). I wonder if perhaps this configuration uses a remote/network drive/storage that isn't playing nicely memory maps (there are known problems with some remote file systems with LMDB )? I don't know enough about the gatsby infrastructure to know if this is the same platform where the error in this last message occurred?

@kriszyp
Copy link
Owner

kriszyp commented May 30, 2022

I discovered and fixed an issue with file locking (causing locks to break), that should be fixed v2.4.5, and might possibly address this issue and #164.

@LekoArts
Copy link
Author

LekoArts commented Jun 3, 2022

The CircleCI tests in gatsbyjs/gatsby#35724 seem to consistently fail on MDB_BAD_RSLOT: Invalid reuse of reader locktable slot: The reader lock pid 0, txn -1, doesn't match env pid 523 - at least when I re-trigger runs it fails very often: https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/83182/workflows/29ded013-2e0a-4cb8-9837-b0d87062049c/jobs/990452

I can SSH into that pod and get you any info you need :)

kriszyp added a commit that referenced this issue Jun 4, 2022
@kriszyp
Copy link
Owner

kriszyp commented Jun 5, 2022

Sorry, this is kind of trial and error. I believe this latest error is probably because the fix to better ensure database file identification was incorrectly handling the case of initial creation of the database (was doing the stat check before it was created). Published a fix for this in v2.5.1, if you can try it out when you get a chance.

@LekoArts
Copy link
Author

I was OOO for a week, but I'll update the PR to 2.5.2 now and let you know how it goes :)

@kriszyp
Copy link
Owner

kriszyp commented Jun 13, 2022

BTW, I would also be interested in adding an integration test of gatsby using lmdb-js. If you have any pointers/suggestion of a good test that would efficiently test gatsby's usage of lmdb, would love to add that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants