Skip to content
Nick Ruest edited this page Jan 27, 2016 · 10 revisions

Time/Place

This meeting is a hybrid teleconference and IRC chat. Anyone is welcome to join. Here is the info:

Attendees

  • Mark Cooper
  • Jared Whiklo
  • Diego Pino
  • Adam Soroka
  • Nick Ruest
  • Melissa Anez 🌟
  • Aaron Coburn
  • Danny Lamb
  • Nigel Banks

Agenda

irc log

  1. Transactions
  • Isolation
  • Is snapshot isolation sufficient?
  • A. Soroka might join us to flesh things out more conversation-wise
  1. Sprint updates
  2. Ansible update
  3. Fedora 4 IG Terms of Reference
  4. ... (feel free to add agenda items)

Minutes

  1. Transactions. Has come up on the Fedora tech call over the past few weeks. Working through a CLAW use case for it. Jared and Nick brought it to the Fedora Tech call last week. Adam Soroka is joining here to talk through the details.
  2. Consistency:
**Diego**: When we are talking about consistency, are we talking about data or states? When something fails during a transaction. Doesn't think we should leave data to Fedora. Wants our state to stay the same if something fails. For example, Solr uses acid transactions. When you put something bad in Solr, the previous index stays consistent. You can add new data.

**Adam**: This isn't what we mean by consistency in acid. Maybe it's robustness? When we talk about adding something bad, what does that mean?

**Jared**: The key is that it's bad in the sense that the transaction cannot complete. We have data in Fedora. Whatever it is, if you open a transaction and try to take actions, and it doesn't take - Diego wants it to roll back. 

**Adam**: [Atomicity](https://en.wikipedia.org/wiki/Atomicity_(database_systems)) is the notion that either something works entirely or fails entirely. That's not consistency. Fedora has atomicity and will continue to do so. But it does not have data consistency in the API. If you offer data, triples that make no sense, Fedora is fine with that, and that's how it should work. 

**Jared**: How about if we specify that we want in the use case as part of atomicity. To ensure that it is enshrined in our use case for Fedora.

**Adam**: consistency would mean "a new and valid" state of data is sought. Which isn't what we want because Fedora doesn't know or care if your data is "valid". 

**Diego**: Does this mean that we could break LDP? 

**Adam**: You should not be able to break LDP. As an example, something without a container, or doesn't know its container... you should not be able, via the API, to move a repo into that kind of state. If there's some key piece of data that Islandora needs, then Fedora won't enforce it. But we're looking for uses cases; if Islandora comes back and says that behaviour is absolutely necessary, then that's a conversation we'll have to have. 

**Aaron**: There is a lot of talk about the notion of validity. It might come in at a higher layer than Fedora; even if Fedora doesn't enforce consistency, that does not preclude some other layer from doing do. 

**Nigel**: is the LDP stuff implemented at another layer? 

**Adam**: No, it's in the core API and almost certainly always will be. 

**Danny**: When we are talking about consistency, I don't think of invalid states (assumes the application layer will do that), I think of consistency across replicas as we expand and scale. Is there a word for that? 

**Adam**: Not in the Fedora conversation right now. Most are talking about network consistency. One way to distinguish  is that when we talk about transactions, acid, etc - we are not talking about the implementation. We have no idea if it's clustered or not. So the kind of consistency Danny is talking about is outside of the discussion because we're not aware of that part of the implementation- although a very important topic on its own. 

**Aaron**: Would add to that: Danny's consistency is from the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), so there's an unfortunate double use of the term "consistency" in the sense of [ACID](https://en.wikipedia.org/wiki/ACID) transactions. So if your replicas are not 'consistent', the implication is that you may have transactions that appear inter-meshed instead of synchronized. It touches more on [isolation](https://en.wikipedia.org/wiki/Isolation_(database_systems)) than consistency. 

**Danny**: Agreed on all that, including punting the topic. We're talking about what an implementation must provide. Is looking at, for our use case, that we need atomicity. Individual applications should provide the consistency that we're talking about today. So what about isolation? 

**Adam**: In the Fedora context, the kind of isolation that is available is snapshot isolation. That's not the only or most common meaning for that term. Snapshot is fairly powerful, but we could also talk serializability. My sense is that snapshot isolation is fine for what most people need. It's very hard to get into details without getting into implementation details. 

**Aaron**: What we mean by transaction and what ACID guarantees might mean for an implementation. He has been thinking about how to implement the Fedora kernel on top of [HBase](https://hbase.apache.org/) (which would be a great implementation). If you were to run it on HBase, typically you would have one index where you store userland, server manage properties, what amounts to a distributed table holding all the mainline data. Then a separate table holding LDP membership and LDP containment. You would save your data from a Put/POST command, then there would be a map reduce job running in the background asynchronously that would update that containment. You could either block until the job is job, or just let it run and assume it will be done eventually. In the meantime (and hopefully it's fast) whatever that delta is would potentially have serializable results that don't make sense in the contact of applet isolation. 

**Adam**: To put it other words, isolation has to do with what state you are seeing from a given point in a transaction. In a highly distributed architecture, there may not be a given state as things percolate across the network. And network latency between highly-distributed nodes means there's always going to be asynchrony. Lots of asynchrony. These are the reasons we would like to NOT talk about isolation in the context of the API. We don't want to give guarantees. It should be up to the implementation what kind of isolation is used. The problem there is with concurrency. If your various requests across the repo aren't using the same resources, you're fine. If two tasks attack the same resource, things could go sideways. This puts the burden back to the upper layers of the stack. So we're looking at whether that is an acceptable burden. 

**Danny**: Has been agreeing all along, just saying it wrong :) 

**Adam** And just to be clear, just because we don't want to do these things in the Fedora API, does not mean that there's any reason why an implementation could not do so on their own. 

**Adam**: So from the CLAW use case perspective, how does this stuff fit? Is this difficult to support in the middleware? 

**Danny**: The only time it may come into play is some sort of mass ingest of closely related objects. From experience, it's really all about how you set up your ingest; essentially providing your own isolation when you do that. 

**Nigel**: Not just ingest - think of CWRCwriter where multiple people are constantly changing data. 

**Adam**: Great use case to dig into. That is exactly where isolation comes into play. If the repo does not provide it, Andrew Woods suggests Fedora provides some kind of serialization via timestamp. If someone opens a resources, makes edits, and discovers that someone has edited some completely different part of the doc during that time, it's frustrating if that would make the original commit fail. 

**Diego**: We could, as we do now (Fedora 3), use locking for those use cases. 

**Danny**: Sounds like this needs to go to the listserv for more opinions. 

**Adam**: You definitely do not want to miss any important use cases. 

**Danny**: One of our angles is to take advantage of Drupal, which has a relational database. If we filter a lot of this stuff through MySQL, some of this may get taken care of. 

**Diego**: Multi-sites could still be a problem. 

**Danny**: Where we are now, we sort of have a multi-master setup where you can do things in Fedora or Drupal and it's reflected in both. Which is complicated, because you can't lock things down with one point of entry. Eventually, this means more work needs to be done to synchronize everything to make it all jibe. It's going to be tough. 

**Aaron**: Is doing something similar right now, with Hydra on top for the editing, and something completely different for the front end. The easiest way to make it all work is to have a dead simple object model. Not so simple in Hydra right now, but a 'stupid simple' object model would make everything better. 

**Adam**: To back that up, consider the object you are dealing with outside of the context of its repo. It should still make sense. Fits in with the idea of not repeated metadata; you should do it once and pull it from there. Complex and fragile object graphs aren't ideal.

**_Consensus_** - We do not require consistency from the Fedora API.
  1. Isolation:
Will do some more research, seek out use cases from the community before adding to our Fedora use case.
  1. Sprint Updates: Specifically how do we merge Danny's new indexer and Nigel's Ansible work.

    Merge Nigel's Ansible work first.

    Installation piece is small, inside parent pom.xml

    Installed Stomp add a watch command to watch Maven repository.

    Danny's could merge first, then Nigel will integrate changes for Ansible.

    Nigel has added a lot of Docker stuff, changing roles to change what is provisioned based on architecture, etc.

    Consensus: Danny will submit a PR to review, if it is small then Nigel will integrate into his changes.

This is an archive. For new Tech Call notes, click here

⚠️ ARCHIVED Islandora Tech Calls

⚠️ ARCHIVED Islandora User Calls

Clone this wiki locally