- Prof Arnab Bhattacharya, IIT Kanpur
Database is used to store interconnected data A DBMS is a system to manage databases Particularly:
- store data
- visualize data
- access (query) data
- update (manipulate) data
You can get all the above 🔝 with a normal file system as well Advantages of DBMS over file systems:
- Redundancy/Inconsistency is reduced
- in a filesystem, we store data in multiple places
- also, since we have multiple versions, we may have inconsistent versions
- Data isolation
- Data is stored in an internal format (which the user does not have to worry about) and the user gets an API to interact with it
- This may not be true with file systems where you get incompatible versions of xlsx etc
- Data integrity
- You cannot enforce constraints on the data (like the CGPA has to be b/w 1 and 10 for eg)
- Atomicity of operations
- Databases ensure that a block of updates are done in whole
- Concurrency
- Multiple transactions can be done concurrently
- Security
- Data security can be made secure in DBMS
- We can have various access levels in the database
A relation is defined as: n-ary tuple (a1, a2, …, an) where ai ∈ Di (each ai comes from a domain Di)
A relation is a cross product from (elements of) the sets of Di Example:
a1: name = {A, B, C} a2: street = {1st, 2nd} a3: city = {Kolkata, Delhi}
So, we can have relations: r = {(A, 1st, Kolkata), (A, 2nd, Delhi), (B, 2nd, Kolkata), }
It is not necessary that every combination of the domains is part of the relation Eg, here, 🔝, (B, 1st, Kolkata) is not a part of the relation
Also, remember that Relations are unordered The relations are depicted as a table
name | stree | city |
---|---|---|
A | 1st | Kolkata |
A | 2nd | Delhi |
B | 2nd | Kolkata |
Attribute comes from a domain and it has a name Example: Street (which is the name), (domain was 1st, 2nd, NULL)
NULL is a special value which is a part of every domain - used to depict absence of information
Attributes are generally atomic (cannot be broken down further)
The sets we defined earlier define a relation schema
Eg: Address schema - it says the address schema has 3 attributes (name, street, city)
If we have a schema R, a particular relation r(R) is denoted to be from this schema
A relation instance is one instantiation of this relation (eg: (A, 1st, Kolkata))
Tuple is one row, one instantiation of the relation (r(R))
So, tuple corresponds to a row, attribute corresponds to a column
So, we have defined: relation, schema, instance, tuple
Database: collection of interconnected relations It must handle all the problems the file system cannot handle
Consider a relation schema R A subset K is called a “superkey” of R (a special key of R) iff values of the values of K are enough to determine all the values of attributes in R So, if in a particular tuple, if you know that key, you can get to know all the other attributes in that tuple
So, a superkey identifies a unique tuple In fact, a super key should be able to distinguish a tuple for all possible relations (not just the ones that exist right now)
Eg of superkey - roll no etc
Any super-set of a superkey is a superkey as well, so if K is a superkey, (K, X) is also a super key (where X is a attribute)
- Candidate key = minimal superkey (no extra terms)
- A particular relation may have multiple candidate keys
- Primary key = One of the candidate keys is designated by the database designer as a primary key
- The rest become secondary keys
Every relation has to have a candidate key - in the worst case, you can take all the attribute of the relation (the entire tuple) and call it as a candidate key - (given you do not allow duplicates)
If there are 2 relations r1 (with primary key as pk1) and r2 (with primary key pk2) which are connected, If a relation has a primary key, in the other relation, it becomes a foreign key
Here, f2 is the foreign key for r2
Here, r2 🔝 is called the referencing relation and r1 is called referenced relation
Relational Algebra is a procedural language - used to define database queries
6 basic operators
- select: σ
- project: π
- union: ∪
- set difference: -
- cartesian product - x
- rename operator
RA is what forms the basis of SQL - it defines the theory of SQL
Apart from the 6 operators 🔝, RA also uses propositional calculus. They are connected by 3 basic operators:
- And: ^
- Or: v
- not: ¬
Syntax: <attribute> <comparator> <attribute/constant>
The comparators are the standard:
- =
- ≠
- <
- ≤
- >
- ≥
This defines the entire RA
In more details the basic operators:
σP(r) = {t | t ∈ r and P(t) } Here, P is the predicate on which the select is done, r is the relation instance Here, we select all tuples t which are in r and satisfy the predicate
This does not change the schema of r
Example: consider relation r
Now, we have this select operator:
Here, we are selecting projections A1 to Ak from the relation r Duplicates are removed Example:
We have projection: πA, C(r)
Consider 2 relations, r and s
r ∪ s = { t | t ∈ r OR t ∈ s}
r and s need to have the same attributes
Note: We remove duplicates
Just like set difference
r - s = { t | t ∈ r AND t ∉ s}
r x s = {tq | t ∈ r and q ∈ s}
Here, the schema is changed. If r and s are disjoint, we just need to add new attributes Else, we need to be more careful
It just renames the attribute
The schema has changed, but not the meaning
Operator | Schema Changed? |
---|---|
select | NO |
project | YES |
Union | NO |
Difference | NO |
Cartesian product | YES |
Rename | YES (meaning does not change) |
Next we will look at composing the operators
We can apply compositions of operations eg:
σA=C(rxs)
We have the relations:
- branch (bname, bcity)
- customer (cname, city)
- account (ano, bname, bal)
- loan (lno, bname, amt)
- depositor (_cname_, _ano_)
- borrower (_cname_, _lno_)
Here, attr is a primary key and attr is a foreign key
We can solve certain types of queries now:
Πloan(σloan>100(loan))
We select and then we use project to get just the required attribute out
Since the customer table does not have information about loan, we need to look at loan table and filter out “ABC” bname and then look at borrower table and get the cname
Πlno(σbname=”ABC”(loan))
Now, we take this lno and get the cname from borrower table
Or, we can take a cartesian product and get it:
σb.lno=l.lno(borrower x loan)
The b.lno = l.lno condition will give us all valid tuples (won’t give us spurious tuples) We need to put a condition for bname now and select
Πcname(σbname=”ABC”(σb.lno=l.lno(borrower x loan)))
If we had wanted the additional condition that they shouldn’t have an account, we would have had to take a set difference with depositor.
They are:
- Set intersection
- Join
- Division
- Assignment
They simplify the queries but do not add any new power to RA (They can be written using the basic operators from earlier)
r ∩ s = {t | t ∈ r and t ∈ s}
We can write r ∩ s as r - (r - s)
Think of it in terms of Venn diagram r - s is the part of r without s, i.e. pure r Now, subtracting this from the whole r gives us only the shared part with s - r ∩ s
Denoted by: r ∞θ s = σθ(r X s)
Joins have many types:
Θ contains equality: eg
r ∞θ s = σA=B(r X s)
This is what people mostly refer to when they talk about joins It has it’s own symbol - *
r * s = r ∞r.A = s.A s
Both r and s have to have a common attribute (both in it’s name and it’s semantic meaning)
Equality join does not change the schema (any join does not change schema from that of r X s) - except natural join
In that, the common attribute (A here 🔝) has only 1 copy
Examples
If r has schema (A, B) and s has schema (A, C) then r*s (the natural join) has schema (A, B, C)
r ÷ s = {t | t ∈ πr-s(r) and ∀ u ∈ s (tu ∈ r)}
Assume, r has R schema and s has S schema For division to apply, we must have the condition that schema S is a subset of schema R (S ⊂ R)
Also, schema of r ÷ s is R-S
÷ chooses tuples from R-S such that cartesian product of these with s(S) are all in r(R)
This is what the second part says, ∀ u ∈ s (tu ∈ r) “tu” is the cartesian product, which should be in r
The first part just specifies the schema
Example:
r = (A, B, C, D) s = (C, D)
Here, note that S ⊂ of R So, r ÷ s has schema (A, B)
Assuming r and s as:
We have:
r ÷ s
A | B |
---|---|
1 | 5 |
2 | 6 |
3 | 6 |
We don’t select (3, 5) in the result because when we take cartesian product with s, we get (3, 5, 2, 7) which isn’t there in r
How is this useful? We’ll see later. The intuition is that this lets you distill out the information of “s” from “r” and make the representation lean
s ← E(r)
This is similar to rename operator ρ What we do is we temporarily assign something to “s”
Example: Our earlier query now becomes: s ← borrower ∞ loan q ← σbname = “ABC”(s)
We continue from our banking example earlier 🔝 with this schema
We have the relations:
- branch (bname, bcity)
- customer (cname, city)
- account (ano, bname, bal)
- loan (lno, bname, amt)
- depositor (_cname_, _ano_)
- borrower (_cname_, _lno_)
Here, attr is a primary key and attr is a foreign key
Πcname(borrower) ∩ Πcname(depositor)
Note the projection must be done before the intersection must be done (since the schema of borrower and depositor is different)
Here, the operative word is “every”, which is best solved by the division operator
Πcname, bname(depositor ∞ account)
So, we effectively get the full list of names of folks with their bname Now, we can divide this with Πbname(σbcity=”DEF”(branch)) which selects all the branches in the city DEFWhen we take a ÷, we end up with only those customers which have an account in every branch on RHS (every branch of city “DEF”)
Here, we increase power over basic RA We have 3 operators:
- Generalized projection
- Aggregation
- Outer join
It takes the normal projection operator Π and it allows arithmetic operations on those projections (on the projected columns)
ΠF1, …, Fk(E)
Where F1 to Fk are functions
Example:
Consider r:
A | B | C |
---|---|---|
1 | 1 | 5 |
1 | 2 | 5 |
2 | 3 | 5 |
2 | 4 | 8 |
Now, we have: Π2B-A, C (r) as:
2B-A | C |
---|---|
1 | 5 |
3 | 5 |
4 | 5 |
6 | 8 |
We can use certain aggregation functions like AVG, MIN, MAX, SUM, COUNT etc on groups of tuples (entire relation or a subset of it)
So, we take E, (which is a relation or it’s subset) And we group it based on G1, …, Gk (t1t2 are in the sample group if they agree of all G1 - Gk) Finally, we apply the functions F1 to Fn
Example:
Consider r:
A | B | C |
---|---|---|
1 | 1 | 5 |
1 | 2 | 5 |
2 | 3 | 5 |
2 | 4 | 8 |
We have: Gsum(c)(r) as:
sum(c) |
---|
23 |
Since we don’t have any grouping, everything falls under same group
We can also have:
Extension of the normal join to retain more information How? We first do the join, and then adds tuples to the result (we will have to utilize “null” values here)
It retains every tuple from left relation (r here) even if it does not obey the join condition θ
Example:
This is the natural join. But, since this is a left join, we also have remaining values from A
Note the use of “null” here
Same thing, except we include all values from right relation (s here)
From the above example, we have:
This is a combination of both left and right outer joins It captures all relations from both left and right relations
From the above example, we have:
Now, to avoid confusion, the normal join (the natural join) is called inner join And outer join is called → left join/right join/full join
When only join used, it refers to natural outer join
- Aggregate operators ignore null
- Arithmetic expression may need to evaluate null (eg: 5 = null etc)
- 5 = null → null
- 5 > null → null
- null = null → True
null gives us “Three-Valued logic” - True, False and Unknown (we used only binary logic till now, ie True/False)
Logic table:
OR:
- u OR t = t
- u OR f = u
- u OR u = u
AND
- u AND t = u
- u AND f = f
- u AND u = t
NOT
- NOT u = u
Select operator treats unknown as false