Dee

makes Python relational

Author:Greg Gaughan
Copyright:Copyright (C) 2007-2010 Greg Gaughan
Licence:GPL (see Licence.txt for details)
Contact:feedback@quicksort.co.uk
Date:10/05/2010

Contents

Introduction

Inspired by Date and Darwen’s Databases, Types and the Relational Model (The Third Manifesto), we’re putting forward an implementation of a truly relational language using Python. We will address two problems:

  1. The impedance mismatch between programming languages and databases
  2. The weakness and syntactic awkwardness of SQL

Mind the Gap

Most of today’s programs handle data in one way or another and often this data is stored in some kind of relational database. To read and modify this data, a program must bridge the gap between its representation and the one used by the dialect of SQL that the database provides. This bridge typically comprises a database API that sends queries as text strings, often accompanied by some kind of table-to-object mapper that has to coerce data and relationships in both directions, usually with elaborate layers of abstraction in an effort to keep the two sides loosely coupled.

“Yet by obscuring the true data source these solutions end up throwing away the most compelling feature of relational databases; the ability for the data to be queried.”

—Microsoft, DLinq .NET Language-Integrated Query for Relational Data, May 2006

This approach not only adds complexity and increases the need for data transformations but, most importantly, it destroys the significant advantages provided by the relational model of data. The relational model is built upon predicate logic which brings the power of formal reasoning to data: it is the only sound foundation available.

Enough of the Shenanigans!

A number of approaches and frameworks have been proposed to span the gap between the two systems; most never question why there are two systems in the first place.

Microsoft’s forthcoming LINQ to SQL (formerly DLinq) is a major attempt to bring SQL closer into the program than before, but will still keep the database sub-language and all that it entails.

“It is no wonder that applications expected to bridge this gap are difficult to build and maintain. It would certainly simplify the equation to get rid of one side or the other. Yet relational databases provide critical infrastructure for long-term storage and query processing, and modern programming languages are indispensable for agile development and rich computation.”

—Microsoft, DLinq .NET Language-Integrated Query for Relational Data, May 2006

The solution to the problem is not to get rid of one side or the other, nor to have one side overlap the other, but to merge the two sides into one: supersede SQL (the COBOL of database languages) with a true relational programming language, one that is computationally complete, and then the gap disappears. Our solution uses one of the most effective, expressive and readable languages available, Python, and extends it with relations and a sound relational algebra.

A Bit of History

Since its inception in 1969 by E. F. Codd, the relational model has been the foundation for nearly all databases. It replaced earlier network and hierarchical ad-hoc approaches to data storage by being as simple as it needed to be, but no simpler. It was so powerful it allowed users to ask for what they wanted to find, rather than specify how they might find it.

Over the decades, SQL has become the de-facto language for relational databases, but SQL misses many of the benefits of relational technology. In recent years, partly due to SQL’s weaknesses and partly due to minimalistic and stagnant implementations, the database has become merely a storage engine fronted by layers of drivers, mappers, hierarchical markups and frameworks which make flexible querying both complex and distant from the application code.

Where We’re Coming From

Having implemented a comprehensive, standards-compliant SQL server, ThinkSQL, we did some further research into the history of SQL’s dominance in the marketplace and its quirky syntax. We found a far superior alternative in the form of D [1], a generic name for any relational language that conforms to The Third Manifesto. We’ve implemented such a language, Dee, as an extension to Python.

The relational algebra and most of the ideas underlying Dee come from Date and Darwen’s Databases, Types and the Relational Model (The Third Manifesto). An introduction into the ideas behind it can be found in Databases in Depth and many related links and reference materials are on The Third Manifesto website.

The current version of Dee is an initial release to gain feedback regarding the approach. We chose Python because its interpreted style, dynamic typing and built-in sets and dictionaries make it ideal for interacting with data; plus any language that allows you to do the following sorts of things has got to be good:

>>> x, y = 45, 90
>>> print x, y
45 90
>>> x, y = y, x         #swapping values without the usual temporary variable!
>>> print x, y
90 45

>>> 70 < x < 120
True

See Why Use Python? for more information on the advantages of the language. A guide to the Python language can be found in An Introduction to Python. We do assume you are familiar with Python in what follows.

Where We’re Going

The current release is an initial proposal, intended to encourage feedback. We have many ideas for future versions to make it more deployable. See the Future Work section below for more details.

Basics

To start using Dee from within the Python interpreter or from a Python program, first import the module. (For demonstrating we import everything but it’s recommended that you only import the features you need.)

>>> from Dee import *

Tuples

A Tuple is a set of attribute/value pairs. A Tuple can be represented by a Python dictionary, e.g.

>>> print {"StudentId":'S1', "Name":'Anne'}
{'StudentId': 'S1', 'Name': 'Anne'}

and the attributes and values can be extracted using the standard Python syntax, e.g.

>>> t1 = {"StudentId":'S1', "Name":'Anne'}
>>> t1["StudentId"]
'S1'

>>> "Name" in t1
True

>>> t1.keys()
['StudentId', 'Name']

A more powerful way is to use the Tuple class which allows a slightly simpler syntax for denoting attribute values. To specify a Tuple:

>>> t1 = Tuple(StudentId='S1', Name='Anne')

and then the attributes values can be extracted in the same way as the Python dictionary but also using the dot notation without the quotes, e.g.

>>> t1["Name"]
'Anne'

>>> t1.Name
'Anne'

The Tuple class also provides a number of useful methods, such as project and remove, for manipulating relational tuples.

Attribute values are dynamically typed in the usual Python way and they must be of the same type for every tuple in a given relation. Currently, the types can be anything that can be pickled.

Relations

A Relation comprises a heading and a body. The heading is a set of attribute name/type pairs. The body is a set of tuples. Each tuple in the body comprises a value for every attribute in the heading. To specify a relation literal, pass the heading as a list of attribute names followed by the body as a list of tuple literals, e.g.:

>>> print Relation(["StudentId", "Name"],
...               [{"StudentId":'S1', "Name":'Anne'},
...                {"StudentId":'S2', "Name":'Boris'},
...                {"StudentId":'S3', "Name":'Cindy'},
...                {"StudentId":'S4', "Name":'Devinder'},
...                {"StudentId":'S5', "Name":'Boris'},
...               ])
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

Note:

  • there is no order to the heading attributes (they are a set)
  • nor is there any order to the tuples in the body (they are a set)
  • there is no duplication in the heading attribute names (they are a set)
  • nor is there any duplication in the tuples in the body (they are a set)

Also, we will try to use the term relation variable when we mean a variable that refers to a Relation, and just relation (or relation value) to mean the value of the relation. This is an important distinction. The value of a relation never changes, just like the value 5 never changes.

To assign a relation value to a relation variable, use the standard Python syntax, e.g.

>>> IS_CALLED = Relation(["StudentId", "Name"],
...                     [{"StudentId":'S1', "Name":'Anne'},
...                      {"StudentId":'S2', "Name":'Boris'},
...                      {"StudentId":'S3', "Name":'Cindy'},
...                      {"StudentId":'S4', "Name":'Devinder'},
...                      {"StudentId":'S5', "Name":'Boris'},
...                     ])

An alternative way to define a relation is to use the Tuple class to define the body:

>>> IS_CALLED = Relation(["StudentId", "Name"],
...                     [Tuple(StudentId='S1', Name='Anne'),
...                      Tuple(StudentId='S2', Name='Boris'),
...                      Tuple(StudentId='S3', Name='Cindy'),
...                      Tuple(StudentId='S4', Name='Devinder'),
...                      Tuple(StudentId='S5', Name='Boris'),
...                     ])

or alteratively, a more concise option is available which relies on the order of the body attributes matching the order of the heading:

>>> IS_CALLED = Relation(["StudentId", "Name"],
...                     [('S1', 'Anne'),
...                      ('S2', 'Boris'),
...                      ('S3', 'Cindy'),
...                      ('S4', 'Devinder'),
...                      ('S5', 'Boris'),
...                     ])

(Note that Python allows an additional comma after the last item in a list, which can simplify copy/paste operations. Also a Python tuple with a single value must have a comma after the value to distinguish it from a value in parentheses, e.g. (7,) rather than (7))

There are a number of ways to display a relation:

  1. Print it as a string (i.e. using its __str__ method), e.g.
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+
  1. Print a literal representation (one of possibly many variations) (i.e. using its __repr__ method), e.g.
>>> print `IS_CALLED`           #or just: >>> IS_CALLED
Relation(('StudentId', 'Name'),
[Tuple(StudentId='S1', Name='Anne'), Tuple(StudentId='S2', Name='Boris'), Tuple(StudentId='S3', Name='Cindy'), Tuple(StudentId='S4', Name='Devinder'), Tuple(StudentId='S5', Name='Boris')],
{'PK':(Key, None)})

Note: this literal can itself be evaluated using Python‘s eval() function to retrieve the relation’s value, e.g.

>>> print eval(`IS_CALLED`)
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+
>>> r2=eval(`IS_CALLED`)
>>> print r2
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+
  1. Print it rendered as an HTML table, e.g.
>>> print IS_CALLED.renderHTML()
<table><thead><th><em>StudentId</em></th><th><em>Name</em></th></thead><tbody><tr><td>S1</td><td>Anne</td></tr><tr><td>S2</td><td>Boris</td></tr><tr><td>S3</td><td>Cindy</td></tr><tr><td>S4</td><td>Devinder</td></tr><tr><td>S5</td><td>Boris</td></tr></tbody></table>

Which in a browser becomes:

StudentId Name
S1 Anne
S2 Boris
S3 Cindy
S4 Devinder
S5 Boris

The heading of a relation can be retrieved via its heading method, which returns the attribute names as a Python set, e.g.

>>> print IS_CALLED.heading()
set(['StudentId', 'Name'])
The Interpretation of a Relation

Given a relation such as the one denoted by IS_CALLED above, we should take the meaning of it to be as follows:

  • The heading supplies the parameters for the predicate, e.g. StudentId and Name are the parameters for the IS_CALLED predicate.
  • The tuple Tuple(StudentId='S3', Name='Cindy') is an instantiation of that predicate. It is a proposition where the argument values ‘S3’ and ‘Cindy’ are substituted for the parameters. This states that student S3 is called Cindy.
  • Each tuple in the relation is a true instantiation.
  • Any tuple not in the relation is a false instantiation.

Function-based Relations

Instead of defining the value of a relation variable once when it is assigned, we can refer to a function to provide the relation. The function can then return different values at different times. One important kind of relation variable that refers to a function for its data is a virtual (or derived) relation variable. A virtual relation variable refers to a function that returns a relational expression. All other relational variables are base relation variables. To specify a virtual relation variable we first need to define a function to provide the data by returning a relational expression. For example (ignore the relational expression syntax for now, we’ll cover the details of that later):

>>> def vIS_CALLED_caps():
...     return IS_CALLED.extend(['NameCaps'], lambda t: {'NameCaps': t.Name.upper()}).remove(['Name'])

Then pass the heading as a list of attribute names followed by the body as a function reference, e.g.

>>> IS_CALLED_caps = Relation(["StudentId", "NameCaps"], vIS_CALLED_caps)
>>> print IS_CALLED_caps
+-----------+----------+
| StudentId | NameCaps |
+===========+==========+
| S1        | ANNE     |
| S2        | BORIS    |
| S3        | CINDY    |
| S4        | DEVINDER |
| S5        | BORIS    |
+-----------+----------+

Such virtual relation variables’ values will then vary as the underlying base relation variables vary. These virtual relation variables are called views in SQL.

Relation-Valued Attributes

An attribute value can itself be a relation. Such attributes are known as relation-valued attributes or RVAs. There are a number of relational operators (actually macros) that use such nested relations. For example, GROUP, which takes a relation and a set of attribute names together with a new attribute name and returns a relation with the set of attributes as a nested relation, 1 per unique value of the non-grouped attributes:

>>> print GROUP(IS_CALLED, ['StudentId'], 'StudentIds')
+----------+---------------+
| Name     | StudentIds    |
+==========+===============+
| Anne     | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S1        | |
|          | +-----------+ |
| Boris    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S2        | |
|          | | S5        | |
|          | +-----------+ |
| Cindy    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S3        | |
|          | +-----------+ |
| Devinder | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S4        | |
|          | +-----------+ |
+----------+---------------+

Predefined Relations

There are two interesting relations that are useful for defining some fundamental relational operators in Dee. We introduce them here.

DUM

This is the relation that has no attributes and no tuples. It plays the role of False. It is difficult to display:

>>> print DUM
+
|
+
+

>>> print DUM.renderHTML()
<table><thead></thead><tbody></tbody></table>

It is also called TABLE_DUM and FALSE.

DEE

This is the relation that has no attributes and a single tuple. It plays the role of True. It is difficult to display:

>>> print DEE
+
|
+
|
+

>>> print DEE.renderHTML()
<table><thead></thead><tbody><tr></tr></tbody></table>

It is also called TABLE_DEE and TRUE.

Relation Constraints

A Relation (function-based or not) can also take an extra parameter in its constructor to specify a set of constraints. This takes the form of a Python dictionary where each key gives the constraint name and each value is a pair of constraint-function, parameters. For example, to specify that the “StudentId” attribute is a candidate key for the above relation we could say:

>>> IS_CALLED = Relation(["StudentId", "Name"],
...                     [('S1', 'Anne'),
...                      ('S2', 'Boris'),
...                      ('S3', 'Cindy'),
...                      ('S4', 'Devinder'),
...                      ('S5', 'Boris'),
...                     ],
...                     {'PK':(Key, ["StudentId"])}
...                    )
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

Here, Key is a pre-defined constraint type (actually a function wrapper that creates a function) that takes a list of attributes to enforce the constraint. A constraint function can return True or False and is called whenever the relation is assigned a new value. If no candidate key is specified for a relation, one is assumed comprising all the attributes in the relation (this is displayed in representations as {'PK':(Key, None)}). As another example:

>>> COURSE = Relation(["CourseId", "Title"],
...                  [('C1', 'Database'),
...                   ('C2', 'HCI'),
...                   ('C3', 'Op Systems'),
...                   ('C4', 'Programming'),
...                  ],
...                  {'PK':(Key, ["CourseId"])}
...                 )

Another pre-defined constraint (function wrapper) is ForeignKey. It takes a relation name and a mapping of foreign key attributes to candidate key attributes as parameters, e.g.:

>>> IS_ENROLLED_ON = Relation(["StudentId", "CourseId"],
...                         [('S1', 'C1'),
...                          ('S1', 'C2'),
...                          ('S2', 'C1'),
...                          ('S3', 'C3'),
...                          ('S4', 'C1'),
...                         ],
...                         {'FKS':(ForeignKey, ('IS_CALLED', {"StudentId":"StudentId"})),
...                          'FKC':(ForeignKey, ('COURSE', {"CourseId":"CourseId"}))}
...                        )

Here, two foreign keys are declared to ensure referential integrity between this relation and the relations referred to by IS_CALLED and COURSE.

Lambda

In a number of places we need to pass expressions, e.g. restrictions (where clauses). Python has a built-in way of defining such expressions with anonymous functions using the lambda keyword. So an example restriction for the above IS_CALLED relation could be:

>>> print IS_CALLED.where(lambda t: t.Name == 'Boris')
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S5        | Boris |
+-----------+-------+

In this example, the lambda expression is passed to the relation’s where function and the expression introduces a range variable, t, which will stand for each Tuple in the relation. The expression itself, the part after the colon, tests whether the Name attribute of each tuple is equal to ‘Boris’: if it is then the tuple is included in the result. Any Python expression can be passed this way. So here, complex boolean expressions including boolean operators and function calls can be built, e.g.

>>> print IS_CALLED.where(lambda t: t.Name.startswith('B') and t.StudentId.endswith('5'))
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S5        | Boris |
+-----------+-------+

>>> print IS_CALLED.where(lambda t: 'A' < t.Name[0] < 'D')
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S3        | Cindy |
| S5        | Boris |
+-----------+-------+

>>> print IS_CALLED.where(lambda t: t["Name"].startswith('B'))
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S5        | Boris |
+-----------+-------+

Of course, simple boolean expressions can also be used, e.g.

>>> print IS_CALLED.where(lambda t: True)
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

>>> print IS_CALLED.where(lambda t: False)
+-----------+------+
| StudentId | Name |
+===========+======+
+-----------+------+

It’s perhaps worth noting that the where function is really just shorthand for a natural join. Take the first example:

>>> print IS_CALLED.where(lambda t: t.Name == 'Boris')
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S5        | Boris |
+-----------+-------+

This relational calculus based where clause can be rephrased using the relational algebra’s AND operator (in this case acting as the natural join):

>>> print IS_CALLED & Relation(["Name"], [('Boris',)])
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S5        | Boris |
+-----------+-------+

Many of the relational methods provided are in fact macros implemented using only a few fundamental relational operators, such as AND.

Another place lambda expressions can be used is when defining virtual relation variables. For example the earlier example:

>>> def vIS_CALLED_caps():
...     return IS_CALLED.extend(['NameCaps'], lambda t: {'NameCaps': t.Name.upper()}).remove(['Name'])
>>> IS_CALLED_caps = Relation(["StudentId", "NameCaps"], vIS_CALLED_caps)
>>> print IS_CALLED_caps
+-----------+----------+
| StudentId | NameCaps |
+===========+==========+
| S1        | ANNE     |
| S2        | BORIS    |
| S3        | CINDY    |
| S4        | DEVINDER |
| S5        | BORIS    |
+-----------+----------+

Could be re-coded using lambda in a more concise way as:

>>> IS_CALLED_caps = Relation(["StudentId", "NameCaps"],
...                            lambda: IS_CALLED.extend(["NameCaps"], lambda t: {
...                                                      "NameCaps": t.Name.upper()}).remove(["Name"]))
>>> print IS_CALLED_caps
+-----------+----------+
| StudentId | NameCaps |
+===========+==========+
| S1        | ANNE     |
| S2        | BORIS    |
| S3        | CINDY    |
| S4        | DEVINDER |
| S5        | BORIS    |
+-----------+----------+

Lambda expressions can also be used as general constraints. On relations, another pre-defined constraint is Constraint. This takes a function that must evaluate to True for the constraint to hold, e.g.:

>>> EXAM_MARK = Relation(["StudentId", "CourseId", "Mark"],
...                     [('S1', 'C1', 85),
...                      ('S1', 'C2', 49),
...                      ('S2', 'C1', 49),
...                      ('S3', 'C3', 66),
...                      ('S4', 'C1', 93),
...                     ],
...                     {'PK':(Key, ["StudentId", "CourseId"]),
...                      'MarkRange': (Constraint, lambda r: ALL(r, lambda t: 0 <= t.Mark <= 100))}
...                    )

Here, the ‘MarkRange’ Constraint uses the ALL relational operator (discussed below) to ensure that all Marks in this relation are between 0 and 100. Note the Constraint works at the relation level and its range variable is r in the example. Useful operators at this level are ALL, ANY, IS_EMPTY, and the relational comparison operators discussed below, because they all take relations and return a boolean result.

Relations to Tuples

Here are some conversion functions to map between relations and tuples:

fromTuple

This static method returns a relation from a tuple:

>>> r1 = Relation.fromTuple({'CourseId':'C1', 'Title':'Database'})
>>> print r1
+----------+----------+
| CourseId | Title    |
+==========+==========+
| C1       | Database |
+----------+----------+

It can also take an extra parameter to specify a set of constraints:

>>> r1 = Relation.fromTuple({'CourseId':'C1', 'Title':'Database'}, {'PK':(Key, ['CourseId'])})
>>> print r1
+----------+----------+
| CourseId | Title    |
+==========+----------+
| C1       | Database |
+----------+----------+

toTuple

This can apply only to a single-tuple relation and returns a tuple from that relation:

>>> t1 = r1.toTuple()
>>> print t1
Tuple(CourseId='C1', Title='Database')
>>> print t1.Title
Database

fromTupleList

This static method returns a relation from a list of tuples:

>>> r2 = Relation.fromTupleList([{'CourseId':'C1', 'Title':'Database'},
...                              {'CourseId':'C4', 'Title':'Programming'},
...                              {'CourseId':'C3', 'Title':'Op Systems'},
...                              {'CourseId':'C2', 'Title':'HCI'}])
>>> print r2
+----------+-------------+
| CourseId | Title       |
+==========+=============+
| C1       | Database    |
| C4       | Programming |
| C3       | Op Systems  |
| C2       | HCI         |
+----------+-------------+

It can also take an extra parameter to specify a set of constraints:

>>> r2 = Relation.fromTupleList([{'CourseId':'C1', 'Title':'Database'},
...                              {'CourseId':'C4', 'Title':'Programming'},
...                              {'CourseId':'C3', 'Title':'Op Systems'},
...                              {'CourseId':'C2', 'Title':'HCI'}],
...                             {'PK':(Key, ['CourseId'])})
>>> print r2
+----------+-------------+
| CourseId | Title       |
+==========+-------------+
| C1       | Database    |
| C4       | Programming |
| C3       | Op Systems  |
| C2       | HCI         |
+----------+-------------+

toTupleList

This returns a list of tuples from the relation. Since relations are sets they can have no order, so to iterate through all the tuples in a relation you must use this method to first extract a list of tuples from the relation.

>>> ts = r2.toTupleList()
>>> print ts
[Tuple(CourseId='C1', Title='Database'), Tuple(CourseId='C4', Title='Programming'), Tuple(CourseId='C3', Title='Op Systems'), Tuple(CourseId='C2', Title='HCI')]

This list can then be iterated over in the usual ways, e.g:

>>> for t in ts:
...     print t.Title
Database
Programming
Op Systems
HCI

>>> print [t.Title for t in ts if t.CourseId=='C4']
['Programming']

>>> for t in reversed(ts):
...     print t.Title
HCI
Op Systems
Programming
Database

>>> print len(ts)
4

>>> print ts[0]
Tuple(CourseId='C1', Title='Database')

>>> print ts[-1]
Tuple(CourseId='C2', Title='HCI')

This is also the way to access the tuples in a pre-defined order. The toTupleList method can take an extra parameter to define a sort order. The sort parameter is a pair (ascending, attribute-list) where ascending is a boolean flag to indicate whether to sort in ascending order or not, and the attribute-list specifies the attributes to sort on.

>>> tss = r2.toTupleList((True, ['Title']))
>>> print [t.Title for t in tss]
['Database', 'HCI', 'Op Systems', 'Programming']

>>> tss = r2.toTupleList((False, ['CourseId']))
>>> print [t.CourseId for t in tss]
['C4', 'C3', 'C2', 'C1']

The renderToHTML method, mentioned earlier, is built upon the toTupleList method and also allows this sort parameter, e.g:

>>> print r2.renderHTML(sort=(True, ['Title']))
<table><thead><th><em>CourseId</em></th><th>Title</th></thead><tbody><tr><td>C1</td><td>Database</td></tr><tr><td>C2</td><td>HCI</td></tr><tr><td>C3</td><td>Op Systems</td></tr><tr><td>C4</td><td>Programming</td></tr></tbody></table>

Which in a browser becomes:

CourseId Title
C1 Database
C2 HCI
C3 Op Systems
C4 Programming

Relational Comparisons

A number of boolean operators are available to compare the values of two relations. These are all implemented with the obvious overloaded Python comparisons.

Equality (==)

>>> print IS_CALLED == Relation(["StudentId", "Name"],
...                     [('S1', 'Anne'),
...                      ('S2', 'Boris'),
...                      ('S3', 'Cindy'),
...                      ('S4', 'Devinder'),
...                      ('S5', 'Boris'),
...                     ])
True

A useful shorthand for testing equality against an empty relation is to use the IS_EMPTY function:

>>> print IS_EMPTY(IS_CALLED.where(lambda t: t.StudentId=='S99'))
True

>>> print not IS_EMPTY(IS_CALLED)
True

Inequality (!=, not ... ==)

>>> print IS_CALLED != COURSE
True

>>> print not IS_CALLED == COURSE
True

Proper Subset (<)

>>> print IS_CALLED.where(lambda t: t.StudentId=='S3') < IS_CALLED
True
>>> print IS_CALLED.where(lambda t: t.StudentId.startswith('S')) < IS_CALLED
False

Subset (<=)

>>> print IS_CALLED.where(lambda t: t.StudentId=='S3') <= IS_CALLED
True
>>> print IS_CALLED.where(lambda t: t.StudentId.startswith('S')) <= IS_CALLED
True
>>> print IS_CALLED.where(lambda t: t.StudentId=='S3') <= IS_CALLED.where(lambda t: t.StudentId.startswith('S')) <= IS_CALLED
True

Proper Superset (>)

>>> print IS_CALLED > IS_CALLED.where(lambda t: t.StudentId=='S3')
True
>>> print IS_CALLED > IS_CALLED.where(lambda t: t.StudentId.startswith('S'))
False

Superset (>=)

>>> print IS_CALLED >= IS_CALLED.where(lambda t: t.StudentId=='S3')
True
>>> print IS_CALLED >= IS_CALLED.where(lambda t: t.StudentId.startswith('S'))
True

Membership (in)

This is effectively the same as the subset comparison:

>>> print IS_CALLED.where(lambda t: t.StudentId=='S3') in IS_CALLED
True

The membership operator can also be passed a tuple:

>>> print Tuple(StudentId='S3', Name='Cindy') in IS_CALLED
True

>>> print Tuple(StudentId='S3', Name='Bob') in IS_CALLED
False

>>> print Tuple(StudentId='S3', Name='Cindy') not in IS_CALLED
False

>>> print Tuple(StudentId='S3', Name='Bob') not in IS_CALLED
True

Relational Operators

We use a small core of relational operators to deliver a large number of operations. For example, we use & (relational AND) to provide natural join, intersection and Cartesian product, and we use it as the basis for implementing restriction and extension. A number of other operators are defined as macros on top of the core ones, e.g. GROUP, and this number can easily be increased. The ideas behind this approach can be found in Appendix A of the Third Manifesto.

One of the powerful uses of & is the natural join. This joins relations together on their commonly named attributes. To make the most of this, without having to rename attributes before each join, use the same name for the same attributes across relations, e.g. if a key on one relation is named “product_code” then use that same name in all other relations in case they need to be joined. Naming it “code” on the product relation and “product_code” on other relations would require the rename operator to be used before doing a natural join (not to mention making the two attributes appear to be different things).

The relational operators are defined as Python functions taking, and usually returning, relations. Many of the common ones are also defined as methods and operators on the Relation class.

Some basic operations on a relation now presented.

Projection (project, remove)

This is so called because a relation can be thought of as representing a point in n-dimensional space (where n is the number of attributes) and just selecting a few of them is akin to projecting that point onto the chosen axes. Note once again that since a relation body is a set of tuples, there are no duplicate tuples.

>>> print IS_CALLED.project(['Name'])
+----------+
| Name     |
+==========+
| Anne     |
| Boris    |
| Cindy    |
| Devinder |
+----------+

>>> print IS_CALLED(['Name'])
+----------+
| Name     |
+==========+
| Anne     |
| Boris    |
| Cindy    |
| Devinder |
+----------+

>>> print IS_CALLED.remove(['Name'])
+-----------+
| StudentId |
+===========+
| S1        |
| S2        |
| S3        |
| S4        |
| S5        |
+-----------+

>>> print IS_CALLED.remove(['Name', 'StudentId']) == IS_CALLED.project([]) == IS_CALLED([]) ==  DEE
True

Rename (rename)

This is crucial to our implementation since attributes with the same name are considered to represent the same thing. The mapping of old to new attribute name(s) is given as a Python dictionary (or indeed a Tuple would also do).

>>> print IS_CALLED.rename({'Name':'NewName'})
+-----------+----------+
| StudentId | NewName  |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

>>> print IS_CALLED.rename({'StudentId':'NewId', 'Name':'NewName'})
+-------+----------+
| NewId | NewName  |
+=======+----------+
| S1    | Anne     |
| S2    | Boris    |
| S3    | Cindy    |
| S4    | Devinder |
| S5    | Boris    |
+-------+----------+

Restriction (where)

This is also known as relational selection, but that can be confusing because of the SELECT in SQL which is actually for projection.

>>> print IS_CALLED.where(lambda t: t.StudentId=='S4')
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S4        | Devinder |
+-----------+----------+

Natural Join, Times, Intersection (&)

If you think about it, these are all the same thing - it just depends on whether the relations have some, none, or all of their attributes in common. We implement them all using the AND relational operator using the Python &. Note that since a relation heading is a set of attributes, there are no duplicate attributes.

Natural Join - Some attributes in common
>>> print IS_CALLED & IS_ENROLLED_ON
+----------+-----------+----------+
| CourseId | StudentId | Name     |
+==========+===========+==========+
| C1       | S1        | Anne     |
| C2       | S1        | Anne     |
| C1       | S2        | Boris    |
| C3       | S3        | Cindy    |
| C1       | S4        | Devinder |
+----------+-----------+----------+
Times (Cartesian Join) - No attributes in common

Beware: this kind of join can be very large and is almost always meaningless.

>>> print IS_CALLED & COURSE
+----------+-----------+----------+-------------+
| CourseId | StudentId | Name     | Title       |
+==========+===========+==========+=============+
| C1       | S1        | Anne     | Database    |
| C1       | S2        | Boris    | Database    |
| C1       | S3        | Cindy    | Database    |
| C1       | S4        | Devinder | Database    |
| C1       | S5        | Boris    | Database    |
| C2       | S1        | Anne     | HCI         |
| C2       | S2        | Boris    | HCI         |
| C2       | S3        | Cindy    | HCI         |
| C2       | S4        | Devinder | HCI         |
| C2       | S5        | Boris    | HCI         |
| C3       | S1        | Anne     | Op Systems  |
| C3       | S2        | Boris    | Op Systems  |
| C3       | S3        | Cindy    | Op Systems  |
| C3       | S4        | Devinder | Op Systems  |
| C3       | S5        | Boris    | Op Systems  |
| C4       | S1        | Anne     | Programming |
| C4       | S2        | Boris    | Programming |
| C4       | S3        | Cindy    | Programming |
| C4       | S4        | Devinder | Programming |
| C4       | S5        | Boris    | Programming |
+----------+-----------+----------+-------------+
Intersection - All attributes in common
>>> print IS_CALLED.where(lambda t: t.Name[0] < 'C') & IS_CALLED.where(lambda t:t.Name[0] > 'A')
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S5        | Boris |
+-----------+-------+

Note that this is equivalent to:

>>> print IS_CALLED.where(lambda t: t.Name[0] < 'C' and t.Name[0] > 'A')
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S2        | Boris |
| S5        | Boris |
+-----------+-------+

Or, Union (|)

These are the same thing - it just depends on whether the relations have any of their attributes in common. We implement them all using the OR relational operator using the Python |. For pragmatic reasons, we only implement the Union operator, i.e. where the relations have all of their attributes in common. The more general Or would need to handle an infinite (impossible) number of alternatives.

>>> print IS_CALLED.where(lambda t: t.Name[0] > 'C') | IS_CALLED.where(lambda t:t.Name[0] < 'B')
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S4        | Devinder |
| S1        | Anne     |
+-----------+----------+

Note that this is equivalent to:

>>> print IS_CALLED.where(lambda t: t.Name[0] > 'C' or t.Name[0] < 'B')
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S4        | Devinder |
+-----------+----------+

Difference (-)

Note that, unlike intersection and union, this is not commutative: which relation is mentioned first does make a difference (excuse the pun). It is implemented using the MINUS relational operator.

>>> print IS_CALLED - IS_CALLED.where(lambda t:t.Name[0] < 'B')
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

Extension (extend)

This is used to add new attributes to relations. First, the list of the names of the extra attributes is passed, followed by a lambda expression returning a dictionary containing the attribute values for each tuple. The values can refer to the range variable introduced by the lambda to access tuple values. It is implemented using the EXTEND relational operator which in turn is implemented using the AND relational operator (can you see how?).

>>> print IS_CALLED.extend(['Initial'], lambda t: {'Initial':t.Name[:1]})
+---------+-----------+----------+
| Initial | StudentId | Name     |
+=========+===========+==========+
| A       | S1        | Anne     |
| B       | S2        | Boris    |
| C       | S3        | Cindy    |
| D       | S4        | Devinder |
| B       | S5        | Boris    |
+---------+-----------+----------+

Note that this does not modify the original relation.

Aggregate Operators

These operators take relations and return scalar values according to some lambda expression (except in the case of COUNT which simply counts the number of tuples). If the relation has a single attribute then the expression defaults to it.

COUNT (len)

The number of tuples.

>>> print COUNT(IS_ENROLLED_ON)
5

SUM

The total.

>>> print SUM(EXAM_MARK, lambda t: t.Mark)
342

AVG

The average.

>>> print AVG(EXAM_MARK, lambda t: t.Mark)
68.4

MIN

The minimum.

>>> print MIN(EXAM_MARK, lambda t: t.Mark)
49

>>> print MIN(EXAM_MARK(['Mark']))
49

MAX

The maximum.

>>> print MAX(EXAM_MARK, lambda t: t.Mark)
93

ALL

The expression in this case must return a boolean value. If all of these are True then ALL returns True, and False otherwise.

>>> print ALL(EXAM_MARK, lambda t: 0 <= t.Mark <= 100)
True

>>> print ALL(EXAM_MARK, lambda t: t.Mark > 90)
False

ANY

The expression in this case must return a boolean value. If any of these are True then ANY returns True, and False otherwise.

>>> EXAM_MARK_HIGH_FLAG = EXAM_MARK.extend(['High'], lambda t: {'High': t.Mark > 90})
>>> print EXAM_MARK_HIGH_FLAG
+-------+-----------+----------+------+
| High  | StudentId | CourseId | Mark |
+=======+===========+==========+======+
| False | S1        | C1       | 85   |
| False | S1        | C2       | 49   |
| False | S2        | C1       | 49   |
| False | S3        | C3       | 66   |
| True  | S4        | C1       | 93   |
+-------+-----------+----------+------+

>>> print ANY(EXAM_MARK_HIGH_FLAG, lambda t: t.High)
True

IS_EMPTY

This is obviously just a shorthand for COUNT(r) == 0.

>>> print IS_EMPTY(IS_ENROLLED_ON)
False

>>> print IS_EMPTY(IS_ENROLLED_ON.where(lambda t: t.StudentId=='S99'))
True

>>> print IS_EMPTY(DUM)
True

>>> print IS_EMPTY(DEE)
False

Some Not-so-basic but Very Useful Operators

Generation (GENERATE)

This is a shorthand for converting a tuple into a relation, implemented by extending DEE.

>>> print GENERATE({'x':50, 'y':100})
+-----+----+
| y   | x  |
+=====+====+
| 100 | 50 |
+-----+----+

>>> print GENERATE(Tuple(x=50, y=100))
+-----+----+
| y   | x  |
+=====+====+
| 100 | 50 |
+-----+----+

Divide (DIVIDE_SIMPLE)

>>> print DIVIDE_SIMPLE(IS_CALLED, IS_ENROLLED_ON)
+----------+-------+
| CourseId | Name  |
+==========+=======+
| C2       | Anne  |
| C3       | Cindy |
+----------+-------+

Summarization (SUMMARIZE)

This takes two relations, the heading of the second must be a projection of the first, plus a tuple returning one or more scalar summary expressions. If we take the EXAM_MARK relation variable, which looks like:

>>> print EXAM_MARK
+-----------+----------+------+
| StudentId | CourseId | Mark |
+===========+==========+------+
| S1        | C1       | 85   |
| S1        | C2       | 49   |
| S2        | C1       | 49   |
| S3        | C3       | 66   |
| S4        | C1       | 93   |
+-----------+----------+------+

We can find the maximum Mark per marked Course like this:

>>> print SUMMARIZE(EXAM_MARK, EXAM_MARK(["CourseId"]), {'TopScore':(MAX, lambda t: t.Mark)})
+----------+----------+
| CourseId | TopScore |
+==========+==========+
| C1       | 93       |
| C2       | 49       |
| C3       | 66       |
+----------+----------+

And we can count the marks for every course like this (note that there is an entry for the course having no marks):

>>> print SUMMARIZE(EXAM_MARK, COURSE(["CourseId"]), {'Takers':(COUNT, None)})
+----------+--------+
| CourseId | Takers |
+==========+========+
| C1       | 3      |
| C2       | 1      |
| C3       | 1      |
| C4       | 0      |
+----------+--------+
Here are the summary types available:
  • COUNT
  • SUM
  • AVG
  • MIN
  • MAX

Grouping and Ungrouping (GROUP, UNGROUP)

This converts relations into relations with relation-valued attributes (RVAs), and vice versa.

>>> A = GROUP(IS_CALLED, ['StudentId'], 'StudentIds')
>>> print A
+----------+---------------+
| Name     | StudentIds    |
+==========+===============+
| Anne     | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S1        | |
|          | +-----------+ |
| Boris    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S2        | |
|          | | S5        | |
|          | +-----------+ |
| Cindy    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S3        | |
|          | +-----------+ |
| Devinder | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S4        | |
|          | +-----------+ |
+----------+---------------+

>>> B = UNGROUP(A, 'StudentIds')
>>> print B
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S5        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
+-----------+----------+

Wrapping and Unwrapping (WRAP, UNWRAP)

>>> A = WRAP(EXAM_MARK, ['CourseId', 'Mark'], 'CourseMark')
>>> print A
+-------------------------------+-----------+
| CourseMark                    | StudentId |
+===============================+===========+
| Tuple(CourseId='C1', Mark=85) | S1        |
| Tuple(CourseId='C2', Mark=49) | S1        |
| Tuple(CourseId='C1', Mark=49) | S2        |
| Tuple(CourseId='C3', Mark=66) | S3        |
| Tuple(CourseId='C1', Mark=93) | S4        |
+-------------------------------+-----------+

>>> print UNWRAP(A, 'CourseMark')
+----------+-----------+------+
| CourseId | StudentId | Mark |
+==========+===========+======+
| C1       | S1        | 85   |
| C2       | S1        | 49   |
| C1       | S2        | 49   |
| C3       | S3        | 66   |
| C1       | S4        | 93   |
+----------+-----------+------+

Semijoin (SEMIJOIN, MATCHING)

This joins two relations and then projects back onto the first one. It is very useful for finding just those tuples in one relation that have a counterpart in another, hence the alias MATCHING. For example, “find all students who are enrolled on at least one course”:

>>> print SEMIJOIN(IS_CALLED, IS_ENROLLED_ON)
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
+-----------+----------+

Semidifference (SEMIMINUS, NOT_MATCHING)

For example, “find students who are enrolled on no course at all”:

>>> print SEMIMINUS(IS_CALLED, IS_ENROLLED_ON)
+-----------+-------+
| StudentId | Name  |
+===========+=======+
| S5        | Boris |
+-----------+-------+

Composition (COMPOSE)

This gives a handy shorthand for performing the natural join and then removing any attributes that were common to the relations.

>>> print COMPOSE(IS_ENROLLED_ON, COURSE)
+-----------+------------+
| StudentId | Title      |
+===========+============+
| S1        | Database   |
| S2        | Database   |
| S4        | Database   |
| S1        | HCI        |
| S3        | Op Systems |
+-----------+------------+

This operator is especially attractive if we think of a relation as a function: ‘calling’ the function with another relation as a parameter can be achieved by performing a natural join. For example, suppose we have a function-based relation, plus, with a heading of [‘x’, ‘y’, ‘z’]. Suppose the tuples of this relation are all the possible ones where x + y = z. One implementation of such a function-based relation could be:

>>> def plusfn(trx):
...     """Plus (could just as well be called minus)"""
...     if set(['x', 'y']).issubset(trx.attributes()) and not set(['z']).issubset(trx.attributes()):
...         return [Tuple(x=trx.x, y=trx.y, z=trx.x + trx.y)]
...     if set(['x', 'z']).issubset(trx.attributes()) and not set(['y']).issubset(trx.attributes()):
...         return [Tuple(x=trx.x, y=trx.z - trx.x, z=trx.z)]
...     if set(['y', 'z']).issubset(trx.attributes()) and not set(['x']).issubset(trx.attributes()):
...         return [Tuple(x=trx.z - trx.y, y=trx.y, z=trx.z)]
...
...     if set(['x', 'y', 'z']).issubset(trx.attributes()):
...         if trx.x + trx.y == trx.z:
...             return [Tuple()]    #DEE, i.e. True        #todo: should really return trx(x,y,z) - no matter?
...         else:
...             return []           #DUM, i.e. False
...
...     raise RelationException("Plus: Infinite rows") #no x,y or z
...     #Note: we could go further and return tuples given just one attribute
...     #      or indeed we could start yielding infinite combinations if no attributes are passed

>>> plus = Relation(["x", "y", "z"], plusfn)

And it could be used to test whether an expression is valid or not: a returned tuple means True (DEE), no tuple means False (DUM).

>>> print GENERATE({'x':3, 'y':4, 'z':7}) & plus        #i.e. is 3 + 4 = 7?
+---+---+---+
| y | x | z |
+===+===+===+
| 4 | 3 | 7 |
+---+---+---+

>>> print GENERATE({'x':4, 'y':4, 'z':7}) & plus        #i.e. is 4 + 4 = 7?
+---+---+---+
| y | x | z |
+===+===+===+
+---+---+---+

Also, it could be used to fill in the blank:

>>> print GENERATE({'x':3, 'y':4}) & plus               #i.e. what's 3 + 4?
+---+---+---+
| y | x | z |
+===+===+===+
| 4 | 3 | 7 |
+---+---+---+

>>> print GENERATE({'z':7, 'y':4}) & plus               #i.e. what's 7 - 4?
+---+---+---+
| y | x | z |
+===+===+===+
| 4 | 3 | 7 |
+---+---+---+

The advantages of using a natural join in this way over standard functions is that it takes and returns a relation. So multiple questions can be asked at once:

>>> print Relation(["x", "y"],
...                [(3, 4),
...                 (7, 8),
...                 (11, 23)]) & plus           #i.e. what are 3 + 4, 7 + 8 and 11 + 23?
+----+----+----+
| y  | x  | z  |
+====+====+====+
| 4  | 3  | 7  |
| 8  | 7  | 15 |
| 23 | 11 | 34 |
+----+----+----+

And if the function-based relation had more than one result for a given tuple, e.g. square root would return -2 and +2 for 4, then it could simply return multiple results.

This whole approach could be made syntactically simpler using composition. Using a function-based relation as a function (or indeed a mathematical operator) could be done with COMPOSE(Relation(["X", "Y"], [(3, 4)]), plus).

>>> print COMPOSE(GENERATE({'x':3, 'y':4}), plus)       #i.e. what's 3 + 4?
+---+
| z |
+===+
| 7 |
+---+

This is now looking more like a call to a plus function with the COMPOSE making sure that only the new information, the result, is returned. Of course it’s really still performing a natural join on two relations. We could go further and extract the result as a number.

>>> print COMPOSE(GENERATE({'x':3, 'y':4}), plus).toTuple().z           #i.e. what's 3 + 4?
7

In the future, a Python operator may be overridden to implement COMPOSE. For example if we override % to be relational composition then we could use something like: GENERATE({'x':3, 'y':4}) % plus.

Transitive Closure (TCLOSE)

This operator is used to traverse tree-like structures, such as a bill-of-materials. It takes as input a binary relation comprising parent and child pairs and returns a relation that has all hierarchies fully expanded. For example:

>>> MM = Relation(["MAJOR_P#",       "MINOR_P#"],
...              [{"MAJOR_P#":'P1',  "MINOR_P#":'P2'},
...               {"MAJOR_P#":'P1',  "MINOR_P#":'P3'},
...               {"MAJOR_P#":'P2',  "MINOR_P#":'P3'},
...               {"MAJOR_P#":'P2',  "MINOR_P#":'P4'},
...               {"MAJOR_P#":'P3',  "MINOR_P#":'P5'},
...               {"MAJOR_P#":'P4',  "MINOR_P#":'P6'},
...              ])
>>> print TCLOSE(MM)
+----------+----------+
| MAJOR_P# | MINOR_P# |
+==========+==========+
| P1       | P2       |
| P1       | P3       |
| P2       | P3       |
| P2       | P4       |
| P3       | P5       |
| P4       | P6       |
| P1       | P4       |
| P1       | P5       |
| P2       | P5       |
| P2       | P6       |
| P1       | P6       |
+----------+----------+

Quota (QUOTA)

This restricts a relation to tuples with a number of values after ranking them by the given attribute(s). For example, to find the three lowest marks:

>>> print QUOTA(EXAM_MARK, 3, ['Mark'])
+----------+-----------+------+
| CourseId | StudentId | Mark |
+==========+===========+======+
| C2       | S1        | 49   |
| C1       | S2        | 49   |
| C3       | S3        | 66   |
+----------+-----------+------+

This operator takes an optional parameter (asc) which, if set False, sorts the tuples in descending rank before the quota is taken, e.g. to find the top three marks:

>>> print QUOTA(EXAM_MARK, 3, ['Mark'], False)
+----------+-----------+------+
| CourseId | StudentId | Mark |
+==========+===========+======+
| C1       | S1        | 85   |
| C3       | S3        | 66   |
| C1       | S4        | 93   |
+----------+-----------+------+

Note: this could return a higher or lower number of tuples than the desired number if there are ties or not enough values.

Insert (insert, |=)

To modify a relation variable, like any other variable, you assign a value to it. Since the values of relations are often large, we introduce a shorthand, insert, for adding tuples to an existing relation variable. This shorthand really just performs a UNION (|) between the original value and the given relation (so the two relations must have the same heading).

>>> IS_CALLED.insert(Relation(["StudentId", "Name"], [('S100', 'New'), ('S101', 'New2')]))
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
| S100      | New      |
| S101      | New2     |
+-----------+----------+

An alternative is to use the inplace | assignment operator, |=:

>>> IS_CALLED |= Relation(["StudentId", "Name"], [('S102', 'New3'), ('S103', 'New4')])
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
| S100      | New      |
| S101      | New2     |
| S102      | New3     |
| S103      | New4     |
+-----------+----------+

There is also a version that takes a single tuple:

>>> IS_CALLED |= Tuple(StudentId='S104', Name='New5')
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
| S100      | New      |
| S101      | New2     |
| S102      | New3     |
| S103      | New4     |
| S104      | New5     |
+-----------+----------+

Delete (delete, -=)

We also introduce a shorthand, delete, for removing tuples from an existing relation variable. This shorthand really just performs a MINUS (-) between the original value and the given relation (so the two relations must have the same heading).

>>> IS_CALLED.delete(Relation(["StudentId", "Name"], [('S100', 'New'), ('S101', 'New2')]))
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
| S102      | New3     |
| S103      | New4     |
| S104      | New5     |
+-----------+----------+

An alternative is to use the inplace - assignment operator, -=:

>>> IS_CALLED -= Relation(["StudentId", "Name"], [('S102', 'New3'), ('S103', 'New4')])
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
| S104      | New5     |
+-----------+----------+

There is also a version that takes a single tuple:

>>> IS_CALLED -= Tuple(StudentId='S104', Name='New5')
>>> print IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+----------+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

To delete all tuples, pass itself:

>>> EXAM_MARK_HIGH_FLAG.delete(EXAM_MARK_HIGH_FLAG)
>>> print EXAM_MARK_HIGH_FLAG
+------+-----------+----------+------+
| High | StudentId | CourseId | Mark |
+======+===========+==========+======+
+------+-----------+----------+------+

And to destroy the relation variable altogether, use the standard Python del() method:

>>> del(EXAM_MARK_HIGH_FLAG)
>>> print EXAM_MARK_HIGH_FLAG
Traceback (most recent call last):
        ...
NameError: name 'EXAM_MARK_HIGH_FLAG' is not defined

Update (update)

Another shorthand, update, is for updating selected tuples in an existing relation variable. This shorthand really just performs a delete followed by an insert, with some special copying between the deleted tuples and the inserted tuples. For example, if we want to update all exam marks below 70 to 0:

>>> print EXAM_MARK
+-----------+----------+------+
| StudentId | CourseId | Mark |
+===========+==========+------+
| S1        | C1       | 85   |
| S1        | C2       | 49   |
| S2        | C1       | 49   |
| S3        | C3       | 66   |
| S4        | C1       | 93   |
+-----------+----------+------+

>>> EXAM_MARK.update(lambda t: t.Mark < 70, ['Mark'], lambda u: {'Mark':0})
>>> print EXAM_MARK
+-----------+----------+------+
| StudentId | CourseId | Mark |
+===========+==========+------+
| S1        | C1       | 85   |
| S4        | C1       | 93   |
| S1        | C2       | 0    |
| S2        | C1       | 0    |
| S3        | C3       | 0    |
+-----------+----------+------+

The first boolean expression is used to determine which tuples to update - this is the same format used in Restriction (the where method). Then comes the list of attributes to be updated, followed by another lambda expression that returns a dictionary containing the new values for each attribute - this is the same format used in Extension (the extend method). The values here can refer to the pre-update tuple values by prefixing their names with ‘_OLD_’, for example to remove 10 from each Mark over 9:

>>> EXAM_MARK.update(lambda t: t.Mark >= 10, ['Mark'], lambda u: {'Mark':u._OLD_Mark - 10})
>>> print EXAM_MARK
+-----------+----------+------+
| StudentId | CourseId | Mark |
+===========+==========+------+
| S1        | C2       | 0    |
| S2        | C1       | 0    |
| S3        | C3       | 0    |
| S1        | C1       | 75   |
| S4        | C1       | 83   |
+-----------+----------+------+

User-definable Operators

One of the features of Dee is that adding new relational operators is easy and encouraged. Simply define a standard Python function taking and returning relations. For example, the transitive closure operator TCLOSE, used to explode a tree structure and discussed earlier, is implemented like this:

def TCLOSE(r):
    """Transitive closure (an example of a recursive relational operator)"""
    if len(r.heading()) != 2:
        raise RelationInvalidOperationException(r, "TCLOSE expects a binary relation, e.g. with a heading ['X', 'Y']")

    X, Y = r.heading()

    TTT = r | (COMPOSE( r, r.rename({Y:'_Z', X:Y}) ).rename({'_Z':Y}))
    if TTT == r:
        return TTT
    else:
        return TCLOSE(TTT)

Notice that this takes a single relation, asserts that it has two attributes and then recursively defines itself to return a relation. The simplicity of the definition owes much to the power of the basic relational operators, the macros defined on them, and the Python language itself.

Databases

We currently define a base Database class (in DeeDatabase.py) which provides a namespace for grouping relation variables (base and virtual) together. It also acts as a unit of persistence for both defining, storing and loading relation variables and for transactions. Each new Database should descend from this base class. The initial base relation variable definitions and data should be added and any virtual relation variable definitions and supporting functions.

Each database has its own catalog: a set of relations that describe the database (including itself):

  • relations
  • attributes
  • constraints
  • constraint_attributes

When a database is imported, the relation variables it holds are loaded from storage. When the program that imported the database is finished, the database (i.e. all its relation variables) are stored to the storage area (e.g. disk). The database is also stored upon a transaction commit - see below.

The database construction method (__init__ in Python) is called when a database is first created or loaded from storage. It is used to define initial values for relation variables, e.g.:

class darwen_Database(Database):
    def __init__(self, name):
        """Define initial relvars and their initial values here
           (Called once on database creation)"""
        Database.__init__(self, name)

        if 'IS_CALLED' not in self:
            print "Adding IS_CALLED..."
            self.IS_CALLED = Relation(["StudentId", "Name"],
                                 [('S1', 'Anne'),
                                  ('S2', 'Boris'),
                                  ('S3', 'Cindy'),
                                  ('S4', 'Devinder'),
                                  ('S5', 'Boris'),
                                 ]
                                )
        ...

Also the _vinit method should be implemented to add any functions to support virtual relation variables. This is called whenever the relation variables are refreshed, e.g. after a load or rollback:

def _vinit(self):
    """Define virtual relvars/relconsts
    (Called repeatedly, e.g. after database load from disk or commit)
    """
    Database._vinit(self)

    if 'C_ER' not in self:
        print "Defining C_ER..."
        self.C_ER = Relation(["CourseId", "Exam_Result"],
                               self.vC_ER,
                               {'pk':(Key, ["CourseId"])})

And then include any supporting functions:

def vC_ER(self):
    return self.COURSE.extend(['Exam_Result'], lambda t:{'Exam_Result':
                                                    (self.EXAM_MARK & GENERATE({'CourseId':t.CourseId})
                                                    )(['StudentId', 'Mark'])}
                            )(['CourseId', 'Exam_Result']) #fixed

And load or create the database (this will run whenever the module is imported):

#Load or create the database
darwen = Database.open(darwen_Database, "darwen")

Then to use this database, import it:

>>> from darwen import *
Defining C_ER...

and refer to its relational variables within their namespace:

>>> print darwen.relations
+-----------------------+
| relation_name         |
+=======================+
| IS_ENROLLED_ON        |
| C_ER                  |
| EXAM_MARK             |
| IS_CALLED             |
| relations             |
| COURSE                |
| constraint_attributes |
| attributes            |
| constraints           |
+-----------------------+

>>> print COMPOSE(darwen.attributes, GENERATE({'relation_name':'IS_CALLED'}))
+----------------+
| attribute_name |
+================+
| StudentId      |
| Name           |
+----------------+

>>> print darwen.IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

>>> print darwen.C_ER
+----------+----------------------+
| CourseId | Exam_Result          |
+==========+----------------------+
| C1       | +-----------+------+ |
|          | | StudentId | Mark | |
|          | +===========+======+ |
|          | | S1        | 85   | |
|          | | S2        | 49   | |
|          | | S4        | 93   | |
|          | +-----------+------+ |
| C2       | +-----------+------+ |
|          | | StudentId | Mark | |
|          | +===========+======+ |
|          | | S1        | 49   | |
|          | +-----------+------+ |
| C3       | +-----------+------+ |
|          | | StudentId | Mark | |
|          | +===========+======+ |
|          | | S3        | 66   | |
|          | +-----------+------+ |
| C4       | +-----------+------+ |
|          | | StudentId | Mark | |
|          | +===========+======+ |
|          | +-----------+------+ |
+----------+----------------------+

Any modifications to the existing relation variables, or new ones added to the namespace, or old ones deleted from the namespace, will be persisted when the program finishes (or after a transaction commit).

Clusters

We currently define a base Cluster class (in DeeCluster.py) which provides a namespace for grouping databases together. Each new Cluster should descend from this base class and import and set its database definitions.

Each cluster has its own catalog: a set of relations that describe the cluster:

  • databases

The cluster construction method (__init__ in Python) is called when a cluster is first created or loaded from storage. It is used to define initial values for database variables, e.g.:

class demo_Cluster(Cluster):
    def __init__(self, name):
        """Define initial databases
           (Called once on cluster creation)"""
        Cluster.__init__(self, name)

        self.date = date
        self.darwen = darwen

And load or create the cluster (this will run whenever the module is imported):

#Create the cluster
demoCluster = demo_Cluster("demo")

Then to use this cluster, import it:

>>> from demoCluster import *

and refer to its database variables within their namespace:

>>> print demoCluster.databases
+---------------+
| database_name |
+===============+
| date          |
| darwen        |
+---------------+

>>> print demoCluster.darwen.relations
+-----------------------+
| relation_name         |
+=======================+
| IS_ENROLLED_ON        |
| C_ER                  |
| EXAM_MARK             |
| IS_CALLED             |
| relations             |
| COURSE                |
| constraint_attributes |
| attributes            |
| constraints           |
+-----------------------+

then within each database, the relation variables are available:

>>> print demoCluster.darwen.IS_CALLED
+-----------+----------+
| StudentId | Name     |
+===========+==========+
| S1        | Anne     |
| S2        | Boris    |
| S3        | Cindy    |
| S4        | Devinder |
| S5        | Boris    |
+-----------+----------+

Transactions

These are currently controlled at the database level. The database has begin, commit and rollback methods to facilitate transactions. For example:

>>> darwen.begin()
Defining C_ER...
>>> try:
...     darwen.EXAM_MARK -= darwen.EXAM_MARK.where(lambda t: t.Mark > 80)
...     print darwen.EXAM_MARK
...     raise Exception("Simulated error condition")
...     darwen.commit()
... except:
...     print "Exception - rolling back..."
...     darwen.rollback()
+-----------+----------+------+
| StudentId | CourseId | Mark |
+===========+==========+======+
| S1        | C2       | 49   |
| S2        | C1       | 49   |
| S3        | C3       | 66   |
+-----------+----------+------+
Exception - rolling back...

>>> print darwen.EXAM_MARK
+-----------+----------+------+
| StudentId | CourseId | Mark |
+===========+==========+======+
| S1        | C1       | 85   |
| S1        | C2       | 49   |
| S2        | C1       | 49   |
| S3        | C3       | 66   |
| S4        | C1       | 93   |
+-----------+----------+------+

Front-ends

As well as using the Python interpreter to interact with relations and databases, a stand-alone web-server that gives access to a cluster of sample databases from a browser is available. To start the server and a client run:

python DeeWebDemo.py

The default browser should launch and a text box will be presented to allow Dee expressions to be entered. The expression can be evaluated by pressing the ‘Evaluate’ button. The default database can be changed by selecting from the drop-down list and pressing the ‘Change database’ button.

Future Work

There are a number of areas that need more work. These include:

Constraint Checking

This should be optimised for insert/update operations where often just the new tuples need to be checked. Also more work could be done to simplify and improve the declaration of some constraints, e.g. ForeignKey, and the representation and storage of ad-hoc lambda functions.

Currently if a constraint fails during the insertion phase of an update, the deletions are not rolled back (unless this is manually controlled via a database transaction).

We would also like to use a multiple assignment mechanism, similar to Python‘s, to achieve parallel assignment to avoid any need for deferred constraints.

Constraint Inference

We need to implement routines to ensure relation constraints, such as candidate keys, are inferred in derived relations. We do some of this already, e.g. for the rename operator, but much more is required.

Catalog Maintenance

Adding and removing constraints from existing relations is not yet supported. Also database-level constraints cannot be created yet. Also actions such as deleting a relation variable that is referenced should be handled better.

Storage

With memory sizes measured in Gigabytes, the immediate intention is to keep databases in memory with a protected power supply to provide short-term durability. For longer term durability, the database can be saved to disk. This is currently done by storing the database relation variables in a single Python pickle file. If the debug_dump constant is True in DeeDatabase then a script is also written containing the latest relation definitions and data. This storage mechanism could be made more efficient (by storing incremental changes rather than the whole database) and more resilient (e.g. storing online to Amazon’s S3).

Concurrency

The current implementation is intended to be shared by a single application, e.g. a web-server, feeding multiple clients. Work is needed to ensure that multiple threads can share the data and serialise transactions.

Attribute Values

Some work is needed to tighten up the permitted values in Tuples. For example, whether None can ever be used and how Python objects are stored and compared.

API

We envisage several possible ways of accessing the data, e.g. RPC, SOAP.

Virtual Relation Variables

These need to allow inserts, deletes and updates that can then be passed down to the base relation(s) where possible. Also an improved method for storing their definitions is needed.

Security

Access rights and privileges need to be designed and implemented.

SQL Veneer

Obviously SQL will be around for a long time yet, so an interpreter could be built to parse SQL and generate the equivalent statements in Dee (where possible).

Optimisation

By working with in-memory data and using fully hashed relations we can perform very fast joins. We have some optimisation code for some relational operators, but this could be improved e.g. by introducing a ternary join operator AND(r, s, t).

As another example, the where method effectively scans the relation, filtering by the given expression. For equality checks on large relations it would be much faster to use & with the appropriate relation: then the hash join will be used instead of a scan. This optimisation can be made manually, e.g. replace r.where(lambda t: t.Name=='Find it') with r & GENERATE({'Name':'Find it'}), but it ought to be done by the system.

Surface More Abstractions

Perhaps we should introduce more Relation methods (such as intersect, union, join, times) to map onto the existing operators. These would simply be aliases but may make things appear more familiar. On the other hand, once it’s realised that these are all just AND and OR, it might be best to keep the number of different names for the same thing to a minimum.

Et Cetera

  • renderXML() plus XSLT
  • Excel/web-spreadsheet as a front-end for data entry
  • _scan to yield tuples for pipelining
  • Some syntactic improvements (if not a whole interpreted layer)
  • Add to sense-checking throughout
  • Fix, test and document generic DIVIDE
  • Overload / operator for Relations to be relational DIVIDE
  • Determine extend/update attribute lists from the lambda expression’s dictionary keys
  • Allow RVAs to be described in the catalog (i.e. recursive catalog)
  • Put into a Python package
  • Syntax highlighting

Colophon

This document was written in reStructuredText format. The HTML was generated using Sphinx with syntax highlighting by Pygments. The examples within it were verified by doctest.

[1]“Date and Darwen’s Database Dream”

Table Of Contents

This Page