database terminal

dbschemacmd

A Tool for Database Schema Normalization Working on Functional Dependencies

This page can be viewed online at www.elstel.org/database/dbschemacmd.html.de

preface

Database schemata consisting of a set of relations or tables are usually decomposed into normal form in order to avoid insert, update and deletion anomalies. As soon as a databse schema is not in the highest normal form all of these anomalies will appear together / in conjunction. In the following I will present you a tool that can facilitate normalizations done by hand or even fully automate this process at least as long as there appear no multivalued dependencies. If there should be no multivalued dependencies then a schema that is in BCNF is already in the highest normal form and can be deployed without caution.

Good UML- or EER-diagrams will map classes/entities into relations in BCNF or an even higher normal form; this means that they are already in normal form. As soon as there are more than one prime attributes and different keys and whenever parts of a key may pertain to another class or relation (weak entities) care needs to be taken to decompose the conceptual data model correctly into relations. The tool gives different possiblilties to achieve an appropriate normalization.

minimal cover, transitive closure, general mode of operation

dbschemacmd is an extended python command line. As soon as a word starts with a capital letter it will automatically interprete it as a set of attributes or dependencies:

> s=('ABCDEFG','B->BE,B->DG,BF->AD,C->G DG->BA,G->F') > s ABCDEFG B->BDEG BF->AD C->G DG->AB G->F > closure(C,s) # cl(C,s) CFG > closure(CD,s) # cl(CD,s) ABCDEFG

A database schema is simply a Python tuple consisting of a set of attributes (in the first place) and a set of dependencies (in the second place). Attribute and dependency sets can also be written without simple quotes whereby the comma has to be replaced by a slash; for intstance A->B/B->C. closure (or short 'cl') is a function which calculates the transitive closure for a set of attributes in a given database schema. The schema can also be interpreted as simple fact-rule system or as a simple set of dependencies of any kind.

dbschemacmd is a full featured Python interpreter. However variables, functions and constants like true or false need to start with lower capital letters. Simple strings or text need to be quoted by double quotes as long as they should not represent sets of attributes or dependencies.

Now let us arrive at the minimal cover:

> mincoverage_more(s) # mminc(s) »0« ABCDEFG B->DEG C->G DG->AB G->F »1« ABCDEFG B->ADEG C->G DG->B G->F > minc(s,0) # mminc(s)[0], mincoverage(s,scramble=0) ABCDEFG B->DEG C->G DG->AB G->F > minc(s,1) # mminc(s)[1] ABCDEFG B->ADEG C->G DG->B G->F

The function mincoverage_more (synonym: mminc) already calculates many different minimal covers and puts the result into Python dictionary i.e. into a key-value mapping. It just uses numbers as keys though also strings and other constants would be allowed in place here. mincoverage_more is an equivalent to calling mincoverage or minc several times with different scramble parameter. The scramble parameter picks one of n! permutations while just compound rules with more than one attribute at the left side are permutated. The fact that dbschemacmd works with compound rules means that rules with the same left side are always grouped into one rule.

It may be that mminc/mincoverage_more does not return all minimal covers. That is particularely true if different attributes on the right side may be left out in a different order. The following example should show this:

> bspHRTC=( 'HRTCSG', 'HR->C,C->T,HT->R,CS->G,HS->R' ); > [bspHRTC] CGHRST { C->T, CS->G, HR->C, HS->R, HT->R } > [minc(maxc(bspHRTC))] CGHRST { C->T, CS->G, HR->C, HS->T, HT->R } > [minc(maxc(bspHRTC),hints=HS->R)] CGHRST { C->T, CS->G, HR->C, HS->R, HT->R } > [maxc(bspHRTC)] CGHRST { C->T, CS->GT, CH->RT, HR->CT, HS->CGRT, HT->CR }

Here we have a so called 'maximum coverage'. That means a maximized non-reflexive right side and a complemented left side in cases where the reverse application of rules does not yield a new superset of an already existing left side. The maximum coverage is also used internally by the BCNF algorithm as shown later on.

If we have once maximized the right side and want to minimize it later on again then we may use the hints parameter of minc as well as mminc to tel which rules should preferredly survive.

Remark: To make the result appear in a singleton line we have written [expression] instead of expression which packs the result into a list containing just one element. Lists in Python look like the following: [1,2,3].

Keys and the Third Normal Form

Now let us progress to the detection of keys and the 3NF synthesis algorithm.

> sets(s) # keyBaseSets(s) {'re': 'AE', 'ua': '', 'mi': 'BDFG', 'li': 'C'} > k=keys(s); pp k; BCD { BC, CD } > verbosity=3; > a = syntesis3NF(k,s); # a = syn(s); one relation per dependency: [u'FG', u'ABDG', u'CG', u'BDEG'] union of relations with mutually dependent left side: [u'FG', u'ABDEG', u'CG'] no relation has a key; adding key CD (all keys: BC,CD) relations including key relation: [u'FG', u'ABDEG', u'CG', u'CD'] all subset relations removed: [u'FG', u'ABDEG', u'CG', u'CD'] > a FG FG { G->F } ABDEG ABDEFG { B->DEG, DG->AB } CG CFG { C->G } CD ABCDEFG { }

The function keys or written in its long form keysTreeAlg returns a tuple consisting of all prime attributes that happen to be found in some key and the set of keys itself. The command pp ("pretty print") which is also applicable as a function does simpy print k on the terminal. This has become necessary here as Python does switch off terminal output by default for assignments.

syntesis3NF or short syn transforms a schema into third normal form. You may speed things up by sepecifying the set of keys having been already calculated as a first parameter. The variable »verbosity« makes many functions output internal steps when calculating their result; among these are keysTreeAlg, syntesis3NF and splitBCNF; even more read, write and append happen to be more talkative like this.

> verbosity=2; read aaudbi2014.dbschema read bspHRTC read bspRDT read bsp035 ... > write my.dbschema, bspHRTC, bspRDT 2 records written > append my.dbschema, bsp035 1 records written > read my.dbschema, replace=false error: could not replace bspHRTC error: could not replace bspRDT error: could not replace bsp035 3 records read. > locals().keys() [u'bsp045', 'verbosity', u'bspHRTC', u'bsp054', u'bsp055', u'bsp056', u'bsp046', u'bspRDT', 'autorepeat', u'bsp044', u'bspRED', u'bsp035']

Variables can be written to file by write or append. Read reads such variables which can basically have an arbitraray content back. Stating the variables which should be written or read is optional. If you do not state such variables then all local variables except the system variables verbosity and autorepeat are written or read from file. You may specify by the replace parameter if already existing variables should be replaced when reading things in again.

das Programm

Before we go on with splitBCNF here you have the download area of the dbschemacmd program. It should work on any computer where you have installed Python. Under Linux and MacOS Python uses to be a preinstalled component of the system; under Windows you may either install Cygwin or try ActivePython though the program has been developed under Linux and has not yet been tested with Activestate Activepython.


donate
Download:
dbschemacmd v1.0.5 dbschemacmd: first released version
dbschemacmd v1.0 pre-release version (do not use)
Author:
Elmar Stellnberger estellnb@elstel.org
verify your download with software/SHA512SUMS.signed and our public key available at contact / impressum at the main page.
python dbschemacmd.py should work for any case if the dbschemacmd link should not have been inflated correctly.


rss-feed get informed about site updates via rss!  (right click: add with Akregator)


BCNF Boyce Codd Normalform

Now let us arrive at an example which still needs to be decomposed into BCNF even after the 3NF synthesis algorithm has successfully been applied.

> bspHRTC=( 'HRTCSG', 'HR->C,C->T,HT->R,CS->G,HS->R' ); > syn(bspHRTC) CHRT CHRT { C->T, HR->C, HT->R } HRS CGHRST { HS->R } CGS CGST { CS->G } > splitBCNF(syntesis3NF(bspHRTC)) CHR { CH->R, HR->C } CT { C->T } HRS { HS->R } CGS { CS->G }

The best results with the BCNF-support being still experimental can often be achieved when applying the 3NF syntesis algorithm first.

> verbosity=2; > splitBCNF(bspHRTC) # split(bspHRTC,heuristic=1) CGHRST: CS -> G : 1 3 C -> T : 2 2 HR -> C : 3 3 HT -> R : 3 3 splitting CGHRST along CS->G CHRST: C -> T : 2 2 HR -> C : 2 3 HT -> R : 3 3 splitting CHRST along C->T CHRS: HR -> C : 2 3 CH -> R : 3 3 splitting CHRS along HR->C HRS { HS->R } CHR { CH->R, HR->C } CT { C->T } CGS { CS->G }

It is the number of rules from the minimal coverage which would get lost by decomposition (first measured value) and the simplicity of rules (second measured value) which determine the rule being chosen next for decomposition by the BCNF-split algorithm. The result appears to be fine.

python (snake)
dbschemacmd is a complete Python interpreter.
> verbosity=0; pp splitBCNF(bspHRTC,heuristic=2) HRS { HS->R } HRT { HR->T, HT->R } CHR { CH->R, HR->C } CGS { CS->G } > autorepeat=true; pp splitBCNF() HST { HS->T } HRT { HR->T, HT->R } CHR { CH->R, HR->C } CGS { CS->G }

As there are many possiblities on how to decompose a database schema into BCNF we can also offer an alternative heuristic which assesses the number of rules left over in the maximum coverage as second measured value rather than the simplicity of rules (as few as possible should get lost). We wanna remark that the result of splitBCNF will also depend on the minimum coverage it is fed with initially as the algorithm tries to preserve as many rules as possible in their original state in spite of calculating the maximum coverage and minimal cover several times; at least as far as that concerns the right sides of the rules.

Further results with the same heuristic and the same initial minimal cover can be retrieved by calling splitBCNF without parameters. You may alternatively retrieve a valid Python iterator object by calling iterSplitBCNF. There you can query for the next result with myiter.next().

Internals

The internal representation of data structures can be viewed with s_repr (dbschemacmd) and repr (pure Python). dbschemacmd processes rules in the so called compound notation; not in canonical form. Compound notation means that rules with the same left side are grouped into a hash or dictionary. Compound notation has not only been chosen for the reason of improved speed but also because it has facilitated our implementation.

The author will be glad to receive feedback, proposals, bug reports and fixes.