Introduction Data Modeling and Relational Databases

Introduction Data Modeling and Relational Databases

John "Scooter" Morris

May 7, 2008

Overview

  • Limitations
  • Data Modeling
  • Data Access Methods
  • Types of Databases
  • SQL

Limitations

  • 1 Hour is not enough!
  • What am I not telling you about?
    • database normalization
    • object-based approaches to database design
    • object-relational mapping
    • .... too much more to mention ....
  • This is an introduction
    • enough to get started and to know what you don't know (I hope)
  • Ask questions!

Example Problem

A system to automate the tracking and documentation of plasmid construction

  • Terminology:
    • fragment: a length of double-stranded DNA
    • plasmid: a circular fragment
    • recipe: a series of manipulations of the DNA to produce a new plasmid with cDNA of interest inserted
    • ACL: access control list
  • Needs:
    • Data processing -- convert raw data into results
    • Visualization -- a way to visualize the results
    • Data storage -- store the results (and perhaps the raw data)

Example problem

Data Modeling

  • The FIRST Step
  • Structured way to understand the data semantics
  • Independent of underlying platform
  • Way to communicate with team members (including users)
  • Excellent (minimal?) documentation
  • Example: ER Diagrams

ER Diagrams

  • Entity (Entity Type)
    • A collection of entities that share common properties
      • e.g. Fragment, Recipe, Gene
  • Attribute
    • Property of an entity that is of interest
      • e.g. Name, File, Sequence
  • Relationship
    • An association between entities
      • e.g. Produces
  • Degree
    • Number of entities involved in the relationship
      • one-to-many, one-to-one, many-to-many

ER Diagrams

ER Diagrams

  • Questions?

  • Recommended Reading:
    • Chen, P.S. The entity-relationship model: toward a unified view of data. ACM Trans on Database Syst. pp 9-36 (March 1976)

Data Access Methods

  • How is the data accessed?
  • Why do we care?
    • Important for special-purpose databases
    • Some systems give you choices
  • Terminology:
    • Index: an access path into the data
    • Key: a field (or fields) used to access the data
    • Primary key: a field (combination) whose values uniquely identify the record

Data Access Methods - Linear

  • Simple record-oriented view
  • Access is through sequential reads
  • OK for small data stores -- very slow when the number of records gets large

Data Access Methods - Hash

  • Compute a function to access the data
    • e.g. add up the characters to produce an integer
  • Usually requires a separate index
  • The "goodness" of the hash function is important
    • A perfect hash function would result in a direct access to the data (i.e. a one-to-one relationship)
    • Perfect hash functions are almost never possible
    • This results in the possibility of multiple "hits" per hash value (or bucket)

Data Access Methods - Hash

  • Simple (and silly) example:
    • Hash on the first letter of the recipe name:

Data Access Methods - BTree

  • Good for sequenced or character data
  • In general, the index set is a tree whose leaves consist of pointers into a sequence set
  • Each node in the index set points to three lower nodes
  • Access is by value comparison:
  • For value V:
     if V <= left value: 
      → move to the left lower node
     if left value < V <= right value:
       → move to the middle lower node
     if V > right value:
       → move to the right lower node

Data Access Methods - BTree

  • Example: find pBR322.f2 assuming a Btree index on fragment name
    • pBR322.f2 > pBR322 and <= pHR5CV
      • we take the middle node, which contains pBR322.f2
      • If there are more layers, continue repeating the algorithm until you get to the sequence set

Data Access Methods - BTree

Data Access Methods

  • There are many other indexing techniques
  • Indexing can substantially improve access times
  • Deciding what field to index on depends on usage patterns
  • You can have multiple indices, but that substantially increases insert time and space requirements

Data Access Methods

  • Questions?

  • Recommended Reading:
    • Knuth, D. E. The Art of Computer Programming, Volume III: Sorting and Searching. Reading, Mass.: Addison-Wesley (1973)

Types of Databases

  • Flat-file (notes only)
  • Hierarchical (notes only)
  • Network (notes only)
  • Relational
  • Object (notes only)
  • Object-Relational (notes only)

Flat-File Databases

  • No database-enforced (or provided) linkage between records
  • Excellent for small or special-purpose databases
  • Might include support for single or multiple indexes
  • Major feature: ease of use (Filemaker, Access)
  • Major drawback: scalability & flexibility
  • e.g.:
    • ndbm
    • Berkeley DB (Sleepycat DB)
    • vi, grep, sed
    • FileMaker
    • Access

Hierarchical Databases

    • Relationship between Recipe and Fragment is one-to-many (master-detail)
    • Assume two recipes: r1.cr and r2.cr
      • r1.cr produces 2 plasmids and 1 fragment:
        • r1.p1, r1.p2, and r1.f1
      • r2.cr produces 2 fragments:
        • r2.f1, and r2.f2

Hierarchical Databases

Hierarchical Databases

  • Database provides explicit master-detail support
  • Ideal for many business applications
  • Restricted to a strict hierarchy
  • Queries down the hierarchy are very efficient
  • Any other queries are very expensive
  • e.g.
    • IMS
  • What about many-to-many relationships?

Networked Databases

    • Fragment and Gene have a many-to-many relationship
    • Not represented well by hierarchical databases

Networked Databases

Networked Databases

  • Based on set theory
  • Database provides explicit linkage support
  • Very significant design costs
  • Queries along the connection path are very efficient
  • Any other queries are very expensive
  • e.g.
    • CODASYL
  • What if I want to "discover" other relationships?

Relational Databases

  • Based on relational views (tables)
  • Associations are based on data values, not expressed linkages
  • All data is expressed in tables
  • Terminology:
    • Rows are called tuples
    • Columns (attributes) are of a common domain (type)

Relational Databases

  • Start by defining tables for our entities:

Relational Databases

  • Now define tables for relationships, adding attributes for the associations:

Relational Algebra

    • Selection
      • Selection of tuples based on Boolean criteria

Relational Algebra

    • Projection
      • Selection of attributes

Relational Algebra

    • Join (equijoin)
      • Matrix product of two relations based on equality of attributes with the same domain

Relational Algebra

  • Query: What recipes produce the AMP gene?
    • First, select the AMP gene from the GENE relation and join it to CONTAINS
      TEMP1 = (CONTAINS[FRAG,GENE] times GENE[ID,NAME]) 
                    where GENE.ID=CONTAINS.GENE and GENE.NAME='AMP' 

Relational Algebra

  • Query: What recipes produce the AMP gene?
    • Second, join the result to the PRODUCES relation and select the RCP attribute
      TEMP2 = (TEMP1 join PRODUCES)[RCP]*

Relational Algebra

  • Query: What recipes produce the AMP gene?
    • Finally, join the result to the RECIPES relation
      ANSWER = 
               (TEMP2 join RECIPES) where TEMP2.RCP=RECIPES.RCP

Structured Query Language (SQL)

  • ANSI standard syntax for relational algebra
  • Supported by all major commercial relational databases
  • Also supported by many open-source efforts
    • e.g. mysql, perl's DBI/DBD
  • Will only cover:
    • CREATE
    • INSERT
    • SELECT
    • UPDATE

SQL - CREATE

  • Creates database objects (databases, tables, indices)
    • SYNOPSIS:
      CREATE DATABASE database_name 

      CREATE TABLE table_name
      (
        column_name1 data_type,
        column_name2 data_type,
        ......
      [PRIMARY KEY (column_name),]
      [FOREIGN KEY (column_name) REFERENCES table_name(column_name),]
      )

      CREATE [UNIQUE] INDEX index_name 
            ON table_name (column_name)

SQL - CREATE

    • Examples:
      CREATE TABLE "GENE"
      ( 
      ID char(16), 
      NAME varchar(20), 
      PROTEIN varchar, 
      START# int,
      PRIMARY KEY (ID)
      );

      CREATE TABLE "PRODUCES"
      ( 
      RCP char(16), 
      FRAG char(16),
      DATE date, 
      FOREIGN KEY (RCP) REFERENCES RECIPE(RCP),
      FOREIGN KEY (FRAG) REFERENCES FRAG(ID)
      );

      CREATE UNIQUE INDEX on GENE (ID);

SQL - INSERT

  • Inserts data into a table row
    • SYNOPSIS:
      INSERT INTO "tablename" (first_column,...last_column) 
                 VALUES(first_value,...last_value);
    • Example:
      INSERT INTO GENE (ID, NAME, PROTEIN, START#) 
                  VALUES ('G1', 'AMP', 'MAKK...', -5);

SQL - SELECT

  • Selects data from relational tables
  • Key syntax for expressing relational algebra
    • SYNOPSIS
      SELECT [DISTINCT] column1[,column2] FROM table1[,table2]
          [WHERE "conditions"] 
          [GROUP BY "column-list"] 
          [HAVING "conditions] 
          [ORDER BY "column-list" [ASC | DESC] ]

SELECT - Selection

  • Selection
    • Selection of tuples based on Boolean criteria

SELECT - Projection

    • Projection
      • Selection of attributes

SELECT - Equijoin

  • Join (equijoin)
    • Matrix product of two relations based on equality of attributes with the same domain

SELECT -- Query Example

  • Query: What recipes produce the AMP gene?
    • First, select the AMP gene from the GENE relation and join it to CONTAINS
      CREATE TABLE TEMP1 AS 
           SELECT CONTAINS.FRAG,CONTAINS.GENE,GENE.NAME FROM CONTAINS, GENE 
                WHERE GENE.ID=CONTAINS.GENE AND GENE.NAME="AMP";

SELECT -- Query Example 2

  • Query: What recipes produce the AMP gene?
    • Second, join the result to the PRODUCES relation and select the RCP attribute
      CREATE TABLE TEMP2 AS 
           SELECT PRODUCES.RCP FROM TEMP1,PRODUCES WHERE TEMP1.FRAG=PRODUCES.FRAG;

SELECT -- Query Example 3

  • Query: What recipes produce the AMP gene?
    • Finally, join the result to the RECIPES table
      SELECT DISTINCT RECIPE.RCP, RECIPE.NAME, RECIPE.FILE, RECIPE.OWNER 
           FROM TEMP2, RECIPE WHERE TEMP2.RCP = RECIPE.RCP;
      • Note the DISTINCT keyword to remove duplicate rows

SQL - Query Example (shorthand)

  • Most modern relational databases have good query optimizers
    • Usually no need to create intermediate tables:
      SELECT DISTINCT RECIPE.RCP, RECIPE.NAME, RECIPE.FILE, RECIPE.OWNER 
           FROM GENE, CONTAINS, PRODUCES, RECIPE 
                WHERE GENE.NAME = 'AMP' AND GENE.ID = CONTAINS.GENE
                     AND CONTAINS.FRAG = PRODUCES.FRAG
                     AND PRODUCES.RCP = RECIPE.RCP;

SQL - UPDATE

  • Updates data in a database
    • SYNOPSIS:
      UPDATE tablename 
          SET columnname="newvalue"[,nextcolumn="newvalue2"...]
              WHERE columnname OPERATOR "value" 
                  [AND|OR column OPERATOR "value"];
      		
    • Example:
      UPDATE GENE SET NAME='AMP' WHERE ID='G1';

SQL - Other Useful Commands

    • ALTER - Alter a table after it has been created
      • Add or drop columns
      • Add or drop primary or foreign keys
    • DELETE - Delete a row from a table. Syntax is similar to SELECT.
    • DROP - Delete an entire table or database
    • SQL Functions - aggregation functions that operate on the results from a select
      • Include basic statistics (STDEV,AVG,SUM,VAR), and counting functions like COUNT(column)
      • Example:
        SELECT COUNT(*) FROM RECIPE,PRODUCES 
         	   WHERE RECIPE='scooter' AND RECIPE.RCP=PRODUCES.RCP;

SQL - References

Relational Databases

  • Foundation of most production databases
  • Based on relational calculus and relational algebra
  • Allows ad-hoc query capability across record types
  • Supports a standard query language (SQL)
  • Can support either hierarchical or network models
  • Attributes are limited to basic types
  • e.g.
    • mysql
    • Oracle
    • DB2

Object-Relational Databases

  • Essentially an extension of the relational database model
  • Preserves the tabular (relational) organization of the data
  • Allows developers to define more complex data types (User Defined Types, UDTs)
  • No support for encapsulation or inheritance
  • Some support for methods is provided (User Defined Functions, UDFs)
  • SQL object extensions already standardized (SQL3)
  • e.g.
    • postgres
    • Oracle

Object Databases

  • Provides persistent storage of objects
  • Most useful in conjunction with object-based applications
  • Primarily a programmer's tool, although vendors are providing SQL3 and ODBC interfaces
  • e.g.
    • Objectivity

Types of Databases

  • Questions?

  • Recommended Reading:
    • Date, C.J. An Introduction to Database Systems. Reading, Mass.: Addison-Wesley (1981)
    • Codd, E.F. A Relational Model of Data for Large Shared Data Banks. CACM 13, No. 6 (June 1970)

Uses of Databases

  • ...or why do I [should you] care about this stuff?
  • Three major computing issues in bioinformatics:
    • Data processing -- convert raw data into results
    • Visualization -- a way to visualize the results
    • Data storage -- store the results (and perhaps the raw data)

Questions?

Database Access with Python

  • SQL provides a way to interact with a relational database...
  • ... but how do I access my database programmatically?
  • Lots of ways, but we're going to discuss MySQLdb
  • MySQLdb:
    • Provides access to MySQL from python scripts
    • Simple (maybe too simple...)
    • Basic idea is to execute SQL commands and return the response as a python list
    • Installed on ockham

MySQLdb Example

#! /usr/bin/python

import sys
import MySQLdb

try:
	# Open a connection to the database
	conn = MySQLdb.connect (host="127.0.0.1", db="bmi280")
	cursor = conn.cursor()
	
	# Execute an SQL statement -- can be pretty much any SQL
	cursor.execute("SELECT NAME, PROTEIN from GENE")
	# fetchall returns a list of lists
	rows = cursor.fetchall()
	for row in rows:
		print "%s, %s"%(row[0], row[1])

	# Close the cursor and commit any changes to the database
	cursor.close()
	conn.commit()
	conn.close()

except MySQLdb.Error, e:
	# Handle any errors
	print "Error %d: %s" % (e.args[0], e.args[1])
        sys.exit (1)
AMP, MAKK...
TET, MYAK...
NGF, MYAK...

Larger MySQLdb Example

#! /usr/bin/python
import sys
import MySQLdb

try:
  conn = MySQLdb.connect (host="127.0.0.1", db="bmi280")

  # Get a cursor we can work with
  cursor = conn.cursor()

  # Use the execute method to pass SQL commands to the database
  cursor.execute("DROP TABLE IF EXISTS `GENE`")
  # Note that we use triple quotes when we need multiple lines
  cursor.execute("""
                CREATE TABLE `GENE` 
                (
                        `ID` char(16),
                        `NAME` varchar(20),
                        `PROTEIN` longtext,
                        `START#` int,
                        PRIMARY KEY (`ID`)
                )
  """)

  cursor.execute("""
                INSERT INTO `GENE` VALUES 
                  ('G1','AMP','MAKK...',-5),
                  ('G2','TET','MYAK...',-10),
                  ('G3','NGF','MYAK...',-1)
  """)

  print "Number of rows inserted: %d"%cursor.rowcount

  # OK, now lets try to get some data out
  cursor.execute("SELECT NAME, PROTEIN from GENE")
  while (1):
    row = cursor.fetchone ()
    if row == None:
      break
    print "%s, %s"%(row[0],row[1])

  print "Number of rows returned: %d"%cursor.rowcount
  # Another way to do the same thing
  cursor.execute("SELECT NAME, PROTEIN from GENE")
  rows = cursor.fetchall ()
  for row in rows:
    print "%s, %s"%(row[0],row[1])

    print "Number of rows returned: %d"%cursor.rowcount

    cursor.close()
    conn.commit ()
    conn.close()

except MySQLdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit (1)

MySQLdb Use

  • MySQLdb provides a good low-level interface
  • For most uses, probably want to wrap low-level SQL commands in Python objects
  • In the above example, a GENE might be an object
  • Might have methods to fetch (SELECT) or save (INSERT) a GENE
  • Provides some insulation from underlying SQL implementation