Introduction To Data Modeling and Data Access Methods

John "Scooter" Morris

April 2, 2014

Overview

  • Limitations
  • Data Modeling
  • ER Diagrams
  • ER Diagrams - Examples
  • Data Access Methods

Limitations (Data Modeling)

  • Data modeling is a large topic
  • We're going to focus on one data modeling technique (Entity-Relationship Diagrams)
  • What am I not telling you about?
    • Other data modeling techniques (see Data Modeling on Wikipedia for a more complete list)
    • Application modeling techniques like UML
    • User modeling techniques that attempt to document the user interaction
  • This is an introduction
    • enough to get started and to know what you don't know (I hope)
  • Ask questions!

Example Problem

A system to automate the tracking and documentation of plasmid construction

  • Terminology:
    • fragment: a length of double-stranded DNA
    • plasmid: a circular fragment
    • recipe: a series of manipulations of the DNA to produce a new plasmid with cDNA of interest inserted
  • Needs:
    • Data processing -- convert raw data into results
    • Visualization -- a way to visualize the results
    • Data storage -- store the results (and perhaps the raw data)

Example problem

Example

Implementation Approaches

  • Incremental implementation
    • Start coding right away with small parts of system
    • Add complexity as you go along
    • Pros:
      • get something done quickly
      • learn by doing
    • Cons:
      • will probably have to throw out a lot of code
      • early data model will constrain your implementation
      • changes to data model will require significant refactoring
      • over time, will become unmaintainable
    • Only recommended for quick-and-dirty throw-away code

Implementation Approaches

  • Detailed design
    • Produce detailed design
      • Data model
      • Class diagrams
      • UML, etc.
    • Code only after design complete
    • Pros:
      • probably get cleaner implementation
      • design documentation will serve to assist in maintenance
      • better able to scope project and estimate resources
    • Cons:
      • very time consuming
      • changes in research may happen too quick to make this practical
      • users may get inpatient
    • Only recommended for very limited, stable projects
    • Data model is key

Implementation Approaches

  • Hybrid approach
    • Produce data model design
    • Do fragment implementation
  • Pros:
    • changing the data model is hard, probably will have the largest impact on your code
    • data model documentation is a useful document to discuss system with colleagues
    • get benefit of fragment implementation
  • Cons:
    • still have to spend some up-front design time
    • will (undoubtedly) need to throw out some code or refactor
  • Recommended approach for most projects

Data Modeling

    • The FIRST Step
    • Structured way to understand the data semantics
    • Independent of underlying platform
    • Way to communicate with team members (including users)
    • Excellent (minimal?) documentation
    • Example: ER Diagrams

ER Diagrams: Notation

  • Entity (Entity Type)
    • A collection of entities that share common properties (a thing)
      • e.g. Fragment, Recipe, Gene
  • Attribute
    • Property of an entity that is of interest
      • e.g. Name, File, Sequence
  • Relationship
    • An association between entities
      • e.g. Produces
  • Degree
    • Number of entities involved in the relationship
      • one-to-many, one-to-one, many-to-many

ER Diagrams: Example

ER Diagrams: Extended Example

  • Extend the system....
    • Add the ability to extract the experimental details
    • Add more information about the gene: promotors, enhancers, RBS, introns, exons, CDS, etc.
    • Add information about the protein: structure, function, sequence, etc.

ER Diagrams: Extended Example - 1

One possible design:

ER Diagrams: Extended Example - 2

Drop "Feature" entity:

ER Diagrams: Extended Example - 3

Expand structure:

ER Diagrams: Other Examples

  1. The canonical: employee/employer/department system
  2. Another database "favorite": sales/parts/inventory
  3. More relevant: on-line laboratory information management system (LIMS)
  4. Modeling systems:
    • apoptosis signaling pathway
    • ascending pain pathway

ER Diagrams: Apoptosis Signaling Pathway

ER Diagrams: Ascending Pain Pathway

ER Diagrams

  • Questions?

  • Recommended Reading:
    • Chen, P.S. The entity-relationship model: toward a unified view of data. ACM Trans on Database Syst. pp 9-36 (March 1976)

Data Modeling Assignment

Put together an ER diagram for a database system for cellular pathways. Include information about the proteins, metabolites, functions, interactions, cellular locations, and evidence codes. Don't attempt to be complete -- focus on the major entities and their relationships.

Data Access Methods

  • How is the data accessed?
  • Why do we care?
    • Important for special-purpose databases
    • Some systems give you choices
  • Terminology:
    • Index: an access path into the data
    • Key: a field (or fields) used to access the data
    • Primary key: a field (combination) whose values uniquely identify the record

Data Access Methods - Linear

  • Simple record-oriented view
  • Access is through sequential reads
  • OK for small data stores -- very slow when the number of records gets large

Data Access Methods - Hash

  • Compute a function to access the data
    • e.g. add up the characters to produce an integer
  • Usually requires a separate index
  • The "goodness" of the hash function is important
    • A perfect hash function would result in a direct access to the data (i.e. a one-to-one relationship)
    • Perfect hash functions are almost never possible
    • This results in the possibility of multiple "hits" per hash value (or bucket)

Data Access Methods - Hash

  • Simple (and silly) example:
    • Hash on the first letter of the recipe name:

Data Access Methods - BTree

  • Good for sequenced or character data
  • In general, the index set is a tree whose leaves consist of pointers into a sequence set
  • Each node in the index set points to three lower nodes
  • Access is by value comparison:
  • For value V:
     if V <= left value: 
      → move to the left lower node
     if left value < V <= right value:
       → move to the middle lower node
     if V > right value:
       → move to the right lower node

Data Access Methods - BTree

  • Example: find pBR322.f2 assuming a Btree index on fragment name
    • pBR322.f2 > pBR322 and <= pHR5CV
      • we take the middle node, which contains pBR322.f2
      • If there are more layers, continue repeating the algorithm until you get to the sequence set

Data Access Methods - BTree

Data Access Methods

  • There are many other indexing techniques
  • Indexing can substantially improve access times
  • Deciding what field to index on depends on usage patterns
  • You can have multiple indices, but that substantially increases insert time and space requirements

Data Access Methods

  • Questions?
  • Recommended Reading:
    • Knuth, D. E. The Art of Computer Programming, Volume III: Sorting and Searching. Reading, Mass.: Addison-Wesley (1973)