Sorting Concept

Sorted File

An approach to keep a level of organization to the tuples and permit some more efficient searching
Rows are sorted based on some attribute(s)
  • Access path is a binary search
  • An equality or range query based on that attribute has cost log2F to retrieve page containing first row
  • Successive rows are in same (or successive) page(s) and cache hits are likely
  • By storing all pages on the same track, seek time can be minimized
Example - Transcript sorted on StudId :
SELECT T.Course, T.Grade
FROM Transcript T
WHERE T.StudId = 123456
SELECT T.Course, T.Grade
FROM Transcript T
WHERE T.StudId BETWEEN 111111 AND 199999




Transcript Stored as a Sorted File

123456 CS305  S1996 4.0 
234567 CS305  S1999 4.0

515151 EE101  F1995 3.0
page 0
586609 EE101  S1998 3.0 
666666 MGT123 F1994 4.0
717171 CS315  S1997 4.0
765432 MAT123 S1996 2.0
page 1

878787 MGT123 S1996 3.0    
987654 CS305  F1995 2.0
page 2
keep space on the page for new insertions.



Maintaining Sorted Order

Problem: After the correct position for an insert has been determined, inserting the row requires (on average) F/2 reads and F/2 writes
(Partial) Solution 1: Leave empty space in each page: fillfactor or utilization
(Partial) Solution 2: Use overflow pages (chains) but this has disadvantages:
  • Successive pages no longer stored contiguously
  • Overflow chain not sorted, hence cost no longer logF





  • Overflow

    123456 CS305  S1996 4.0 
    234567 CS305  S1999 4.0
    
    515151 EE101  F1995 3.0
       3 page 0
    586609 EE101  S1998 3.0 
    666666 MGT123 F1994 4.0
    717171 CS315  S1997 4.0
    765432 MAT123 S1996 2.0
    -page 1
    
    878787 MGT123 S1996 3.0    
    987654 CS305  F1995 2.0
    
    -page 2
    111654 CS305 F1995 2.0
    111233 PSY 220 S2001 3.0 
     page3

    Eventually degrades to a linked list and has to be reorganized. A 7/24 database cannot afford to be taken offline for reorganization.