Data structure

From Citizendium
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

Data structures are used in computer programming to specify how information is arranged on storage media for processing. They exist both on a conceptual level, describing the interface through which data can be accessed, as well as on a concrete level, describing how they are implemented. As there is no fixed mapping between interface and implementation of data structures - actually many data structures can emulate each other – the selection of an implementation that fits the access pattern of the data can improve the performance of algorithms significantly.

Interface

On their interface level data structures are very close to abstract data types. Both concepts define functions, which manipulate the data, and the types these functions work on. Abstract data types continue from there to define the semantics of the functions using a formal description. Data structures are concerned with the actual arrangement and manipulation of the data on the storage medium.

Based solely on the access functions provided by the data structure interface one can roughly distinguish three main kinds:

  • Structures with set-like access where data items are entered into and retrieved from sets with not necessarily unique elements. Examples include lists, sets and some trees.
  • Structures with key-based access where data items are stored and retrieved based on distinguished keys. Examples include hash tables, arrays (with the index being the key) and some other trees.
  • Structures with restricted access where at any time only certain data items can be accessed. Examples include stacks and queues.

Using such an abstract view on the data structures to use in an algorithm helps to focus on the task at hand. But sometimes this is not possible because the actual representation of the data in the data structure has an important role in the problem to solve e.g. in parse trees and in many graph-based data structures for example genealogical trees.

Implementation

The implementation is the aspect data structures typically focus on. The right choice of implementation influences the runtime and storage consumption of an algorithm. It depends on several factors among others the prevalence of read or write operations, the access pattern, existing or required sorting of the data, peculiarities of the storage used (e.g. block size), and whether a bound for the amount of data to be stored can be given in advance. It is thus not surprising that for most of the 'basic' data structures mentioned in the interface section several implementations exists. Typically (imperative) programming languages provide a collection of these in libraries to facilitate reuse. The most common data structures and their implementation variants are:

Array

An array is a data structures with random access to all its elements using an integer number as index. It is typically implemented as a continuous range in memory. As it is one of the basic structures a program can use, support for arrays is often integrated into the programming language itself. An array is a static data structure as it has a maximum number of elements it can contain.

A variation of arrays is the dynamic array which is typically realized by copying the array contents into a new memory range if the old one overflows. Also the hash table an implementation of associative memory using an array and a hash function can be regarded as a variation of an array.

The Vlist, a mixture between an array and a list, can be seen as a variant implementation of an array since it tries to achieve near constant time access operations while it keeps in contrast to an array the individual elements immutable.

List

A list is a sequentially linked sequence of nodes, each storing an element and a pointer (link) to the next node. If only one direction of links exists, the structure is called a single-linked list. The first node is by convention called the head, its last node the tail. Accessing a random element thus on the average requires length_of_list/2 accesses. Double-linked lists provide links in both directions of the list so that the previous node can be reached from a node in addition to the next one.

Variants are the Xor linked list, which saving memory by storing both links of a double-linked list in the same memory space, and the Unrolled linked list, a structure where each node of the list contains more than one element to reduce the overhead of the links and make better use of block-oriented accesses to memory.

Tree

A tree expands on a list in that each node of a tree links to several dependent nodes (descendants) instead of only one as in a list. In computer science trees grow from top to bottom and thus the topmost node – the node which has no ancestor – is called the root. Having several descendants shortens the number of accesses needed to reach a random element from the root. In a balanced tree it is about if d is the number of descendants each node has and N the total number of nodes present in the tree. Trees can be either ordered if an order exists between the descendants of a node and the element in the node itself, or unordered.

As ordered trees are used as the primary means to organize search indices on storage, their implementation is highly performance-relevant for many programs. As a result a huge number of variant implementations exists. To list only the most known ones: Binary trees are trees with two descendants per node. Typically one descendant (which is a tree in itself) contains all elements smaller than the element stored in the node and the other descendant all elements which are greater. A special case of binary trees are AVL trees which attempt to stay near-balanced with less overhead than a completely balanced tree. B-trees store several elements per node and link to number_of_elements+1 descendants to make better use of the block-based access of hard drives. Tries are a variant where nodes use only a part of the search key for ordering, typically increasing the amount used farther down the tree. They are primarily used for string indexing.


Other

Other data structures mentioned in the interface sections such as sets, queues, stacks and graphs are often implemented using one of the implementations specified above. But some special variants exist for them as well: The bitmap can be used to represent sets with a limited, fixed number of elements. The most commonly known example is the FAT. A ring buffer, actually an array with two pointers, can be used to implement a queue of fixed maximum length.