Search Contact information
University of Cambridge Home Department of Engineering
University of Cambridge >  Engineering Department >  computing help >  Languages >  C++

Data Structures and C++

Introduction to the STL

Many books exist dealing with sorting and organising data. With the emergence of languages that support data abstraction, people tried to generalise data structures. The Smalltalk language introduced some useful ideas. These have been developed and brought into the mainstream with C++'s STL (Standard Template Library) The STL includes containers that hold data. Containers are grouped into 3 types These containers have many member functions in common. For example, s.size() gives the number of elements in the container s if s is a set, a list or even a string of characters. The STL also includes algorithms that operate on the containers. To be of maximum use these algorithms need to be able to operate on as many types of containers as possible, so each container has member functions that give algorithms access to the data. For example, s.begin() will always return an Iterator (like a pointer) to the first element of a container s. By hiding the implementation details of containers in this way, high-level generic algorithms can be written which will not only work with the standard containers but also with any new containers that a programmer might write.

The STL doesn't contain a tree container, but if you wrote one so that it had the standard hooks that algorithms require, then algorithms like copy etc. should work with it. Clearly a tree container would require some special routines that other containers wouldn't support, but C++ makes it easy to add functionality in, building on what you already have. Before we look at trees, let's see how to use some simpler containers.

Queues

#include <deque>                                                       
#include <algorithm>                                            
#include <iostream>                                                        
using namespace std;                                                    
                                                                       
int main()                                                                 
{                                                             
deque<int> Q;       //  An empty, elastic, container ready to hold ints.
// A deque is a double-ended queue. Queues (unlike lists) are
// data structures where you wouldn't expect to insert new items
// anywhere else but at the ends

Q.push_back(0);     //  Many containers have a push_back
                    //  routine. It's a way to add elements
                    //  to a container in a safe way - the
                    //  container will grow if it's not big enough.
Q.push_back(2);                                                          
Q.push_back(1);     

// the next line is a bit of magic you'll have to take on trust 
// for now - it "copies" (i.e. prints) the elements of Q to 
// the output stream. A simple "for" loop could be used instead 
// but it's tempting to avoid using "for" altogether.
                                                    
copy(Q.begin(), Q.end(), ostream_iterator<int>(cout, " "));  
                // Output: 0 2 1                                     
cout <<endl;

// Now let's sort!          
sort(Q.begin(), Q.end());
// the arguments we've given to sort mean that the whole queue
// will be sorted, and that the result will be copied back to
// the original queue. sort uses the default sorting criterium
// for integers, and can use the swap function of Q to switch
// elements around. 

copy(Q.begin(), Q.end(), ostream_iterator<int>(cout, " "));         
// Output: 0 1 2
cout <<endl;                         
}
To create a queue of strings rather than a queue of integers, we don't need to change much. This ease of coping with different data types will be useful when we come to create trees.
#include <deque>                                                       
#include <string> // added so that we can use strings
#include <algorithm>                                            
#include <iostream>                                                        
using namespace std;                                                    
                                                                       
int main()                                                                 
{                                                             
deque<string> Q;       // changed

Q.push_back("standard");           
Q.push_back("template");                                                        
Q.push_back("library");     

// the next line has been changed to support strings rather than integers
copy(Q.begin(), Q.end(), ostream_iterator<string>(cout, " "));  
                                                   
cout <<endl;

// Now let's sort!          
sort(Q.begin(), Q.end());
// sort uses the default sorting criterium for strings - alphabetical

// the next line has been changed to support strings rather than integers
copy(Q.begin(), Q.end(), ostream_iterator<string>(cout, " "));         
cout <<endl;                         
}
As well as being able to have queues of standard items like strings and integers, we can also invent our own types and have queues of those. The main problem is that when we sort we have to specify the sort criteria for our own types. It's sufficient to specify what the '<' operator should do for your own type. Here's an example
#include <deque>                                                       
#include <algorithm>                                            
#include <iostream>                                                        
using namespace std;                                                    

class simple {
public:
  int value;
  // redefine the '<' operator, so when 2 simple objects are
  // compared, the 'value' fields of the objects are compared.
  const bool operator<(const simple &rhs) const
        { return this->value < rhs.value; }
};
                                                                       
int main()                                                                 
{                                                             
 deque<simple> Q;       //  An empty, elastic, container.

 simple s;
 s.value=0;
 Q.push_back(s); 
 s.value=2;
 Q.push_back(s);   
 s.value=1;                                          
 Q.push_back(s);     

 for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
 }
 cout <<endl;
 // Output: 0 2 1

 // Now let's sort! sort uses the '<' operator that we created in the 'simple' class   
 sort(Q.begin(), Q.end());

 for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
 }
 cout <<endl; 
 // Output: 0 1 2
}

Trees

There are many types of tree structures (threaded trees, Splay trees, 2-3 trees, etc). The choice depends on many factors - how dynamic the data is, whether best-case or worst-case behaviour is the most important, whether you want to avoid recursion, etc. Trees are often a natural choice (e.g. when writing a chess program) but they are also effective in many situations where data that's received sequentially can be beneficially stored in a more organised fashion - e.g. parsing mathematical expressions, parsing sentences, reading words that later need to be searched, or retrieved in alphabetical order.

Binaries trees (where each node has at most 2 children) are commonly used - not because they most closely model reality, but because they're computationally and mathematically more easy to deal with.

Suppose that a program were to be given 10,000 words (non-unique) and had to list the unique words used. Suppose also that memory was at a premium, so that storing every non-unique word is to be avoided. Each time a word is read the program needs to check if the word's new and store it only if it is. If the words are stored in an array alphabetically, searching will be faster, but adding a word to the start of the array will be slow because subsequent words will have to be shifted.

Using a binary tree will speed things up. Each time a word is read, the program starts at the top of the tree. If the node is empty, the word is stored there, otherwise the left or right branch is taken according to whether the word comes before or after the word stored in the node. For example, suppose the first 5 words were "standard", "template", "library", "data" and "structures". "standard" would be stored in the top node. "template" would be stored in the right-child node of the top node. "library" would be stored in the left-child node of the top node.

              standard
               /   \
          library  template
"data" comes before "standard" so the left link will be taken. "data" comes before "library", so "data" will be stored in the left-child node of "library". The final tree looks like this
              standard
               /   \
          library  template
             /      /
           data   structures
If this tree remains balanced, then it could fit 10,000 words within 15 layers, so a word-search would require only 15 comparisons at most. Insertions (new nodes) are cheap too. Contrast this with the worst case array scenario, where 10,000 comparisons might need to be made before a search is complete!
However, if the words are supplied in alphabetical order, this tree-based method suffers - each new word will add a right-node, deepening the tree each time, making searching as slow as with arrays. Various techniques exist to combat this. The benefits of these techniques are so enormous that simple binary trees aren't used in big programs.

Tree Traversal

There are 2 basic approaches to brute-force tree searching - breadth-first and depth-first. With breadth-first you start at the top node then visit all its childen before visiting its children's childen - i.e. you explore the tree layer by layer. With depth-first searching you keep going down, backtracking when you can go no further. The purpose of the search and the shape of the tree affect the choice of approach. For example, if you're playing chess and you don't have much time to think, it's better to do a breadth-first search rather than looking at one move (which might be bad anyway) at great depth.

With either method there's a choice of ways to order operations on the node.

As an example consider the following tree, which represents a mathematical expression.
             -
           /   \
          *     *
         / \   / \
        2   + 3   -
           / \   / \
          x   5 y   4
If you do a depth-first search, processing each left-branch and right-branch of a node before printing the node's contents you'll get
      2 x 5 + * 3 y 4 - * -
(check to see if you agree with this!) which is the expression in Reverse Polish notation (used by older calculators and the postscript language) but if you deal with nodes differently (still depth-first search, but this time processing the left-branch, printing the contents of the node, then processing the right-branch) you'll get something much closer to the normal representation.

The depth-first algorithm can be described more formally as follows

  1. Form one element queue Q consisting of the root node
  2. Until the Q is empty or the goal has been reached, determine if the first element in Q is the goal.
    1. If it is, do nothing
    2. If is isn't, remove the first element from Q and add the first element's children, if any, to the front of the queue (so that children are searched before siblings)
  3. If the goal is reached, success; else failure
Try this algorithm by hand on the following tree
             A
           /   \
          B     C
         / \   / \
        D   E F   G
           / \   / \
          H   I J   K
The STL makes it relatively easy to translate this algorithm into code nearly line-by-line. If each vertex has a list of sub-vertices called child then
    
    list<Vertex > q;
    q.push_back( start ); 
creates and initialises a list, and
     while( !q.empty( ) ){
         Vertex v = q.front( ); // get the first item
         if (!v.goal_reached()) { 
            q.pop_front( );      // remove it from the list
            for( int i = 0; i < v.child.size( ); i++ )
               q.push_front(v.child[i]); // push each child to the front of the queue
         }
     }
modifies the list.

It's sometimes possible to sort this list of nodes to be visited so that the most promising nodes are visited first.

[a landscape]
Suppose you started at point A of this 2D landscape. The code for these methods follows the stages of the algorithm closely. We've seen earlier how sort can be used to sort a list of data structures like Vertex
     deque<Vertex> q;
     deque<Vertex> new_wave;
     while( !q.empty( ) ){
         Vertex v = q.front( ); // get the first item
         if (!v.goal_reached()) { 
            q.pop_front( );      // remove it from the list
            for( int i = 0; i < v.child.size( ); i++ )
               new_wave.push_front(v.child[i]); // push each child onto a queue
            // Now decide whether to sort this new queue before adding
            // it to the main queue, or add it to the main queue before
            // sorting the whole queue
            if (mode==hill_climbing) {
              sort(new_wave.begin(),new_wave.end());
              while(!new_wave.empty()){
                 q.push_front(new_wave.back( ));
                 new_wave.pop_back( );
              }
            }
            else {
              while(!new_wave.empty()){
                 q.push_front(new_wave.back( ));
                 new_wave.pop_back( );
              }
              sort(q.begin(),q.end());
            }
         }
     }
Branch and Bound methods break the set of feasible solutions into smaller subsets, and find for each subset a lower bound (as high as possible) on the function one's trying to maximise in the hope of isolating the optimal solution. These subsets often take the form of subtrees. If the subsets are well chosen, and if the bounds are easy to calculate, these methods can greatly reduce the number of nodes that need to be explicitly searched.

Many tree traversal strategies require the program to remember previous nodes in the path so that backtracking is possible. This (and recursion) can use up a lot of memory. With threaded trees each node knows its sucessor - with a depth-first search parents lead to their first child which in turn leads to siblings. The last sibling leads to the next node to be visited. Maintaining such a tree requires more work when nodes are added or removed, and makes traversal less flexible but with each node having a successor, it becomes more possible to treat the Graph as an STL container on which iterators can operate.

Graphs

A tree is a special case of the more general graph (or net). Graphs have many uses from network analysis to spreadsheet operation. Graphs have vertices (or nodes) and edges (which can be one-way or undirected). Edges can have a cost associated with them, and the cost may depend on the direction taken along the edge. Popular graph-related puzzles include More formally, a graph is a pair (V,E), where V is a finite set and E is a binary relation on V. V is called a vertex set whose elements are called vertices. E is a collection of edges, where an edge is a pair (u,v) with u,v in V. In a directed graph, edges are ordered pairs, connecting a source vertex to a target vertex. In an undirected graph edges are unordered pairs and connect the two vertices in both directions, hence in an undirected graph (u,v) and (v,u) are two ways of writing the same edge.

If some edge (u,v) is in graph , then vertex v is adjacent to vertex u. In a directed graph, edge (u,v) is an out-edge of vertex u and an in-edge of vertex v. In an undirected graph edge (u,v) is incident on vertices u and v. In a directed graph, the number of out-edges of a vertex is its out-degree and the number of in-edges is its in-degree. For an undirected graph, the number of edges incident to a vertex is its degree.

A path is a sequence of edges in a graph such that the target vertex of each edge is the source vertex of the next edge in the sequence. If there is a path starting at vertex u and ending at vertex v we say that v is reachable from u. A path is simple if none of the vertices in the sequence are repeated. A path is a cycle if the first and last vertex in the path are the same. A graph with no cycles is acyclic.

Graph Traversal

Graphs can be traversed much as trees can (depth-first, breadth-first, etc), but care must be taken not to get stuck in a loop - trees by definition don't have cycles, and in a tree there's always only one path from the root to a node whereas in a graph there may be many paths between any pair of nodes.

The following directed graph has 6 nodes. The Dijkstra-Moore algorithm can be used to find the shortest path from a given node to each of the others as long as costs aren't negative. A breadth-size search from the first node radiates out across the graph. If it reaches an unvisited node, it sets its distance from the first node to be the distance of the previous node from the first node, plus the length of the edge from the previous node. If a node's already been visited, it will already have been given a distance from the first node, but this might not be the shortest distance - the current path might be shorter. So a comparison is done, and the node's distance is set to the minimum.

For example, in the graph below, suppose that A was the source node. In the first stage, node B's distance would be set to 7 and node C's distance to 1. During the second stage, routes from nodes B and C would be explored. When node B is reached from node C, the distance is 1+5, so node B's distance is reset to 6 from its original 7. As the algorithm progresses, node B's distance will later be reset to 5, which is the final value for node B's distance. The algorithm terminates after 5 stages because by then all of the routes will have been traversed - routes can't have more than number_of_vertices - 1 edges.

[a weighted digraph]

Implementing Graph Algorithms

Graphs provide useful source material for practising C++ - inheritance provides a useful mechanism for integrating node, edge and graph structures, and the STL takes away much of the chore. See the References below for a list of books and online courses. Below is one way that graph algorithms could be implemented

A graph is an STL vector of vertices. Each vertex has an STL vector of edges. For flexibility the details of the search task are separate from the configuration of the graph.

enum ordering {preorder, inorder, postorder};
enum search_mode {breadth_first, depth_first, hill_climbing, best_first, Astar, shortest_paths, min_spanning_tree};
class Search_Description
{
public:
  ordering order;
  search_mode searching;
  string startnode;
  string goalnode;
  bool printnode;        // whether the visited node-name is printed
};
The vertices and edges can have associated information (cost, names, etc). The graph and search specifications can be read from a file or be specified interactively. The minimum-distance problem above can be set up using the following file
A B 7 anon 
A C 1 anon 
B D 4 anon
B F 1 anon
C B 5 anon
C E 2 anon
C F 7 anon
E B 2 anon
E D 5 anon
F E 3 anon 
SEARCHMODE shortest
STARTNODE  A
Most of the lines specify directed edges with weights (the "anon" means that the edges don't have names). Vertices are automatically created on demand. The output is
B is 5 from the source.
Shortest path is A to C to E to B
C is 1 from the source.
Shortest path is A to C
D is 8 from the source.
Shortest path is A to C to E to D
F is 6 from the source.
Shortest path is A to C to E to B to F
E is 3 from the source.
Shortest path is A to C to E
The hill climbing problem above can be set up with the following file
A B 1 anon
B C 1 anon
C D 1 anon
A E 1 anon
E F 1 anon
NODEVALUE A 0
NODEVALUE B 2
NODEVALUE C 0
NODEVALUE D 1
NODEVALUE E 1
NODEVALUE F 3
// For best-first change the next line to  SEARCHMODE bestfirst
SEARCHMODE hill
STARTNODE A
GOALNODE F
Graph member functions include Graphs can also be created programatically. Here's an example -
#include "graphs.cc"

int main( int argc, char *argv[ ] )
{
    Graph g;
    Search_Description s;
    vector<string> nodenames;

    nodenames.push_back("A");
    nodenames.push_back("B");
    nodenames.push_back("C");
    nodenames.push_back("D");
    nodenames.push_back("E");
    nodenames.push_back("F");

    for (int i=0;i<6;i++)
       for (int j=i+1;j<6;j++)
          g.addEdge(nodenames[i],nodenames[j],i+j,"anon");

    g.shortestpaths("B");
    return 0;
}

Exercises

References

Books

Code and documentation

See Also

© Cambridge University Engineering Dept
Information provided by Tim Love (tpl)
Last updated: February 2002