There are two basic activities that will introduce challenges to the usefulness of arrays and linked lists: searching and inserting data. We'll talk about searching first, as it will motivate the need to carefully insert data.
Let's say you have a list of numbers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19. A typical activity is based on the desire to search that list for a particular value that may or may not be in the list and return its location. If I asked you to find the location of '9', for example, you would probably scan across the list of values until you spotted it, and announce "It's right there!" You might even say "It's the sixth number on the list!" This is precisely the strategy of searching that is used in a linked list.
+---------+ +---------+ +---------+ +---------+ +---------+ head --> | 1, next---> | 3, next---> | 5, next---> | 7, next---> | 9, next---> ... +---------+ +---------+ +---------+ +---------+ +---------+ ^ ^ ^ ^ ^ | | | | | search 1 search 2 search 3 search 4 found on 5
The code is:
linkedsearch.cpp:
struct node { int data; node* next; }; node* find(node* head,int tofind) { node* temp=head; while (temp->data < tofind) { temp=temp->next; } if (temp->data == tofind) return temp; else return NULL; }
As you can see, we look through the list for the value and return either a pointer to it, or a null pointer to indicate that it wasn't found (when we find a larger value). Since, on average, our value will be in the middle of the list, it will take an average of n/2+1 comparisons (+1 for the final check of = or >) to get our result. This is O(n). To see the average search times, look at the following chart:
list size: 10 100 1,000 10,000 100,000 1,000,000 ave. comps.: 6 51 501 5,001 50,001 500,001
What O(n) tells us is this: once the number of elements is large enough, multiplying the number of elements by y multiplies your metric (in this case average comparisons) by y as well (when focused on significant digits). This introduces an issue for us: while this is good, it also kind of sucks. Consider writing a database application: you have your application and you test it with 1000 records and find that your search for records averages 500 ms. You then release your application to a customer who adds 10,000,000 records and finds that searches are taking 5,000,000 ms (or 5,000 s, or 1 hr, 23 min, 20 s). I can pretty much guarantee that you would not have that customer for long. There has to be a better way to search... and there is.
Look at the list of numbers again: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19. Since it's in order, I can approach the problem from a different perspective: look at the middle value (9 in this case) Is it the value we're looking for? Is it lower than what we're looking for? If it isn't the value, then we can focus our search on the top or bottom half of our list. 2 comparisons throw out half the list. Repeating the process will result in every 2 comparisons chopping the potential chunk of list in half. With arrays, we can find the middle of our list easily. Below is the logic to find 13:
Index bounds are 0 - 9, midpoint is index 4 +----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 13 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 *4* 5 6 7 8 9 9<13 so index bounds are 5-9, midpoint is index 7 +----++----++----++----++----+ | 11 || 13 || 15 || 17 || 19 | +----++----++----++----++----+ 5 6 *7* 8 9 15>13 so index bounds are 5-6, midpoint is index 5 +----++----+ | 11 || 13 | +----++----+ *5* 6 11<13 so index bounds are 6-6, midpoint is index 6 +----+ | 13 | +----+ *6* found at index 6!
The resulting code is:
arraysearch.cpp:
int find(int* searcharray, int tofind, int size) { int min=0; while (min <= size) { if(searcharray[(min+size)/2]=tofind) { return (min+size)/2; } else if(searcharray[(min+size)/2]<tofind) { min=(min+size)/2+1; } else { size=(min+size)/2-1; } } return -1; }
This search has a worst case scenario of approximately 2*log_2(n) (there may be a + or – 1 in there, but I'm feeling a bit lazy). To see the worst case search times, look at the following chart:
list size: 10 100 1,000 10,000 100,000 1,000,000 ave. comps.: 7 13 20 27 33 40
This is a search that is O(log(n)). What it tells us is that, once the number of elements is large enough, multiplying the number of elements by y adds z to your metric. Here, multiplying by 10 adds about 6.6 comparisons. Repeating the above scenario, you have your application and you test it with 1000 records and find that your search for records averages 500 ms. You then release your application to a customer who adds 10,000,000 records and finds that searches are taking 1175 ms (or 1s 175ms. At that point, your customer may notice that it's "a little slower" than the demo you gave, "but still pretty good".
Based on what we've seen so far, this clearly indicates that arrays are far better than linked lists, and we should immediately stop using linked lists, right? Not so fast! There was a requirement on that array: it had to contain the elements in order! That doesn't just happen by magic. Every time you add a new element to your list of data, there is an overhead cost. If you are not willing to accept that overhead cost, your data will not be sorted, and an array cannot pull that neat trick. An array search would then be just as slow as a linked list search: go through all the data and hope you find it early.
There are two ways to deal with this situation. 1) sort the data before doing a search, or 2) make sure the data remains sorted as you add new values. Since doing a sort is an expensive operation ( O(n*log(n)) or worse, depending on algorithm ), we're going to opt for keeping our data sorted as we add values. This means we need to look at how expensive it will be to insert an item. To keep things simple, we'll just look at the cost of insertion, and remember that there is always going to be an overhead cost (finding where the data goes). The total cost on an insert will be [search cost] + [insert cost], which we will deal with after we look at the cost of an insert.
A linked list is pretty simple, just adjust the pointers (in this case to add 6):
this: +---------+ +---------+ head --> ... --> | 5, next---> | 7, next---> ... +---------+ +---------+ +---------+ | 6, next---> NULL +---------+ becomes: +---------+ +---------+ head --> ... --> | 5, next | | 7, next---> ... +-----|---+ +---------+ | ^ | +---------+ | +-> | 6, next---+ +---------+
Here's the code to insert data into a linked list:
linkedinsert.cpp:
node* insert(node* head, int toadd) { node* temp; if (head->data < toadd) { temp=new node; temp->data = toadd; temp->next = head; return temp; } else { while ((temp->next!=NULL)&&((temp->next)->data < toadd)) { temp=temp->next; } if (temp->data < toadd) { temp->next = new node; temp=temp->next; temp->data = toadd; } else { node* temp2=new node; temp2->data = toadd; temp2->next = temp->next; temp->next = temp2; temp=temp2; } return temp; } }
The cost of this operation is very low. 2 pointer assignments do the job. By contrast, to insert into an array you have to move elements out of the way, like this:
Insert 6 at index 3 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 13 || 15 || 17 || 19 || | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 move elements to the right to free space +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 13 || 15 || 17 || 19 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 13 || 15 || 17 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 13 || 15 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 13 || 13 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 11 || 11 || 13 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 9 || 9 || 11 || 13 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 7 || 7 || 9 || 11 || 13 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10 and now add 6 at index 3 +----++----++----++----++----++----++----++----++----++----++----+ | 1 || 3 || 5 || 6 || 7 || 9 || 11 || 13 || 15 || 17 || 19 | +----++----++----++----++----++----++----++----++----++----++----+ 0 1 2 3 4 5 6 7 8 9 10The code is:
arrayinsert.cpp:
int insert(int* searcharray, int toadd, int size) { int min=0; int max=size; while (min <= max) { if(searcharray[(min+max)/2]=toadd) { min = (min+max)/2; for(int i=size;i>min;i--) { searcharray[i+1]=searcharray[i]; } searcharray[min]=toadd; return min; } else if(searcharray[(min+max)/2]<toadd) { min=(min+max)/2+1; } else { max=(min+max)/2-1; } } min=(min+max)/2; if (searcharray[min+1]<toadd) { min=min+1; } for(int i=size;i>min;i--) { searcharray[i+1]=searcharray[i]; } searcharray[min]=toadd; return min; }
Here, the big problem is that we have to move elements out of the way of the new one. The cost of the operation depends on the size of the array and the position in the array, but on average it is about n/2 moves of data. At this point, it is worth noting that these two operations are not easy to compare (as simple as it looked). The reason is that a pointer typically occupies about 4 bytes of data. The result is that assigning two pointers is not just 2 operations, it is 2 cheap operations. By contrast, moving the actual data of interest can be quite expensive. Imagine if each element of the array is a class containing 1 MB of data! If you only have to move 5 elements of the array, that is 5MB of data, compared to 8 B.
The total cost of search plus insert becomes: linked list = n/2 + 1 + 2, array = 2*log_2(n) + (n/2)*(cost of move). Both are O(n) for the total insert, with array being potentially much slower depending on the type of data involved. Based on these two methods for storing data, it looks like you should use arrays if you are mainly searching for data that is small, and you should use linked lists if you have a lot of inserts or the data is large.
In the next tutorial, we'll try to get the best of both worlds
