How to loop efficiently over Table data

This page compares the performance of java - loops with loops in numeric datastructures. It should give the reader a hint how to find the quickest loop.

Pure java arrays

In java the way to loop over data is simple: Just write a loop over an array.

for(int i = 0; i< 10000; i++){
    if( javadata[i] == i) continue;
}

Its fast: 228 microseconds for this loop over 10000 entries.

Entrance: ia.numeric and ia.dataset

Usually we store our data in datasets in order to export them to fits etc. So what happens, if we loop over the data of a TableDataset?

//First create the dataset:
TableDataset t = new TableDataset();
Column c = new Column(Int1d.range(10000));
t.addColumn("c", c);

Access TableDataset data in the loop

Now write the most straightforward loop

for(int i = 0; i< 10000; i++){
    if( ((Int1d)t.getColumn("c").getData()).get(i) == i) continue;
}

20646 microseconds - a factor 90 slower

Extract the java array before the loop

Obviously a loop over an int array is faster (we all expected that, didn't we). Just get out the int array and loop over it.

int[] intdata = ((Int1d)t.getColumn("c").getData()).toArray();
for(int i = 0; i< 10000; i++){
    if( intdata[i] == i) continue;
}
19647 microseconds. Whats going on????

Extract the ArrayData before the loop

Lets do something an experienced developer would never do: Instead of performing the one method call in the last example, get the Int1d out of the TableDataset and perform 10.000 additional method calls in the loop:

Int1d int1ddata = ((Int1d)t.getColumn("c").getData());
for(int i = 0; i< 10000; i++){
    if( int1ddata.get(i) == i) continue;
}
910 microseconds.

This is less than 5 times slower than the direct java loop.

Access the ArrayData within the loop

To complete the tests, lets do something crazy:

for(int i = 0; i< 10000; i++){
    if( ((Int1d)t.getColumn("c").getData()).get(i) == i) continue;
}
3689 microseconds. Not a good idea.

Extract the java - array within the loop

Want to make a guess?

for(int i = 0; i< 10000; i++){
     if( ((Int1d)t.getColumn("c").getData()).toArray()[i] == i ) continue;
}
1446034 microseconds ( = 1.4 seconds). Definitely too crazy.

The reason why java - array access is slow, is an internal trim() method in the ArrayData objects. To speed up append operations, ArrayData (like Int1d, Double1d etc.) administers an internal capacity, that cannot be influenced from outside. As a result the ArrayData is internally larger than its size might imply. So every toArray() access is not simply a move of a pointer, it needs to arraycopy the real amount of data before the java array can be returned.

Remark

I don't have a real clue, if the capacity is always used. First I did this test with a Int1d in the TableDataset that was created with 10000 appends. This definitely defines a capacity. The range() method that I used in this example, gave very similar performance results. So I must assume that a capacity is always used.

Result

The quickest thing in real life - loop over plain java arrays - is the slowest for numeric calculations.

The best to do is, get ArrayData out of the Dataset and access the data with the get methods. This does not call trim().

-> the best method is 5 times slower than direct java loops.

Herschel: PACS/HCSS