2012-06-18

Presentation

For those of you who won't be able to attend tomorrow, or just can't wait, my presentation (PDF) can be found at http://minus.com/mfKnqN5D7/.

2012-05-27

A small logging intermezzo

I have a a lot of (mostly TRACE) debug statements in my ladder queue implementation and I was wondering if these should be removed as some of them would be used quite a lot. I set up a little experiment that executed a for loop 5 billion times. The results, on a stock JDK7, are the following:

NothingNo impact (obviously)
Calling an empty private final function with a static string in parameterNo impact
Calling an empty private final function with a dynamically composed string (eg "iteration" + i) Large impact
Calling an empty private final function with a static string and integer (in separate params)No impact
Calling logger.debug on an slf4j logger with attached but disabled logback logger with static string 8,0 × 10-9 seconds per invocation
Calling logger.debug on an slf4j logger with attached but disabled logback logger with static string and int 2,14 × 10-8 seconds per invocation
Calling logger.debug on an slf4j logger with attached but disabled logback logger with static string and String1,8 × 10-8 seconds per invocation

The last items mean calling it like this: logger.debug("something {}", i); . In this table large impact means "I let it run for 40 minutes after which I gave up and spent the next 15 minutes trying to avoid my computer crashing because of thermal overload (lots of "don't you die on me" and mouth-to-fan resuscitation was involved).

The overhead caused by using slf4j statically with a static string isn't too bad, but when using the {} syntax, logging with an object in parameter seems to take about twice as long as logging only a simple string. Logging an integer seems to take even longer than that due to auto-boxing.

I also tried adding a if(logger.isDebugEnabled()) statement before the logger.debug() statement but this did not have any impact. So there's no need to do stuff like


if(logger.isDebugEnabled())
   logger.debug("entering whatever");

as long as you're not doing anything funny in the logging statement.

The difference is smaller than what I thought it to be earlier, but I'd still like to give the LadderQ the best possible chances. So I guess that I'll use a small interface and decide on compile time to either send debugging info to slf4j, or do nothing with it at all. This has no overhead and still allows me to keep the debug statements.

<Insert an obligatory clenbuterol zero zero zero zero joke about the "0,0000001038 seconds per invocation">

2012-05-26

Small progress report

I'm currently trying to figure out what the cause is of the out of memory exceptions thrown when running with the PairingHeapQueue and trying to figure out if that queue just demands a large amount of memory or if there is some sort of a leak going on.

In other news, the Ladder Queue seems to withstand my torture testing quite well (after some minor adjustments like the ability to deal with more than 5 rungs, which is very unlikely to come up in real simulation), so it seems that it's about ready to be plugged into the simulator.

2012-05-18

VM crashes and more...

When I tried to rerun the last experiment again with the more accurate measurement method yesterday, the test did not finish. Some investigation pointed out that the java VM was crashing for some reason (link to dump). I have a weak suspicion that this crash is in some way caused by a bug in the subprocess python module. I think it may be doing strange things buffering input causing the jvm to write to some illegal part of memory or something like that.


To remedy this, I once again changed the script to send its stdout to file instead of PIPE, which reintroduced the "cutting off the output" bug I encoutered earlier. I tried a lot of different things to solve this problem, like changing the subprocess.call to subprocess.Popen, changing shell=True to shell=False, etc. None of this had any effect. I'm currently running again with a simple System.out.flush() after the print command in java code (I hope this flush is propagated from the java program to ant to maven to python). If that doesn't work, my backup solution is to send the timing output on stderr, which should be a pretty much empty stream and shouldn't have the same problems.

Ladder Queue work

Yesterday I also fixed the creation of a new rung from bottom. After this, my simulation-like test succeeded completely. Today I'm running something of a torture-test on the ladderQ, threating it like an ordinary queue (in something that's definitely not a discrete event simulation). I'm just inserting a large amount of completely randomly distributed [0-10000] events, removing a random amount of them, adding another large amount and repeating that for a few hundred times. The ladder queue will not perform well (in a time optimal manner) in a test like this, but it should be able to handle the test none the less.

This exposed another bug in the pseudo code, and one I definitely did not see in advance. When inserting a new element that's smaller than the start of the lowest rung, it should be inserted into bottom. To prevent degradation to a linked list, a check is made to see if bottom isn't too full when inserting that element, and if it is, bottom is moved to a newly created rung. The code for this goes like this (verbatim):

create_new_rung(NBot)
transfer bottom to it
bucket_k = <formula>
insert into tail of rung NRung, bucket_k
<increase number of event in that bucket>

There's a number of problems with this code: first of all, create_new_rung creates a new rung from an existing rung, while there is no guarantee that at that time a single rung exists. Furthermore, it simply does not handle the creation of a new rung from bottom correctly! Rstart and Rcur for the new rung should obviously be adjusted to the values in bottom, but they are not. The instruction insert into tail of rung NRung, bucket_k is also very much incorrect, seeing as how there is no guarantee that that event will even fit in that rung, seeing as how the rung was made before adding the new event. I'll handle this by replacing the insert into tail of rung NRung, bucket_k with a recursive call to offer() of the new event.

2012-05-17

Better measurements

I tried out my own suggestion in the previous blogpost, to place the time measurements in the model itself, starting time being just before calling bench() on the first entity and stoptime being exactly when the last entity stops working. Compared to the old "experiment"-level way of timing things, I get about a 2.5 second decrease in times, which is obviously a good thing since this 2.5 seconds is completely irrelevant to my experiments.

Fix for latest bug

The issue I mentioned last week, with the simulation not ending when all of the entities were finished was fixed. Originally I thought I could fix this by simply setting the simulationBound to some incredibly high numer (Long.MAX_VALUE-1 or something like that) and the simulation would spontaneously stop after the last event was processed, but for some reason that didn't work.

The solution I'm using now is slightly cludgy, with each of the entities reporting back to the model when they're finished and the model doing something like this:

 @Override
 public synchronized void done() {
  runningNodes--;
  
  if(runningNodes == 0) {
   ParallelEventSystem.mayFinish();
  }
 }
. This seems to work pretty well. If I wanted to I could even put the print of the time it took in this function, it would probably be more accurate. I'll try that next.

2012-05-07

SimulationBound & Benchmarking

To compare performance of data structures as the distribution of events changes, using the "simulationBound" (which stops the simulation when the simulation reaches a certain time) is not ideal. For example, by changing the distribution from 10-1000 to 1-10, the simulation would take 100 times longer if the same simulationBound is kept. So it would be difficult to graph the time for a run when parameters change (which is something I'd like to do).

I just tried it out, and my benchmark machine, when running with 10 nodes and the Java priority queue, in 1 minute about 824000 events can processed. So that comes to 8 240 000 events per minute or 137 333 events per second (running on 1 core).

A small problem when actually setting the number of events to process to 8 240 000, is that the simulation does not stop. The entities stop running but the simulation (process) keeps on running.
I guess I could use System.exit() or other dirtyness, but I'll ask Kurt at the next meeting what the preferred way of making the simulator stop is. One problem I can see with the simple System.exit() is that this exit would happen after a *single* Entity completes its work, but other Entities may still have work remaining.

2012-05-06

Creating a new rung because the first bucket of last rung

... seems to work!

!!! Woehoew !!!

No major problems, other than the problems outlined in the previous post, and a small omission in the pseudocode (in recurse_rungs setting k = 0 after expanding a bucket to a rung).

Next up: fixing the creation of new rungs from an overful bottom.

Current work on the Ladder Queue

Good morning everyone

(You guys asked for that one...)

So, after fixing the RCur problem and finally getting a successful run of my test suite, I realised that at no point I actually had more than one rung in my ladderQ (it's a good thing multiple rungs are so hard to create, it means the ladderQ is reasonably efficient). So I forced the creation of a new rung by adding a lot of events that are really close together in time. This revealed (as expected) a number of other problems. For example, there are two ways of creating a new rung:

  • The first bucket of the lowest rung is getting too full
  • The bottom structure is getting too full
In both of these scenario's I think I need to evaluate all events that need to be moved to the new rung to know their lowest and highest TS, so I can do a decent job of setting RCur, RStart and BucketWidth for that Rung. The reason I didn't do this in the first place is that in the pseudocode, the function create_new_rung simply does something like this:

BucketWidth[NRung] = BucketWidth[NRung-1]
RSTart[NRung] = RCur[NRung] = RCur[NRung - 1]
 
which is a strange thing to do, since that way there's really no way for the bucketwidth of two rungs to differ. And that's a pretty essential feature (they even use an array to indicate this value differs for different rungs, so once again I don't know what they were thinking). It should be a reasonably easy problem to solve.

2012-05-01

First SleepModel results

After tinkering about with the timebound of the SleepModel, the first results are in (with time uniformly distributed between 10 and 1000)


EncapsulatedSkipList;140000000;27289561472
EncapsulatedSkipList;140000000;27263209214
EncapsulatedSkipList;140000000;27333339798
EncapsulatedSkipList;140000000;29876444971
EncapsulatedSkipList;140000000;33150127870
FibonacciHeapQueue;140000000;25657102324
FibonacciHeapQueue;140000000;25408097658
FibonacciHeapQueue;140000000;25290353285
FibonacciHeapQueue;140000000;26417257884
FibonacciHeapQueue;140000000;24674895865
EventHeapQueue;140000000;22657880920
EventHeapQueue;140000000;24321408297
EventHeapQueue;140000000;23273707683
EventHeapQueue;140000000;24240075311
EventHeapQueue;140000000;23326448113
CalendarQueue;140000000;26320928253
CalendarQueue;140000000;25328098246
CalendarQueue;140000000;25755167657
CalendarQueue;140000000;25315079319
CalendarQueue;140000000;26312306105
PairingHeapQueue;140000000;32619338904
PairingHeapQueue;140000000;31721119742
PairingHeapQueue;140000000;32086773030
PairingHeapQueue;140000000;30888228364
PairingHeapQueue;140000000;31549981466
SkipList;140000000;26642718242
SkipList;140000000;29466671062
SkipList;140000000;29355682538
SkipList;140000000;27473396963
SkipList;140000000;29299904675
JavaPriorityQueue;140000000;24293036390
JavaPriorityQueue;140000000;24281461792
JavaPriorityQueue;140000000;24290784409
JavaPriorityQueue;140000000;24262189242
JavaPriorityQueue;140000000;24499839511
SplayTreeQueue;140000000;26288893989
SplayTreeQueue;140000000;26309865599
SplayTreeQueue;140000000;26307518832
SplayTreeQueue;140000000;26505388409
SplayTreeQueue;140000000;27006209760
TreeMapQueue;140000000;23357232492
TreeMapQueue;140000000;23308374541
TreeMapQueue;140000000;24277818530
TreeMapQueue;140000000;24300141711
TreeMapQueue;140000000;24278979156
HashMapQueue;140000000;25264645807
HashMapQueue;140000000;25271931607
HashMapQueue;140000000;25299499125
HashMapQueue;140000000;25314163271
HashMapQueue;140000000;25292758506

which corresponds to

EventHeapQueue96.0300322964%
TreeMapQueue99.9466222301%
JavaPriorityQueue100.0%
HashMapQueue104.124914536%
FibonacciHeapQueue104.599741327%
CalendarQueue106.028554794%
SplayTreeQueue108.312128402%
EncapsulatedSkipList112.525554292%
SkipList120.621484188%
PairingHeapQueue130.589112348%

The EventHeapQueue and TreeMapQueue do a good job in both this synthetic benchmark and the actual simulation. HashMapQueue and the JavaPriorityQueue is about the same in both (reasonably good).
The CalendarQueue, which did quite good in the the practical benchmark, was a lot worse in the synthetic one. The PairingHeapQueue apparently just sucks all the time.

2012-04-29

Some of the problems during LadderQ implementation

When meeting with Kurt earlier this week, I was asked to put together some text detailing the sort of problems I've been solving during implementation. This post may be rather long and quite frankly uninteresting so I won't blame anyone if they choose not to read it. Those who enter here, abandon all hope.

TL;DR: Worked on LadderQueue, encountered difficulties and eventually overcame them...

Firstly, as I've mentioned before, the pseudocode is incredibly high-level at times, and low-level at others. For example, it contains lines like

transfer Top to rung 1 of Ladder
which is in itself non-trivial. Luckily this same operation is detailed elsewhere in the code but the same still
applies for other examples.

An example of low-level stuff is the manual incrementation and decrementation of sizes every time something is added to or removed from an array. When working in a modern language,
this is something that's almost always handled by the container itself. Care must however be taken when removing these variables, as the NRung variable, which indicates the number of
rungs currently in the ladder, should still be used in addition to the list containing the rungs. This because while the number of rungs in use, as indicated by NRung, can vary throughout
the program, the number of rungs in the structure will not vary because of performance reasons. You could reason that the ArrayList or whatever structure containing the rungs will do
the right thing
and won't re-size downwards immediately when removing a rung, but this discounts the possible performance advantages that can be achieved because the unused rungs are still in memory. The extra objects for each rung also don't have to be created. I do still keep a counter for the number of elements currently in the ladder part of the ladderQ (the rungs), because iterating over a list of lists of lists and adding elements isn't the fastest of things.

A pretty big problem I also have with the implementation is the fact that variables are almost never initialised. When talking about code like

while (TS < RCur[x] && x <= NRung)
    x++;

one can guess what the initial value for x should be (in this case 1, as the rungs are numbered from 1 onwards). This gets a lot trickier when it comes to the many arrays that are used throughout the program. For example, each of the rungs numbered 2 and larger are initially sized at 50 buckets (the first rung's size is determined at runtime but can still grow wildly) but it's perfectly possible that at some point in the program 300 buckets are needed in some rung (when the initial dequeueing into rung 1 is a bad estimation of the distribution in the first epoch). I solve this problem by simply checking if the structure is large enough and to grow it if it isn't, but it's not really ideal.

The pseudo code even contains a few things that are blatantly wrong, like for example the previously mentioned fragment of code

while (TS < RCur[x] && x <= NRung)
    x++;
that will access an element that is out of range on the last iteration unless rewritten to
while (x <= NRung && TS < RCur[x])
    x++;

But by far the most problematic are some general omissions in the code. Take for example a scenario in which three events are inserted in the queue, all having the same timestamp: 15. These will initially be inserted into top. When the first element is polled (or peek'd for that matter), the top rung will be moved to the first rung. This first rung will have to be created. In order to create a rung from top, the bucket width (width in time of a bucket) is calculated based on

Bucketwidth[1] = (MaxTS -- MinTS) / NTop;
which will be (15-15)/3 which is +INF. Now, I'll be the first to admit a scenario like this won't give you a good epoch because the code has no idea about the distribution of the incoming events, but that doesn't change that fact that something like this should be taken into account. A second problem with this same scenario is that the events with equal timestamps will obviously be in the same bucket. In the same dequeue operation where they were moved from top to ladder, the bucket containing all events will also be moved to the sorted bottom list. If that bucket is the only bucket, NRung should be set to zero when moving that bucket, or problems will occur with the next poll() operation trying to poll from an empty rung.

As I already explained on Tuesday, the problem I'm currently having is that when dequeueing an element, RCur is not updated. This can cause serious problems when enqueueing after that and dequeueing again. I already explained the problem to Kurt and Silas so I won't go into much detail, but suffice it to say that it's a bit of a tricky problem.

A little something extra, which isn't exactly a problem in the code but more of a problem I had, was that the peek() operation was not specified. Rather than the

x=poll();
offer(x);
I initially had, I now use the faster
x = poll();
bottom.add(0, x);
or equivalent, which adds the event to the first position in bottom. I don't know how often the peek() operation is used in the simulator, but it should be well worth the minor adjustment
anyhow. Checking the bottom list is the first thing that's done during a dequeue operation and it's impossible to know what the first event is without going through most of the logic in poll()
so this implementation should be more or less optimal. The only thing that could be faster is that I actually remove the event from the first position in bottom, after which I add it again. If
this proves to be a frequently used operation I can always either duplicate the code or use a strange function and some conditional logic that's either marginally faster or isn't.

2012-04-08

Latest results

I dramatically increased the time it takes to test one queue. After changing this however, I encountered a rather strange bug that caused a part of the output of a simulation run to disappear. This bug is however not critical.

The raw results I gathered:

TreeMapQueue;250000;44504685563 TreeMapQueue;250000;43510640683 TreeMapQueue;250000;43507232275 TreeMapQueue;250000;43493418310 TreeMapQueue;250000;45450831695 PairingHeapQueue;250000;126060167710 PairingHeapQueue;250000;125240863979 PairingHeapQueue;250000;145046297489 PairingHeapQueue;250000;137354103844 PairingHeapQueue;250000;144051341014 HashMapQueue;250000;52394058529 HashMapQueue;250000;51806601251 HashMapQueue;250000;51151331841 HashMapQueue;250000;53559766910 HashMapQueue;250000;52176779442 FibonacciHeapQueue;250000;54482560996 FibonacciHeapQueue;250000;54985676302 FibonacciHeapQueue;250000;54372282025 FibonacciHeapQueue;250000;53014851349 FibonacciHeapQueue;250000;52796625042 JavaPriorityQueue;250000;51551165898 JavaPriorityQueue;250000;51443365330 JavaPriorityQueue;250000;50457917413 JavaPriorityQueue;250000;52876494280 JavaPriorityQueue;250000;53129498732 SkipList;250000;58607951279 SkipList;250000;60014792829 SkipList;250000;58422165150 SkipList;250000;57934313472 SkipList;250000;60079309281 CalendarQueue;250000;42451213404 CalendarQueue;250000;44537189695 CalendarQueue;250000;42570999140 CalendarQueue;250000;40472274434 CalendarQueue;250000;42437191377 EventHeapQueue;250000;40552683898 EventHeapQueue;250000;39938462029 EventHeapQueue;250000;42589802688 EventHeapQueue;250000;42522441333 EventHeapQueue;250000;41810384506 EncapsulatedSkipList;250000;60108290120 EncapsulatedSkipList;250000;59988195877 EncapsulatedSkipList;250000;59373120898 EncapsulatedSkipList;250000;58392384034 EncapsulatedSkipList;250000;60109929789 SplayTreeQueue;250000;58798470148 SplayTreeQueue;250000;61035334794 SplayTreeQueue;250000;60905879898 SplayTreeQueue;250000;61560373083 SplayTreeQueue;250000;60811032674

which results in the following results (with the second column being percentage of the time it takes to execute that same experiment compared to the java priority queue):

EventHeapQueue81.1046341585%
CalendarQueue82.3477270873%
TreeMapQueue84.4028256686%
JavaPriorityQueue100.0%
HashMapQueue101.213577876%
FibonacciHeapQueue105.472458436%
SkipList113.688895795%
EncapsulatedSkipList116.366322336%
SplayTreeQueue118.146464463%
PairingHeapQueue266.442283994%
So the EventHeapQueue, CalendarQueue and TreeMapQueue are all much faster than the Java priority queue, while you should definitely stay away from the pairing heap queue and the splay tree queue.

2012-04-01

Mounting with samba and new results

I just couldn't resist trying the network thing: if I'm developing from home, I can skip a few steps and directly mount my "main" directory over the network and compile from my benchmark machine using the project on my development machine:
mount -t cifs -o username=ives,password=<passwd>,uid=ives,gid=ives //ix/users/Ives/stage /stage

This obviously doesn't work if I'm not in the same network (I don't feel like setting up a VPN).

Now, for the worse news: I ran the benchmark again earlier this week, with the following raw results:

JavaPriorityQueue;10000;6561024477
JavaPriorityQueue;10000;5404055641
JavaPriorityQueue;10000;5448148997
JavaPriorityQueue;10000;5518502627
JavaPriorityQueue;10000;5400359276
JavaPriorityQueue;10000;5430126603
JavaPriorityQueue;10000;5759644066
JavaPriorityQueue;10000;5401271246
JavaPriorityQueue;10000;5464453238
JavaPriorityQueue;10000;5407788605
PairingHeapQueue;10000;5446775232
PairingHeapQueue;10000;5423628469
PairingHeapQueue;10000;5418897748
PairingHeapQueue;10000;5421841291
PairingHeapQueue;10000;5409722011
PairingHeapQueue;10000;5729907979
PairingHeapQueue;10000;5416062530
PairingHeapQueue;10000;5477439257
PairingHeapQueue;10000;5433077898
PairingHeapQueue;10000;5412777948
TreeMapQueue;10000;5425671545
TreeMapQueue;10000;5426851101
TreeMapQueue;10000;5425298384
TreeMapQueue;10000;5399660301
TreeMapQueue;10000;5392418944
TreeMapQueue;10000;5391327447
TreeMapQueue;10000;5926422059
TreeMapQueue;10000;5393873311
TreeMapQueue;10000;5443363135
TreeMapQueue;10000;5744871062
SkipList;10000;5397133306
SkipList;10000;5675809273
SkipList;10000;5422119755
SkipList;10000;5392720363
SkipList;10000;5400829320
SkipList;10000;5422794354
SkipList;10000;5415324309
SkipList;10000;7051620337
SkipList;10000;5416224427
SkipList;10000;5420276478
SplayTreeQueue;10000;5395243251
SplayTreeQueue;10000;5412487676
SplayTreeQueue;10000;5545716916
SplayTreeQueue;10000;5442838378
SplayTreeQueue;10000;5430914414
SplayTreeQueue;10000;5610695557
SplayTreeQueue;10000;5464748603
SplayTreeQueue;10000;5408613093
SplayTreeQueue;10000;5501706374
SplayTreeQueue;10000;5405540312
HashMapQueue;10000;5396951336
HashMapQueue;10000;5384225368
HashMapQueue;10000;5393222775
HashMapQueue;10000;5398562791
HashMapQueue;10000;5368138462
HashMapQueue;10000;5419371366
HashMapQueue;10000;5769373695
HashMapQueue;10000;5489567527
HashMapQueue;10000;5438062242
HashMapQueue;10000;5873245080
EncapsulatedSkipList;10000;5621653702
EncapsulatedSkipList;10000;5439568704
EncapsulatedSkipList;10000;5412919542
EncapsulatedSkipList;10000;5461816898
EncapsulatedSkipList;10000;5462168806
EncapsulatedSkipList;10000;5402203002
EncapsulatedSkipList;10000;5425761816
EncapsulatedSkipList;10000;5393928593
EncapsulatedSkipList;10000;5490411424
EncapsulatedSkipList;10000;5548951710
FibonacciHeapQueue;10000;5666839845
FibonacciHeapQueue;10000;6225841498
FibonacciHeapQueue;10000;5429501483
FibonacciHeapQueue;10000;5423790030
FibonacciHeapQueue;10000;5815081783
FibonacciHeapQueue;10000;5437626501
FibonacciHeapQueue;10000;5462466700
FibonacciHeapQueue;10000;5400344311
FibonacciHeapQueue;10000;5480721465
FibonacciHeapQueue;10000;5426114305
CalendarQueue;10000;5380552101
CalendarQueue;10000;5488739278
CalendarQueue;10000;5993241530
CalendarQueue;10000;5388830581
CalendarQueue;10000;5406325242
CalendarQueue;10000;5446190891
CalendarQueue;10000;5401955885
CalendarQueue;10000;5407636115
CalendarQueue;10000;5406300099
CalendarQueue;10000;5422167002
EventHeapQueue;10000;5700755620
EventHeapQueue;10000;5400055933
EventHeapQueue;10000;4371602178
EventHeapQueue;10000;4549412219
EventHeapQueue;10000;6803269647
EventHeapQueue;10000;5401834240
EventHeapQueue;10000;5759731649
EventHeapQueue;10000;5439161377
EventHeapQueue;10000;5446489216
EventHeapQueue;10000;4398517379
.

This translates to the following relative results:

CalendarQueue99.2563918127%
HashMapQueue99.4717906758%
SkipList99.4884038778%
PairingHeapQueue99.5499291959%
TreeMapQueue99.5874295653%
EventHeapQueue99.8350335131%
SplayTreeQueue99.9025243435%
JavaPriorityQueue100.0%
EncapsulatedSkipList100.250872379%
FibonacciHeapQueue100.262799402%

So once again all results are within the margin of error... At least now the time the experiment takes correlates with the time limit I set. Also, I'm sure the actual queues are being used so it's not like I screwed up majorly (at least not with that). I must say that each of the experiments still doesn't take too long (a few seconds), so maybe I should increase the time a single experiment takes as it's possible setup still takes up a too large percentage of time.

The next step will be to see what's taking up time in a profiler and increasing the time limit a single experiment can take.

2012-03-24

Small status update

The svn update seems to have gone well: I just mounted the directory I use in windows in my virtualbox through vboxsf[sic] and everything compiled well after changing two config files. The core tests also run once again which is always nice.

This setup, with the directory on my windows machine but accessible from xubuntu, is working well for me (works with my backup strategy, gives me svn access from both windows and linux and doesn't require any synchronization).

Next up is synchronizing with my benchmark machine (I guess I could try something with a network share but that's probably more trouble than it's worth), compiling and then running the benchmark again.

2012-03-23

This week's progress

Earlier this week I encountered a "show stopping bug": when running with coroutines enabled, the simulator seemed to stop immediately without actually doing much of anything. I figured this out after running the benchmarks and getting pretty much exactly the same result (timing) for all of the queues. This was suspicious to say the least. I mailed Kurt who could reproduce the bug and fixed the bug. I updated my svn rep today and will be running the benchmarks again tomorrow. Some of the things that make my life more interesting:
  • My windows machine has an svn repo but can't compile, nor benchmark
  • My xubuntu machine can compile but doesn't have an svn repo (it seems to have been lost in transit) and also can't be used for benchmarking because it's a virtual machine
  • My fedora machine can compile and benchmark, but is always controlled remotely which makes it inappropriate for development (although I'm normally a vim guy, I must admit I prefer eclipse for java development). Due to ridiculously unreliable WoL I can only access this machine during the weekend.

So you can imagine the juggling I do... I'll probably fix the Xubuntu machine for svn tomorrow, and might even spend some more time on getting WoL working on the fedora machine (although I'm reluctant to spend another afternoon on that problem).

2012-03-10

This week's progress

This week I finished the transition from javaflow to coroutines. There's still a test that hangs when I try to run it, but the actual benchmark seems to run okay.

Furthermore, I also continued working on the LadderQ.

2012-03-02

This week's progress

Another week, another pound of blood, sweat and tears...

So, since the last update (which was only four days ago mind you), I installed a 64 bit recent version of Xubuntu in a VM. I also installed the coroutine JDK and eclipse. By way of proving the coroutine JDK works I ran eclipse on it and it seemed to work fine. Minutes ago, I installed maven and copied over my entire GES tree.

Now, for the bad news...

When compiling the core project with the jdk7 pom.xml and


ives@vbox64:~/stage/branch/students/ives/core$ echo $JAVA_HOME
/home/ives/Programmas/corovm-jdk
ives@vbox64:~/stage/branch/students/ives/core$ which javac
/home/ives/Programmas/corovm-jdk/bin/javac
ives@vbox64:~/stage/branch/students/ives/core$ which java
/home/ives/Programmas/corovm-jdk/bin/java

I still get the same error I got before (on Windows):



[Invoker 160863352] Invoking be.ac.ua.comp.gestest.systemtest.de.multicore.SimpleTest.testSingleCoreDelayedEntityCalls
2012-03-02 23:24:15,914 [Core0] ERROR org.apache.commons.javaflow.bytecode.StackRecorder  - Process.SuspendNow(..) fall through.
be.ac.ua.comp.core.de.EventSystemError: Process.SuspendNow(..) fall through.
 at be.ac.ua.comp.core.de.AbstractProcess.suspendNow(AbstractProcess.java:157)
 at be.ac.ua.comp.core.de.AbstractProcess.delay(AbstractProcess.java:317)
 at be.ac.ua.comp.core.de.AbstractProcess.delay(AbstractProcess.java:282)
 at be.ac.ua.comp.core.de.AbstractProcess.delayCurrent(AbstractProcess.java:375)
 at be.ac.ua.comp.gestest.systemtest.de.multicore.SimpleTest$2.run(SimpleTest.java:389)
 at be.ac.ua.comp.core.de.AbstractSchedulable.internalRun(AbstractSchedulable.java:87)
 at be.ac.ua.comp.core.de.AbstractProcess$1.run(AbstractProcess.java:59)
 at org.apache.commons.javaflow.bytecode.StackRecorder.execute(StackRecorder.java:92)
 at org.apache.commons.javaflow.Continuation.continueWith(Continuation.java:152)
 at org.apache.commons.javaflow.Continuation.continueWith(Continuation.java:125)
 at be.ac.ua.comp.core.de.continuation.JFContinuation.start(JFContinuation.java:28)
 at be.ac.ua.comp.core.de.AbstractProcess.start(AbstractProcess.java:56)
 at be.ac.ua.comp.core.de.multicore.ParallelEventCore.run(ParallelEventCore.java:602)
 at java.lang.Thread.run(Thread.java:734)

For some reason it seems like that test is still trying to use javaflow (and failing). I haven't mailed Kurt or Silas with this problem yet and I'll probably do so tomorrow. First I'm going to try to get some rest, maybe even some sleep.

2012-02-27

Intermediary Progress

So far the upgrade to jdk7+coroutines has been going a little slower than I would have liked. The coroutine-patched version of the openjdk is only available in precompiled binary format for linux, which kind of screws up my development-on-windows, benchmarking-on-fedora strategy.

Since I don't have access to the Fedora machine 4 days a week (I tried WoL stuff, but the combination closed laptop+fedora+WoL didn't work very well), I'll be continuing my development efforts on a virtual machine on my Windows host.

Unfortunately the only VM I still had lying around did not have enough free space for the DES and the coroutine JDK. After making some room and installing maven, I still got pretty weird errors. My coin dropped (only Flemish people will get that one) quickly and I realized that that VM was a 32bit VM, while the prebuilt coroutine JDK was 64bit software. A 64bit fedora image I tried was way too slow to be practical.

I'll be downloading a 64bit linux image tomorrow on campus to spare my home internet connection. By sheer luck the benchmark machine does run a 64-bit version so that's nice.

2012-02-24

This week's progress

I worked on the LadderQ this week. Over 400 lines of pretty complicated code later and I'm still working on it. I'm starting to understand why none of the simulators I reviewed used this structure...

Some of the problems I had during implementation:

  • Pseudo pseudo code. By this I mean that the code near the end of the paper wasn't lowlevel enough to call it pseudocode. E.g, 4 lines out of the pseudocode algorithm:
    transfer Top to rung 1 of Ladder
    bucket k = recurse rung();
    sort events from bucket k and copy to Bottom
    return first event from Bottom
  • Missing initialisations. Initialising variables is apparently optional.
  • Missing code (show me where MaxTS and MinTS are adjusted)
  • Missing considerations. E.g.

    Note that if MaxTs is equal to MinTs, it means all the events in Top have the same timestamp. In this article we consider the mean jump to be finite and positive. Thus, the likelihood of this occurring is extremely low.

    After which they ignore the issue completely. Apparently nobody seems to consider the possibility that there might only be one event in the Event Queue at the moment the first event is pulled out (this event's timestamp == MaxTs == MinTS, because of which BucketWidth == 0, which causes a division by zero and crash). This might not occur in normal situations but if you're gonna write an ADT, it better be correct all of the time.

I suppose it could be worse.

On an unrelated note, I'll be changing to the new openJDK7+coroutines way of life from the current sun JDK6+javaflow either tomorrow or today on both my development and my benchmark machines. We're hoping this will influence the results in a positive way.

2012-02-17

This week's progress

The previous week I read up on calendar queue implementations, like for example http://pioneer.netserv.chula.ac.th/~achaodit/paper5.pdf. I also realised I hadn't yet implemented a ladder queue (for an ADT that is supposedly great, it should be noted that of the dozen or so event simulators I reviewed, not a single one uses a ladderQ) so I requested and read the paper by Tang et al. that describes the LadderQ in depth.

I've started implementation of the ladderQ, including a few tests to ensure the ladderQ is correct before plugging it into the event simulator. Implementation is rather elaborate and although the pseudocode near the end of the paper definitely helps, it's still incredibly far from an implementation in an existing language.

2012-02-07

SplayTreeQueue

Today I realised I hadn't yet tried a SplayTree, so that's what I spent this afternoon doing. I based the SplayTreeQueue on ftp://ftp.cs.cmu.edu/user/sleator/splaying/SplayTree.java. During development I got the "from running to running" error for the third time, this time it was fixed by replacing the contents of poll() from something like
E e = findMin();
remove(e);
return e;
to the contents of those functions. I still have no real idea what causes these errors...

I also optimised the removeMin operation a little (took advantage of the fact that findMin already splayed the tree on the the leftmost element, which means the remove() does not need to repeat this operation).

But unfortunately, no sigar. The SplayTreeQueue seems to be one of the slowest (definitely not faster than the JavaPriorityQueue) priority queues tested yet.

2012-02-05

More progress and first data

I added a TreeMapQueue, CalendarQueue, PairingHeapQueue, FibonacciHeapQueue and EventHeapQueue to the benchmark (some of these datastructures were already present, just not in the benchmark).

I also ran the benchmark on MARLIN (the dedicated pc) and got the following raw results (the second column is max simulation time and the third is wall clock time in nanoseconds):

CalendarQueue;1000;12463543564
CalendarQueue;1000;12431540564
CalendarQueue;1000;12314603248
CalendarQueue;1000;12342168941
CalendarQueue;1000;12379956154
EncapsulatedSkipList;1000;12233962662
EncapsulatedSkipList;1000;12435363549
EncapsulatedSkipList;1000;13524316579
EncapsulatedSkipList;1000;12451756549
EncapsulatedSkipList;1000;13430291485
SkipList;1000;12442345820
SkipList;1000;13471379407
SkipList;1000;13259910859
SkipList;1000;13423150185
SkipList;1000;13433629771
JavaPriorityQueue;1000;11427608446
JavaPriorityQueue;1000;11331713344
JavaPriorityQueue;1000;11486361044
JavaPriorityQueue;1000;11447420677
JavaPriorityQueue;1000;11489319111
PairingHeapQueue;1000;12627835823
PairingHeapQueue;1000;12461527869
PairingHeapQueue;1000;12635402319
PairingHeapQueue;1000;12766900475
PairingHeapQueue;1000;12403462163
TreeMapQueue;1000;11477720945
TreeMapQueue;1000;11412306791
TreeMapQueue;1000;11567066375
TreeMapQueue;1000;11426513260
TreeMapQueue;1000;12277413812
FibonacciHeapQueue;1000;12358072838
FibonacciHeapQueue;1000;11414073432
FibonacciHeapQueue;1000;11296198337
FibonacciHeapQueue;1000;12390780696
FibonacciHeapQueue;1000;12356444144
HashMapQueue;1000;12510711908
HashMapQueue;1000;12231090920
HashMapQueue;1000;12543661825
HashMapQueue;1000;12463697651
HashMapQueue;1000;12440836556
EventHeapQueue;1000;11464068161
EventHeapQueue;1000;11328706434
EventHeapQueue;1000;11411157250
EventHeapQueue;1000;11466460866
EventHeapQueue;1000;11439271687

Selecting the median of each queue and setting the Java Priority Queue to 100% gives (lower is better)

EventHeapQueue99.9288137456%
JavaPriorityQueue100.0%
TreeMapQueue100.264690788%
FibonacciHeapQueue107.940858405%
CalendarQueue108.146249739%
EncapsulatedSkipList108.773468717%
HashMapQueue108.877781316%
PairingHeapQueue110.311625468%
SkipList117.259167491%
The EventHeap (binary heap) scores ever so slightly better than the java priority queue, but this difference is well within error margins and it's safe to assume they are pretty much equivalent. The TreeMapQueue is also a solid contender, which was to be expected because it's obviously closely related to a Heap. Although the calendar queue has access to exact firing times, which should give it an advantage over other datastructures, it fails to live up to expectations.

So, the score for today:
Oracle: 1
Ives: 0

2012-02-02

Progress...

Yesterday I added some minimal timing code to my experiment to get (theoretical) nanosecond timing. I used to rely on the printed output of mvn, but I don't like that this included the forking of the java executable.

I also changed the simulation script that tries out different queueing strategies by passing different XML files to the simulator. It currently outputs simple CSV data, containing the nanosecond timestamps gathered from standard output of the simulator.

I also implemented a CalendarQueue (from here on called CQ), based on the one used by Ptolemy II, in the simulator. Porting it over was a bit harder than expected, since the Ptolemy II CQ wasn't generic ("templated" on the event), used all sorts of internal Ptolemy II classes like Time and most of all, it uses a "smart" way of structuring the CQ. This structuring takes into account when the contained events fire to structure the CQ.

2012-01-28

Returning...

After a relatively long period of exams and projects I'm about ready to resume working on the project.

I've suspected for a while that a big part of the reason I was getting a lot of variance in my benchmark results was that I'm running them on my i7 laptop with Intel Turbo Boost. I downloaded a small intel gadget that shows your current processor speed (which isn't the one reported through eg Process Explorer) and it showed my clock speed changing at least once a second. Although it should certainly be possible to benchmark on this machine, I'm afraid I'd either have to reboot after every test, run really long tests, run in a temperature controlled environment, etc.

I decided it might be a good idea to try benchmarking on another machine, so I spent most of yesterday installing Fedora 16 on a Core2Duo laptop (I won't be working directly from this machine but controlling it through SSH and SCP which makes Linux preferrable over Windows). I won't go into the details of yesterday, but just imagine the fun I had with a BIOS that seems to be broken for USB boot, resizing partitions, corrupt MBR's, a machine that refuses to boot even from CDROM, finding out "not enough free space" meant I crossed the limit of 4 primary partitions per MBR based drive, programs crashing while they're partitioning my hard drive, and lots more. However, All's Well That Ends Well and I hope to resume doing more useful work tomorrow or the day after.

2011-11-25

This week's progress

I had a few things planned for this week (actually today would be event queueing-day), but unfortunately due to my grandfather's sudden drop in health this morning and eventual death a few hours ago I haven't been able to do, well, any of them.

I'll try to meet with Kurt in the beginning of next week.

2011-11-18

This week's progress

Other than meeting with Silas on Wednesday, this week I
  • Finished and pretty much rewrote the Queue based on a ConcurrentSkipListMap. Originally this queue used keys that were generated from the events it received (not that quick a task because of the whole ProcessInterface inheritance check in SchedulableMulticoreComparator). These keys then implemented SchedulableInterface so they could be used without rewriting SchedulableMulticoreComparator.

    I later realised that it would be much better to use the events as keys themselves and map them to something like an Integer (putting null in as the corresponding value causes problems with the delete operation on the queue...)

  • Found a simple SkipList implementation on the internet and transformed it so it can be used (eg make it use a comparator).
  • Made the queue used a field in the configuration that cna be changed at runtime. An enumeration is constructed from this field and later passed to a Factory that constructs a Queue used in the core.
  • Made a little python script that runs the simulator with different queues.
I also tried timing a few runs with different queues but noticed no measurable differences. The differences between several runs using the same queue are waaaaaaaaaaay too big unless incredibly stupid things happen in the queue. Profiling reveals less than 3% of total time spent in the relevant add() method (in reasonable implementations of the queue), which is still pretty far in the margin of error because of differences in temperature in the room (laugh all you want, I'm sure of it), search indexers that suddenly start running, automatic changes in process schedulers in operating systems etcetera.

2011-11-11

This week's progress

This week I met with Kurt and Silas to talk about getting the GES up and running.

Surprisingly enough, after installing the latest JDK6, Eclipse Indigo, Maven and Subversion, actually getting the GES to run was surprisingly easy. I even managed to get the relevant parts in Eclipse without many issues.

I did have some trouble getting VisualVM to run, as evidenced by the previous blogpost. I'm also looking forward to the conversion from svn to hg/git/bzr/anything but svn. I really don't like svn and it doesn't seem to like me either.

Furthermore, I also made an EventQueue interface that's easier to implement than an AbstractQueue and have started thinking about implementing a HashMapEventQueue. I'm still figuring out what values are known when (SchedulableMulticoreComparator is a major influence) for an event, and how to combine them into a unique and sortable key.

Application not visible in VisualVM

I spent a few hours trying to figure out why I couldn't see the applications I was starting (through maven) in VisualVM.

The VisualVM FAQ talks about how you can only see java applications that are started by the same user and running on the same VM, so I first uninstalled three of my JVM's (JDK 32bit, JRE 32 bit, JRE 64bit leaving only JDK64 bit) to make absolutely sure the same VM was being used.

After this, still no GES in VisualVM. I used Process Explorer to make sure maven wasn't doing any voodoo (eg. switching users, it would be weird but not unseen), but no, both VisualVM and the GES were running from the same JVM executable nd were started by the same user. Moments later I realised what was going on:

  • I was starting VisualVM from explorer.exe
  • I was starting mvn from powershell or a cygwin bash
  • When starting a java application, the current process is forked and the JVM will run as a child process of the current process. So even though visualVM and the GES were both running from the same executable by the same user, the JVM was not in fact the same one, just one that was similar.

Starting the VisualVM from command line fixed the problem.

2011-11-04

This week's progress

This was a pretty calm week - I didn't have any classes all week and spent most of my week in the Netherlands, so I only read a paper called "Revisiting Priority Queueus for Image Analysis" by Cris Hendriks. I did plan a meeting with Kurt next Tuesday.

2011-10-28

Weekly Summary

This week I reviewed
  • Facsimile Simulation Library
  • Tortuga
  • SimPy
  • GTNetS
  • SimGrid
  • JiST
  • Ptolemy II
  • OMNeT++
  • GridSim
  • NS-3
for their event queueing implementation.

Facsimile Simulation Library

In retrospect reviewing Facsimile wasn't that useful. Although it's listed on the Wikipedia list of Discrete Event Simulators, only a few percent of the code seems implemented and the project appears to be dead. But I might as well post it now.

The relevant code is written in the Scala file /libfacsimile/src/main/scala/org/facsim/facsimile/engine/Simulation.scala. It uses a scala.collection.mutable.PriorityQueue which is implemented using a Heap.

Tortuga

The Tortuga DES provides three implementations of its EventManager interface.
  • A Calendar implementation in delegate-src/src/org/mitre/sim/event/calendar/CalendarImpl.java that's based on the original paper but has discouraging performance related comments.
  • A HashSet implementation in delegate-src/src/org/mitre/sim/event/hashSet/HashSetImpl.java that uses Java's built in hashset.

    The important part (dequeueing) is something like this (GPLv2 licensed, © MITRE corporation): Event result = null; double smallestTime = Double.POSITIVE_INFINITY; for (Iterator i = db.iterator(); i.hasNext(); ) { Event nextEvent = (Event)i.next(); if (nextEvent.getNextTime() < smallestTime) { smallestTime = nextEvent.getNextTime(); result = nextEvent; } } db.remove(result);

    Suffice it to say that evaluating every single event at every single step of the simulation will not get you good performance.

  • A TreeMap based implementation in delegate-src/src/org/mitre/sim/event/treeMap/TreeMapImpl.java. As you might have guessed, this uses the java.util.TreeMap<K,V>. I'm not sure what kind of tree it uses internally, but if it's not a binary heap, performance should be close.
The default event manager is the TreeMap one. Based on the source code I wouldn't call this the most mature of simulators.

I also just realised that I apparently switch from Dutch to English and back every a few posts, so eh, sorry about that. I honestly didn't notice.

2011-10-27

SimPy

SimPy uses the standard binary heap that's built into Python for its FES. The actual code can be found in SimPy/Simulation.py.

2011-10-26

GTNetS

GTNetS, de Georgia Tech Net Simulator, gebruikt voor zijn Future Event Set een std::map scheduler (SRC/scheduler-map.{cc,h}, SC/scheduler.{cc,h}). Om een deterministische sortering te verzekeren, en geen problemen te krijgen met gelijke keys voor verschillende events wordt er in het geval van een tie in execution slice gekeken naar een uniek event ID om te weten welk van de events het eerst moet worden uitgevoerd.

SimGrid

Deze was veruit tot nu toe de moeilijkste om te bestuderen.

Maargoed, SimGrid heeft de mogelijkheid om onder andere NS3 en GTNets (experimenteel) te gebruiken, maar bezit ook over zijn eigen simulatiecore.

Deze core, SURF genaamd, gebruikt een wiskundig model dat niet in een ADT te omvatten is (zie onder andere src\surf\surf.c meerbepaald functie surf_solve). Er wordt verder hier en daar wel hier en daar gebruik gemaakt van een generieke heap (src\xbt\heap.c), maar er zijn geen "fancy" datastructuren zoals calendar queues te vinden.

2011-10-25

JiST

JiST 1.0.6 implements both an array based heap and a Calendar queue. Both are written in Java and can be found in src/jist/runtime/Scheduler.java. Currently only the heap is actually used by the simulator; to use the Calendar queue one needs to recompile the source code after commenting out one line and uncommenting another in src/jist/runtime/Controller.java.

Ptolemy II

De Discrete Event director in Ptolemy II gebruikt een Calendar Queue. Deze is net als de rest van Ptolemey II in Java geimplementeerd. De echte calendar queue bevindt zich in ptolemy.actor.util.CalendarQueue, maar deze wordt gewrapt in ptolemy.domains.de.kernel om te voldoen aan de correcte interface.

OMNeT++

OMNeT++ maakt gebruik van een binary heap. Er wordt wel gezegd dat een andere structuur zoals een skiplist in sommige omstandigheden beter zou kunnen presteren. De auteurs lijken verder redelijk overtuigd dat in het algemene geval een binaire heap het snelst is.

De code is geschreven in C/C++ en is een binaire heap gebaseerd op Handbook of Algorithms and Data Structures, Gaston H. Gonnet, pp. 273-274. De code is terug te vinden in src/sim/cmsgheap.cc.

GridSim

(Al het volgende is gebaseerd op GridSim 5.2 beta).

GridSim gebruikt van het gedateerde (anno 1995) SimJava2 als zijn DES. Deze maakt op zijn beurt gebruik van 2 verschillende queues.

De eerste queue is die waarin toekomstige gebeurtenissen terecht komen. Deze is geïmplementeerd
als een standaard java.util.TreeSet.

De tweede queue is de queue waarin uitgestelde gebeurtenissen geplaatst worden. Deze is op zijn beurt geïmplementeerd als een java.util.LinkedList.

Veel valt hier niet over te zeggen.

NS-3

(Al het volgende is gebaseerd op de laatste stabiele versie namelijk 3.12).

Er zijn verschillende queues geimplementeerd, allen in C++. Voor zover ik kan zien
geen fancy automatisch switching tussen queue implementaties ofzo. De queue
kan gekozen worden met --SchedulerType=MyTypeId command line
switch, default is de std::map queue.


  • Calendar Queue ( src/core/model/calendar-scheduler.{cc,h} ). Implementatie
    van het originele "Calendar Queues: A Fast O(1) Priority Queue Implementation for
    the Simulation Event Set Problem", Randy Brown met originele policy. Merk op: veel trager dan verwacht door de auteur, waarschijnlijk door slechte resizing policies. Een
    stuk trager dan bv de std::map queue.

  • Heap queue ( src/core/model/heap-scheduler.{cc,h} ).
    Straightforward binaire heap.

  • std::list queue ( src/core/model/list-scheduler.{cc,h} ).
    Triviale implementatie.

  • std::map queue ( src/core/model/map-scheduler.{cc,h} ).
    Gebaseerd op een idee in de Georgia Tech Network Simulator (GTNetS). Dit
    vereist wel dat er op de keys een absolute ordering zit (zonder
    duplicates).

  • NS-2 Calendar Queue (
    src/core/model/ns2-calendar-scheduler.{cc,h} ).
    De NS-2 Calendar
    Scheduler die iets vager is dan de NS-3 Scheduler.

  • Realtime Scheduler (
    src/core/model/realtime-simulator-impl.{cc,h} en
    src/core/model/wall-clock-synchronizer.{cc,h} ).
    Zie
    http://www.nsnam.org/docs/manual/html/realtime.html. Minder relevant voor
    onze doeleinden afaik.

Evaluatie van verschillende Discrete Event Simulatoren

Er volgen op deze blog enkele evaluaties van verschillende discrete event simulatoren en hun respectievelijke keuze van queueing mechanismen.

2011-10-14

On this blog...

On this blog I'll describe my progress for my Reseach Internship 1 at CoMP concerning Evaluation and implementation of Event Queue ADT’s. Some more information can be found on ESP.

Stay tuned.