You are currently browsing the daily archive for March 13, 2008.

Once you’ve fixed jcrawler so that it can follow dynamic URLs like those used in Seaside, you are ready to aim it at a Seaside application. In this post I’ll give you some pointers for getting started with jcrawler.

Out of the box, jcrawler is written to be run in the jcrawler/dist directory. There are relative paths in the shell script. That isn’t too convenient, so here’s a bash script that lets you run jcrawler from anywhere:

export JAVA_HOME=<path to java>
export JCRAWLER_HOME=<path to jcrawler>
$JAVA_HOME/bin/java -cp "$JCRAWLER_HOME/:\
$JCRAWLER_HOME/lib/htmlparser.jar" com.jcrawler.Main

When you run jcrawler, it expects there to be a crawlerConfig.xml file in a local conf directory (./conf/crawlerConfig.xml). The following is one that I have used for testing against an appliance running at on our internal network. I’ve supplied 4 starting URLs, one for each of the GemStone Examples (randomError, rcTally, tally and serial):

    Interval (in milliseconds) to invoke a crawl thread.
    There is an HTTP hit every  millisecond.

    Interval (in milliseconds) to invoke a monitor thread.
    Monitor adds new entry in the monitor.log every

<!-- HTTP connection timeout in milliseconds -->

<!-- Headers to be used by the http client crawler -->
	<header name="User-Agent">Mozilla</header>
	<header name="Cache-Control">no-cache</header>
	<header name="Accept-Language">en-us</header>

<!-- URLs to start crawling from -->

    URL patterns (regexps!!!) to allow or deny set of URLs
    permission=true  - these patterns are allowed (anything else is denied)
    permission=false - these patterns are denied (anything else is allowed)
    <url-patterns permission="true">


With an interval of 250, jcrawler will run at about 4 requests/second which is fast enough to get started. If you don’t see fireworks at this rate, you can set the interval to zero and let jcrawler fire at will. To really hammer an application you can launch multiple instances of jcrawler.

Here’s a sample object log from one of my runs:

partial random error object log

Notice that the report is dominated by entries labeled ‘Lock not acquired – retrying’. A glance at the full width log will show that the retry is due to a ‘Session lock denied: 2075’. A session lock is denied (and the request is retried) if two requests for the same session are received at the same time. This is not surprising given the fact that jcrawler uses a FIFO to store the URLs it scrapes from a page – almost every URL on a single page will have the same session key. When you see errors like this showing up in the object log, you at least know that jcrawler is firing simultaneous requests at Seaside.

As a final note, you should set deployment mode to true (using the Configuration Editor) before pointing jcrawler at your application. If you don’t, you are guaranteed to get some fireworks. While you’re in the Configuration Editor, take a look at the field for setting the Root Component. If the clear link is pressed, the root component for that application is wiped out. Until a new root component is set you will get internal server malfunctions, whenever the WADispatcher tries to launch a new instance of the application. If you don’t set deployment mode to true, jcrawler will eventually find its way into the Configuration Editor and it will eventually hit the clear link.

You are now ready to launch jcrawler at the WATally application and see how our Simple Persistence example fares under load.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 446 other followers


RSS GLASS updates

  • An error has occurred; the feed is probably down. Try again later.

RSS Metacello Updates

  • An error has occurred; the feed is probably down. Try again later.

RSS Twitterings

  • An error has occurred; the feed is probably down. Try again later.
March 2008