Saturday, January 30, 2010

Hacking search suggestions

Opensearch specification has extension for search suggestions. Major search providers have their own search suggestion entry points. After looking doing a search of my own, I found the entry points for the 3 major search providers

Bing - http://api.search.live.com/osjson.aspx?query={Search Term}
Google - http://suggestqueries.google.com/complete/search?q={Search Term}&client=firefox
Yahoo - http://ff.search.yahoo.com/gossip?output=fxjson&command={searchTerms}

For Google entry point, removing client parameter will also provide the number of results in the response. This format is not as per the Opensearch standards.

Who is more open? Google or Yahoo or Bing?

While looking at the search result page HTML from Google, Yahoo and Bing, I discovered that Google does not add auto discovery to its search result page. Where as Yahoo and Bing does.

That makes me wonder why Yahoo and Bing does not get as much credit as Google for adopting the open technologies/standards?

Friday, January 22, 2010

http request pipeline in Erlang

I tried to use Erlang's http module for high concurrent requests. It was not performing well due to pipelining and persistent connection issues. This seems to be solved in R13 version. I figured out how to use the http profiles to do selective pipelining/persistent connections to one server but not for others [if application is sending requests to multiple hosts].

First step in the process is to create a new http profile. It can be done in 2 ways. First one is to run a stand along http connection manager (httpc_manager).


{ok, Pid} = inets:start( httpc, [{profile, other}] ).


As per the documentation, this is not desirable as all benefits of OTP framework is lost.

Dynamically started services will not be handled by application takeover and failover behavior when inets is run as a distributed application. Nor will they be automatically restarted when the inets application is restarted, but as long as the inets application is up and running they will be supervised and may be soft code upgraded. Services started as stand_alone, e.i. the service is not started as part of the inets application, will lose all OTP application benefits such as soft upgrade. The "stand_alone-service" will be linked to the process that started it. In most cases some of the supervision functionality will still be in place and in some sense the calling process has now become the top supervisor


2nd method is to run it as a part of inets application via configuration file

Have a config file with the following content (say inets.config)

[{inets,
[{services,[{httpc,[{profile, server1}]},
{httpc, [{profile, server2}]}]}]
}].


Run the erlang shell as


erl -config inets.config


This will start 3 http profiles [server1, server2 and default].

Now the question is how to use the newly created profiles. Let's say the application is using 2 web services hosted at foo1.example.com and foo2.example.com. Web service hosted at foo1.example.com is hosted on a web server which can support lot of persistent connections [keep alive connections]. Web service hosted foo2.example.com is hosted on a normal web server which is not optimized for large number of persistent connectinons.

In the application set the profile for server1 for the connections to foo1.example.com. This can be done by changing the http options listed here.


http:set_options([{max_sessions, 20}, {pipeline_timeout, 20000}], server1).


NOTE:It is required to set the pipeline timeout in order to enable http pipelining.

Profile can be specified during the request time.


http:request( "http://foo1.example.com/v1/get_info/dudefrommangalore", server1).


There is no interface provided by httpc_manager or inets to get the info on the number of sessions open to a server. But good news is that the session information is kept in the ets table. One can query the ets table to get the list of persistent connections.


ets:tab2list(httpc_manager_server1_session_db).


Output is something like

{tcp_session,{{"fo11.example.com",80},
<0.103.0>},
false,http,#Port<0.1032>,...}


<0.103.0> is a Pid of httpc_handler gen server process. It is possible to get the status of this process via standard OTP sys module.


sys:get_status(erlang:list_to_pid("<0.103.0>")).


It is also possible to get all the pipelined requests on each persistent connections. For that it is necessary to get the pid of the httpc_manager via inets:services_info(). This call will return the pid of the httpc_manager.


[{httpc,<0.52.0>,[{profile,server1}]},
{httpc,<0.53.0>,[{profile,server2}]},
{httpc,<0.41.0>,[{profile,default}]}]


From the pid, get the status of httpc_manager gen server process.


sys:get_status(erlang:list_to_pid( "<0.52.0>")).


ets table name is in bold here.


15> sys:get_status(erlang:list_to_pid("<0.52.0>")).
{status,<0.52.0>,
{module,gen_server},
[[{'$ancestors',[httpc_profile_sup,httpc_sup,inets_sup,
<0.36.0>]},
{'$initial_call',{httpc_manager,init,1}}],
running,<0.40.0>,[],
[httpc_manager_server1,
{state,[],24596,
{undefined,28693},
httpc_manager_server1_session_db,httpc_manager_server1,
{options,{undefined,[]},
0,2,5,120000,2,disabled,false,inet,default,...}},
httpc_manager,infinity]]}


Get the content of the ets table to get the pipelined connection


ets:tab2list(24596).


Application can tune the http options to utilize the network bandwidth better, get the most of the machine and network.

Tuesday, January 19, 2010

Erlang process mailbox performance

I came across this performance issue in Erlang while doing the pattern matching against the mailbox [a.k.a. selective message processing]. Here is the orignal code:


-module (perf).

-export( [start/0] ).

start() ->
S = erlang:now(),
Pids = spawn_n(fun test/1, 10000, []),
wait(Pids),
E = erlang:now(),
io:format( "Total time: ~p~n", [timer:now_diff(E, S)/1000] ).

spawn_n(_F, 0, Acc) -> Acc;
spawn_n(F, N, Acc) ->
Me = self(),
Pid = spawn(fun() -> F(Me) end),
spawn_n(F, N-1, [Pid|Acc]).

test(Pid) -> Pid ! {self(), ok}.

wait([]) -> ok;
wait([Pid|Pids]) ->
receive {Pid, ok} -> ok end,
wait(Pids).


Run time for perf:start() was 1.3 seconds

Erlang (BEAM) emulator version 5.6.5 [source] [smp:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> perf:start().
Total time: 1368.038
ok
2>

Now I changed wait(Pids) to wait(lists:reverse(Pids)). After this change, run time for perf:start() was 83 milliseconds.

1> perf:start().
Total time: 83.037
ok

15x improvement just by changing the way mailbox scan is done.

Little things like this are usually overlooked and the language is blamed for the performance issues.

Monday, January 18, 2010

Concurrency in Java - Part 2

In the earlier post I covered the basic cached thread pool.

Another facility offered by Java's concurrency framework is to schedule a thread after certain time or at regular interval (like standard unix cron job). There are 2 ways to schedule the thread at regular interval. First one is to run a task at regular interval regardless of the previous job. Second one is to run a task and wait for the certain interval after the previous job is done.

Second one is helpful in situation like crawlers. It is necessary to download the pages with some politeness factor [wait for sometime before downloading a page from the same website].


ScheduledExecutorService executionService = Executors.newScheduledThreadPool(2);


Above code snippet create a pool of 2 threads. This service can schedule the threads at regular interval.

ScheduledExecutorService provide a method schedule

Runnable task = new Runnable() {
public void run() {
System.out.println( "I am responsible for downloading a page" );
return;
}
};
/* TimeUnit is defined within java.util.concurrent package */
Future future = executionService.schedule(task, 2000, TimeUnit.MILLISECONDS);


The above code schedule a task to run after 2 seconds [2000 milliseconds]. schedule method return a future object. This future object can be used to check the status of the task [isDone method] or to cancel the task [cancel method]. Read more about the future object and the methods available here.

ScheduledExecutorService also support repeated execution of a task via scheduleAtFixedRate and scheduleWithFixedDelay.

scheduleAtFixedRate method schedule the task at regular interval. This method does not check if the previously scheduled task is finshed or not.

scheduleWithFixedDelay method is similar to scheduleAtFixedRate except that this method wait for the previous execution to finish, wait for the fixed interval and then schedule the task again.

Sunday, January 17, 2010

Concurrency in Java

Java 5.x introduced the concurrency framework. It make the life of developer easier to run multiple threads. This framework also take care of thread caching this reducing the number of spawned threads in the system.

When the concepts of thread was introduced in the operating systems, it was considered light-weight processes. As the clock speed of the CPU is increasing dramatically and also number of CPU cores available for the programs are increasing, even this light-weight processes are deemed to costly to start. Thus introduced the concept of cached threads. Erlang solve this problem by introducing ultra-light-weight processes. Millions of such processes can be spawned within few seconds. This is not the case in kernel threads. Even when kernel threads are used, there is a cost of context switching to schedule those threads from wait state to run state.

I am new to Java concurrency framework. So I am taking baby steps to learn to use the classes available in Java 5.x. All the concurrency related classes are in java.util.concurrent package.

First step in start using these classes is to introduce Executors class. This class has several class methods to create thread pools.


import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;

ExecutorService threadPool = Executors.newCachedThreadPool();


The above code create a cached thread pool. Behavior of this thread pool is documented here.

Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available. These pools will typically improve the performance of programs that execute many short-lived asynchronous tasks. Calls to execute will reuse previously constructed threads if available. If no existing thread is available, a new thread will be created and added to the pool. Threads that have not been used for sixty seconds are terminated and removed from the cache. Thus, a pool that remains idle for long enough will not consume any resources. Note that pools with similar properties but different details (for example, timeout parameters) may be created using ThreadPoolExecutor constructors


A task can be submitted to the newly created thread pool for execution. A task must be an instance of Runnable interface. Submitting a task to the ExecutorService will return an instance of Future interface. This is a wrapper around the task submitted. This instance can be used to query the submitted task for completion, as well as to cancel the task.



Future task = threadPool.submit( new Runnable() {
public void run() {
System.out.println( "Hello world from within the thread pool" );
}
});


Finally wait for the task to be completed


while ( !task.isDone() ) {
}


Once the task is completed and thread pool is no-longer necessary, send shutdown message to the thread pool to terminate all the threads created.


threadPool.shutdown();
while (!thread.isTerminated()) {
}


There it is. First Hello world code using the Concurrent Thread Pool in Java. As and when I learn new methods in this framework, I will write about that here.



Until then, happy thread pooling and utilizing all the cores on the system.

Update: Download the source from here

Book Promotion