[Mpiblast-users] blast in 1 day but could not get mpiblast done even in 10 days for the same dataset

ialam

2007-03-08 13:27:59 UTC

Hi Aron,

I wanted to have the benchmark dataset so that I could test mpiblast
performance. Could you please point me to the dataset. In the meantime I am
trying to get mpich running on the cluster.

Many Thanks,

Intikhab
----- Original Message -----
From: "intikhab alam" <***@cs.man.ac.uk>
To: "Aaron Darling" <***@cs.wisc.edu>
Sent: Friday, March 02, 2007 1:20 PM
Subject: Re: [Mpiblast-users] blast in 1 day but could not get mpiblast done
even in 10 days for the same dataset

Hi Aron,
I would like to try out the benchmark dataset, could you point me from
where I could download this?
Intikhab
----- Original Message -----
Sent: Friday, March 02, 2007 6:21 AM
Subject: Re: [Mpiblast-users] blast in 1 day but could not get
mpiblast done even in 10 days for the same dataset
: It sounds like there must be something causing an mpiblast-specific
: communications bottleneck in your system. Anybody else have ideas
: here? If you're keen to verify that, you could run mpiblast on the
: benchmark dataset we were using on Green Destiny and compare
runtimes.
: My latest benchmark data set (dated June 2005) has a runtime of
about 16
: minutes for 64 nodes to search the 300K erwinia query set against
the
: first 14GB of nt using blastn. Each compute node in that machine
was a
: 667Mhz transmeta chip, 640MB ram, connected via 100Mbit ethernet. I
was
: using mpich2-1.0.1, no SCore. Based on paper specs, your cluster
should
: be quicker than that.
: On the other hand, if you've got wild amounts of load imbalance,
: --db-replicate-count=5 may not be enough, and 41 may prove ideal
(where
: 41 = the number of nodes in your cluster). In that case, mpiblast
will
: have effectively copied the entire database to each node, totally
: factoring out load imbalance from the compute time equation. Your
: database is much smaller than each node's core memory, and a single
: fragment is probably much larger than each node's CPU cache, so I
can't
: think of a good reason not to fully distribute the database, apart
from
: the time it takes to copy DB fragments around.
: In any case, keep me posted if you discover anything.
: -Aaron
: > Hi Aaron,
: >
: >
: > --db-replicate-count=5
: >
: > assuming it may help reach the 24hrs mark to complete the job.
: > However, I see that only 6% of the (total estimated) output has
been
: > generated until now(i.e after 4 days (4*24 hrs). If I continue
this
: > way, my mpiblast would finish in 64 days. Any other suggestion to
: > improve the running time?
: >
: > Intikhab
: > ----- Original Message -----
: > Sent: Wednesday, February 21, 2007 1:33 AM
: > Subject: Re: [Mpiblast-users] blast in 1 day but could not get
: > mpiblast done even in 10 days for the same dataset
: >
: >
: > : Hi Intikhab...
: > : > : can take a long time to compute the effective search space
: > required
: > : > for
: > : > : exact e-value calculation. If that's the problem, then you
: > would
: > : > find
: > : > : just one mpiblast process consuming 100% cpu on the rank 0
node
: > for
: > : > : hours or days, without any output.
: > : >
: > : > Is the effective search space calculation done on the master
node?
: > If
: > : > yes, this mpiblast job stayed at the master node for some
hours
: > and
: > : > then all the compute nodes got busy with >90% usage all the
time
: > with
: > : > continued output being generated until the 12th day when I
killed
: > the
: > : > job.
: > : >
: > : yes, the search space calculation is done on the master node and
it
: > : sounds like using the --fast-evalue-approximation command-line
: > switch
: > : would save you a few hours, which is pretty small compared to
the
: > weeks
: > : or months that the rest of the search is taking.
: > : > : The more likely limiting factor is load imbalance on the
: > cluster.
: > : >
: > : >
: > : > In this case, do you think the job should finish on some nodes
: > earliar
: > : > than others? In my case job was running on all the nodes with

90%

: > : > usage and the last output I got was on the last day when I
killed
: > the
: > : > job.
: > : >
: > : It's possible the other nodes may continue running mpiblast
workers
: > : which are waiting to send results back to the mpiblast writer
: > process.
: > : > : If some database fragments happen to have a large number of
hits
: > and
: > : > : others have few, and the database is distributed as one
fragment
: > per
: > : > : node, then the computation may be heavily imbalanced and may
run
: > : > quite
: > : > : slowly. CPU consumption as given by a CPU monitoring tool
may
: > not
: > : > be
: > : > : indicative of useful work being done on the nodes since
workers
: > can
: > : > do a
: > : > : timed spin-wait for new work.
: > : > : I can suggest two avenues to achieve better load balance
with
: > : > mpiblast
: > : > : 1.4.0. First, partition the database into more fragments,
: > possibly
: > : > two
: > : > : or three times as many as you currently have. Second, use
the
: > : >
: > : > You mean more fragments that inturn means to use more nodes?
: > Actually
: > : > at our cluster not more than 44 nodes are allowed for the
parallel
: > : > jobs.
: > : >
: > : no, it's not necessary to run on more nodes when creating more
: > : fragments. mpiblast 1.4.0 needs at least as many fragments as
nodes
: > : when --db-replicate-count=1 (the default value).
: > : when there are more fragments than nodes, mpiblast will happily
: > : distribute the extra fragments among the nodes.
: > : > : --db-replicate-count option to mpiblast. The default value
for
: > the
: > : > : db-replicate-count is 1, which indicates that mpiblast will
: > : > distribute a
: > : > : single copy of your database across worker nodes. For your
: > setup,
: > : > each
: > : > : node was probably getting a single fragment. By setting
: > : >
: > : >
: > : > Is it not right if each single node gets a single fragment of
the
: > : > target database (the number of nodes assigned for mpiblast =
: > number of
: > : > fragments+2) so that the whole query dataset could be searched
: > against
: > : > the fragment (effective search space calculation being done
before
: > : > starting the search for blast comparable evalues) on each
single
: > node?
: > : >
: > : the search space calculation happens on the rank 0 process and
: > totally
: > : unrelated to the number of nodes and number of DB fragments.
The
: > most
: > : basic mpiblast setup has one fragment per node, but when
: > load-balancing
: > : is desirable, as in your case, mpiblast can be configured to use
: > : multiple fragments per node. This will not affect the e-value
: > calculation.
: > : >
: > : > : --db-replicate-count to something like 5, each fragment
would be
: > : > copied
: > : > : to five different compute nodes, and thus five nodes would
be
: > : > available
: > : > : to search fragments that happen to have lots of hits. In
the
: > : > extreme
: > : >
: > : > You mean this way nodes would be busy searching the query
dataset
: > : > against the same fragment on 5 compute nodes? Is this just a
way
: > to
: > : > keep the nodes busy until all the nodes complete the searches?
: > : >
: > : Yes, this will balance the load and will probably speed up your
: > search.
: > : > : case you could set --db-replicate-count equal to the number
of
: > : > : fragments, which would be fine if per-node memory and disk
space
: > is
: > : > : substantially larger than the total size of the formatted
: > database.
: > : >
: > : > Is it possible in mpiblast that for cases where the size of
the
: > query
: > : > dataset is equal to the size of target dataset, the query
dataset
: > : > should be fragmented, the target dataset should be kept in the
: > : > global/shared area and searches are done on single nodes (the
: > number
: > : > of nodes equal to the number of query dataset fragments) and
this
: > way
: > : > there would be no need to calculate the effective search space
as
: > all
: > : > the search jobs get the same size of the target dataset? by
: > following
: > : > this way I managed to complete this job using standard blast
in <
: > : > 24hrs.
: > : >
: > : The parallelization approach you describe is perfectly
reasonable
: > when
: > : the total database size is less than core memory size on each
node.
: > : With a properly configured --db-replicate-count, I would guess
that
: > : mpiblast could approach the 24 hour mark, although may take
slightly
: > : longer since there are various overheads involved with copying
of
: > : fragments and serial computation of the effective search space.
: > : > : In your particular situation, it may also help to randomize
the
: > : > order of
: > : > : sequences in the database to minimize "fragment hotspots"
which
: > : > could
: > : > : result from a database self-search.
: > : >
: > : > I did not get the "fragment hotspots" bit here. By randomizing
the
: > : > order of sequence you mean each node would possibly take
similar
: > time
: > : > to finish the searches? Otherwise it could be possible that
the
: > number
: > : > of hits could be lower for some fragments than others and this
: > ends up
: > : > in different times for the job completion on different nodes?
: > : >
: > : Right, the goal is to get the per-fragment search time more
balanced
: > : through randomization. But after thinking about it a bit more,
i'm
: > not
: > : sure just how much this would save....
: > : >
: > : > : At the moment mpiblast doesn't have
: > : > : code to accomplish such a feat, but I think others (Jason
Gans?)
: > : > have
: > : > : written code for this in the past.
: > : >
: > : > Aaron, do you think Score based mpi communication may be
delaying
: > the
: > : > overall time in running mpiblast searches?
: > : >
: > : It's possible.
: > : The interprocess communication in 1.4.0 was fine-tuned for
default
: > : mpich2 1.0.2 and lam/mpi implementations. We use various
: > combinations
: > : of the non-blocking MPI_Issend(), MPI_Irecv(), and the blocking
: > : send/recv api in mpiblast 1.4.0. I have no idea how it would
: > interact
: > : with SCore.
: > : -Aaron
: >

Aaron Darling

2007-03-08 13:38:03 UTC

Permalink

The query test set is on the mpiblast download archive:
http://www.mpiblast.org/Downloads.Archive.html

Specifically, you're after the 300kb of e. chrysanthemi predicted ORFs:
http://www.mpiblast.org/downloads/files/e.chrysanthemi.fas

As for the nt database, you'll have to download it from NCBI:
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
and siphon off the first 14GB (uncompressed) with dd or something
similar. It may not be identical to what I used in 2005, but it should
be close enough for a cursory runtime check. For extra points, try
using mpiformatdb's ability to read uncompressed fasta databases from
stdin. that should allow you to build a series of unix pipes that saves
plenty of disk I/O.

-Aaron

Post by ialam
Hi Aron,
I wanted to have the benchmark dataset so that I could test mpiblast
performance. Could you please point me to the dataset. In the meantime
I am trying to get mpich running on the cluster.
Many Thanks,
Intikhab
Sent: Friday, March 02, 2007 1:20 PM
Subject: Re: [Mpiblast-users] blast in 1 day but could not get
mpiblast done even in 10 days for the same dataset

Post by ialam
Hi Aron,
I would like to try out the benchmark dataset, could you point me from
where I could download this?
Intikhab
Sent: Friday, March 02, 2007 6:21 AM
Subject: Re: [Mpiblast-users] blast in 1 day but could not get
mpiblast done even in 10 days for the same dataset
: It sounds like there must be something causing an mpiblast-specific
: communications bottleneck in your system. Anybody else have ideas
: here? If you're keen to verify that, you could run mpiblast on the
: benchmark dataset we were using on Green Destiny and compare
runtimes.
: My latest benchmark data set (dated June 2005) has a runtime of
about 16
: minutes for 64 nodes to search the 300K erwinia query set against
the
: first 14GB of nt using blastn. Each compute node in that machine
was a
: 667Mhz transmeta chip, 640MB ram, connected via 100Mbit ethernet. I
was
: using mpich2-1.0.1, no SCore. Based on paper specs, your cluster
should
: be quicker than that.
: On the other hand, if you've got wild amounts of load imbalance,
: --db-replicate-count=5 may not be enough, and 41 may prove ideal
(where
: 41 = the number of nodes in your cluster). In that case, mpiblast
will
: have effectively copied the entire database to each node, totally
: factoring out load imbalance from the compute time equation. Your
: database is much smaller than each node's core memory, and a single
: fragment is probably much larger than each node's CPU cache, so I
can't
: think of a good reason not to fully distribute the database, apart
from
: the time it takes to copy DB fragments around.
: In any case, keep me posted if you discover anything.
: -Aaron
: > Hi Aaron,
: >
: >
: > --db-replicate-count=5
: >
: > assuming it may help reach the 24hrs mark to complete the job.
: > However, I see that only 6% of the (total estimated) output has
been
: > generated until now(i.e after 4 days (4*24 hrs). If I continue
this
: > way, my mpiblast would finish in 64 days. Any other suggestion to
: > improve the running time?
: >
: > Intikhab
: > ----- Original Message ----- : > From: "Aaron Darling"
: > Sent: Wednesday, February 21, 2007 1:33 AM
: > Subject: Re: [Mpiblast-users] blast in 1 day but could not get
: > mpiblast done even in 10 days for the same dataset
: >
: >
: > : Hi Intikhab...
: > : > : can take a long time to compute the effective search space
: > required
: > : > for
: > : > : exact e-value calculation. If that's the problem, then you
: > would
: > : > find
: > : > : just one mpiblast process consuming 100% cpu on the rank 0
node
: > for
: > : > : hours or days, without any output.
: > : >
: > : > Is the effective search space calculation done on the master
node?
: > If
: > : > yes, this mpiblast job stayed at the master node for some
hours
: > and
: > : > then all the compute nodes got busy with >90% usage all the
time
: > with
: > : > continued output being generated until the 12th day when I
killed
: > the
: > : > job.
: > : >
: > : yes, the search space calculation is done on the master node and
it
: > : sounds like using the --fast-evalue-approximation command-line
: > switch
: > : would save you a few hours, which is pretty small compared to
the
: > weeks
: > : or months that the rest of the search is taking.
: > : > : The more likely limiting factor is load imbalance on the
: > cluster.
: > : >
: > : >
: > : > In this case, do you think the job should finish on some nodes
: > earliar
: > : > than others? In my case job was running on all the nodes with

90%