What is BFD?
How many times have you seen a message like the following?
ERROR
Requested document (URL http://j.random.host/file.html)
could not be accessed.The information server either is not
accessible or is refusing to serve the document to you.
The motivation behind BFD is to make it so that you will never
(well, hardly ever) see that message!
BFD stands for Bulk File Distribution. It is a system for
transparently mirroring files between cooperating file servers,
and keeping track of where the mirrored copies are.
The locations of the mirrored copies are stored in a distributed
database which is accessible from a BFD-aware WorldWideWeb (W3) browser.
When using such a browser, when you click on a phrase or icon to a
particular web page, the client will consult the BFD location database
to see if knows about any mirrored copies of that page. If so, the
client will access the page from one of the mirrored servers. (If the
first mirrored server is unavailable, it will try the second one, and
so on, until it finds one that works.) If all else fails, the browser
will attempt to fetch the page from the primary server.
But what if the mirrored copy is out-of-date?
BFD is designed to make that unlikely. If the software used to mirror
one file server to another is also BFD-aware, that software will
update the BFD location database every time it copies a new file.
Also, the browser can find out the date at which the mirrored file was
copied from the primary server, and display that to the user.
Finally, for files that are published using BFD, BFD can
provide reasonable assurances (to within seconds) that the file is
current, and also allow the client to check the integrity of the file
(so you will know that the file isn't corrupted).
Won't BFD slow things down?
In some cases, accessing a file via BFD and a mirror server may be
slower than it would have been to access the file from the location in
its URL. However, the really annoying cases are where you have to
wait several seconds only to find out that a file isn't available,
shouldn't happen nearly as often.
What's more, with BFD it is possible to have a popular web page
mirrored on dozens of servers all around the world. The browser can
then decide (using heuristics) which mirror server is the closest to
you, or which one is most likely to be available, and try that one
first. If the browser can fetch a file from a nearby site,
it's almost certainly faster than fetching it from a file halfway
across the planet.
Finally, with BFD, a popular web site can scale dramatically, by
mirroring its files on several machines which share the load.
How does BFD work?
There are several components to BFD, including:
- Name (URN) Servers
- Location (URL) Servers
- Replication Daemons
- Collection Managers
- File Servers
- Clients
- Other tools
Each of these will be described in turn.
In our implementation, the URN server and LIFN server are
pulled together into the
Resource Catalog (RC) Server.
Name (URN) Servers
A name server manages a database of meta-information about
files that are managed by BFD. Each file managed by BFD is assigned a
Uniform Resource Name (URN) which can be used to refer to the
file. Given a URN, a Web browser can query a name server for that URN
and find out some information about that file.
Among the information stored for each file is a
Location-Independent File Name (LIFN). A LIFN is a name for
a specific sequence of bytes that corresponds to the current
version of the file. Other pieces of information about the file which
might be available from the URN server would include the file's
``catalog information'' (title, author, description, etc.,) as well as
``instance information'' such as content-type, size, MD5 fingerprint,
and a cryptographically signed certificate of authenticity.
Name servers can also collect statistics about which URNs are
requested most often, and when, to aid the collection manager in
knowing which files to acquire, which ones to keep on-hand, and which
ones to reap.
Location (URL) Servers
A location server manages a database of LIFN-to-location
(URL) bindings. Every time a file server makes a new file available
via BFD, it informs a location server. If a Web client subsequently
asks for locations for a particular LIFN, the location server will
then respond with a list of URLs where that file can be found.
NOTE: The distinction between a URN and a LIFN is subtle but
important. A URN is a long-lasting name that can be used by humans to
refer to some network-accesible resource: be it a Web page, a telnet
session, or a MUD. However, the exact contents of the resource named
by the URN can change. Accessing a URN for ``today's newspaper''
would give you different results today than yesterday.
The use of LIFNs is more restricted. First of all, a LIFN can only
refer to a file; it cannot be used to name other kinds of network
accessible resources. Second, once a LIFN has been assigned to a
specific sequence of bytes, that LIFN cannot be used to name any other
sequence of bytes. Finally, LIFNs are not really intended to be used
by humans (though this might happen occasionally); they were created
so that the various components of BFD could all have a common,
unambiguous name for every file managed by BFD.
Replication Daemons
A replication daemon performs the task of acquiring new files from
remote servers, deleting files that are no longer wanted, and
informing the location servers of the changes. This function is
similar to that provided by several existing ``mirror'' programs, but
in addition to copying files from one server to another, the
replication daemon also propagtes each file's LIFN and any other
information which is needed by the file server.
The BFD replication daemon is designed to perform its task very
efficiently. Planned features include on-the-wire compression,
checkpoint/restart, multiple file multiplexing (to allow for the
gradual transfer of very large files without pre-empting small ones),
integrity checking, and a protocol which works well over high
bandwidth-delay links.
Collection Managers
A collection manager decides which files to acquire, which ones to
keep, and which ones to throw away. It makes such decisions based on
access statistics (as obtained from the file server or a name server),
and site-specified criteria. The results of such decisions are then
fed to one or more replication daemons. If several file servers are
under control of a single administration, a single collection manager
may make decisions for several of its file servers, and transmit
the instructions to each file server via the network.
File Servers
File servers in BFD are essentially ordinary HTTP, Gopher, or FTP
servers which provide file access to BFD-managed files for Web
browsers.
Clients
BFD clients are slightly modified Web browsers which, in addition to
having the capability to retrieve a file given its URL, also have the
capability to retrieve files (by consulting name and/or location
servers) by URN or LIFN.
Since URNs and LIFNs cannot be expected to be widely used for some
time, a transition strategy has been
developed that provides some of the benefits of BFD for files accessed
by a URL.
Other tools
Other tools will be necessary to implement BFD fully. In particular,
there will be a need for tools to help publishers manage their
collections, tools to help authors and editors maintain the catalog
information for their works, and a mechanism to export various kinds
of meta-information to resource discovery systems such as Harvest.
Use of URNs and LIFNs for publishing and accessing files is
illustrated in Figures 1 and 2.
Figure 1:
Figure 2: