4. What?
Do we need a definition of metrics?
The metrics measured by GridMon are:
- Connectivity
- Jitter
- Packet loss
- RTT (Round Trip Time)
- TCP & UDP throughput
Their relevance to network performance is summarised below:
Packet loss
Data is transmitted in packets, and unsurprisingly, packet loss is an of measure how many packets are lost during transport. This includes packets which are discarded because they have arrived with corrupted "transmission" data (separate to the payload/user data part of the packet).
Packets that are discarded or fail to arrive must be re-transmitted, and this quickly causes a "traffic jam" if the packet loss is severe.
Connectivity
Is simply an indication of whether you can connect to a remote site/machine, and is closely linked to packet loss. It can be used to identify sudden faults, such as link failures, or more sporadic problems such as loss of connectivity at certain times of the day, perhaps as a result of network congestion occuring during "busy" periods, causing packets to be discarded.
Inter-packet Jitter
Is a measure of the variation in the delay of packets arriving. It is very important metric for real-time applications such as video conferencing, which require packets to arrive at a steady rate.
This metric is currently only measured for UDP traffic (including multicast) but it is planned to extend this for TCP traffic in the future.
RTT
Round trip time is a measure of the time it takes to send a packet from node x to y, and receive a response back at x.
TCP is a send-acknowledge protocol. A block of data cannot be sent until the receipt of the previously transmitted block has been acknowledged. If this acknowledgement takes a long time to arrive (due to a long RTT) then transmission delays are created.
TCP/UDP Throughput
Throughput is essentially a measure of the rate at which data is or can be transferred. However you must be careful about what kind of throughput you are talking about. For example, are you talking about maximum throughput, network throughput, end-to-end throughput, or throughput on the wire?
The GridMon toolkit monitors network throughput (what the network sees) for TCP and UDP traffic, and end-to-end throughput (what end user applications, and hence human end users see) for TCP traffic.
You can use this data to see what transfer rates you can expect to achieve to other sites, and to identify inefficiency (e.g. if network throughput is significantly better than end-to-end performance).
5. How (1)?

Figure 1 - GridMon Monitoring Architecture
The monitoring is performed by a kit of tools installed on a suitable machine at each e-Science Centre. Performance data is stored locally on that machine, and is published to interested people via a web interface, and to the Grid middleware via a publication service (LDAP, Grid or web). LDAP seems to be popular in the States, with OGSA growing in popularity in Europe. An (OGSA) Grid service is essentially a web service with some Grid specific add-ons/pre-requisites.
Every 30 minutes (90 minutes for bbcp/ftp) each machine performs monitoring between itself and all other e-Science Centres. In this way a mesh of monitoring is created, allowing each centre to build up a picture of the quality of its links to all other centres. The mesh approach is feasible given the relatively low number of sites involved (12-15 in this case). The times at which the individual machines run tests are staggered in an attempt to minimise the disruption one machine’s tests may have on another’s.
The monitoring host must obviously meet some requirements, with the most important being that:
- It is a dedicated monitoring machine. There is little point installing the toolkit on your web server, for example, because it will not allow like for like performance data to be connected.
- In the same vein, the machine must have similar connectivity to the other networked e-Science resources at your centre. Performance data will not be representative if the tools are, for example, installed on a machine hanging from the primary link to outside world, while users in the e-Science centre are three or four levels further down the network hierarchy.
5. Monitoring: How (2)?