Monday, November 9, 2009

Internet Indirection Infrastructure

Originally Internet based on point-to0point communication abstraction, but many applications have emerged like multicast, anycast and host mobility that require more general abstraction and a layer of indirection that decouples the sending host from the receiving host mismatched with the original one. however, implementing these general abstraction at IP layer or application layer poses many difficulties. This paper portrays a single overlay network to serve as an indirection infrastructure i3 that combine the generality of IP layer solutions with the deployability of overlay solutions. It offers best effort service without reliability or ordered delivery.
i3 serves as a rendezvous-based communication abstraction where a single node in this network is responsible to store packets sent to all identifiers id's that agree in the first k bits. Receiver sends trigger (idt, addr) to indicate that all packets with longest prefix match id should be forwarded by the i3 infrastructure to the node identified by addr. i3 achieves general communication abstraction of
i) mobility by maintaining end-to-end connectivity even when both end points move simultaneously by simply updating the receiver's trigger to consider the new address
ii) multicast by simply having hosts maintain triggers with the same identifier
iii) Anycast by simply having hosts of anycast group maintain triggers which are identical in a prefix of length k bits. This also is helpful to encode application preferences through the use of m-k bits which can used to balance the load over the servers
Another generalization of i3 is to replace an identifier by identifier stack that can be a list of identifier or address such that a source can send a packet to a series of identifier. These type of generalization is helpful to transform the data by third part server or to pass the data through firewall before its delivering to the destination, to enabling heterogeneous receivers to subscribe to same multicast group.
i3 is implemented depending on Chord lookup protocol using circular identifier space with m=256bits and k=128 bits, such overlay Check Spellingnetwork is self configuring in that a node can join i3 infrastructure simply. i3 uses soft state approach to be robust against node failures depending on the packet routing, end-host's periodic refreshing to maintain their triggers and stack identifier. However, the routing in i3 overlay network is less efficient than routing via IP. Thus, the clients caches the i3 server's IP address and the receiver has to choose their triggers with an identifier that is located on nearby server.
The paper simulates i3's efficiency with randomly distributed identifier. The big challenge was to achieve efficient routing since the end-host doesn't have good control over the location of it's triggers. A ratio of inter node latency on i3 network to that on IP network is used to enhance the performance of i3.

Monday, November 2, 2009

LOOKINGUP DATA in P2P SYSTEMs

one of the main challenges in Large P2P systems is the lookup problem without a centralized server. Many algorithms have been developed based on DHT interface while the algorithm must be able to locate the node that is responsible to store the data associated with a specific key. This requires that each node has to maintain IP address of neighbors to form an overlay network. DHT implementation includes that keys and node are identified using a number of m-bits, and each key is associated to data stored in node with ID close to that key. The paper discusses recent scalable algorithms with low latency and balanced distribution of keys among the nodes.
In Chord, each node has a finger table based on power of two data structure such that- a query for a key k is forwarded to the node in finger table with the highest ID not exceeding k and at least half of the remaining ID-space distance to k. the distance function is unidirectional and asymmetric. It depends on the numeric difference between two ID.
Pastry, Tapesry and Kademlia based on tree-like data structure. In Pastry and Tapesry, each node maintains a the location of some node with each prefix such that the distance function is symmetric but not unidirectional. The correctness requires that each node has a leaf set (nodes closest smaller and larger than the node). Pastry optimize the forwarding performance by maintaining a routing table that will be used to find a node has a longer shared prefix (than current node or at least equal) with the queried key.
Kademlia's distance function depends on XOR of two ID that means it is symmetric and unidirectional.
In CAN, each node is responsible for a rectangular zone and it maintains a routing table of the neighbors in coordinate space. The lookup is performed by forwarding the query in straight path roughly to the neighbor to the node storing the key.
Each of the aforementioned algorithms involves a procedure to adapt to node's arrival and departure with a a key difference in the data structure they use as a routing table. However, these P2P lookup algorithms have been analyzed under static conditions only. As it is obvious, they have many common and different aspects and they need further work to get more strength one.

Tuesday, October 20, 2009

White Space Networking with Wi-Fi like Connectivity

The paper focuses on setting up a UHF white space WhiteFi network without interference with wireless transmissions of primary users and with ability to handle spectrum fragmentation of different width. UHF channel of 30 segments each of 6 MHz width


The authors have implemented WhiteFi on the KNOWS platform that incorporates a Wi-Fi card, a UHF band converter, and SDR. The platform has two key features
1) Variable Channel Widths where the transmitted and received signal can be of bandwidth 5, 10 and 20 MHz. this will impose longer time to scan and discover AP since the scanning has also include the probable combination between the available channels

2) SIFT analyzes incoming signals in the time domain to detect transmissions over different channel widths without changing the channel width of the wireless card. It is able to distinguish intermittent data from noise. SIFT is important to reduce the required time to discover APs that have switched to a different part of the spectrum

WhiteFi requires applying adaptive spectrum-assignment algorithm that periodically reevaluates the assignment based on white space availability at the AP and clients using the spectrum map and airtime utilization observed at each of the clients. AP must pick a channel that is free for all its clients.

Monday, October 19, 2009

Interactive WiFi Connectivity For Moving Vehicles

The challenges in a network with vehicular clients are sharp drop in the connection and unpredictably and the difficulty of estimating the continuously changing channel quality.

The paper presents the design and performance measurement of diversity-based handoff protocol ViFi, a protocol that minimizes disruptions in WiFi network connectivity to support interactive applications from vehicular clients. The measurement study is for two commonly used interactive applications: VoIP and short TCP transfers of two vehicular mobile network testbeds; VanLAN and DieselNet in different cities. ViFi exploits macrodiversity (i.e connection with multiple basestation on the same channel simultaneously) and opportunistic receptions by nearby basestations to minimize disruptions for mobile clients. The challenge in designing ViFi is in coordinating among basestations that opportunistically receive packets. ViFi imposes inter-BS communication within limited bandwidth, minimum overhead, low latency per packet.
In ViFi, the client broadcasts the identity of the selected nearby BS as anchor that provides Internet access, the auxiliary set of BSes and the previous anchor. The old anchor transfers any unacknowledged packets that were received from the Internet to the new anchor that treats these packets as if they arrived directly from the Internet.

The paper investigates four practical and two ideal handoff policies, five of them are based on hard handoff strategy where the client associate with one BS at a time depending on different factors. Two performance measurements are used to evaluate the performance of these different policies: aggregate performance and period of uninterrupted connectivity where it has been found that more packets can be delivered, less number of interruptions and longer uninterrupted session can be achieved as with multi-BS handoff policy. The deployed prototype doubles the number of successful TCP transfers and doubles the length of disruption-free VoIP calls compared to a hard handoff protocol. However, it performs poorly in case of high number of auxiliary BSes and/or equal distance from the auxiliary BSes to the source and the destination.

Thursday, October 15, 2009

ExOR: Opportunistic Multi-Hop Routing for Wireless Network


The paper presents design, implementation and evaluation of a new link/network-layer routing technique. Depending on periodic measurement of delivery probability of node pairs, sub set of the useful nodes between the source and destination will agree to receive broadcasted packets from the source. Each packet contains a list of candidate and prioritized forwarders. The closest node to the destination (with the highest priority) will forward the packet by broadcasting it again to new sub set and so on till it reaches the destination. this requires time scheduling to avoid collision and agreement to choose one forwarder to avoid duplication. But the agreement overhead is propotional to the number of nodes, so the sub set must be chosen carefuly to avoid including useless nodes or that enforce transmission more than once.
ExOR header is longer than that of traditional one and depends on the forwarder list size. It is unlike traditional routing can deal with asymmetric links and also with long route with high loss rate that are avoided by tradtional routing.

ExOR was evaluated on outdoor roof-top 802.11 network. It has been found that ExOR's throughput is higher than the throughput using traditional routing over long and short links, where the highest achievement is over longest routes of five and seven hops. It ensure fewer transmission of each packet which results in less interference for network's users on the same spectrum.
But if ExOR routing achieves improvement in the throughput, why it is applied to transfer just 90% of the packets while the remaining packets are transferred with traditional routing.

Tuesday, October 6, 2009

Architecture and Evaluation of an Unplanned 802.11b Mesh Network

The paper tries to combine the best characteristics of multi-hop and single-hop network types; unplanned mesh network with wide coverage and acceptable performance. It evaluates such architecture using Roofnet which is a multi-hop 802.11b Internet access network with 37 nodes spread over four square Km of densely populated area. The main characteristics of Roofnet are

  • Each Roofnet node involves PC with Internet access via Ethernet port, 802.11b card, omni-directional antenna and software; Linux, routing software, DHCP and Web server
  • Each node has a unique Roofnet layer address (assigned automatically via Roofnet software) and IP address (assigned via DHCP server) that will be used inside Roofnet only
  • If a Roofnet node success to connect to an ISP, it will advertise itself as Internet gateway and provide wireless sharing of Internet connection for other nodes using NAT. Otherwise it is default router for the host and it can connect to the Internet via the gateway with the best ETT metric (lowest ETT or highest throughput).
  • ETT uses periodic broadcast of 1500-byte in a forward direction at each available 802.11 bit rate and 60-byte at 1Mbps in reverse direction
  • Roofnet routing protocol Srcr uses Dijkstra's algorithm on the partial database of many link's metrics to find highest throughput route whenever the node learns of changed link metric
  • The nodes learn the link's metric either from dummy queries from gateway or from the forwarded data packets which includes the link's current metric in the packet's source route.
  • Roofnet uses SampleRate algorithm to select bit rate that provide highest throughput depending on the delivery probability, as it send data packet
Evaluation of Roofnet found that the average throughput over all links is 627 kbps and it decreases with number of hops and average RT delay of 39 msec. Roofnet users talk to Internet gateway with an average throughput of 1395 kbps and average RT delay of 22 msec. Also, it has been found that Srcr effectively selects short links of a few hundred meters and ignores long links

Sunday, October 4, 2009

Modeling Wireless Links for Transport Protoocols

New transport protocol should be designed for good performance over both fixed and wireless links. However, Variable bandwidth, corruption, channel allocation delay and asymmetry of wireless links can affect negatively on transport protocol performance. This paper presents a model for cellular, WLAN and satellite links wireless links that help to minimize these impacts on transport protocols. It considers single or multiple Wireless links located at one end or in the middle of the communication path

The goal is to optimized realistic (set up parameters that accurately portrays real world behavior), easy, exploring wide space of the parameters, and detailed traffic model to evaluate the effect of link-level mechanisms on transport protocol's performance for two ways transfers. Performance metric of wireless link is the goodput that reflects the energy and spectrum efficiencies, interference and cost.

The authors discussed some characteristics of wireless link, their effects on transport protocol and how they can be modeled, like

  • Error loss can reduce the sending rate by triggering lengthy retransmission timeout. At the same time the bursty loss can increase the throughput than in case of equivalent number of non-bursty losses. Such impact can be modeled at the end of the link by dropping packets
  • Abrupt increase in link delay can trigger spurious timeout, retransmission and reduce the sending rate. The delay can be modeled by deferring the data transmission on the link.
  • Packet reordering due to link-level retransmission can incorrectly trigger congestion control responses for TCP while such delivery is helpful to decrease the delay for unreliable transport protocol. It can be modeled by swapping the packets in a queue, or delaying one packet at a time and letting other to pass
  • Resource allocation delay translates to increased delay to the user and it can be modeled by introducing additional delay when a packet arrives to a queue that has been empty for a time longer than the channel hold time
  • Decrease in link bandwidth results in triggering spurious timeouts in TCP while increase in the bandwidth could result in inefficient utilization of link bandwidth. This oscillation also leads to packet reordering. It can be modeled by changing link bandwidth
  • Asymmetric bandwidth and latency in up/downlink directions of a wireless link could result in congestion and limit the throughput. It requires configuration different parameters for up/downlink's bandwidth and latency.

    The paper also discusses
    1) the buffering model and its management and size impacts on transport protocol. While large size of buffer leads to poor loss recovery, small buffer size results in low link utilization and packet drop. Drop-Tail queue management can lead to bursty loss of packet followed by triggering retransmission timeout.
    2) Mobility model and its effects, where the mobility can result in packet loss, introduce delay and can also lead to change in link bandwidth and latency.

    Finally, the paper suggests that cross-layer communication (between link layer and transport layer) could be helpful to optimize the performance over wireless link. It could be investigated for transport protocol to inform the link layer that it is tolerant to packet reordering, or the degree of delay or bit errors –tolerance. It can also be investigated for link-layers to inform transport end-nodes of bandwidth or delay changes from handovers

Wednesday, September 30, 2009

MACAW: A Media Access Protocol for Wireless LAN’s

CSMA can't sense the hidden terminal and it provides only information about the collision at the receiver, so the authors tried to find another approach based on the previous one MACA which is based on RTS/CTS short and fixed size signaling packet. Both RTS and CTS packets contains the length of the proposed data transmission such that any node hears RTS and/or CTS defer its transmission. MACA does not provide adequate level of fairness within one cell because of the applied BEB algorithm, and the calculated congestion information is different at each pad.

The paper includes a design of multiple accesses called MACAW for indoor wireless LAN infrastructure with a goal to deliver fairness over optimal total throughput. The infrastructure uses single 256kHz channel between the base stations installed in the ceiling and the pads
· Pads and base stations transmit the same power
· Range of transmission is 3-4 meters
· The cell around each base station of very small size and sharply boundary
· Two stations are either in range or out of range of one another
· A station receives a packet correctly if and only if there is one active transmitter within its range
· Pads or base station know the presence of other devices only if they communicate

The authors applied the fairness's notion per stream rather than per station by allocating a queue for each stream and running a back off algorithm MILD independently for each queue. BO algorithm bases on multiplicative increase in back off time after occurrence a collision and linear decrease after successful transmission. Using UDP traffic data the fair allocation of the bandwidth is proved by the simulation.

MACAW also modifies basic link recovery RTS-CTS-DATA exchange of MACA, by sending DS packet by the sender to inform other stations about the existence and length of the following data transmission, and ACK packet by the receiver after successful reception of data. However, the additional ACK overhead was not helpful in the simulation at error rate=0.001.
The paper describes a scenario of two cells, each in coverage range of their respective base station and of other where one of the streams in a cell will be negatively affected (its access completely denied) by the transmission to the second pad (received all the requested throughput). The solution is to synchronize the information by introducing additional overhead packet called RRTS. Whenever a station receives an RTS to which it can't respond, it sends a RRTS packet to the sender during the next contention period. However all these exchanging messages don't help all scenarios for unicast transmission and muticast in general.
The author portrays clearly some challenges in media access of wireless LAN but I think the prposed solution here is not helpful well compared to the introduced overheads.

Tuesday, September 29, 2009

Safe and Effective Fine-grained TCP Retransmission for Datacenter Communication


The paper focuses on the Incast observed for synchronized reads and writes. Incast is a communication pattern occurs when many senders simultaneously upon receiving the request from a client, transmit a large amount data and result in bottleneck link, packets overfilling the buffers on the client's port on the switch that results in many losses while the client has to wait a minimum of 200msec before receiving the whole response. This reduces the throughput seen by the application 1-10% of bandwidth capacity.



Other works offered a solution with application specific mechanism while this pprovides a TCP level solution as

· Reducing minimum RTO in workload involves a switch with 32KB of output buffer size per port and more than 10 simultaneously servers. RTT=100usec, link capacity=1Gbps and the client issues a request of data block of 1MB. It has been found that maximum effectiveness can be achieved by making min. RTO closer to the network's RTT.
· So, for datacenter of 10 Gbps network and 10usec port-to-port latency, avoiding throughput collapse and idle link requires smaller RTO. The authors analyzed the behavior of 10Gbps Ethernet network with reduced RTT=100- 20usec, block size of 80MB and scaled the number of the servers into thousands. They found that removing the lower bound on RTO can improve the performance for up to 512 servers only, while for higher number of servers only 50% reduction in application throughput was recorded due to synchronized retransmission and successive timeout
· Therefore they add some randomization to the RTO timer to desynchronize the repeated flows.
· They focused the importance of fine-grained TCP retransmission for wide area flow with different RTT and RTO. They verified that eliminating min. bound on RTO and enabling retransmission time in microseconds can avoid the Incast.

Sunday, September 27, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks


Incast has been observed in distributed storage, MapReduce and web-search workloads. Many solutions were proposed but they either did not help like eliminating slow start, reduce RTO, or it is expensive like increasing switch's and router's buffer size, or application level solution which in turn requires modification in the application that use TCP.
The paper tries to describe the problem of Incast. It used a workload of distributed storage applications where the receiver requests a number of blocks from number of servers that response either with a fixed and variable size of fragments. Each block is striped across the servers and the next request has to wait until the received data from all senders.

The authors demonstrated that Incast pattern is general by replicating it in their simple test bed of 1Gbps with all servers are connected through single-layer switching such that the Incast phenomena can be observed easily. Different behaviors than that in other works have been observed.
They attempted to understand this phenomenon in both fixed and variable size fragement workload by
· Reducing and randomizing the minimum and/or initial RTO value and setting a smaller and randomized multiplier for the RTO exponential back off. However, they did not help to improve the goodput since the servers share the same switch buffer
· Turning out RTO time resolution's and delayed ACK's impacts where low time resolution and delayed ACK not disabled offer optimal result for fixed size fragment workload

Wednesday, September 23, 2009

VL2: A Scalable and Flexible Data Center Network

Data center network must achieve lower cost and high utilization by the ability to assign any server to any service. These are not satisfied in the current design because of:
· It doesn't provide uniform capacity between the connected servers due to high cost of equipments
· Traffic of one service affects other services
· Forwarding in the network depends on IP address and dividing the servers between the VLAN. But this doesn't support the VM migration and reassigning the servers to the services.

This paper presents the network architecture VL2 with low cost switch connected into a Clos topology, to achieve the goals of data center network by avoiding the three mentioned problems.
· It adopts VLB to spread the traffic across all the available paths randomly such that uniform capacity and performance isolation can be achieved
· It uses OSPF protocol to calculate the forwarding table for the switches.
· It uses flat addressing scheme. IP address of the servers are used as the names and shim layer in the server is used to find the location of the destination from a defined directory system

The architecture was evaluated with TCP flow only. The paper states that VL2’s implementation of VLB splits the flows evenly and achieves high TCP fairness.

Tuesday, September 22, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Data center network is organized as a number of the connections of ToR switches to one or more of ToR switches which in turn connected to core switch. It contains huge number of hosts and employ virtual machine multiplexing that results in millions of unique addressable end hosts that makes another challenge to both Layer 2 & 3. So the efficiency, scalability and fault tolerance are all the significant concerns with data center network

The authors presented a PortLand to achieve plug-and-play large-scale network fabric such that the data center network can be treated as a single unified fabric such that:

· It observes the data center network as a multi-rooted tree however, small scale fat tree scheme was considered over a data center network topology with three levels of connected switches; edge, aggregating and core switches.
· It employs centralized manager on a dedicated machine to maintain soft state not hard state
· It employs a LDP protocol to enable switches to discover their position in the topology without administrator configuration. The switches periodically send LDM message out all of their ports.
· It assigns 48 bit PMAC addresses to all end hosts to encode their position in the topology to enable loop-free forwarding and small forwarding tables. Egress switch performs PMAC to AMAC rewriting on the last hop to the destination host and ingress performs AMAC to PMAC rewriting on the first hop from the source host.
· It supports VM migration from one physical machine to another by forwarding an invalidation message to that VM's previous switch which in turn transmit a unicast ARP to any source of transmission destined to the migrated VM's PMAC address.

The efficiency, scalability and fault tolerance of this implementation were evaluated for both unicast and multicast communication using different measurements
· Total time required to re-establish communication of UDP and TCP flow in the presence of failure which is increased slowly with number of failures. Longer time is measured in case of TCP flow compared to that measured for UDP flow
· Scalability of fabric manager for larger topologies by measuring the control traffic that it can handle
· The ability to support VM migration

Thursday, September 17, 2009

Floodless in SEATTLE: A Scalable Ethernet Architecture

The plug-and-play functionality of Ethernet simplifies many aspects of network configuration and its flat addressing simplifies the handling of topology changes and host mobility. However, Ethernet has many limitations over large network due to Flooding to locate an end host, Comprising spanning tree and Broadcasting when end hosts use ARP, DHCP etc.

The approaches that can improve scalability over conventional Ethernet are
Hybrid IP/Ethernet networks: Interconnecting small Ethernet LAN by routers that
- Ensure efficient and flexible use of network resources
- Use a routing table with size depends on the number of subnets not the hosts
- Create smaller broadcasting domain
but it eliminates many of the plug and play advantage of Ethernet
VLAN:
A large bridged network into several VLANs that can reduce the broadcast overhead imposed on hosts. Compared with IP, VLANs simplify mobility between bridges in the same VLAN. Again it introduces many problems
- Configuration overhead to trunk the VLAN at each participating bridge
- Insufficient efficiency since effective use of per-VLAN trees requires periodically moving the roots and rebalancing the trees (manually process), and interconnection between VLAN via IP gateways rather than shortest physical paths.

The author considered SEATTLE as a Scalable Ethernet Architecture for Large Enterprises that maintains the same configuration-free properties as Ethernet bridging, yet scales to large networks
- SEATTLE forward packets along the shortest path to the host based on host MAC address. It uses resolution to map a MAC address to a host’s location.
- The path between two hosts in a SEATTLE network does not change as other hosts join and leave the network.
- SEATTLE employs traffic driven caching of host-information and this reduces the amount of state required to deliver packets

SEATLE uses one hop network level DHT to provide these mapping of MAC address to location and IP at switches to avoid the control overhead of a separate directory infrastructure. It runs a link-state protocol to ensure that the switch can observe all other switches and route along the shortest path. Also it uses a hash function to map haost information at switch like (MAC address, Location) and (IP address, MAC Address). However, when a SEATTLE network is deployed over a wide area, forwarding lookups increases latency and makes the lookup more prone to failure. So SEATTLE can be configured hierarchically, by leveraging a multi-level, one hop DHT.

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

Development of tools to help operators diagnose faults has been the subject of much research and commercial activity. Existing diagnostic systems that designed with large networks concern at performance and reachability issues. They fall short at helping the operators of small networks. Detailed diagnosis is required to help these operators.

The study in this paper considers detailed diagnosis for problems in small enterprise called NetMedic. For observing and diagnosing a lot of failure modes, it depends on
i) using many variables to capture different aspects of component behavior rather than a single, abstract variable that denotes overall health.
ii) link faulty components to the affected components through a chain (estimate dependency graph) then inferring that source component is impacting the destination by looking in history when the state of the source component is similar to its current state. If the destination at that time is also in similar state to its current then we can infer that the source is impacting the destination one, otherwise is not. The primary difficulty in this estimation is that we do not know a priori how components interact.

NetMedic built a dependence graph with roughly 1000 components and 3600 edges, with each component represented by roughly 35 state variables. It was evaluated by injecting faults comprise both fail-stop and performance problems. In 80% of them, the faulty component is the top identified cause. They found from the classification of the considered cases that the diagnosis system should monitor individual applications and they need to track application specific health rather generic one. Also they found that the rootr causes of most of cases are some other configuration element in the environment on which the application depends like the lower-layer services that are running, the firewall configuration etc.

Monday, September 14, 2009

Congestion Control for High Bandwidth -Delay Product Networks

The authors portray at first the problem that can face TCP in the future Internet. As the delay-bandwidth product increases, TCP becomes oscillatory and prone to instability regardless of the queue scheme, inefficient and unfair. Thus they developed explicit Control Protocol XCP by which the router notifies the sender about the congestion and the degree of the congestion. Also XCP implies the decoupling of the efficiency control from the fairness control, distinguishing error losses from congestion losses and detection of misbehaving sources. It remains stable, efficient and fair as the delay-bandwidth product and number of sources increase (i.e. independent of the environment). The efficiency here involves high utilization, small queue and almost no drop.
The new in that congestion control are

  • Don't depend on binary signal as a feedback to indicate the congestion state because this signal has to reflect the degree of congestion.

  • The reaction of the source should be adjusted depended on the delay of the feedback

  • The aggregate traffic should be independent from the number of flows while the fairness in bandwidth allocation depends on the number of flows. This leads to decouple the efficiency control EC from the fairness control FC. EC uses a MIMD law, while FC uses AIMD law.

With XCP, the packet carries a congestion header to inform the router about the flow state (congestion window and round trip time), and feedback from the router to the receiver. To compute the feedback the router uses an efficiency controller and a fairness controller over the average round trip time to prevent the queue from building up to the point at which a packet has to be dropped. Each router on the path to the receiver compares the aggregate input traffic rate with the link bandwidth to find positive or negative feedback proportional on the link bandwidth and queue size. Then the router tries to compute the feedback for the individual flow using AIMD law. When the feedback reaches the receiver, the congestion header is copied from the data packet to its acknowledgment.

Sunday, September 13, 2009

Randomly Early Detection Gateways for congestion Avoidance

Different congestion control mechanism has been suggested and implemented to overcome the performance degradation with increasing the use and size of computer network and to keep the network in a region of low delay and high throughput. The congestion detection can occur either TCP at the source or in the gateway which is the most effective detection since the gateway can distinguish between propagation delay and persistence queueing delay.

The paper points out some Congestion avoidance method at the gateway before the description of the own method RED gateway like:

  • Random Drop: when new packet arrives to the gateway with full queue, the gateway randomly chooses a packet from the gateway queue to drop.

  • Drop Tail: when the queue overflows the gateway drops packet from several connections that result in global synchronization in the network where many connections reduce their window at the same time.

  • Early Random Drop: when queue length is more than a certain level, the gateway drops each arrived packet with fixed drop probability. However it can't control on misbehaving sources

  • Source Quench message: the gateway sends a Source Quench message to the sources when the queue size reaches to a defined threshold and also the gateway can drop the arriving packets when the queue size approaches the maximum level

  • DECbit; a binary feedback in arriving packet header is used to indicate the congestion in the network if the calculated average queue length over a certain time exceed one. At the source, if half the packets over two round trip time had the indicated congestion bit set, then the window is decreased exponentially. Otherwise, the window is linearly increased

Then, the authors focused on three design goal for RED gateway

  • CA by controlling on average queue size. it should be kept low while the fluctuation in its actual size must be allowed to avoid transient congestion.

  • Avoidance of global synchronization and avoidance of a bias against bursty traffic by using distinict algorithm for congestion detection and for deciding which connectons should be notified of the congestion.

  • Effective even in the absence of cooperation from transport layer protocol at the source, such that the gateway drops packets if the average queue size exceed the maximum threshold level rather than setting bit as a feedback.

RED gateway at first calculates the average queue size and compares it with min. and max. thresholds. If the average queue size is lower the min. threshold no action will be taken while if the size is over the max. threshold then the gateway marks every arriving packet even by setting a bit in the packet header or dropping the packet in case of uncooperative sources. When the average queue size is between the min and max threshold, the gateway mark each arriving packet with probability that is roughly proportional to that connection's share of the bandwidth at the gateway
The simulation proved the success of RED gateway to meet the above goal. Moreover, it maximizes global power (ratio of throughput to the delay), and fairness since RED gateway does not discriminate against particular connection and the fraction of marked packets for each connection is roughly proportional to the connection share's of bandwidth.

Monday, September 7, 2009

Congestion Avoidance and Control

The paper describes several algorithms to deal with the congestion over the Internet and to conserve the packets. Conservation of packets means that during operating stably, a new packet isn't put into the network till the old packet leaves it which means the sender self clocked. Since the receiver generate acknowledgment signals with space equal to minimum packet spacing on the slowest link in the path. ACK signals then can be used as a clock at the sender such that it starts to send packets with the same space interval. However, there are many reasons that fails this conservation:
  • The connection is starting or restarting after losing packet (i.e., does not get to this state yet). Slow start algorithm has been developed to start data flowing by one packet which gradually will be increased till the sender establish the window of the connection. This algorithm takes a time equal to the multiplication of round trip time by the log. base 2 of the window size
  • A sender put a new packet into the network before the departure of the old one. By the assumption that the transport protocol's implementation is correct, this can be resulted from either a failure in sender's retransmission timer or in the backoff timer (space between the retransmissions). While the first one retreated by an algorithm for good estimation of the round trip time, the second problem recovered by using exponential backoff
  • The network is congested so some of the packets are dropped. This can be solved by applying a congestion avoidance algorithm. Such algorithm has to inform the transport endpoint that congestion is or is about to occur, and the endpoints have to adjust their usage or loading

The author argues the previous paper in that there is no need to the new bit in packet header (feedback signal) since lost packet or timeout case are indication to the congestion over the network. The second part of the algorithm is the control on the endpoint usage of the resources by the implementation of
· Multiplicative decrease by reducing the window size to the half, after any timeout.
· Additive increase by increasing the window size by its inverse value, after each ACK
· Sending the minimum advertised window

Now, while the paper states that this is congestion avoidance algorithm, it seems that algorithm is actually to recover the congestion state after its occurrence since the indication that signals this state is either lost packet or timeout case. Moreover, the paper itself points to that lost packet means congestion which result in retransmission and applying slow start algorithm in addition to the earlier described algorithm


On other hand, the proposed algorithm here doesn't insure the fair sharing of the network capacity. To satisfy this metric the author suggested sending a signal earlier to some of endnodes to inform them that they are using more than their fair share. But this signal represents a feedback while the author argue the first paper in this point exactly ignoring that it satisfied the fairness with congestion avoidance mechanism.

Analysis of the Increase and Decrease Algorithm for Congestion Avoidance in Computer Network

Continuous growing in the Internet and mixing between various technologies with different speed (like twisted pair and fiber optic), result in congestion problem that degrades the network performance as response time and throughput. After the Cliff point where the load reaches the network capacity the throughput falls rapidly, the packets is being dropped and the time response increases drastically.

While tradition congestion control schemes recovers the network from such state, this papers describes a congestion avoidance control mechanism to guide the network to operate within the optimal range close to Knee point where the load resulted in a small increase in throughput with a reasonable increase in response time. Such a mechanism depends on the ability of the users to deal with dynamic window size or rate during the communication. This mechanism depends on the increase/decrease algorithm which in turn represents dynamic resource management. Network resources sense its load, determine if they are overloaded or under loaded and feedback simply a binary signal to their users to adjust their number of packets transmitted over the network. Window size in the transport layer protocol or rate of transmission of a user often are used to adjust the system load. The binary feedback signal (1=overload, 0=underload) will be represented by a bit in the packet header, and interpreted by the user as 0;increase load and 1;decrease load

The paper concentrates on the analysis of various classes of increase/ decrease algorithm where the user's reaction to the received feedback can be additive with a constant (positive or negative) or multiplicative by a constant (>1 or 0< .. >1). Also, It uses many performance metrics and the assumption of many users share the same one resource with synchronous feedback and control loop. These metrics includes;
· Efficient usage of the resource that can be defined by the closeness of the total load to Knee point
· Fair allocation of the resource between the users
· Fast convergence; time to reach the goal state and size of oscillation. The binary feedback prevents the convergence to single steady state and leads to an oscillation

It has been found that additive increase and multiplicative decrease algorithm satisfies the goal.

However, I don't have an idea if they thought to use a feedback signal with two bits instead of one it is very simple idea
First bit as mentioned before to represent the required reaction by the user either increase or decrease its load, and the second bit to inform the user to consider the defined constant or its half value
Ex: feedback 10 means decrease load by B/2
feedback 11 means decrease load by B
etc.
I think it performs better and the feedback still simple

Thursday, September 3, 2009

Understanding BGP Misconfiguration

The paper offers a study of globally visible BGP misconfiguration. BGP is important for the reliability of Internet, so any defect in its implementation such as route oscillation, increase time of the convergence or misconfiguration of routers that runs BGP, result in global connectivity problems like that happened by improper announcement of routes by AS7007 and AS3561

The main goals of the papers is to portray
· The frequency of misconfiguration occurrence
· Misconfiguration impact on the global connection and routing load
· The causes of misconfiguration
· Finally, how it can be solved

The authors analyzed BGP updates from 23 different points and they found that daily misconfiguration impacts on 0.2-1% of global routing table and results in increase route update load, however only 4% of improper announcements affect on the connectivity negatively. One of the reason of such misconfiguration is the software of router vendor

The authors focused on globally visible misconfiguration due to its wide disruption and it can be identified by a short lived changes in BGP routing table that last less than a day. It can be classified into

  • Origin misconfiguration is unintentional insertion of a route into BGP routing table like mistake of AS to summarize an address space
  • Export misconfiguration is against the policy of export route to BGP peer, for example an AS export a route received from provider to another provider such that create transit between its two providers.
    Such misconfiguration leads to increase in routing load in addition to to the high load due to contiguously growing in Internet. It can lead to disruption in some part or globally and threatens the AS's policy.

Slips (error in the execution of a correct plan) and mistakes (error due to incorrect plan with well execution) cause misconfiguration

Finally, the paper suggests several procedures to at least reduce the occurrence of such failures

  • Designing User interface such that human errors will be minimized
  • Some modification in BGP protocol can prevent misconfiguration errors, for example S-BGP which represents a proposed extension to the current BGP, impose authorization and authentication to announce a route.

Interdomain Internet Routing


Internet infrastructure constructed by cooperation between ISPs with diverse internal structure and scope called Tier3 (local scope), Teir2 (regional scope) or Teir1 (global scope). Internet involves number of different types of autonomous systems AS that exchange route information using routing protocols BGP (Border Gateway Protocol, while IGP protocol can be used within each As such as RIP, OSPF, IS-IS and E-IGRP.

Forms of AS-AS interconnection
1) provider-customer transit is mostly
inter-AS relationship where the provider provides access to all destination in its routing table with charges for forwarding packets
ISP charges customers depending on the data rate in two ways; fixed pricing and changeable price according to the usage of the bandwidth

2) peering is typically for inter ISP relationship where the providers provide an access to a subset of each other’s routing tables (own transit customers and internal ISP address). This kind on interconnection is charge free and often established between competitors and it is emerged due to:
Number of Teir-1 ISP has been emerged with peering relationship with each other to avoid the control of singleTier-1 on all Internet traffic (in case if we have only provider-customer transit relationships). Such that the ASes related to different ISP don't have to pay money for any forwarding of packets between their customers. Also, peering provides kind of direct path between ASes that is necessary to achieve better end-to-end performance.

As a result, advertising a route by an ISP to its neighbor ISP means it agrees to forward packets to this route, thus ISP must take care to not advertise a route that impose it to act as a transit for packets

BGP was created to replace EGP routing protocol to provide decentralized which allowed the Internet to become a decentralized system. It supports Classless Inter-Domain Routing and uses route symmetrization to decrease the size of routing table. It's design goals are
· Scalability: BGP must ensure that the route advertisement scales
well while parts of network going up and coming down
· It must provide to reasonable degree loop free paths
· It must allow ASes to apply their policy to make local decision like filtering and ranking or routes. At the same time keep AS's policy non disclosure
· BGP has been designed to provide routing between different administratively system
Routing Policy
1) Filtering to save or to make money by ISP
· The most important for ISP to advertise routes to its transit customers to as many other connected ASes as possible.
· If ISP received an advertisement from a peering ISP of a route to its transit customer, it will not advertise this route to other peering ISPs, because that may cause a peering ISP to use the advertising ISP to reach that destination which would expend ISP resources but not lead to revenue.

2) Ranking to import routes in routing table
There are many attributes that can be considered to import routes in routing table like
· Local Free attributes implies that the priority is for the customer route (since it has direct path to the customer), then peer route and finally provider route.
· Length of As attribute, actually I didn't figure out its meaning. I thought it means number of hops but I found out it's not
· MED is used when two ASes connected at many locations
· Routes learned via eBGP is preferred on that learned via iBGP
BGP is unique in using TCP as its transport protocol. BGP peers are established by manual configuration between routers to create a TCP session on port 179:
1. Establish a TCP session
2. Routers send OPEN message
3. Routers exchanges their tables (of course after applying ranking and filtering)
4. Then routers send
· UPDATE message (contains only the changes in its table since last update) to add or withdraw a route
· Or KEEPALIVE message periodically to maintain the connection (every 60 seconds by default). In case of absence this message the session terminated during hold time
When BGP is running within an (AS), it is referred to iBGP (Interior BGP) to exchange information about external routes, and it is called eBGP (Exterior BGP) when it is running between ASes. Routers on the boundary of one AS, exchanging information with another AS, are called border or edge routers. In the Cisco operating system, iBGP routes have an administrative distance of 200, which is less preferred than either external BGP or any interior routing protocol. Other router implementations also prefer eBGP to EGP, and IGPs to iBGP.
The paper focuses on interested subject and I wish if it discussed deeply

Monday, August 31, 2009

THE DESIGN PHILOSOPHY OF THE DARPA INTERNET PROTOCOLS

The paper describes some of the objective of Internet architecture and their relation with the main features of TCP/IP protocol suite
Top level goal: interconnect the existing networks (different transmission media, multi media network and administration entities) with high degree of integration
· Multiplexing: Packet switching
· Technique of the interconnection: gateways to store and forward the packets
Second level goal: desirable network features were listed in order of their importance to guide the design decision within Internet architecture
Survivability The approach for the first goal was fate sharing, which implies that the state information of transport level synchronization must be kept at the end point of network. This means implementation of datagram network
· Type of service with diverse requirements; speed and reliability
This goal resulted in separating the TCP/IP protocol into two layers and more than one transport service (TCP for reliable service and UDP for datagram services) were required to achieve this goal
· Varieties of Networks
· Distributed management
· Cost effective
While the long header of Internet packets, the mechanisms to implement desired type of service like retransmission and acknowledgement strategies impose cost
· Easy Host attachment
· Accountable resources

Implementation:
-Transporting datagram across underlying networks is one of fundamental Internet architecture that serves the first three goals
- The service that can be offered depends on the engineering of software within host and gateways, and to the particular network which have been incorporated
- The designer of Internet architecture must consider both logical correctness and performance

-In TCP, flow control and acknowledgment are based on the byte number

Sunday, August 30, 2009

END-TO-END ARGUMENTS IN SYSTEM DESIGN

The paper compares the end-to -end argument against low-level implementation with examples of reliable data transmission, guaranteed message delivery, encryption, duplication, and message sequencing.

For example, by the consideration of an application of file transfer between two hosts, there are many causes of threats that must be concerned even by some functions at low level (providing packet checksum and sequence number, retry mechanisms) or at application level ( end-to-end check and retry). While the end-to-end checking and retrying guarantee correct file transmission, communication system can reduce the frequency of retries by the application program and reduce the delay to correct the failures. If the communication system is unreliable, then the end-to-end checksum fails frequently. However, internal reliability within the communication system imposes a cost; bandwidth, and delay, especially for applications that don't require such enhancement since the communication system is common to different application programs.

The end-to-end argument has been applied on SWALLOW distributed data storage system, where suppression of duplicated message and providing delivery acknowledgement are implanted at the application level. This has resulted in reducing the number of message transmission to the half.
However, identifying the ends over the communication network and the knowledge of the application standing at the end point of the communication are very important for function placement and choose the argument should be applied.
Finally, the paper lists number of implementation of end-to-end argument ; of delivery acknowledgment, encryption, error control and correctness in application systems, etc.