samah mustafa: September 2009

Wednesday, September 30, 2009

MACAW: A Media Access Protocol for Wireless LAN’s

CSMA can't sense the hidden terminal and it provides only information about the collision at the receiver, so the authors tried to find another approach based on the previous one MACA which is based on RTS/CTS short and fixed size signaling packet. Both RTS and CTS packets contains the length of the proposed data transmission such that any node hears RTS and/or CTS defer its transmission. MACA does not provide adequate level of fairness within one cell because of the applied BEB algorithm, and the calculated congestion information is different at each pad.

The paper includes a design of multiple accesses called MACAW for indoor wireless LAN infrastructure with a goal to deliver fairness over optimal total throughput. The infrastructure uses single 256kHz channel between the base stations installed in the ceiling and the pads
· Pads and base stations transmit the same power
· Range of transmission is 3-4 meters
· The cell around each base station of very small size and sharply boundary
· Two stations are either in range or out of range of one another
· A station receives a packet correctly if and only if there is one active transmitter within its range
· Pads or base station know the presence of other devices only if they communicate

The authors applied the fairness's notion per stream rather than per station by allocating a queue for each stream and running a back off algorithm MILD independently for each queue. BO algorithm bases on multiplicative increase in back off time after occurrence a collision and linear decrease after successful transmission. Using UDP traffic data the fair allocation of the bandwidth is proved by the simulation.

MACAW also modifies basic link recovery RTS-CTS-DATA exchange of MACA, by sending DS packet by the sender to inform other stations about the existence and length of the following data transmission, and ACK packet by the receiver after successful reception of data. However, the additional ACK overhead was not helpful in the simulation at error rate=0.001.

The paper describes a scenario of two cells, each in coverage range of their respective base station and of other where one of the streams in a cell will be negatively affected (its access completely denied) by the transmission to the second pad (received all the requested throughput). The solution is to synchronize the information by introducing additional overhead packet called RRTS. Whenever a station receives an RTS to which it can't respond, it sends a RRTS packet to the sender during the next contention period. However all these exchanging messages don't help all scenarios for unicast transmission and muticast in general.

The author portrays clearly some challenges in media access of wireless LAN but I think the prposed solution here is not helpful well compared to the introduced overheads.

Tuesday, September 29, 2009

Safe and Effective Fine-grained TCP Retransmission for Datacenter Communication

The paper focuses on the Incast observed for synchronized reads and writes. Incast is a communication pattern occurs when many senders simultaneously upon receiving the request from a client, transmit a large amount data and result in bottleneck link, packets overfilling the buffers on the client's port on the switch that results in many losses while the client has to wait a minimum of 200msec before receiving the whole response. This reduces the throughput seen by the application 1-10% of bandwidth capacity.

Other works offered a solution with application specific mechanism while this pprovides a TCP level solution as

· Reducing minimum RTO in workload involves a switch with 32KB of output buffer size per port and more than 10 simultaneously servers. RTT=100usec, link capacity=1Gbps and the client issues a request of data block of 1MB. It has been found that maximum effectiveness can be achieved by making min. RTO closer to the network's RTT.
· So, for datacenter of 10 Gbps network and 10usec port-to-port latency, avoiding throughput collapse and idle link requires smaller RTO. The authors analyzed the behavior of 10Gbps Ethernet network with reduced RTT=100- 20usec, block size of 80MB and scaled the number of the servers into thousands. They found that removing the lower bound on RTO can improve the performance for up to 512 servers only, while for higher number of servers only 50% reduction in application throughput was recorded due to synchronized retransmission and successive timeout
· Therefore they add some randomization to the RTO timer to desynchronize the repeated flows.
· They focused the importance of fine-grained TCP retransmission for wide area flow with different RTT and RTO. They verified that eliminating min. bound on RTO and enabling retransmission time in microseconds can avoid the Incast.

Sunday, September 27, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

Incast has been observed in distributed storage, MapReduce and web-search workloads. Many solutions were proposed but they either did not help like eliminating slow start, reduce RTO, or it is expensive like increasing switch's and router's buffer size, or application level solution which in turn requires modification in the application that use TCP.
The paper tries to describe the problem of Incast. It used a workload of distributed storage applications where the receiver requests a number of blocks from number of servers that response either with a fixed and variable size of fragments. Each block is striped across the servers and the next request has to wait until the received data from all senders.

The authors demonstrated that Incast pattern is general by replicating it in their simple test bed of 1Gbps with all servers are connected through single-layer switching such that the Incast phenomena can be observed easily. Different behaviors than that in other works have been observed.
They attempted to understand this phenomenon in both fixed and variable size fragement workload by
· Reducing and randomizing the minimum and/or initial RTO value and setting a smaller and randomized multiplier for the RTO exponential back off. However, they did not help to improve the goodput since the servers share the same switch buffer
· Turning out RTO time resolution's and delayed ACK's impacts where low time resolution and delayed ACK not disabled offer optimal result for fixed size fragment workload

Wednesday, September 23, 2009

VL2: A Scalable and Flexible Data Center Network

Data center network must achieve lower cost and high utilization by the ability to assign any server to any service. These are not satisfied in the current design because of:
· It doesn't provide uniform capacity between the connected servers due to high cost of equipments
· Traffic of one service affects other services
· Forwarding in the network depends on IP address and dividing the servers between the VLAN. But this doesn't support the VM migration and reassigning the servers to the services.

This paper presents the network architecture VL2 with low cost switch connected into a Clos topology, to achieve the goals of data center network by avoiding the three mentioned problems.
· It adopts VLB to spread the traffic across all the available paths randomly such that uniform capacity and performance isolation can be achieved
· It uses OSPF protocol to calculate the forwarding table for the switches.
· It uses flat addressing scheme. IP address of the servers are used as the names and shim layer in the server is used to find the location of the destination from a defined directory system

The architecture was evaluated with TCP flow only. The paper states that VL2’s implementation of VLB splits the flows evenly and achieves high TCP fairness.

Tuesday, September 22, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Data center network is organized as a number of the connections of ToR switches to one or more of ToR switches which in turn connected to core switch. It contains huge number of hosts and employ virtual machine multiplexing that results in millions of unique addressable end hosts that makes another challenge to both Layer 2 & 3. So the efficiency, scalability and fault tolerance are all the significant concerns with data center network

The authors presented a PortLand to achieve plug-and-play large-scale network fabric such that the data center network can be treated as a single unified fabric such that:

· It observes the data center network as a multi-rooted tree however, small scale fat tree scheme was considered over a data center network topology with three levels of connected switches; edge, aggregating and core switches.
· It employs centralized manager on a dedicated machine to maintain soft state not hard state
· It employs a LDP protocol to enable switches to discover their position in the topology without administrator configuration. The switches periodically send LDM message out all of their ports.
· It assigns 48 bit PMAC addresses to all end hosts to encode their position in the topology to enable loop-free forwarding and small forwarding tables. Egress switch performs PMAC to AMAC rewriting on the last hop to the destination host and ingress performs AMAC to PMAC rewriting on the first hop from the source host.
· It supports VM migration from one physical machine to another by forwarding an invalidation message to that VM's previous switch which in turn transmit a unicast ARP to any source of transmission destined to the migrated VM's PMAC address.

The efficiency, scalability and fault tolerance of this implementation were evaluated for both unicast and multicast communication using different measurements
· Total time required to re-establish communication of UDP and TCP flow in the presence of failure which is increased slowly with number of failures. Longer time is measured in case of TCP flow compared to that measured for UDP flow
· Scalability of fabric manager for larger topologies by measuring the control traffic that it can handle
· The ability to support VM migration

Thursday, September 17, 2009

Floodless in SEATTLE: A Scalable Ethernet Architecture

The plug-and-play functionality of Ethernet simplifies many aspects of network configuration and its flat addressing simplifies the handling of topology changes and host mobility. However, Ethernet has many limitations over large network due to Flooding to locate an end host, Comprising spanning tree and Broadcasting when end hosts use ARP, DHCP etc.

The approaches that can improve scalability over conventional Ethernet are
Hybrid IP/Ethernet networks: Interconnecting small Ethernet LAN by routers that
- Ensure efficient and flexible use of network resources
- Use a routing table with size depends on the number of subnets not the hosts
- Create smaller broadcasting domain

but it eliminates many of the plug and play advantage of Ethernet
VLAN:
A large bridged network into several VLANs that can reduce the broadcast overhead imposed on hosts. Compared with IP, VLANs simplify mobility between bridges in the same VLAN. Again it introduces many problems
- Configuration overhead to trunk the VLAN at each participating bridge
- Insufficient efficiency since effective use of per-VLAN trees requires periodically moving the roots and rebalancing the trees (manually process), and interconnection between VLAN via IP gateways rather than shortest physical paths.

The author considered SEATTLE as a Scalable Ethernet Architecture for Large Enterprises that maintains the same configuration-free properties as Ethernet bridging, yet scales to large networks
- SEATTLE forward packets along the shortest path to the host based on host MAC address. It uses resolution to map a MAC address to a host’s location.
- The path between two hosts in a SEATTLE network does not change as other hosts join and leave the network.
- SEATTLE employs traffic driven caching of host-information and this reduces the amount of state required to deliver packets

SEATLE uses one hop network level DHT to provide these mapping of MAC address to location and IP at switches to avoid the control overhead of a separate directory infrastructure. It runs a link-state protocol to ensure that the switch can observe all other switches and route along the shortest path. Also it uses a hash function to map haost information at switch like (MAC address, Location) and (IP address, MAC Address). However, when a SEATTLE network is deployed over a wide area, forwarding lookups increases latency and makes the lookup more prone to failure. So SEATTLE can be configured hierarchically, by leveraging a multi-level, one hop DHT.

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

Development of tools to help operators diagnose faults has been the subject of much research and commercial activity. Existing diagnostic systems that designed with large networks concern at performance and reachability issues. They fall short at helping the operators of small networks. Detailed diagnosis is required to help these operators.

The study in this paper considers detailed diagnosis for problems in small enterprise called NetMedic. For observing and diagnosing a lot of failure modes, it depends on
i) using many variables to capture different aspects of component behavior rather than a single, abstract variable that denotes overall health.
ii) link faulty components to the affected components through a chain (estimate dependency graph) then inferring that source component is impacting the destination by looking in history when the state of the source component is similar to its current state. If the destination at that time is also in similar state to its current then we can infer that the source is impacting the destination one, otherwise is not. The primary difficulty in this estimation is that we do not know a priori how components interact.

NetMedic built a dependence graph with roughly 1000 components and 3600 edges, with each component represented by roughly 35 state variables. It was evaluated by injecting faults comprise both fail-stop and performance problems. In 80% of them, the faulty component is the top identified cause. They found from the classification of the considered cases that the diagnosis system should monitor individual applications and they need to track application specific health rather generic one. Also they found that the rootr causes of most of cases are some other configuration element in the environment on which the application depends like the lower-layer services that are running, the firewall configuration etc.

Monday, September 14, 2009

Congestion Control for High Bandwidth -Delay Product Networks

The authors portray at first the problem that can face TCP in the future Internet. As the delay-bandwidth product increases, TCP becomes oscillatory and prone to instability regardless of the queue scheme, inefficient and unfair. Thus they developed explicit Control Protocol XCP by which the router notifies the sender about the congestion and the degree of the congestion. Also XCP implies the decoupling of the efficiency control from the fairness control, distinguishing error losses from congestion losses and detection of misbehaving sources. It remains stable, efficient and fair as the delay-bandwidth product and number of sources increase (i.e. independent of the environment). The efficiency here involves high utilization, small queue and almost no drop.
The new in that congestion control are

Don't depend on binary signal as a feedback to indicate the congestion state because this signal has to reflect the degree of congestion.

The reaction of the source should be adjusted depended on the delay of the feedback

The aggregate traffic should be independent from the number of flows while the fairness in bandwidth allocation depends on the number of flows. This leads to decouple the efficiency control EC from the fairness control FC. EC uses a MIMD law, while FC uses AIMD law.

With XCP, the packet carries a congestion header to inform the router about the flow state (congestion window and round trip time), and feedback from the router to the receiver. To compute the feedback the router uses an efficiency controller and a fairness controller over the average round trip time to prevent the queue from building up to the point at which a packet has to be dropped. Each router on the path to the receiver compares the aggregate input traffic rate with the link bandwidth to find positive or negative feedback proportional on the link bandwidth and queue size. Then the router tries to compute the feedback for the individual flow using AIMD law. When the feedback reaches the receiver, the congestion header is copied from the data packet to its acknowledgment.

Sunday, September 13, 2009

Randomly Early Detection Gateways for congestion Avoidance

Different congestion control mechanism has been suggested and implemented to overcome the performance degradation with increasing the use and size of computer network and to keep the network in a region of low delay and high throughput. The congestion detection can occur either TCP at the source or in the gateway which is the most effective detection since the gateway can distinguish between propagation delay and persistence queueing delay.

The paper points out some Congestion avoidance method at the gateway before the description of the own method RED gateway like:

Random Drop: when new packet arrives to the gateway with full queue, the gateway randomly chooses a packet from the gateway queue to drop.

Drop Tail: when the queue overflows the gateway drops packet from several connections that result in global synchronization in the network where many connections reduce their window at the same time.

Early Random Drop: when queue length is more than a certain level, the gateway drops each arrived packet with fixed drop probability. However it can't control on misbehaving sources

Source Quench message: the gateway sends a Source Quench message to the sources when the queue size reaches to a defined threshold and also the gateway can drop the arriving packets when the queue size approaches the maximum level

DECbit; a binary feedback in arriving packet header is used to indicate the congestion in the network if the calculated average queue length over a certain time exceed one. At the source, if half the packets over two round trip time had the indicated congestion bit set, then the window is decreased exponentially. Otherwise, the window is linearly increased

Then, the authors focused on three design goal for RED gateway

CA by controlling on average queue size. it should be kept low while the fluctuation in its actual size must be allowed to avoid transient congestion.

Avoidance of global synchronization and avoidance of a bias against bursty traffic by using distinict algorithm for congestion detection and for deciding which connectons should be notified of the congestion.

Effective even in the absence of cooperation from transport layer protocol at the source, such that the gateway drops packets if the average queue size exceed the maximum threshold level rather than setting bit as a feedback.

RED gateway at first calculates the average queue size and compares it with min. and max. thresholds. If the average queue size is lower the min. threshold no action will be taken while if the size is over the max. threshold then the gateway marks every arriving packet even by setting a bit in the packet header or dropping the packet in case of uncooperative sources. When the average queue size is between the min and max threshold, the gateway mark each arriving packet with probability that is roughly proportional to that connection's share of the bandwidth at the gateway
The simulation proved the success of RED gateway to meet the above goal. Moreover, it maximizes global power (ratio of throughput to the delay), and fairness since RED gateway does not discriminate against particular connection and the fraction of marked packets for each connection is roughly proportional to the connection share's of bandwidth.

Monday, September 7, 2009

Congestion Avoidance and Control

The paper describes several algorithms to deal with the congestion over the Internet and to conserve the packets. Conservation of packets means that during operating stably, a new packet isn't put into the network till the old packet leaves it which means the sender self clocked. Since the receiver generate acknowledgment signals with space equal to minimum packet spacing on the slowest link in the path. ACK signals then can be used as a clock at the sender such that it starts to send packets with the same space interval. However, there are many reasons that fails this conservation:

The connection is starting or restarting after losing packet (i.e., does not get to this state yet). Slow start algorithm has been developed to start data flowing by one packet which gradually will be increased till the sender establish the window of the connection. This algorithm takes a time equal to the multiplication of round trip time by the log. base 2 of the window size
A sender put a new packet into the network before the departure of the old one. By the assumption that the transport protocol's implementation is correct, this can be resulted from either a failure in sender's retransmission timer or in the backoff timer (space between the retransmissions). While the first one retreated by an algorithm for good estimation of the round trip time, the second problem recovered by using exponential backoff
The network is congested so some of the packets are dropped. This can be solved by applying a congestion avoidance algorithm. Such algorithm has to inform the transport endpoint that congestion is or is about to occur, and the endpoints have to adjust their usage or loading

The author argues the previous paper in that there is no need to the new bit in packet header (feedback signal) since lost packet or timeout case are indication to the congestion over the network. The second part of the algorithm is the control on the endpoint usage of the resources by the implementation of
· Multiplicative decrease by reducing the window size to the half, after any timeout.
· Additive increase by increasing the window size by its inverse value, after each ACK
· Sending the minimum advertised window

Now, while the paper states that this is congestion avoidance algorithm, it seems that algorithm is actually to recover the congestion state after its occurrence since the indication that signals this state is either lost packet or timeout case. Moreover, the paper itself points to that lost packet means congestion which result in retransmission and applying slow start algorithm in addition to the earlier described algorithm

On other hand, the proposed algorithm here doesn't insure the fair sharing of the network capacity. To satisfy this metric the author suggested sending a signal earlier to some of endnodes to inform them that they are using more than their fair share. But this signal represents a feedback while the author argue the first paper in this point exactly ignoring that it satisfied the fairness with congestion avoidance mechanism.

Analysis of the Increase and Decrease Algorithm for Congestion Avoidance in Computer Network

Continuous growing in the Internet and mixing between various technologies with different speed (like twisted pair and fiber optic), result in congestion problem that degrades the network performance as response time and throughput. After the Cliff point where the load reaches the network capacity the throughput falls rapidly, the packets is being dropped and the time response increases drastically.

While tradition congestion control schemes recovers the network from such state, this papers describes a congestion avoidance control mechanism to guide the network to operate within the optimal range close to Knee point where the load resulted in a small increase in throughput with a reasonable increase in response time. Such a mechanism depends on the ability of the users to deal with dynamic window size or rate during the communication. This mechanism depends on the increase/decrease algorithm which in turn represents dynamic resource management. Network resources sense its load, determine if they are overloaded or under loaded and feedback simply a binary signal to their users to adjust their number of packets transmitted over the network. Window size in the transport layer protocol or rate of transmission of a user often are used to adjust the system load. The binary feedback signal (1=overload, 0=underload) will be represented by a bit in the packet header, and interpreted by the user as 0;increase load and 1;decrease load

The paper concentrates on the analysis of various classes of increase/ decrease algorithm where the user's reaction to the received feedback can be additive with a constant (positive or negative) or multiplicative by a constant (>1 or 0< .. >1). Also, It uses many performance metrics and the assumption of many users share the same one resource with synchronous feedback and control loop. These metrics includes;
· Efficient usage of the resource that can be defined by the closeness of the total load to Knee point
· Fair allocation of the resource between the users
· Fast convergence; time to reach the goal state and size of oscillation. The binary feedback prevents the convergence to single steady state and leads to an oscillation

It has been found that additive increase and multiplicative decrease algorithm satisfies the goal.

However, I don't have an idea if they thought to use a feedback signal with two bits instead of one it is very simple idea
First bit as mentioned before to represent the required reaction by the user either increase or decrease its load, and the second bit to inform the user to consider the defined constant or its half value
Ex: feedback 10 means decrease load by B/2
feedback 11 means decrease load by B
etc.
I think it performs better and the feedback still simple

Thursday, September 3, 2009

Understanding BGP Misconfiguration

The paper offers a study of globally visible BGP misconfiguration. BGP is important for the reliability of Internet, so any defect in its implementation such as route oscillation, increase time of the convergence or misconfiguration of routers that runs BGP, result in global connectivity problems like that happened by improper announcement of routes by AS7007 and AS3561

The main goals of the papers is to portray
· The frequency of misconfiguration occurrence
· Misconfiguration impact on the global connection and routing load
· The causes of misconfiguration
· Finally, how it can be solved

The authors analyzed BGP updates from 23 different points and they found that daily misconfiguration impacts on 0.2-1% of global routing table and results in increase route update load, however only 4% of improper announcements affect on the connectivity negatively. One of the reason of such misconfiguration is the software of router vendor

The authors focused on globally visible misconfiguration due to its wide disruption and it can be identified by a short lived changes in BGP routing table that last less than a day. It can be classified into

Origin misconfiguration is unintentional insertion of a route into BGP routing table like mistake of AS to summarize an address space
Export misconfiguration is against the policy of export route to BGP peer, for example an AS export a route received from provider to another provider such that create transit between its two providers.
Such misconfiguration leads to increase in routing load in addition to to the high load due to contiguously growing in Internet. It can lead to disruption in some part or globally and threatens the AS's policy.

Slips (error in the execution of a correct plan) and mistakes (error due to incorrect plan with well execution) cause misconfiguration

Finally, the paper suggests several procedures to at least reduce the occurrence of such failures

Designing User interface such that human errors will be minimized
Some modification in BGP protocol can prevent misconfiguration errors, for example S-BGP which represents a proposed extension to the current BGP, impose authorization and authentication to announce a route.

Interdomain Internet Routing

Internet infrastructure constructed by cooperation between ISPs with diverse internal structure and scope called Tier3 (local scope), Teir2 (regional scope) or Teir1 (global scope). Internet involves number of different types of autonomous systems AS that exchange route information using routing protocols BGP (Border Gateway Protocol, while IGP protocol can be used within each As such as RIP, OSPF, IS-IS and E-IGRP.

Forms of AS-AS interconnection
1) provider-customer transit is mostly
inter-AS relationship where the provider provides access to all destination in its routing table with charges for forwarding packets
ISP charges customers depending on the data rate in two ways; fixed pricing and changeable price according to the usage of the bandwidth

2) peering is typically for inter ISP relationship where the providers provide an access to a subset of each other’s routing tables (own transit customers and internal ISP address). This kind on interconnection is charge free and often established between competitors and it is emerged due to:
Number of Teir-1 ISP has been emerged with peering relationship with each other to avoid the control of singleTier-1 on all Internet traffic (in case if we have only provider-customer transit relationships). Such that the ASes related to different ISP don't have to pay money for any forwarding of packets between their customers. Also, peering provides kind of direct path between ASes that is necessary to achieve better end-to-end performance.

As a result, advertising a route by an ISP to its neighbor ISP means it agrees to forward packets to this route, thus ISP must take care to not advertise a route that impose it to act as a transit for packets

BGP was created to replace EGP routing protocol to provide decentralized which allowed the Internet to become a decentralized system. It supports Classless Inter-Domain Routing and uses route symmetrization to decrease the size of routing table. It's design goals are
· Scalability: BGP must ensure that the route advertisement scales well while parts of network going up and coming down
· It must provide to reasonable degree loop free paths
· It must allow ASes to apply their policy to make local decision like filtering and ranking or routes. At the same time keep AS's policy non disclosure
· BGP has been designed to provide routing between different administratively system

Routing Policy
1) Filtering to save or to make money by ISP
· The most important for ISP to advertise routes to its transit customers to as many other connected ASes as possible.
· If ISP received an advertisement from a peering ISP of a route to its transit customer, it will not advertise this route to other peering ISPs, because that may cause a peering ISP to use the advertising ISP to reach that destination which would expend ISP resources but not lead to revenue.

2) Ranking to import routes in routing table
There are many attributes that can be considered to import routes in routing table like
· Local Free attributes implies that the priority is for the customer route (since it has direct path to the customer), then peer route and finally provider route.
· Length of As attribute, actually I didn't figure out its meaning. I thought it means number of hops but I found out it's not
· MED is used when two ASes connected at many locations
· Routes learned via eBGP is preferred on that learned via iBGP

BGP is unique in using TCP as its transport protocol. BGP peers are established by manual configuration between routers to create a TCP session on port 179:
1. Establish a TCP session
2. Routers send OPEN message
3. Routers exchanges their tables (of course after applying ranking and filtering)
4. Then routers send
· UPDATE message (contains only the changes in its table since last update) to add or withdraw a route
· Or KEEPALIVE message periodically to maintain the connection (every 60 seconds by default). In case of absence this message the session terminated during hold time

When BGP is running within an (AS), it is referred to iBGP (Interior BGP) to exchange information about external routes, and it is called eBGP (Exterior BGP) when it is running between ASes. Routers on the boundary of one AS, exchanging information with another AS, are called border or edge routers. In the Cisco operating system, iBGP routes have an administrative distance of 200, which is less preferred than either external BGP or any interior routing protocol. Other router implementations also prefer eBGP to EGP, and IGPs to iBGP.

The paper focuses on interested subject and I wish if it discussed deeply

samah mustafa