TCP evaluation: Round table discussion notes
From WiL
Notes taken during meeting on scenarios for evaluating TCP.
Contents |
[edit] Introduction and scope
9:00-9:30 Nov 8, Lachlan presents Lit Review of TCP Evaluation Core Framework (See agenda)
Lachlan:
- Try to describe 1 or 2 scenarios, (being very specific; but doing it broadly is OK) so that people can use them independently. This includes set of metrics: eg distribution of RTT's, rate (& how to measure rate eg interval, etc), cross traffic (definition by inter-arrival times etc)
- Try to "scope out" the space of other tests for further work.
- To sort and discuss contentious issues, but put limits on how long discussion lasts and identify points of contention.
- Reproducibility vs realism. Qualitative reproducibility, not necessarily quantitative. Require simple tests to draw hypotheses, tested with realistic tests.
Larry: Perhaps there is a need for gating criteria of basic tests so you can trust their environment. Suggest very standard tests to check a light-weight validation of experimenter's test set-up. Eg, Doug and Injong PFLDnet'06, had done nominally the same experiment, with qualitatively different results. Had to do soem work to find that this was due to different "socket buffers."
Sally: Suggested "basic set of tests" need to be considered when reviewing a paper. Can also have extra tests and explaination of why "this" scenario is not important. There needs to be a wide-ranging test-suite, using set parameters. Suggested some discussion of broader needs and parameterization to be included in this meeting. Qualitative similarity of results is acceptable, if getting quantitive exactly the same results would be "too much" harder
Lars: Outcome: why tests chosen, as well as what tests. Open testbed. Trust issues, support resources, etc. Testing algorithms vs testing bus bandwidth within end systems. Dummynet on sender: data never goes out of kernel. Makes a difference.
- "The way I do measurements affects the results". Tuning scenarios.
- Testbed/simulation setup: doesn't have an impact, or we understand why it does. Moore's law means links grow as fast as server speeds.
- Bandwidth growing faster than CPU. Memory Bus a problem in end systems. Quantum jumps in bandwidth. CPU in software routers. Power issues; leakage current important because of small feature sizes.
[edit] Simulation Parameters
9:40am Cesar leads discussion on specific parameters and discusses papers from his lit-review (see agenda)
Sally: lobbies for dumbbell topology w/ 3 access links on each side; 6 different access link delays, as a concrete scenario
Lars (and grenville): should add gprs or umts to the scenarios, which might have varying rtt's, and even wired-ethernet, if you do powersaving, can have bandwidth-varying and grenville and other DSL folks observe that having asymmetric bandwidths is relevant (since DSL links often look like this)
- bandwidths: use list-3, and add 400mbps delays ; (cf. sally above)
- buffers: (did discussion of what cisco routers do, etc.)
- outcomes: (from this meeting or the next- PFLDNet paper, or ID
Sally: wants ID
Romaric?: In France they use aggregation level; kind of demand-profile; e.g. 2 vs. 20 flows competing for bottleneck;
Lachlan: pacing effects - flows form lots of low-bw links are different than a few high-bw links
Dunn: What's the prevalence of RED turned on, and where, in "the NET"
Sally: Wishlist
- Different bandwidths/scenarios for different RTTs.
- Within university: High bandwidth, short RTT.
- Satellite: something else, large RTT. Perhaps each being dumbbell. Different loss, ...
e.g, If aim is RTT fairness, could increase AI rate proportional to RTT. Would be very aggressive on satellite links.
- One parameter to vary for all scenarios: Level of congestion. caused by Number of flows. 0.1%, 1%, 10% loss rates, or analogous for non-loss-based.
- What distribution of background sizes?
[edit] Bandwidths
Lars runs discussion on Measure of Utilization.
Bandwidths increasing, sensor networks going to low rates. Eg, in Africa, there are 100 mobile phones per PC so GSM is very important. Noikia trying to make phone TCP stacks better, better use of hot spots. Core group.
- Encourage people to do smallest (56k), and largest possible, and then some from the selected list.
- Ethernet power saving: big jumps in bandwidth on packet-by-packet basis.
- GPRS: cell shared, causing problems.
- Wireless: changes RTT and bandwidth rapidly. Want Linux-based phones.
David Wei: DSL too. Very asymmetric.
Lars: These special links on to-do list. Not clear how to model a 3G link.
Lachlan: Cable modems have delay when underloaded. Those on [3] plus 400M. 400M there to keep below end-system hardware limitations.
Lars: link table fills up causing delay
Larry: In the Ideal world, Find where "non-linear" effects come in, and related cause. Test suite should explain what to expect.
[edit] RTTs
1.40pm Sally presents slides (see agenda).
- Look at response function- analytically, or sim or testbed(like they did w/ HS-TCP)
- Look at scenarios we've discussed; and have 50/50 split of flows using standard (newreno, w/ or w/o SACK and/or ECN)x=#sessions, y=aggregate thruput of standard, and separate line of new-CC stuff range of RTTs;
- Look at different congested links eg campus, WAN/trans-oceanic, satellite; to get different mixes of RTT distributions, and small/medium/large size flows;
- as there will be staggered start-times, ignore first 1/2 of simulation for sims, thinking of running sims with both queues in bytes, and queues in pkts (e.g. cisco routers); TFRC-SP (small pkts) - behavior is a lot different for drop-tail, doing bytes vs. packets; so sometimes this is a sensitive parameter; so is AQM sometimes param: bandwidth "stolen" from TCP- sometimes standard couldn't use the b/w anyway- don't penalize new-CC for stuff standard couldn't use anyway; so run with groupA and groupB both newreno, and then A=newreno, B-new-CC
Wishlist (beyond basics):
- distribution of flow completions times (important if testing new slow-start algorithms, to see if they mess up flow-completion distribution of standard)
- friendliness of slower-reacting mechanisms (like tfrc)- want to evaluate how it reacts to transient congestion events- cf. tfrc paper
Lachlan: this works against having a standard set of tests that everything is evaluated on, right (i.e. customizing the test-suite for each new proposal) quicker-start, slow-response-function, small-pkts(pkt/byte mode w/ w/o AQM), things-for-delay-based (whatever results in it's worst-case behavior) are the 3-5 typical examples.
Hence, would like dimensionality reduction, so one can compare algorithms with a small set of numbers ie likes "benchmark" format,
Sally: prefers "BCP test scenarios",
Lachlan: or "standardized test scenarios" impact on mmedia- probably reduce it to delay and drop-rates, as impact on mmedia might be quantified in this way; also had in mind, w.r.t. "cross-traffic", not standard vs. new, but short- vs. long-flows, either all-new-CC, or mixed scenarios maybe flow-completion times is a good metric here; means is it hard for a new flow to start;
Caesar: back to lit-review; discussion of perhaps plots of distributions- x=filesize and y=completion time
Sally: Range of RTTs. Distribution of per-packet RTTs. Measurements not necessarily for congested links.
Larry: SUggests concrete outcome, ie, someone test RTTs proposed to ensure they match different scenarios. Dumbbell, three access links on each, 9 different RTTs. To liase with someone else about achieving it with WAN-in-Lab constraints. Cross traffic RTTs or "target flow" RTTs?
Romaric: 3 ranges of RTTs: local, metro, international may be enough.
Larry: 2D thing: distribution of RTTs for background, distribution for new protocol.
Ihsan: Cross traffic agreed to be heavy tailed. How is it changing? To be discussed tomorrow afternoon.
[edit] Buffers
Larry: Cisco target has been BDP. Know bandwidth (if not virtualized). Typically assume 100ms.
- Got into trouble (6500s) aimed at LANs. Cheaper than routers, do layer 3 switching. Didn't have BDP buffers. Cisco was blamed
- Appenzeller/McKeown core links don't need such large buffers.
- Refuted by GA Tech
- RED vs TailDrop makes a big difference
- Hardware people want to use this to justify small buffers.
What deployment of RED / ECN?
Lars: More per-packet complexity reduces need for buffers. Where is tradeoff point?
Larry will ask how many links have RED/ECN enabled.
Target "most buffering is output buffer"
Are buffers packet or byte?
- Buffers carve into three classes of size. Pools of each size. Re-carve when physical MTU changes
- QoS/diffserve makes big differences
Small packets (may be) less likely to be dropped, as can use large buffer if no small buffers.
Sally: was told Juniper: Buffer for packet headers, main packet stored elsewhere. Lary: Varies lots
Should we test byte-based or packet based buffers? Both in simulator. Whichever hardware does in testbed.
ToDo: Test individual capabilities of testbed.
Conclusions: 100ms and 10ms buffers.
Different scenarios for DSL links.
Sally: Measurement study says 26% have some AQM.
[edit] Duration
Long enough that long-lived flows not in slow-start, take statistics after that. Want multiple congestion epochs.
Choose in advance.
David: Choose duration based on file size. Cesar: DVD size
Deriving duration: For regular TCP, congestion epoch is W/2 RTTs, could be an hour.
Guideline: 10 mins for 2.5G networks, shorter for others.
[edit] Measures of utilization
Why is link utilization important at all? What relation between router metrics and user metrics? Correlation between packet loss and RTT etc?
Sally: always measure delay and pkt-drop-rates when we measure link-utilization. Need to match up router metrics vs. user-metrics; Should always give measure of all.
Romaric: Study aggregation level. Number of flows causing the congestion. Grid context has fewer flows.
????????????????????? Q: Does sum of the receive rates of the flows using each link adequate? How about Utilization Fluctuation?
Injong: you can get high utilization by just being very aggressive, how do we measure OK-ness
Sally: Aggregate utilization of new protocol vs Standard may be useful.
Lars: How to condense per-flow metrics into single measures, rather than start with utilization?
Scenario with transients. Fluctuations during transients, or "steady state"?
Variance in sending rate of individual flow. Coefficience of variation. Compare with variation of aggregate rates.
One option: coefficient of variance.
Is fluctuation of aggregate actually useful, or are "implications" like loss or jitter more important? General feeling is that aggregate utility isn't needed.
Is "stability" important? Yes.
Metrics: Per flow variation in sending rate (cf smoothness), variation in queue size, loss rate, mean delay.
Transients also important (coming and going).
Multicast. Increasing, with IPTV (within a carrier). Can't set goals for multicast, but can't quantify.
Aggregate throughput: from second half(?), per algorithm throughput.
[edit] Impact of new congestion control on Cross-traffic
Sally's proposal:
Response functions: rate vs packet drop, (analytically) Doesn't capture everything Take previous scenarios. Half traffic standard TCP, half new congestion control. Individual experiments, with n times the amount of traffic. Aggregate throughput of standard vs non-standard. Range of RTTs, different kinds of congested link, in-campus, satellite. Friendliness may depend on RTT Range of connection sizes, not just large and small. Add to burstiness. Stagger start times, ignore first half(?) of simulation. Simulate with queues in packets and queues in bytes. Testbed: what we have. TFRC radically different for queues in bytes vs queues in packets. If it is sensitive, must look at both. "Bandwidth stolen", used in HS-TCP.
Wish list:
flow completion times when traffic using some fraction of new vs old. Mechanisms for "friendliness" of slowly-responding algorithms. Look at transient events. e.g. route change, standard TCP halves rate, slower things reduce more slowly. Different worst-case for different proposals.
Be clear about limitations of what tests don't cover: faster slow start, slowly responding, delay based / hybrid, small packet protocols.
Impact on multimedia: Do we want multi-media or just average delay/drop rates? Also smoothness of flow available. Timeouts, jitter.
Distributions of delay etc: Mean, 5th and 95th percentiles?
To compare: keep everything fixed except buffer size. Plot average throughput vs average delay.
Settings: All flows using new algorithm. Half use standard alg, half new.
Vary traffic intensity. Delay/throughput tradeoff with variable queue size.
Also vary proportion of old/new algorithms. Also different bandwidths. RTT distributions (WAN/campus/satellite).
Impact of UDP. Voice calls, "next gen bittorrent", people who "don't play nice".
Suggest: Make "all" data available online, whenever we do data reduction.
Decision: 5th, 25th 50th 75th 95th percentiles of file completion times.
[edit] Convergence times
[edit] Metrics
- Delta fair: to go from 1/101 and 100/101 to 1+delta/2 and 1-delta/2 of rate.
- Epsilon fair: go from "maximally unfair state" 1, 0 (drop first of slow start, so one pk per RTT) to within epsilon of the other.
Convergence with slow start, and without. (E.g., drop second packet, set slow start threshold to 2)
Capture effect of RTT. No goal independent of RTT.
Coupled with completion times.
Is it repeatable? Butterfly effect. Perhaps specify queue size when flow starts.
What time period to measure over: over one RTT, less than
- RTT different
- Measured
- Always gives "meaningful" value (What does that mean? When is "infinity" meaningful)
"non-linear" -- want to measure in a wide range of situations. Flow completion times.
[edit] Impact on responsiveness
What does it mean? Completion times of finite sized flows, against single infinitely long flow.
Protocols more or less friendly to less aggressive flows.
Doug: flow completion time vs size. Sally: 10pk, 100pk, 1000pk, 10000pk
Mix of RTTs for completing flows. How do we scale with bandwidth?
Chart completion times 0.01, .1, 1, 10 RTT?
10% background traffic. Details to be discussed tomorrow.
[edit] Averages over multiple runs
Needed? To avoid if possible.
With slow start, small changes in start time make big differences. Drop pattern also stochastic. Average over either a long time or many runs.
Tests with Matlab? Clean environment, where everything can be controlled. Consequence of non-linearity. Test tools as well as algorithms.
(Control of experimental conditions needs to be improved. Implementation of stacks etc. still a problem.)
Aim: scalability. How do fairness, convergence time etc scale. Should we discuss how properties perform as bandwidth scales?
[edit] Scenarios
- Generic scenario --
- regular TCP cross traffic -- Cesar
- Delay/throughput tradeoff as fn of queue size -- Sally
- Transients: sudden release of bandwidth, sudden arrival of many flows -- Romaric
- Convergence times: completion time of one flow -- Lachlan
- intra-protocol fairness / inter-RTT fairness -- Sangtae
Write in email of what/why we do the test, what they think we have agreed upon or can agree upon, what might be controversial, what things on the to-do list.
Pseudo-random number generator available.
- Efficiency, stabilty, interprotocol fairness, ... (See Cesar's second PDF)
Slashdot
digg
del.icio.us