[net2-wg] ctp forwader bug
Omprakash Gnawali
gnawali at usc.edu
Sun Aug 16 16:46:54 PDT 2009
On Sat, Aug 15, 2009 at 11:12 AM, Omprakash Gnawali<gnawali at usc.edu> wrote:
> On Sat, Aug 15, 2009 at 6:47 AM, Philip Levis<pal at cs.stanford.edu> wrote:
>> On Aug 14, 2009, at 12:17 AM, Omprakash Gnawali wrote:
>>
>>> - ctp bugs/cleanup
>>
>> (written on a plane to London...)
>>
>> I'm sending some details because I won't be able to test code for the next
>> week (I have no motes). So I'm putting down what I've figured out. Maybe
>> this will be enough for someone else to figure it out, or I can work on it
>> in two weeks.
>>
>> Looking through the logs Om sent, it seems the problem is acknowledgments.
>> Specifically, acks aren't working so nodes are retransmitting many, many
>> times. For example, here's part of a log from node 3:
>>
>> 1250272888.261593 (241): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.277591 (242): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.309590 (243): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.341088 (244): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.373085 (245): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.389085 (246): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.436582 (247): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.453081 (248): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.564706 (249): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.644066 (250): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.644138 (251): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671064 (252): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671125 (253): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671204 (254): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671263 (255): NET_C_TREE_NEW_PARENT 31 30 43 86
>>
>> (brief referesher: WAITACK 25 11 3 0 means this is a unacked send done,
>> which is code 25, for packet #11 from node 3, with a next hop of zero, the
>> root)
>>
>> Meanwhile, the root has:
>>
>> 1250272888.249094 (1031): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.281088 (1032): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.313086 (1033): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.329088 (1034): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.361084 (1035): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.393081 (1036): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.424080 (1037): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.495139 (1040): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.548071 (1041): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.548307 (1043): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.563073 (1045): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.592218 (1047): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.634663 (1049): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272892.631463 (1081): NET_C_FE_RCV_MSG 21 12 3 2
>>
>> So the root is receiving all of these transmissions but for some reason
>> they're not being acked.
>>
>> Here is the relevant code from sendTask (lines 440-461):
>>
>> else {
>> /* The basic forwarding/sending case. */
>> call CtpPacket.setEtx(qe->msg, gradient);
>> call CtpPacket.clearOption(qe->msg, CTP_OPT_ECN | CTP_OPT_PULL);
>> if (call PacketAcknowledgements.requestAck(qe->msg) == SUCCESS) {
>> setState(ACK_PENDING);
>> }
>> else {
>> clearState(ACK_PENDING);
>> }
>> if (hasState(QUEUE_CONGESTED)) {
>> call CtpPacket.setOption(qe->msg, CTP_OPT_ECN);
>> clearState(QUEUE_CONGESTED);
>> }
>>
>> subsendResult = call SubSend.send(dest, qe->msg, payloadLen);
>> if (subsendResult == SUCCESS) {
>> // Successfully submitted to the data-link layer.
>> setState(SENDING);
>> dbg("CtpForwarder", "%s: subsend succeeded with %p.\n",
>> __FUNCTION__, qe->msg);
>> return;
>> }
>>
>> Here's the relevant code from sendDone (lines 501-520):
>>
>> else if (hasState(ACK_PENDING) && !call
>> PacketAcknowledgements.wasAcked(msg)) {
>> /* No ack: if countdown is not 0, retransmit, else drop the packet. */
>> call LinkEstimator.txNoAck(call AMPacket.destination(msg));
>> call CtpInfo.recomputeRoutes();
>> if (--qe->retries) {
>> dbg("CtpForwarder", "%s: not acked, retransmit\n", __FUNCTION__);
>> call CollectionDebug.logEventMsg(NET_C_FE_SENDDONE_WAITACK,
>> call
>> CollectionPacket.getSequenceNumber(msg),
>> call
>> CollectionPacket.getOrigin(msg),
>> call AMPacket.destination(msg));
>> startRetxmitTimer(SENDDONE_NOACK_WINDOW, SENDDONE_NOACK_OFFSET);
>> } else {
>> /* Hit max retransmit threshold: drop the packet. */
>> call SendQueue.dequeue();
>> clearState(SENDING);
>> startRetxmitTimer(SENDDONE_OK_WINDOW, SENDDONE_OK_OFFSET);
>>
>> sendComplete(qe, msg, FALSE);
>> }
>> }
>>
>> What I would do to dig deeper is first determine if the ack failure is on
>> the send or sendDone. That is, is the ack bit of the transmitted packet not
>> set, such that the receiver does not generate an ack, or is that an ack is
>> generated but for some reason wasAcked isn't returning the right thing.
>>
>> Or there could be some trivial bug in the code above that I can't find
>> (e.g., an || instead of an &&, etc.).
>>
>> Hope this helps...
>>
>> Om, I would recommend reverting to 1.18 until we figure this out.
>
>
> Walked through 18-22. Reverted to 21 because that seems to be working;
> the bug was introduced in 22.
With 22, I noticed that the network runs into the fail ack send
problem after a set amount of time. It happens simultaneously across
the network:
http://enl.usc.edu/~om_p/net2/ctpnew-1.22b/
http://enl.usc.edu/~om_p/net2/ctpnew-1.22/
- om_p
More information about the net2-wg
mailing list