[net2-wg] ctp forwader bug

Omprakash Gnawali gnawali at usc.edu
Sun Aug 16 16:46:54 PDT 2009


On Sat, Aug 15, 2009 at 11:12 AM, Omprakash Gnawali<gnawali at usc.edu> wrote:
> On Sat, Aug 15, 2009 at 6:47 AM, Philip Levis<pal at cs.stanford.edu> wrote:
>> On Aug 14, 2009, at 12:17 AM, Omprakash Gnawali wrote:
>>
>>> - ctp bugs/cleanup
>>
>> (written on a plane to London...)
>>
>> I'm sending some details because I won't be able to test code for the next
>> week (I have no motes). So I'm putting down what I've figured out. Maybe
>> this will be enough for someone else to figure it out, or I can work on it
>> in two weeks.
>>
>> Looking through the logs Om sent, it seems the problem is acknowledgments.
>> Specifically, acks aren't working so nodes are retransmitting many, many
>> times. For example, here's part of a log from node 3:
>>
>> 1250272888.261593 (241): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.277591 (242): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.309590 (243): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.341088 (244): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.373085 (245): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.389085 (246): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.436582 (247): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.453081 (248): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.564706 (249): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.644066 (250): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.644138 (251): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671064 (252): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671125 (253): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671204 (254): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>> 1250272888.671263 (255): NET_C_TREE_NEW_PARENT 31 30 43 86
>>
>> (brief referesher: WAITACK 25 11 3 0 means this is a unacked send done,
>> which is code 25, for packet #11 from node 3, with a next hop of zero, the
>> root)
>>
>> Meanwhile, the root has:
>>
>> 1250272888.249094 (1031): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.281088 (1032): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.313086 (1033): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.329088 (1034): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.361084 (1035): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.393081 (1036): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.424080 (1037): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.495139 (1040): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.548071 (1041): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.548307 (1043): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.563073 (1045): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.592218 (1047): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272888.634663 (1049): NET_C_FE_RCV_MSG 21 11 3 1
>> 1250272892.631463 (1081): NET_C_FE_RCV_MSG 21 12 3 2
>>
>> So the root is receiving all of these transmissions but for some reason
>> they're not being acked.
>>
>> Here is the relevant code from sendTask (lines 440-461):
>>
>>     else {
>>        /* The basic forwarding/sending case. */
>>        call CtpPacket.setEtx(qe->msg, gradient);
>>        call CtpPacket.clearOption(qe->msg, CTP_OPT_ECN | CTP_OPT_PULL);
>>        if (call PacketAcknowledgements.requestAck(qe->msg) == SUCCESS) {
>>          setState(ACK_PENDING);
>>        }
>>        else {
>>          clearState(ACK_PENDING);
>>        }
>>        if (hasState(QUEUE_CONGESTED)) {
>>          call CtpPacket.setOption(qe->msg, CTP_OPT_ECN);
>>          clearState(QUEUE_CONGESTED);
>>        }
>>
>>        subsendResult = call SubSend.send(dest, qe->msg, payloadLen);
>>        if (subsendResult == SUCCESS) {
>>          // Successfully submitted to the data-link layer.
>>          setState(SENDING);
>>          dbg("CtpForwarder", "%s: subsend succeeded with %p.\n",
>> __FUNCTION__, qe->msg);
>>          return;
>>        }
>>
>> Here's the relevant code from sendDone (lines 501-520):
>>
>>   else if (hasState(ACK_PENDING) && !call
>> PacketAcknowledgements.wasAcked(msg)) {
>>     /* No ack: if countdown is not 0, retransmit, else drop the packet. */
>>     call LinkEstimator.txNoAck(call AMPacket.destination(msg));
>>     call CtpInfo.recomputeRoutes();
>>     if (--qe->retries) {
>>       dbg("CtpForwarder", "%s: not acked, retransmit\n", __FUNCTION__);
>>       call CollectionDebug.logEventMsg(NET_C_FE_SENDDONE_WAITACK,
>>                                         call
>> CollectionPacket.getSequenceNumber(msg),
>>                                         call
>> CollectionPacket.getOrigin(msg),
>>                                        call AMPacket.destination(msg));
>>       startRetxmitTimer(SENDDONE_NOACK_WINDOW, SENDDONE_NOACK_OFFSET);
>>     } else {
>>        /* Hit max retransmit threshold: drop the packet. */
>>        call SendQueue.dequeue();
>>       clearState(SENDING);
>>       startRetxmitTimer(SENDDONE_OK_WINDOW, SENDDONE_OK_OFFSET);
>>
>>        sendComplete(qe, msg, FALSE);
>>     }
>>   }
>>
>> What I would do to dig deeper is first determine if the ack failure is on
>> the send or sendDone. That is, is the ack bit of the transmitted packet not
>> set, such that the receiver does not generate an ack, or is that an ack is
>> generated but for some reason wasAcked isn't returning the right thing.
>>
>> Or there could be some trivial bug in the code above that I can't find
>> (e.g., an || instead of an &&, etc.).
>>
>> Hope this helps...
>>
>> Om, I would recommend reverting to 1.18 until we figure this out.
>
>
> Walked through 18-22. Reverted to 21 because that seems to be working;
> the bug was introduced in 22.


With 22, I noticed that the network runs into the fail ack send
problem after a set amount of time. It happens simultaneously across
the network:
http://enl.usc.edu/~om_p/net2/ctpnew-1.22b/
http://enl.usc.edu/~om_p/net2/ctpnew-1.22/

- om_p



More information about the net2-wg mailing list