[net2-wg] ctp forwader bug

Philip Levis pal at cs.stanford.edu
Mon Aug 17 06:49:04 PDT 2009


On Aug 17, 2009, at 1:46 AM, Omprakash Gnawali wrote:

> On Sat, Aug 15, 2009 at 11:12 AM, Omprakash Gnawali<gnawali at usc.edu>  
> wrote:
>> On Sat, Aug 15, 2009 at 6:47 AM, Philip Levis<pal at cs.stanford.edu>  
>> wrote:
>>> On Aug 14, 2009, at 12:17 AM, Omprakash Gnawali wrote:
>>>
>>>> - ctp bugs/cleanup
>>>
>>> (written on a plane to London...)
>>>
>>> I'm sending some details because I won't be able to test code for  
>>> the next
>>> week (I have no motes). So I'm putting down what I've figured out.  
>>> Maybe
>>> this will be enough for someone else to figure it out, or I can  
>>> work on it
>>> in two weeks.
>>>
>>> Looking through the logs Om sent, it seems the problem is  
>>> acknowledgments.
>>> Specifically, acks aren't working so nodes are retransmitting  
>>> many, many
>>> times. For example, here's part of a log from node 3:
>>>
>>> 1250272888.261593 (241): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.277591 (242): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.309590 (243): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.341088 (244): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.373085 (245): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.389085 (246): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.436582 (247): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.453081 (248): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.564706 (249): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.644066 (250): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.644138 (251): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671064 (252): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671125 (253): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671204 (254): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671263 (255): NET_C_TREE_NEW_PARENT 31 30 43 86
>>>
>>> (brief referesher: WAITACK 25 11 3 0 means this is a unacked send  
>>> done,
>>> which is code 25, for packet #11 from node 3, with a next hop of  
>>> zero, the
>>> root)
>>>
>>> Meanwhile, the root has:
>>>
>>> 1250272888.249094 (1031): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.281088 (1032): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.313086 (1033): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.329088 (1034): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.361084 (1035): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.393081 (1036): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.424080 (1037): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.495139 (1040): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.548071 (1041): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.548307 (1043): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.563073 (1045): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.592218 (1047): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.634663 (1049): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272892.631463 (1081): NET_C_FE_RCV_MSG 21 12 3 2
>>>
>>> So the root is receiving all of these transmissions but for some  
>>> reason
>>> they're not being acked.
>>>
>>> Here is the relevant code from sendTask (lines 440-461):
>>>
>>>     else {
>>>        /* The basic forwarding/sending case. */
>>>        call CtpPacket.setEtx(qe->msg, gradient);
>>>        call CtpPacket.clearOption(qe->msg, CTP_OPT_ECN |  
>>> CTP_OPT_PULL);
>>>        if (call PacketAcknowledgements.requestAck(qe->msg) ==  
>>> SUCCESS) {
>>>          setState(ACK_PENDING);
>>>        }
>>>        else {
>>>          clearState(ACK_PENDING);
>>>        }
>>>        if (hasState(QUEUE_CONGESTED)) {
>>>          call CtpPacket.setOption(qe->msg, CTP_OPT_ECN);
>>>          clearState(QUEUE_CONGESTED);
>>>        }
>>>
>>>        subsendResult = call SubSend.send(dest, qe->msg, payloadLen);
>>>        if (subsendResult == SUCCESS) {
>>>          // Successfully submitted to the data-link layer.
>>>          setState(SENDING);
>>>          dbg("CtpForwarder", "%s: subsend succeeded with %p.\n",
>>> __FUNCTION__, qe->msg);
>>>          return;
>>>        }
>>>
>>> Here's the relevant code from sendDone (lines 501-520):
>>>
>>>   else if (hasState(ACK_PENDING) && !call
>>> PacketAcknowledgements.wasAcked(msg)) {
>>>     /* No ack: if countdown is not 0, retransmit, else drop the  
>>> packet. */
>>>     call LinkEstimator.txNoAck(call AMPacket.destination(msg));
>>>     call CtpInfo.recomputeRoutes();
>>>     if (--qe->retries) {
>>>       dbg("CtpForwarder", "%s: not acked, retransmit\n",  
>>> __FUNCTION__);
>>>       call CollectionDebug.logEventMsg(NET_C_FE_SENDDONE_WAITACK,
>>>                                         call
>>> CollectionPacket.getSequenceNumber(msg),
>>>                                         call
>>> CollectionPacket.getOrigin(msg),
>>>                                        call  
>>> AMPacket.destination(msg));
>>>       startRetxmitTimer(SENDDONE_NOACK_WINDOW,  
>>> SENDDONE_NOACK_OFFSET);
>>>     } else {
>>>        /* Hit max retransmit threshold: drop the packet. */
>>>        call SendQueue.dequeue();
>>>       clearState(SENDING);
>>>       startRetxmitTimer(SENDDONE_OK_WINDOW, SENDDONE_OK_OFFSET);
>>>
>>>        sendComplete(qe, msg, FALSE);
>>>     }
>>>   }
>>>
>>> What I would do to dig deeper is first determine if the ack  
>>> failure is on
>>> the send or sendDone. That is, is the ack bit of the transmitted  
>>> packet not
>>> set, such that the receiver does not generate an ack, or is that  
>>> an ack is
>>> generated but for some reason wasAcked isn't returning the right  
>>> thing.
>>>
>>> Or there could be some trivial bug in the code above that I can't  
>>> find
>>> (e.g., an || instead of an &&, etc.).
>>>
>>> Hope this helps...
>>>
>>> Om, I would recommend reverting to 1.18 until we figure this out.
>>
>>
>> Walked through 18-22. Reverted to 21 because that seems to be  
>> working;
>> the bug was introduced in 22.
>
>
> With 22, I noticed that the network runs into the fail ack send
> problem after a set amount of time. It happens simultaneously across
> the network:
> http://enl.usc.edu/~om_p/net2/ctpnew-1.22b/
> http://enl.usc.edu/~om_p/net2/ctpnew-1.22/

There seems to be a discrepancy between the logs and the plots. E.g.,  
for ctpnew-1.22, I see a few WAIT_ACK events for node 5 at the  
beginning of the plot, then 10 (t=600) minutes in the number spikes.  
But the log only has 37 seconds of data....

aeolus /tmp/ctpnew-1.22 -11-> head -1 messages-5.txt
1250455707.087995 (237): NET_C_FE_SENDDONE_WAITACK 25 6 5 0
aeolus /tmp/ctpnew-1.22 -12-> tail -1 messages-5.txt
1250455744.260685 (474): NET_C_FE_SENDQUEUE_EMPTY 50 0 0 0

What am I missing here?

Phil


More information about the net2-wg mailing list